Apache Pig Load Function Bag as input possible? - hadoop

if I write a custom Load Function with the constructor
MyLoadFunction(String someOptions, DataBag myBag)
How can I execute this function with piglatin?
X = load 'foo.txt' using MyLoadFunction('myString', myBagAlias);
this does not work, is it even possible?
thanks

I'm not sure your need is suitable for Pig. Pig is all about loading up a lot of data and then putting that data through a pipeline. It sounds like you want something more procedural, to load a small amount of data, do some processing, make a decision based on that, and follow that algorithm to completion.
So I'm not sure this is the best way for you to go, but you can try writing a UDF that will access HBase and grab the data you need. LOAD is inappropriate here because LOAD does not return a bag, it returns a relation that Pig expects you to put through some transformations. But you can pass a bag as input to a UDF, and then inside that UDF to do the HBase lookup and processing you want to do.
A more Pig-ish way of doing things would be to load all of the relevant HBase data into one or more relations, and then do a JOIN as appropriate to combine the pieces of data you want together.

Related

Spark streaming get pre computed features

I am trying to use spark streaming to deal with some order stream, I have some previous computed features for maybe a buyer_id for order in the stream.
I need to get these features while the Spark Streaming is running.
Now, I stored the buyer_id features in a hive table and load it into and RDD and
val buyerfeatures = loadBuyerFeatures()
orderstream.transform(rdd => rdd.leftOuterJoin(buyerfeatures))
to get the pre-computed features.
another way to deal with this is maybe save the features in to a hbase table. and fire a get on every buyer_id.
which one is better ? or maybe I can solve this in another way.
From my short experience:
Loading the necessary data for the computation should be done BEFORE starting the streaming context:
If you are loading inside a DStream operation, this operation will be repeated at each Batch Inteverval time.
If you load each time from Hive, you should seriously consider overhead costs and possible problems during data transfer.
So, if your data is already computed and "small" enough, load it at the beginning of the program in a Broadcast variable or,even better, in a final variable. Either this, or create an RDD before the DStream and keep it as reference (which looks like what you are doing now), although remember to cache it (always if you have enough space).
If you actually do need to read it at streaming time (maybe you receive your query key from the stream), then try to do it once in a foreachPartition and save it in a local variable.

How do I store data in multiple, partitioned files on HDFS using Pig

I've got a pig job that analyzes a large number of log files and generates a relationship between a group of attributes and a bag of IDs that have those attributes. I'd like to store that relationship on HDFS, but I'd like to do so in a way that is friendly for other Hive/Pig/MapReduce jobs to operate on the data, or subsets of the data without having to ingest the full output of my pig job, as that is a significant amount of data.
For example, if the schema of my relationship is something like:
relation: {group: (attr1: long,attr2: chararray,attr3: chararray),ids: {(id: chararray)}}
I'd really like to be able to partition this data, storing it in a file structure that looks like:
/results/attr1/attr2/attr3/file(s)
where the attrX values in the path are the values from the group, and the file(s) contain only ids. This would allow me to easily subset my data for subsequent analysis without duplicating data.
Is such a thing possible, even with a custom StoreFunc? Is there a different approach that I should be taking to accomplish this goal?
I'm pretty new to Pig, so any help or general suggestions about my approach would be greatly appreciated.
Thanks in advance.
Multistore wasn't a perfect fit for what I was trying to do, but it proved a good example of how to write a custom StoreFunc that writes multiple, partitioned output files. I downloaded the Pig source code and created my own storage function that parsed the group tuple, using each of the items to build up the HDFS path, and then parsed the bag of ids, writing one ID per line into the result file.

Merge multiple document categorizer models in OpenNLP

I am trying to write a map-reduce implementation of Document Categorizer using OpenNLP.
During the training phase, I am planning to read a large amount of files and create a model file as result of the map-reduce computation(may be a chain of jobs). I will distribute the files to different mappers, I would create a number of model files as result of this step. Now, I wish to reduce these model files to a single model file to be used for classification.
I understand that this is not the most intuitive of use cases, but I am ready to get my hands dirty and extend/modify the OpenNLP source code, assuming it is possible to tweak the maxent algorithm to work this way.
In case this seems too far fetched, I request for suggestions to do this by generating document samples corresponding to the input files as output of map-reduce step and reducing them to model files by feeding them to document categorizer trainer.
Thanks!
I've done this before, and my approach was to not have each reducer produce the model, but rather only produce the properly formatted data.
Rather than use a category as a key, which separates all the categories Just use a single key and make the value the proper format (cat sample newline) then in the single reducer you can read in that data as (a string) a bytearrayinputstream and train the model. Of course this is not the only way. You wouldn't have to modify opennlp at all to do this.
Simply put, my recommendation is to use a single job that behaves like this:
Map: read in your data, create category label and sample pair. Use a key called 'ALL' and context.write each pair with that key .
Reduce: use a stringbuilder to concat all the cat: sample pairs into the proper training format. Convert the string into a bytearrayinputstream and feed the training API . Write the model somewhere.
Problem may occur that your samples data is too huge to send to one node. If so, you can write the values to A nosql db and read then in from a beefier training node. Or you can use randomization in your mapper to produce many keys and build many models, then at classification time write z wrapper that tests data across them all and Getz The best from each one..... Lots of options.
HTH

Pig HbaseStorage customization

How can I customize HbaseStorage for pig script? Actually I want to perform some business logic on the data before loading it to the pig script. It would be something like custom storage on top of HbaseStorage.
e.g I've my row key has structure like this A_B_C. Currently, I'm passing A_B_C key in HbaseStorage in my pig script but I want to perform some logic like filtering etc against key like A_B_C_D before serving input data to actual pig script. How is it possible
You may have to end up looking at the HBaseStorage java class and implementing your own classes based on that. Depending on how the HBaseStorage and associated classes have been written, this could vary from being easy (just extend HBaseStorage itself and overwrite where necessary) to a real headache.
You then have to ensure that the .jar containing your code is on the pig classpath.
I find HbaseStorage to be a real pain, so I write regular Java MR jobs to query HBase and create custom sequence files, which I then use from Pig with a simple custom loader. I find this saves a ton of time since the sequence file can be re-used many times throughout the day to get quick results, rather than scanning everything in Hbase for every Pig script.

Using Apache Hive as a MapReduce Input Format and/or Scraping Hive Metadata

Our environment is heavy into storing data in hive. I find myself currently working on something that it outside the scope though. I have a mapreduce written, but it requires a lot of direct user inputs for information that could easily be scraped from Hive. That said, when I query hive for extended table data, all of the extended information is thrown out in 1 or 2 columns as a giant blob of almost-JSON. Is there either a convenient way to parse this information, or better yet, get it directly in a more direct manor?
Alternatively, if I could get pointed to documentation on manually using the CombinedHiveInputFormat, that would simplify my code a lot more. But it seems like that InputFormat is solely used inside of Hive, using it's custom structs.
Ultimately, what I want is to know table names, columns (not including partitions), and partition locations for the split a mapper is working on. If there is yet another way to accomplish this, I am eager to know.

Resources