Apache Pig - Transform data bag to set of rows - hadoop

I have a pig data
(123,{(1),(2),(3)},{(0.5),(0.6),(0.7)})
I want to generate records in below format
123,1,0.5
123,2,0.6
123,3,0.7
I am able to do this when above data has one bag but not getting how to generate required output when we have multiple data bags.
Any Suggestions????

Related

SpoolDirectory to Hbase using Flume

I am able to perform the operation of transferring my data from spooldir to Hbase, but my data is in Json Format and I want them to be in separate columns. I am using Kafka channel.
P.F.A attached the photo of my hbase column
If you notice in the picture status, category, sub_category should be different columns.
I created HBASE table like this - create 'table_name','details'. So, they are in same column family but how do I segregate json data to different columns? Any thoughts?

Nifi denormalize json message and storing in Hive

I am consuming a nested Json Kafka message and after some transformations, storing it into hive.
Requirement is json contains several nested arrays and we have to denormalize it so that each element in array forms a separate row in hive table. Would JoltTransform or SplitJson work or do I need to write groovy script for the same?
Sample input -
{"TxnMessage": {"HeaderData": {"EventCatg": "F"},"PostInrlTxn": {"Key": {"Acnt": "1234567890","Date": "20181018"},"Id": "3456","AdDa": {"Area": [{"HgmId": "","HntAm": 0},{"HgmId": "","HntAm": 0}]},"escTx": "Reload","seTb": {"seEnt": [{"seId": "CAKE","rCd": 678},{"seId": "","rCd": 0}]},"Bal": 6766}}}
Expected Output - {"TxnMessage.PostInrlTxn.AdDa.Area.HgmId":"","TxnMessage.PostInrlTxn.AdDa.Area.HntAm":0,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}
{"TxnMessage.PostInrlTxn.AdDa.Area.HgmId":"","TxnMessage.PostInrlTxn.AdDa.Area.HntAm":0,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}
{"TxnMessage.PostInrlTxn.seTb.seEnt.seId":"CAKE","TxnMessage.PostInrlTxn.seTb.seEnt.rCd":678,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}
{"TxnMessage.PostInrlTxn.seTb.seEnt.seId":"","TxnMessage.PostInrlTxn.seTb.seEnt.rCd":0,"TxnMessage.HeaderData.EventCatg":"F","TxnMessage.PostInrlTxn.Key.Acnt":"1234567890","TxnMessage.PostInrlTxn.Key.Date":"20181018","TxnMessage.PostInrlTxn.Id":"3456","TxnMessage.PostInrlTxn.escTx":"Reload","TxnMessage.PostInrlTxn.Bal":6766}

Pass parameter from spark to input format

We have files with specific format in HDFS. We want to process data extracted from these files within spark. We have started to write an input format in order to create the RDD. This way we hope will be able to create an RDD from the whole file.
But each processing has to process a small subset of data contained in the file and I know how to extract this subset very efficiently, more than filtering a huge RDD.
How can I pass a query filter in the form of a String from my driver to my input format (the same way hive context does)?
Edit:
My file format is NetCDF which stores huge matrix in a efficient way for a multidimentionnal data, for exemple x,y,z and time. A first approach would be to extract all values from the matrix and produce a RDD line for each value. I'd like my inputformat to extract only a few subset of the matrix (maybe 0.01%) and build a small RDD to work with. The subset could be z = 0 and a small time period. I need to pass the time period to the input format which will retrieve only the values I'm interested in.
I guess Hive context does this when you pass an SQL query to the context. Only values matching the SQL query are present in the RDD, not all lines of the files.

Pig - HBASE Query - Dynamic Columns - Convert Map to Bag

I am attempting to use Pig to query a HBASE table where the table columns are dynamic (column titles not known at query time). Thus this makes referencing the map[] of key values returned not feasible. I would like to convert each map returned by the query into a bag of key-value tuples.
How do I do this?
I have seen one example that seems to rank high in search results (in python) that converts each map key-value pair into a bag of tuples. See below.
#outputSchema("values:bag{t:tuple(key, value)}")
def bag_of_tuples(map_dict):
return map_dict.items()
However, in trying to follow this example, I do not know where the .items() function is coming from. Is there a way do achieve this in pure Pig Latin with out-of-the-box UDFs?

Not able to store the data into hbase using pig when I dont know the number of columns in a file

I have a text file with N number of columns (Not sure, in the future I may have N+1).
Example:
1|A
2|B|C
3|D|E|F
I want to store above data into hbase using pig without writing UDF. How can I store this kind of data without knowing the number of columns in a file?
Put it in a map and then you can use cf1:* where cf1 is your column family

Resources