Store some fields from PIG to Hbase - hadoop

I am trying to extract some part of string and store it to hbase in columns.
Files Content :
msgType1 Person xyz has opened Internet:www.google.com from IP:192.123.123.123 for duration 00:15:00
msgType2 Person xyz denied for opening Internet:202.x.x.x from IP:192.123.123.123 reason:unautheticated
msgType1 Person xyz has opened Internet:202.x.x.x from IP:192.123.123.123 for duration 00:15:00
pattern of messages corresponding to msgType is fixed. Now i am trying to store person name, destination , source , duration etc in hbase.
I am trying to to wrtie script in PIG to do this task.
But i am stuck at extracting part.(extracting IP or website name from 'Internet:202.x.x.x' token inside string).
I tried Regular expression but its not working for me. Regex alway throw this error :
ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.REGEX_EXTRACT as multiple or none of them fit. Please use an explicit cast.
is there any other way to extract these value and store it to hbase in PIG or other than PIG?

How do you use the REGEX_EXTRACT function ? Have you seen the REGEX_EXTRACT_ALL function ? According to the documentation (http://pig.apache.org/docs/r0.9.2/func.html#regex-extract-all), it should be like this :
test = LOAD 'test.csv' USING org.apache.pig.builtin.PigStorage(',') AS (key:chararray, value:chararray);
test = FOREACH test GENERATE FLATTEN(REGEX_EXTRACT_ALL (value, '(\\S+):(\\S+)')) as (match1:chararray, match2:chararray);
DUMP test;
My file is like that :
1,a:b
2,c:d
3,

I know it's easy to be lazy and not take the step, but you really should use a user-defined function here. Pig is good as a data flow language and not much else, so in order to get the full power out of it, you are going to need to use a lot of UDFs to go through text and do more complicated operations.
The UDF will take a single string as a parameter, then return a tuple that represents (person, destination, source, duration). To use it, you'll do:
A = LOAD ...
...
B = FOREACH A GENERATE MyParseUDF(logline);
...
STORE B INTO ...
You didn't mention what your HBase row key was, but be sure that's the first element in the relation before storing it.

Related

Spring Batch, read whole csv file before reading line by line

I want to read a csv file, enrich each row with some data from some other external system and then write the new enriched csv to some directory
Now to get the data from external system i need to pass each row one by one and get the new columns from external system.
But to query the external system with each row i need to pass a value which i have got from external system by sending all the values of a perticular column.
e.g - my csv file is -
name, value, age
10,v1,12
11,v2,13
so to enrich that i first need to fetch a value as per total age - i.e 12 + 13 and get the value total from external system and then i need to send that total with each row to external system to get the enriched value.
I am doing it using spring batch but using fLatFileReader i can read only one line at a time. How would i refer to whole column before that.
Please help.
Thanks
There are two ways to do this.
OPTION 1
Go for this option if you are okey to store all the records in memory. Totally depends how many record you need to calculate the total age.
Reader(Custom Reader) :
Write the logic to read one line at a time.
You need to return null from read() only when you feel all the lines are read for calculating the total age.
NOTE:- A reader will loop the read() method until it returns null.
Processor : You will get the full list of records. calculate the total age.
Connect the external system and get the value. Form the records which need to be written and return from the process method.
NOTE:- You can return all the records modified by a particular field or merge a single record. This is totally your choice what you would like to do.
Writer : Write the records.
OPTION 2
Go for this if option1 is not feasible.
Step1: read all the lines and calculate the total age and pass the value to the next step.
Step2: read all the lines again and update the records with required update and write the same.

How to rename the fields in a relation

Consider the following code :
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
The relation ranked has two fields : the line number and the text. The text is called line and can be referred to by this alias, but the line number generated by RANK has none. As a consequence, the only way I can refer to it is as $0.
How can I give $0 an name, so that I can refer to it more easily once it's been joined to another data set and is no longer $0?
What you want to do is to define a schema for you data. The easiest way to do so is to use the AS keywoard just like you're doing with LOAD.
You can define a schema with three operators : LOAD, STREAM and FOREACH.
Here, the easiest way to do so would be the following :
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
renamed_ranked = foreach B generate $0 as rank, $1;
You may find more informations on the associated documentation.
It is also good to know that this operation won't add an iteration to your script. As #ArnonRotem-Gal-Oz said :
Pig doesn't perform the action in a serial manner i.e. it doesn't do all the ranking and then does another iteration on all the records. The pig optimizer will do the rename when it assigns the rank. You can see a similar behaviour explained in the pig cookbook.
You can add a projection with FOREACH as
named_ranked = FOREACH ranked GENERATE $0 as r,*;

Storing data with big and dynamic groupings / paths

I currently have the following pig script (column list truncated for brevity):
REGISTER /usr/lib/pig/piggybank.jar;
inputData = LOAD '/data/$date*.{bz2,bz,gz}' USING PigStorage('\\x7F')
PigStorage('\\x7F')
AS (
SITE_ID_COL :int,-- = Item Site ID
META_ID_COL :int,-- = Top Level (meta) category ID
EXTRACT_DATE_COL :chararray,-- = Date for the data points
...
)
SPLIT inputData INTO site0 IF (SITE_ID_COL == 0), site3 IF (SITE_ID_COL == 3), site15 IF (SITE_ID_COL == 15);
STORE site0 INTO 'pigsplit1/0/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/0/','2', 'bz2', '\\x7F');
STORE site3 INTO 'pigsplit1/3/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/3/','2', 'bz2', '\\x7F');
STORE site15 INTO 'pigsplit1/15/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/15/','2', 'bz2', '\\x7F');
And it works great for what I wanted it to do, but there's actually at least 22 possible site IDs and I'm not certain there's not more. I'd like to dynamically create the splits and store into paths based on that column. Is the easiest way to do this going to be through a two step usage of the MultiStorage UDF, first splitting by the site ID and then loading all those results and splitting by the date? That seems inefficient. Can I somehow do it through GROUP BYs? It seems like I should be able to GROUP BY the site ID, then flatten each row and run the multi storage on that, but I'm not sure how to concatenate the GROUP into the path.
The MultiStorage UDF is not set up to divide inputs on two different fields, but that's essentially what you're doing -- the use of SPLIT is just to emulate MultiStorage with two parameters. In that case, I'd recommend the following:
REGISTER /usr/lib/pig/piggybank.jar;
inputData = LOAD '/data/$date*.{bz2,bz,gz}' USING PigStorage('\\x7F')
AS (
SITE_ID_COL :int,-- = Item Site ID
META_ID_COL :int,-- = Top Level (meta) category ID
EXTRACT_DATE_COL :chararray,-- = Date for the data points
...
)
dataWithKey = FOREACH inputData GENERATE CONCAT(CONCAT(SITE_ID_COL, '-'), EXTRACT_DATE_COL), *;
STORE dataWithKey INTO 'tmpDir' USING org.apache.pig.piggybank.storage.MultiStorage('tmpDir', '0', 'bz2', '\\x7F');
Then go over your output with a simple script to list all the files in your output directories, extract the site and date IDs, and move them to appropriate locations with whatever structure you like.
Not the most elegant workaround, but it could work all right for you. One thing to watch out for is the separator you choose in your key might not be allowed (it might only be alphanumeric). Also, you'll be stuck with that extra field in your output data.
I've actually submitted a patch to the MultiStorage module to allow splitting on multiple tuple fields rather than only one, resulting in a dynamic output tree.
https://issues.apache.org/jira/browse/PIG-3258
It hasn't gotten much attention yet, but I'm using it in production with no issues.

assigning IDs to hadoop/PIG output data

I m working on PIG script which performs heavy duty data processing on raw transactions and come up with various transaction patterns.
Say one of pattern is - find all accounts who received cross border transactions in a day (with total transaction and amount of transactions).
My expected output should be two data files
1) Rollup data - like account A1 received 50 transactions from country AU.
2) Raw transactions - all above 50 transactions for A1.
My PIG script is currently creating output data source in following format
Account Country TotalTxns RawTransactions
A1 AU 50 [(Txn1), (Txn2), (Txn3)....(Txn50)]
A2 JP 30 [(Txn1), (Txn2)....(Txn30)]
Now question here is, when I get this data out of Hadoop system (to some DB) I want to establish link between my rollup record (A1, AU, 50) with all 50 raw transactions (like ID 1 for rollup record used as foreign key for all 50 associated Txns).
I understand Hadoop being distributed should not be used for assigning IDs, but are there any options where i can assign non-unique Ids (no need to be sequential) or some other way to link this data?
EDIT (after using Enumerate from DataFu)
here is the PIG script
register /UDF/datafu-0.0.8.jar
define Enumerate datafu.pig.bags.Enumerate('1');
data_txn = LOAD './txndata' USING PigStorage(',') AS (txnid:int, sndr_acct:int,sndr_cntry:chararray, rcvr_acct:int, rcvr_cntry:chararray);
data_txn1 = GROUP data_txn ALL;
data_txn2 = FOREACH data_txn1 GENERATE flatten(Enumerate(data_txn));
dump data_txn2;
after running this, I am getting
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: java.lang.NullPointerException
at datafu.pig.bags.Enumerate.enumerateBag(Enumerate.java:89)
at datafu.pig.bags.Enumerate.accumulate(Enumerate.java:104)
....
I often assign random ids in Hadoop jobs. You just need to ensure you generate ids which contain a sufficient number of random bits to ensure the probability of collisions is sufficiently small (http://en.wikipedia.org/wiki/Birthday_problem).
As a rule of thumb I use 3*log(n) random bits where n = # of ids that need to be generated.
In many cases Java's UUID.randomUUID() will be sufficient.
http://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates
What is unique in your rows? It appears that account ID and country code are what you have grouped by in your Pig script, so why not make a composite key with those? Something like
CONCAT(CONCAT(account, '-'), country)
Of course, you could write a UDF to make this more elegant. If you need a numeric ID, try writing a UDF which will create the string as above, and then call its hashCode() method. This will not guarantee uniqueness of course, but you said that was all right. You can always construct your own method of translating a string to an integer that is unique.
But that said, why do you need a single ID key? If you want to join the fields of two tables later, you can join on more than one field at a time.
DataFu had a bug in Enumerate which was fixed in 0.0.9, so use 0.0.9 or later.
In case when your IDs are numbers and you can not use UUID or other string based IDs.
There is a DataFu library of UDFs by LinkedIn (DataFu) with a very useful UDF Enumerate. So what you can do is to group all records into a bag and pass the bag to the Enumerate. Here is the code from top of my head:
register jar with UDF with Enumerate UDF
inpt = load '....' ....;
allGrp = group inpt all;
withIds = foreach allGrp generate flatten(Enumerate(inpt));

Pass a relation to a PIG UDF when using FOREACH on another relation?

We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data):
35 6009
521 21599
225 51991
12 6129
We wrote a UDF that takes in the column value (so: "35 521 225") and the mappings from the file. We would then split the column value and iterate over each and return the first mapped value from the passed in mappings (thinking that is how it would logically work).
We are loading the data in PIG like this:
data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);
mappings = LOAD 'mappings.txt' USING PigStorage() AS (ourId:chararray, theirId:chararray);
Then our generate is:
output = FOREACH data GENERATE title, com.example.ourudf.Mapper(category, mappings);
However the error we get is:
'there is an error during parsing: Invalid alias mappings in [data::title: chararray,data::category, chararray]`
It seems that Pig is trying to find a column called "mappings" on our original data. Which if course isn't there. Is there any way to pass a relation that is loaded into a UDF?
Is there any way the "Map" type in PIG will help us here? Or do we need to somehow join the values?
EDIT: To be more specific - we don't want to map ALL of the category ids to the 3rd party ids. We just wanted to map the first. The UDF will iterate over the list of our category ids - and will return when it finds the first mapped value. So if the input looked like:
someProduct\t35 521 225
the output would be:
someProduct\t6009
I don't think you can do it this wait in Pig.
A solution similar to what you wanted to do would be to load the mapping file in the UDF and then process each record in a FOREACH. An example is available in PiggyBank LookupInFiles. It is recommended to use the DistributedCache instead of copying the file directly from the DFS.
DEFINE MAP_PRODUCT com.example.ourudf.Mapper('hdfs://../mappings.txt');
data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);
output = FOREACH data GENERATE title, MAP_PRODUCT(category);
This will work if your mapping file is not too big. If it does not fit in memory you will have to partition the mapping file and run the script several time or tweak the mapping file's schema by adding a line number and use a native join and nested FOREACH ORDER BY/LIMIT 1 for each product.

Resources