Get xml Values in Pig Latin - hadoop

I am using pig latin for a large XML dump. I am trying to get the value of the xml node in pig latin. The file is like
< username>Shujaat< /username>
I want to get the input Shujaat. I tried piggybank XMLLoader but it only separates the xml tags and its values also. The code is
register piggybank.jar;
A = load 'username.xml' using org.apache.pig.piggybank.storage.XMLLoader('username')
as (x: chararray);
B = foreach A generate x;
This code gives me the username tags also and values too. I only need values. Any idea how to do that? I found out regular expression but didnt know much?
Thanks

The example element you gave can be extracted with:
B = foreach A GENERATE REGEX_EXTRACT(x, '<username>(.*)</username>', 1)
AS name:chararray;
A nested element like this:
<user>
<id>456</id>
<username>Taylor</username>
</user>
can be extracted by with something like this:
B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,
'<user>\\n\\s*<id>(.*)</id>\\n\\s*<username>(.*)</username>\\n\\s*</user>'))
as (id: int, name:chararray);
(456,Taylor)
You will definitely need to define a more sophisticated regex that suits all of your needs, i.e: handles empty elements, attributes...etc. Another option is to write a custom UDF that uses Java libraries to parse the content of the XML so that you can avoid writing (over)complicated, error-prone regular expressions.

Related

Pig:FLATTEN keyword

I am a little confused with the use of FLATTEN keyword in PIG.
Consider the below dataset:
tuple_record: {details: (firstname: chararray,lastname: chararray,age: int,sex: chararray)}
Without using the FLATTEN I can access a field (suppose firstname) like this:
display_firstname = FOREACH tuple_record GENERATE details.firstname;
Now, using the FLATTEN keyword:
flatten_record = FOREACH tuple_record GENERATE FLATTEN(details);
DESCRIBE gives me this:
flatten_record: {details::firstname: chararray,details::lastname: chararray,details::age: int,details::sex: chararray}
And hence I can access the fields present directly without dereferencing like this:
display_record = FOREACH flatten_record GENERATE firstname;
My questions related to this FLATTEN keyword is:
1) Which way among the two (i.e. with or without using FLATTEN) is the optimized way of achieving the same output?
2) Any special scenarios where without using the FLATTEN keywords, the desired output cant be achieved?
Totally confused; please clarify its use and in which all scenarios I shall use it.
Sometimes you have data in a bag or a tuple and you want to remove that level of nesting.
when you want to switch around your data on the fly and group by a particular field, you need a way to pull those entries out of the bag.
As per Pig documentation:
The FLATTEN operator looks like a UDF syntactically, but it is
actually an operator that changes the structure of tuples and bags in
a way that a UDF cannot. Flatten un-nests tuples as well as bags. The
idea is the same, but the operation and result is different for each
type of structure.
For more details check this link they have explained the usage of FLATTEN clearly with examples

Hadoop Pig Latin Tuples: How to pass them to UDFs?

My goal is to pass every field in the input to a UDF as follows:
A = LOAD './input/file1' USING PigStorage(' ') AS (f1:chararray, f2:chararray);
B = FOREACH A GENERATE com.mycompany.udf.FAKEUDF(tuple(*));
NOTE: I am using Cloudera's version 0.12.0-cdh5.0.0.
The above FOREACH is just one of my many attempts. I have seen examples like
...FAKEUDF(*)
And so forth.
The main question is, what is the correct syntax? And has the syntax changed from earlier versions?
Here is a link which shows the lone asterisk syntax:
Chapter 10: Writing Evaluation & Filter Functions
It depends how u are processing your reqiurement. Argument will be name of column (one or more) like FAKEUDF(column1,column2,....) or for all the column you can specify * also like FAKEUDF(*) or you can specify relationName also. In UDF, you have to take out the column values from the tuple like : tuple.get(index). You have to be careful what you have sent as argument based on that processing is happening. It can be even DataBag.

How to load a file with a JSON array per line in Pig Latin

An existing script creates text files with an array of JSON objects per line, e.g.,
[{"foo":1,"bar":2},{"foo":3,"bar":4}]
[{"foo":5,"bar":6},{"foo":7,"bar":8},{"foo":9,"bar":0}]
…
I would like to load this data in Pig, exploding the arrays and processing each individual object.
I have looked at using the JsonLoader in Twitter’s Elephant Bird to no avail. It doesn’t complain about the JSON, but I get “Successfully read 0 records” when running the following:
register '/tmp/elephant-bird/core/target/elephant-bird-core-4.3-SNAPSHOT.jar';
register '/tmp/elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.3-SNAPSHOT.jar';
register '/tmp/elephant-bird/pig/target/elephant-bird-pig-4.3-SNAPSHOT.jar';
register '/usr/local/lib/json-simple-1.1.1.jar';
a = load '/path/to/file.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true');
dump a;
I have also tried loading the file as normal, treating each line as a containing a single column chararray, and then trying to parse that as JSON, but I can’t find a pre-existing UDF which seems to do the trick.
Any ideas?
Like Donald said, you should use a UDF here. Here in Xplenty we wrote JsonStringToBag to complement ElephantBird's JsonStringToMap.

Apache Pig - Is it possible to serialize a variable?

Let's take the wordCount example:
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
bag_words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
Is it possible to serialize the "bag_words" variable so that we don't have to rebuild the entire bag each time we want to execute the script ?
Thanks.
STORE bag_words INTO 'some-output-directory';
Then read it in later to skip the foreach generate, flatten, tokenize.
You can output any alias in pig using the STORE command: you could use standard formats (like CSV) or write your own PigLoader class to implement any specific behaviour. You can then LOAD this output in a separate script, thus bypassing the initial LOAD.

Extract ordered tuple values from a bag

In pig I massaged my data into something like:
(a,{(b,c),(d,e),(f,g)})
(h,{(i,j),(k,l)})
where the first item is the group and the bag are other values related to the group. I would like to get it into the following format:
(a,b,c,d,e,f,g)
(h,i,j,k,l)
I got to where I am now with
grunt> j = foreach G {
>> o = order myvar by second;
>> generate group, o.(first,second);
>> };
So the tuples in the bag are currently ordered. If I do something like mystuff = foreach j generate group, flatten($1); I get it all flattened and un-grouped.
Is this possible in pig, and if so what command should I be looking at?
There is no way I can that can do what you want out of the box. You really need to use a user-defined function for this. I know it sucks because you have to write Java or Python code, but you'll find several situations where Pig just doesn't go far enough. Pig can be considered a data flow language and not so much of a programming language, which is why UDFs play such an important role: they bridge the gap.
My suggestion is you write a UDF that takes in the group and value bag as parameters. Do the ordering/sorting in the UDF and also the flattening.
The other thing you want to be careful about is that now your rows will have different numbers of columns and Pig doesn't really like this. If you are just immediately outputting it, you can probably get away with this. You might want to consider having your UDF write out the list in a tab-delimited string or something that is preformatted. This isn't that big of a deal... feel free to ignore my advice here.

Resources