Pig XMLLoader: Cannot parse XML (Converting to CSV) - hadoop

I have the following XML data:
<CompactData><my:DataSet><my:Series VAL="A" AMOUNT_TYPE="FI" IDENTIFIER="1"><my:Obs AMT="24.25" UNIT_MEASURE="KG"></my:Obs></my:Series><my:Series VAL="B" AMOUNT_TYPE="GI" IDENTIFIER="2"><my:Obs AMT="21.22" UNIT_MEASURE="KG"></my:Obs></my:Series></my:DataSet></CompactData>
I am trying to convert it to a CSV format using the following commands in PIG:
A = LOAD '/testing/mydata.xml' using org.apache.pig.piggybank.storage.XMLLoader('CompactData') as (x:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<my:Series VAL="([^"]+)" AMOUNT_TYPE="([^"]+)" IDENTIFIER="([^"]+)"><my:Obs AMT="([^"]+)" UNIT_MEASURE="([^"]+)"></my:Obs></my:Series>')) AS (val:chararray,amount_type:chararray,identifier:chararray,amt:chararray,unit_measure:chararray);
Putting the regex <my:Series VAL="([^"]+)" AMOUNT_TYPE="([^"]+)" IDENTIFIER="([^"]+)"><my:Obs AMT="([^"]+)" UNIT_MEASURE="([^"]+)"><\/my:Obs><\/my:Series> into Regexr gives two perfect matches, but Pig just does not want to work with it. It always gives me an empty result whereas I expect the following:
A,FI,1,24.25,KG
B,GI,2,21.22,KG
Update 1: This seems most likely related to the issue mentioned here: Pig xmlloader error when loading tag with colon

Assuming your code does not give an error, I can think of 3 potential problems here:
Your regex is not called
Your regex (in pig) is not returning the expected result
The output of your regex is not shown
To deal with the situation, I would recommend the following steps:
Create a pig program that succesfully uses regex to find 'b' in 'aba'
Create a pig program that succesfully finds both occurrences of 'a' in 'aba'
Create a pig program that succesfully finds the firstof 'a' in 'aba'
Keep 'growing' this solution gradually untill you reach your actual solution
If you still get stuck, please share the last solution that worked, and the first one that didn't work. (including input!)

Related

Pig load files using tuple's field

I need help for following use case:
Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)
So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).
So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach operator and tried as follows:
data = foreach tuples_with_file_info {
fileData = load $2 using PigStorage(',');
....
....
};
However its not working.
Edit:
For simplicity lets assume, I have single tuple with one field having file name:
(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)
You can't use Pig out of the box to do it.
What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:
a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...
so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:
{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
and pass that as a parameter to Pig so your script would start with:
a = LOAD '$input'
and your pig call would look like this:
pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
First, store the tuples_with_file_info into some file:
STORE tuples_with_file_info INTO 'some_temporary_file';
then,
data = LOAD 'some_temporary_file' using MyCustomLoader();
where
MyCustomLoader is nothing but a Pig loader extending LoadFunc, which uses MyInputFormat as InputFormat.
MyInputFormat is an encapsulation over the actual InputFormat (e.g. TextInputFormat) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).
In MyInputFormat, override getSplits method; first read the actual file name(s) from the some_temporary_file (You have to get this file name from Configuration's mapred.input.dir property), then update the same Configuration mapred.input.dir with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat).
Note: 1. You cannot use the setLocation API from the LoadFunc (or some other similar API) to read the contents of some_temporary_file, as its contents will be available only at run time.
2. One doubt may arise in your mind, what if LOAD statement executes before STORE? But this would not happen because if STORE and LOAD use same file in the script, Pig ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki

Hadoop Pig Latin Tuples: How to pass them to UDFs?

My goal is to pass every field in the input to a UDF as follows:
A = LOAD './input/file1' USING PigStorage(' ') AS (f1:chararray, f2:chararray);
B = FOREACH A GENERATE com.mycompany.udf.FAKEUDF(tuple(*));
NOTE: I am using Cloudera's version 0.12.0-cdh5.0.0.
The above FOREACH is just one of my many attempts. I have seen examples like
...FAKEUDF(*)
And so forth.
The main question is, what is the correct syntax? And has the syntax changed from earlier versions?
Here is a link which shows the lone asterisk syntax:
Chapter 10: Writing Evaluation & Filter Functions
It depends how u are processing your reqiurement. Argument will be name of column (one or more) like FAKEUDF(column1,column2,....) or for all the column you can specify * also like FAKEUDF(*) or you can specify relationName also. In UDF, you have to take out the column values from the tuple like : tuple.get(index). You have to be careful what you have sent as argument based on that processing is happening. It can be even DataBag.

How to load a file with a JSON array per line in Pig Latin

An existing script creates text files with an array of JSON objects per line, e.g.,
[{"foo":1,"bar":2},{"foo":3,"bar":4}]
[{"foo":5,"bar":6},{"foo":7,"bar":8},{"foo":9,"bar":0}]
…
I would like to load this data in Pig, exploding the arrays and processing each individual object.
I have looked at using the JsonLoader in Twitter’s Elephant Bird to no avail. It doesn’t complain about the JSON, but I get “Successfully read 0 records” when running the following:
register '/tmp/elephant-bird/core/target/elephant-bird-core-4.3-SNAPSHOT.jar';
register '/tmp/elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.3-SNAPSHOT.jar';
register '/tmp/elephant-bird/pig/target/elephant-bird-pig-4.3-SNAPSHOT.jar';
register '/usr/local/lib/json-simple-1.1.1.jar';
a = load '/path/to/file.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true');
dump a;
I have also tried loading the file as normal, treating each line as a containing a single column chararray, and then trying to parse that as JSON, but I can’t find a pre-existing UDF which seems to do the trick.
Any ideas?
Like Donald said, you should use a UDF here. Here in Xplenty we wrote JsonStringToBag to complement ElephantBird's JsonStringToMap.

PIG Latin : While loading how to discard the first line in any file?

I've been using PIG since sometime and wanted to know how to not consider the first line while loading a file. I've a file which has headers. So I should ignore the first line and go to the next line to do the processing on the date columns and all. How to go about this?
Thanks
If you have pig version 0.11 you could try this:
input_file = load 'input' USING PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
NoHeader = Filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;
The New_input_file should have your data without the header. note that the rank operator is new to pig 0.11, so this will not work with earlier versions.
EDIT: note this solution only works with a single file, if you are instead loading a directory try something else instead.
The given solution will work well if you just load in 1 file. However, if you load in all files in a directory (which is also possible by simply making sure that input is a directory path), the given solution will only cut off the top for the first file.
For removing the header in each file, you will probably want to use CSVExcelStorage
my_input = load 'inputfileordir' USING CSVExcelStorage(',', 'default', 'NOCHANGE', 'SKIP_INPUT_HEADER')

Hive - map not sending parameters to custom map script?

I'm trying to use the map clause with Hive but I'm tripping over syntax and not finding many examples of my use case around. I used the map clause before when I had to process one of the columns of a table using an external script.
I had a python script called, say, run, that took one command line parameter and spit out three space separated values. So I just did:
FROM(MAP
tablename.columnName
USING
'run' AS
result1, result2, result3
FROM
tablename
) map_output
INSERT OVERWRITE TABLE results SELECT *;
Now I have a python script that receives a lot more parameters and tried a few things that didn't worked and couldn't find examples on this. I did the obvious thing:
FROM
(MAP
numAgents, alpha, beta, burnin, nsteps, thin
USING
'runAuthorityMCMC' AS numAgents, alpha, beta, energy, avgDegree, maxDegree, accept
FROM
parameters
) map_output
INSERT OVERWRITE TABLE results SELECT *;
But I got an error A user-supplied transfrom script has exited with error code 2 instead of 0. When I run runAuthorityMCMC, with 6 command line parameters sampled from that table, it works perfectly well.
It seems to me it's trying to run the script without passing the parameters at all. In one of the error messages I got exactly the output I expected if this was the case. What is the correct syntax to do what I'm trying to do?
EDIT:
Confirming - this was part of the error message:
usage: runAuthorityMCMC [-h]
numAgents normalizedBrainCapacity ecologicalPressure
burnInSteps monteCarloSteps thiningRatio
runAuthorityMCMC: error: too few arguments
Which is exactly the output I'd expect with too few arguments. The script should take six arguments.
Ok, perhaps there is a difference of vocabulary here but hive doesn't send the values as "arguments" to the script. They are read in through standard input (which is different than passing something as argument). Also, you can try sending the data to /bin/cat so see what's actually being sent to the hive. If my memory serves me right, the values are sent tab separated and result emitted out from the script is also expected to be tab separated.
Trying printing stuff from stdout (or stderr) in your script, you will see the result in your jobtracker logs. That will help you debug.

Resources