Apache Pig not parsing a tuple fully - hadoop

I have a file called data that looks like this: (note there are tabs after the 'personA')
personA (1, 2, 3)
personB (2, 1, 34)
And I have an Apache pig script like this:
A = LOAD 'data' AS (name: chararray, nodes: tuple(a:int, b:int, c:int));
C = foreach A generate nodes.$0;
dump C;
The output of which makes sense:
(1)
(2)
However if I change the schema of the script to be like this:
A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;
Then the output I get is this:
(1, 2, 3)
(2, 1, 34)
It looks like the first (and only) element in this tuple is a bytearray. i.e. it's not parsing the input text 1, 2, 3 into a tuple.
In future my input will have an unknown & variable number of elements in the nodes item, so I can't just write out a:int, ….
Is there anyway to get Pig to parse the input tuple as a tuple without having to write out the full schema?

Pig does not accept what you are passing in as valid. The default loading scheme PigStorage only accepts delimited files (by default tab delimited). It is not smart enough to parse the tuple construct with the parenthesis and commas you have in the text. Your options are:
Reformat your file to be tab delimited: personA 1 2 3
Read the file in line by line with TextLoader, then write some sort of UDF that parses the line and returns the data in the form you want.
Write your own custom loader.

This is no more a limitation. Pig parses the tuples in input file considering comma as field separator. I'm trying in Apache Pig version 0.15.0.
A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;
Output I get is:
(1)
(2)

Here is another way of tackling this issue, although I know the answers above are more efficient.
data = LOAD 'data' USING PigStorage() AS (name:chararray, field2:chararray);
data = FOREACH data GENERATE name, REPLACE(REPLACE(field2, '\\(',''),'\\)','') AS field2;
data = FOREACH data GENERATE name, STRSPLIT(field2, '\\,') AS fieldTuple;
data = FOREACH data GENERATE name, fieldTuple.$0,fieldTuple.$1, fieldTuple.$2 ;
Load field2 as chararray
Remove parentheses
Split field2 by comma (it gives you a tuple with 3 fields in it)
Get values by index
I know it is hacky. Just wanted to provide another way of doing this

Related

tokenize fields using pig script for records having no delimiter

I have fields C1C2C3C4 (no delimter present)in a raw file, I have to generate output which should look like C1,C2,C3,C4.Using PIG script.
Given :- size of C1=C2=C3=C4= 4bytes.
This should be straightforward with these steps:
Load the data as is
Generate four new columns, using the SUBSTRING function
For example, you should be able to extract c2 as:
SUBSTRING(inputstring, 5, 8)
Extending Dennis's Answer.
Assuming the field is stored as chararray
A = LOAD 'data.txt' as (f1:chararray);
B = FOREACH A GENERATE
SUBSTRING(f1,0,2) as A1,
SUBSTRING(f1,2,4) as A2,
SUBSTRING(f1,4,6) as A3,
SUBSTRING(f1,6,8) as A4;
DUMP B;

Reverse the group data as a different record using Pig

Split the group record in to different records :
for eg :
Input : (A,(3,2,3))
Output in to 3 new lines:
A,3
A,2
A,3
Can any one let me know the option to do this please?
The problem is when you convert the output of Arraylist to tuple then it will be difficult to achieve what you want, so I recommend this approach, so it will be easy to get the output .
In your UDF code, instead of creating Arraylist, append the output into string seperated by comma and return back to pig script.
You final output should be like this from UDF as a string ie "3,2,3"
Then use the below code to get the result
C = FOREACH B GENERATE $0,NewRollingCount(BagToString($1)) AS rollingCnt
D = FOREACH C GENERATE $0,FLATTEN(TOKENIZE(rollingcnt));
DUMP D;

Merging of two part files with header as only first line Hadoop

how can i merge two or more part files in hadoop to single file in such a way that merge output is having entire data but, only one header that is in the 1st line of merge output .
File 1
column1|column2|column3
20000|newyork|john
30000|sydney|joseph
File n
column1|column2|column3
60000|delhi|mike
30000|sydney|joseph
Merged output should be
column1|column2|column3
20000|newyork|john
30000|sydney|joseph
60000|delhi|mike
30000|sydney|joseph
Is there any easy way using hadoop fs -cat command.. ?
or by any other method..
Method 1:
Leaving the headers on is fairly complicated without creating an index or rank, since in Pig a collection of tuples is unsorted. Here's what a Pig job looks like, using rank and order by to place the header on top.
header_ranked.pig
HEADER = LOAD 'header.txt' USING PigStorage('|') AS (b0:int,b1:chararray,b2:chararray,b3:chararray);
H1 = LOAD 'header_test' USING PigStorage('|') AS (c1:chararray,c2:chararray,c3:chararray);
F_H1 = FILTER H1 BY NOT (c1 MATCHES 'column1' AND c2 MATCHES 'column2' AND c3 MATCHES 'column3');
R_H1 = RANK F_H1 by c1 DESC DENSE;
U = UNION R_H1, HEADER;
O = ORDER U by rank_F_H1;
F = FOREACH O GENERATE c1,c2,c3;
dump F;
The two sample files, each containing 2 records and a header line, were placed in a directory called header_test. Additionally, in order for this program to work, I had to create a header file in the following format:
header.txt
0|column1|column2|column3
Walking through the code, the file containing the headers (slightly modified to include an additional column, which is the rank value of 0) is loaded into the HEADER alias.
Next the actual data is loaded into the H1 alias, as it grabs all files under the header_test directory.
F_H1 filters out all headers from the data. If you had 20 files that were loaded into H1 from the header_test directory, those 20 headers would now be filtered out of the data.
R_H1 creates a rank on the filtered data, in descending order and without skipping any numbers.
U effectively concatenates the ranked filtered data with the 0|column1|column2|column3 header line.
O orders the data by the rank, so that the header (which has a rank of 0), appears on top.
And finally, F gets rid of the ranking, leaving the clean tuples.
Results
(column1,column2,column3)
(60000,delhi,mike)
(30000,sydney,joseph)
(30000,sydney,joseph)
(20000,newyork,john)
Method 2:
Basically, leave the headers on one file, strip them from the rest, and then mash them together. Not sure it'll stay sorted, though, haven't tested it thoroughly.
H1 = LOAD 'header_test/header1.txt' USING PigStorage('|') AS (c1:chararray,c2:chararray,c3:chararray);
H2 = LOAD 'header_test/header2.txt' USING PigStorage('|') AS (d1:chararray,d2:chararray,d3:chararray);
F_H2 = FILTER H2 BY NOT (d1 MATCHES 'column1' AND d2 MATCHES 'column2' AND d3 MATCHES 'column3');
U = UNION H1, F_H2;
dump U;
Results
(column1,column2,column3)
(20000,newyork,john)
(30000,sydney,joseph)
(60000,delhi,mike)
(30000,sydney,joseph)

generate a different number of columns based on input number

Suppose I have some XML data that has an unknown number of sub-nodes. Is there a method that allows me to input the number of sub-nodes into the program as a parameter, and have it process them? current code is something like this
SourceXML = LOAD '$input' using org.apache.pig.piggybank.storage.XMLLoader('$TopNode') as test:chararray;
test2 = LIMIT SourceXML 3;
test3 = FOREACH test2 GENERATE REGEX_EXTRACT(test,'<$tag1>(.*)</$tag1>',1),
REGEX_EXTRACT(test,'<$tag2>(.*)</$tag2>',1);
dump test3;
however I may not know in advance how many simple elements there are in the target data (how many $tag# there are). I am hoping to use a .txt file containing parameters that looks something like this:
input=/inputpath/lowerlevelsofpath
numberSimpleElements=3
tag1=tag1name
tag2=tag2name
tag3=tag3name
With a regex_extract being done on each tag in the input file
Any ideas on how to accomplish this?
You could do following
Split the text by some regex, so that each row now has value.
Generate (tag, value) for each row
Do a join between (tag, value) and (list of tags)

How can I use the map datatype in Apache Pig?

I'd like to use Apache Pig to build a large key -> value mapping, look things up in the map, and iterate over the keys. However, there does not even seem to be syntax for doing these things; I've checked the manual, wiki, sample code, Elephant book, Google, and even tried parsing the parser source. Every single example loads map literals from a file... and then never uses them. How can you use Pig's maps?
First, there doesn't seem to be a way to load a 2-column CSV file into a map directly. If I have a simple map.csv:
1,2
3,4
5,6
And I try to load it as a map:
m = load 'map.csv' using PigStorage(',') as (M: []);
dump m;
I get three empty tuples:
()
()
()
So I try to load tuples and then generate the map:
m = load 'map.csv' using PigStorage(',') as (key:chararray, val:chararray);
b = foreach m generate [key#val];
ERROR 1000: Error during parsing. Encountered " "[" "[ "" at line 1, column 24.
...
Many variations on the syntax also fail (e.g., generate [$0#$1]).
OK, so I munge my map into Pig's map literal format as map.pig:
[1#2]
[3#4]
[5#6]
And load it up:
m = load 'map.pig' as (M: []);
Now let's load up some keys and try lookups:
k = load 'keys.csv' as (key);
dump k;
3
5
1
c = foreach k generate m#key; /* Or m[key], or... what? */
ERROR 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Hrm, OK, maybe since there are two relations involved, we need a join:
c = join k by key, m by /* ...um, what? */ $0;
dump c;
ERROR 1068: Using Map as key not supported.
c = join k by key, m by m#key;
dump c;
Error 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Fail. How do I refer to the key (or value) of a map? The map schema syntax doesn't seem to let you even name the key and value (the mailing list says there's no way to assign types).
Finally, I'd just like to be able to find all they keys in my map:
d = foreach m generate ...oh, forget it.
Is Pig's map type half-baked? What am I missing?
Currently pig maps need the key to a chararray (string) that you supply and not a variable which contains a string. so in map#key the key has to be constant string that you supply (eg: map#'keyvalue').
The typical use case of this is to load a complex data structure one of the element being a key value pair and later in a foreach statement you can refer to a particular value based on the key you are interested in.
http://pig.apache.org/docs/r0.9.1/basic.html#map-schema
In Pig version 0.10.0 there is a new function available called "TOMAP" (http://pig.apache.org/docs/r0.10.0/func.html#tomap) that converts its odd (chararray) parameters to keys and even parameters to values. Unfortunately I haven't found it to be that useful, though, since I typically deal with arbitrary dicts of varying lengths and keys.
I would find a TOMAP function that took a tuple as a single argument, instead of a variable number of parameters, to be much more useful.
This isn't a complete solution to your problem, but the availability of TOMAP gives you some more options for your constructing a real solution.
Great question!
I personally do not like Maps in Pig. They have a place in traditional programming languages like Java, C# etc, wherein its really handy and fast to lookup a key in the map. On the other hand, Maps in Pig have very limited features.
As you rightly pointed, one can not lookup variable key in the Map in Pig. The key needs to be Constant. e.g. myMap#'keyFoo' is allowed but myMap#$SOME_VARIABLE is not allowed.
If you think about it, you do not need Map in Pig. One usually loads the data from some source, transforms it, joins it with some other dataset, filter it, transform it and so on. JOIN actually does a good job of looking up the variable keys in the data.
e.g. data1 has 2 columns A and B and data2 has 3 columns X, Y, Z. If you join data1 BY A with data2 BY Z, JOIN does the work of a Map (from traditional language) which maps value of column Z to value of column B (via column A). So data1 essentially represents a Map A -> B.
So why do we need Map in Pig?
Usually Hadoop data are the dumps of different data sources from Traditional languages. If original data sources contain Maps, the HDFS data would contain a corresponding Map.
How can one handle the Map data?
There are really 2 use cases:
Map keys are constants.
e.g. HttpRequest Header data contains time, server, clientIp as the keys in Map. to access value of a particular key, one case access them with Constant key.
e.g. header#'clientIp'.
Map keys are variables.
In these cases, you would most probably would want to JOIN the Map keys with some other data set. I usually convert the Map to Bag using UDF MapToBag, which converts map data into Bag of 2 field tuples (key, value). Once map data is converted to Bag of tuples, its easy to join it with other data sets.
I hope this helps.
1)If you want to load map data it should be like "[programming#SQL,rdbms#Oracle]"
2)If you want to load tuple data it should be like "(first_name_1234,middle_initial_1234,last_name_1234)"
3)If you want to load bag data it should be like"{(project_4567_1),(project_4567_2),(project_4567_3)}"
my file pigtest.csv like this
1234|emp_1234#company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]
4567|emp_4567#company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]
my schema:
a = LOAD 'pigtest.csv' using PigStorage('|') AS (employee_id:int, email:chararray, name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray), project_list:bag{project: tuple(project_name:chararray)}, skills:map[chararray]) ;
b = FOREACH a GENERATE employee_id, email, name.first_name, project_list, skills#'programming' ;
dump b;
I think you need to think in term of relations and the map is just one field of one record. Then you can apply some operations on the relations, like joining the two sets data and mapping:
Input
$ cat data.txt
1
2
3
4
5
$ cat mapping.txt
1 2
2 4
3 6
4 8
5 10
Pig
mapping = LOAD 'mapping.txt' AS (key:CHARARRAY, value:CHARARRAY);
data = LOAD 'data.txt' AS (value:CHARARRAY);
-- list keys
mapping_keys =
FOREACH mapping
GENERATE key;
DUMP mapping_keys;
-- join mapping to data
mapped_data =
JOIN mapping BY key, data BY value;
DUMP mapped_data;
Output
> # keys
(1)
(2)
(3)
(4)
(5)
> # mapped data
(1,2,1)
(2,4,2)
(3,6,3)
(4,8,4)
(5,10,5)
This answer could also help you if you just want to do a simple look up:
pass-a-relation-to-a-pig-udf-when-using-foreach-on-another-relation
You can load up any data and then convert and store in key value format to read for later use
data = load 'somedata.csv' using PigStorage(',')
STORE data into 'folder' using PigStorage('#')
and then read as a mapped data.

Resources