I am new in hadoop and i am working with a programme that the input of map function is a file that keys are like this:
ID: value:
3 sd
37 g
5675 gk
68 oi
My file is about 10 gigabytes and i want to change these Ids and renumber them in descending order. I don't want to change the values.
My output must be like this:
ID: value:
5675 sd
68 g
37 gk
3 oi
I want to do this work in a cluster of nodes? How can i do that?
I think that i need a global variable and i can't do this in a cluster? What can i do?
You can do one map/reduce to order the ids then you'd have a file with the ids in descending order.
You can then write a second map/reduce that would join that file with the unsorted file where the mapper will emit enumerator (that can be calculated by the split size to facilitate multiple maps) so that the mapper that go over the fist file will emit "1 sd" "2 g" etc. and the mapper that processes the ids file would emit "1 5675" "2 68". The reducer will then join the files
here's an (untested) pig 0.11 script that would do something along these line:
A = load 'data' AS (id:chararray,value:chararray);
ID_RAW= FOREACH A GENERATE id;
DATA_RAW = FOREACH A GENERATE value;
ID_SORT= RANK ID_RAW BY id DESC DENSE;
DATA_SORT = RANK DATA_RAW DENSE;
ID_DATA = JOIN ID_SORT by $0, DATA_SORT by $0;
RESULT = FOREACH ID_DATA GENERATE ID_SORT::ID,DATA_SORT::value;
STORE RESULT to 'output';
Before I say this, I like Arnon's answer for using hadoop.
But, since this is small file, 10G is not that big, and you only need to run it once, I would personally just write a small script.
Assuming a tab delimited file
sort myfile.txt > myfile.sorted.text
paste myfile.sorted.text myfile.text | cut -f1,4 > newFile.txt
That might take a long time, certainly longer than using hadoop, but is simple and works
Related
how to join three large dataset in a single mapreduce program.All these three files can not fit in memory.The file1 has key K1,file2 has K2 and file3 has both k1 and k2.I want to join File1 and File2 by referencing File3.Please let me know if there is any technique to do this. Thanks in advanced...!!
You cannot do it with a single mapreduce job. Since you have to create two separate custom writable class one for input 1 and input 2(key has to be field k1 or k2 and a join identifier) and a separate one for input 3(key has to be field k1,k2 and a join identifier). So it requires two mapreduce jobs.
MR 1(Join input 1 and input 3 based on key K1).
Mapper 1
Map output:
(K,V)=>((K1,input1),value)
Mapper 2
Map Output:
(K,V)=>((K1,input3),value)
Append/Join data set in reducers.
MR2(Join input 2 and output of MR 1 based on key K2)
Mapper 1:
Map Output:
(K,V)=>((K2,input2),value)
Mapper 2:
Map Output:
(K,V)=>((K2,outputofMR1),value)
Append/Join data sets in reducer
Split the group record in to different records :
for eg :
Input : (A,(3,2,3))
Output in to 3 new lines:
A,3
A,2
A,3
Can any one let me know the option to do this please?
The problem is when you convert the output of Arraylist to tuple then it will be difficult to achieve what you want, so I recommend this approach, so it will be easy to get the output .
In your UDF code, instead of creating Arraylist, append the output into string seperated by comma and return back to pig script.
You final output should be like this from UDF as a string ie "3,2,3"
Then use the below code to get the result
C = FOREACH B GENERATE $0,NewRollingCount(BagToString($1)) AS rollingCnt
D = FOREACH C GENERATE $0,FLATTEN(TOKENIZE(rollingcnt));
DUMP D;
I have input records of the form
2013-07-09T19:17Z,f1,f2
2013-07-09T03:17Z,f1,f2
2013-07-09T21:17Z,f1,f2
2013-07-09T16:17Z,f1,f2
2013-07-09T16:14Z,f1,f2
2013-07-09T16:16Z,f1,f2
2013-07-09T01:17Z,f1,f2
2013-07-09T16:18Z,f1,f2
These represent timestamps and events. I have written these by hand, but actual data should be sorted based on time.
I would like to generate a set of records which would be input to graph plotting function which needs continuous time series. I would like to fill in missing values, i.e. if there are entries for "2013-07-09T19:17Z" and "2013-07-09T19:19Z", I would like to generate entry for "2013-07-09T19:18Z" with predefined value.
My thoughts on doing this:
Use MIN and MAX to find the start and end date in the series
Write UDF which takes min and max and returns relation with missing
timestamps
Join above 2 relations
I cannot get my head around on how to implement this in PIG though. Would appreciate any help.
Thanks!
Generate another file using a script (outside pig)with all time stamps between MIN and MAX , including MIN and MAX. Load this as a second data set. Here is a sample that I used from your data set. Please note I filled in only few gaps not all.
2013-07-09T01:17Z,d1,d2
2013-07-09T01:18Z,d1,d2
2013-07-09T03:17Z,d1,d2
2013-07-09T16:14Z,d1,d2
2013-07-09T16:15Z,d1,d2
2013-07-09T16:16Z,d1,d2
2013-07-09T16:17Z,d1,d2
2013-07-09T16:18Z,d1,d2
2013-07-09T19:17Z,d1,d2
2013-07-09T21:17Z,d1,d2
Do a COGROUP on the original dataset and the generated dataset above. Use a nested FOREACH GENERATE to write output dataset. If first dataset is empty, use the values from second set to generate output dataset else the first dataset. Here is the piece of code I used on these two datasets.
Org_Set = LOAD 'pigMissingData/timeSeries' USING PigStorage(',') AS (timeStamp, fl1, fl2);
Default_set = LOAD 'pigMissingData/timeSeriesFull' USING PigStorage(',') AS (timeStamp, fl1, fl2);
coGrouped = COGROUP Org_Set BY timeStamp, Default_set BY timeStamp;
Filled_Data_set = FOREACH coGrouped {
x = COUNT(times);
y = (x == 0? (Default_set.fl1, Default_set.fl2): (Org_Set.fl1, Org_Set.fl2));
GENERATE FLATTEN(group), FLATTEN(y.$0), FLATTEN(y.$1);
};
if you need further clarification or help let me know
In addition to #Rags answer, you could use the STREAM x THROUGH command and a simple awk script (similar to this one) to generate the date range once you have the min and max dates. Something similar to (untested! - you might need to single line the awk script with semi-colon command delimitation, or better to ship it as a script file)
grunt> describe bounds;
(min:chararray, max:chararray)
grunt> dump bounds;
(2013/01/01,2013/01/04)
grunt> fullDateBounds = STREAM bounds THROUGH `gawk '{
split($1,s,"/")
split($2,e,"/")
st=mktime(s[1] " " s[2] " " s[3] " 0 0 0")
et=mktime(e[1] " " e[2] " " e[3] " 0 0 0")
for (i=st;i<=et;i+=60*24) print strftime("%Y/%m/%d",i)
}'`;
I have a file called data that looks like this: (note there are tabs after the 'personA')
personA (1, 2, 3)
personB (2, 1, 34)
And I have an Apache pig script like this:
A = LOAD 'data' AS (name: chararray, nodes: tuple(a:int, b:int, c:int));
C = foreach A generate nodes.$0;
dump C;
The output of which makes sense:
(1)
(2)
However if I change the schema of the script to be like this:
A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;
Then the output I get is this:
(1, 2, 3)
(2, 1, 34)
It looks like the first (and only) element in this tuple is a bytearray. i.e. it's not parsing the input text 1, 2, 3 into a tuple.
In future my input will have an unknown & variable number of elements in the nodes item, so I can't just write out a:int, ….
Is there anyway to get Pig to parse the input tuple as a tuple without having to write out the full schema?
Pig does not accept what you are passing in as valid. The default loading scheme PigStorage only accepts delimited files (by default tab delimited). It is not smart enough to parse the tuple construct with the parenthesis and commas you have in the text. Your options are:
Reformat your file to be tab delimited: personA 1 2 3
Read the file in line by line with TextLoader, then write some sort of UDF that parses the line and returns the data in the form you want.
Write your own custom loader.
This is no more a limitation. Pig parses the tuples in input file considering comma as field separator. I'm trying in Apache Pig version 0.15.0.
A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;
Output I get is:
(1)
(2)
Here is another way of tackling this issue, although I know the answers above are more efficient.
data = LOAD 'data' USING PigStorage() AS (name:chararray, field2:chararray);
data = FOREACH data GENERATE name, REPLACE(REPLACE(field2, '\\(',''),'\\)','') AS field2;
data = FOREACH data GENERATE name, STRSPLIT(field2, '\\,') AS fieldTuple;
data = FOREACH data GENERATE name, fieldTuple.$0,fieldTuple.$1, fieldTuple.$2 ;
Load field2 as chararray
Remove parentheses
Split field2 by comma (it gives you a tuple with 3 fields in it)
Get values by index
I know it is hacky. Just wanted to provide another way of doing this
I'd like to use Apache Pig to build a large key -> value mapping, look things up in the map, and iterate over the keys. However, there does not even seem to be syntax for doing these things; I've checked the manual, wiki, sample code, Elephant book, Google, and even tried parsing the parser source. Every single example loads map literals from a file... and then never uses them. How can you use Pig's maps?
First, there doesn't seem to be a way to load a 2-column CSV file into a map directly. If I have a simple map.csv:
1,2
3,4
5,6
And I try to load it as a map:
m = load 'map.csv' using PigStorage(',') as (M: []);
dump m;
I get three empty tuples:
()
()
()
So I try to load tuples and then generate the map:
m = load 'map.csv' using PigStorage(',') as (key:chararray, val:chararray);
b = foreach m generate [key#val];
ERROR 1000: Error during parsing. Encountered " "[" "[ "" at line 1, column 24.
...
Many variations on the syntax also fail (e.g., generate [$0#$1]).
OK, so I munge my map into Pig's map literal format as map.pig:
[1#2]
[3#4]
[5#6]
And load it up:
m = load 'map.pig' as (M: []);
Now let's load up some keys and try lookups:
k = load 'keys.csv' as (key);
dump k;
3
5
1
c = foreach k generate m#key; /* Or m[key], or... what? */
ERROR 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Hrm, OK, maybe since there are two relations involved, we need a join:
c = join k by key, m by /* ...um, what? */ $0;
dump c;
ERROR 1068: Using Map as key not supported.
c = join k by key, m by m#key;
dump c;
Error 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Fail. How do I refer to the key (or value) of a map? The map schema syntax doesn't seem to let you even name the key and value (the mailing list says there's no way to assign types).
Finally, I'd just like to be able to find all they keys in my map:
d = foreach m generate ...oh, forget it.
Is Pig's map type half-baked? What am I missing?
Currently pig maps need the key to a chararray (string) that you supply and not a variable which contains a string. so in map#key the key has to be constant string that you supply (eg: map#'keyvalue').
The typical use case of this is to load a complex data structure one of the element being a key value pair and later in a foreach statement you can refer to a particular value based on the key you are interested in.
http://pig.apache.org/docs/r0.9.1/basic.html#map-schema
In Pig version 0.10.0 there is a new function available called "TOMAP" (http://pig.apache.org/docs/r0.10.0/func.html#tomap) that converts its odd (chararray) parameters to keys and even parameters to values. Unfortunately I haven't found it to be that useful, though, since I typically deal with arbitrary dicts of varying lengths and keys.
I would find a TOMAP function that took a tuple as a single argument, instead of a variable number of parameters, to be much more useful.
This isn't a complete solution to your problem, but the availability of TOMAP gives you some more options for your constructing a real solution.
Great question!
I personally do not like Maps in Pig. They have a place in traditional programming languages like Java, C# etc, wherein its really handy and fast to lookup a key in the map. On the other hand, Maps in Pig have very limited features.
As you rightly pointed, one can not lookup variable key in the Map in Pig. The key needs to be Constant. e.g. myMap#'keyFoo' is allowed but myMap#$SOME_VARIABLE is not allowed.
If you think about it, you do not need Map in Pig. One usually loads the data from some source, transforms it, joins it with some other dataset, filter it, transform it and so on. JOIN actually does a good job of looking up the variable keys in the data.
e.g. data1 has 2 columns A and B and data2 has 3 columns X, Y, Z. If you join data1 BY A with data2 BY Z, JOIN does the work of a Map (from traditional language) which maps value of column Z to value of column B (via column A). So data1 essentially represents a Map A -> B.
So why do we need Map in Pig?
Usually Hadoop data are the dumps of different data sources from Traditional languages. If original data sources contain Maps, the HDFS data would contain a corresponding Map.
How can one handle the Map data?
There are really 2 use cases:
Map keys are constants.
e.g. HttpRequest Header data contains time, server, clientIp as the keys in Map. to access value of a particular key, one case access them with Constant key.
e.g. header#'clientIp'.
Map keys are variables.
In these cases, you would most probably would want to JOIN the Map keys with some other data set. I usually convert the Map to Bag using UDF MapToBag, which converts map data into Bag of 2 field tuples (key, value). Once map data is converted to Bag of tuples, its easy to join it with other data sets.
I hope this helps.
1)If you want to load map data it should be like "[programming#SQL,rdbms#Oracle]"
2)If you want to load tuple data it should be like "(first_name_1234,middle_initial_1234,last_name_1234)"
3)If you want to load bag data it should be like"{(project_4567_1),(project_4567_2),(project_4567_3)}"
my file pigtest.csv like this
1234|emp_1234#company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]
4567|emp_4567#company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]
my schema:
a = LOAD 'pigtest.csv' using PigStorage('|') AS (employee_id:int, email:chararray, name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray), project_list:bag{project: tuple(project_name:chararray)}, skills:map[chararray]) ;
b = FOREACH a GENERATE employee_id, email, name.first_name, project_list, skills#'programming' ;
dump b;
I think you need to think in term of relations and the map is just one field of one record. Then you can apply some operations on the relations, like joining the two sets data and mapping:
Input
$ cat data.txt
1
2
3
4
5
$ cat mapping.txt
1 2
2 4
3 6
4 8
5 10
Pig
mapping = LOAD 'mapping.txt' AS (key:CHARARRAY, value:CHARARRAY);
data = LOAD 'data.txt' AS (value:CHARARRAY);
-- list keys
mapping_keys =
FOREACH mapping
GENERATE key;
DUMP mapping_keys;
-- join mapping to data
mapped_data =
JOIN mapping BY key, data BY value;
DUMP mapped_data;
Output
> # keys
(1)
(2)
(3)
(4)
(5)
> # mapped data
(1,2,1)
(2,4,2)
(3,6,3)
(4,8,4)
(5,10,5)
This answer could also help you if you just want to do a simple look up:
pass-a-relation-to-a-pig-udf-when-using-foreach-on-another-relation
You can load up any data and then convert and store in key value format to read for later use
data = load 'somedata.csv' using PigStorage(',')
STORE data into 'folder' using PigStorage('#')
and then read as a mapped data.