which is the PigStorage() need to use to get the load the file for the following query
INSERT OVERWRITE DIRECTORY 'doop'
select a.* from cdr.cell_tower_info
the output for the above query is like this
Haryana Ambala 404 20 80 37591 76.76746 30.373488 404-20-80-37591
Haryana Ambala 404 20 80 30021 76.76746 30.373488 404-20-80-30021
Haryana Ambala 404 20 80 37591 76.76746 30.373488 404-20-80-37591
I am working with CDR analysis where first i need to retrieve some fields from a table using select and save it into a HDFS folder , this result is again need to load using pig for further analysis
Try this
CREATE EXTERNAL TABLE cell_tower_info
FIELDS TERMINATED BY ','
LOCATION 'doop'
AS
SELECT .* from cdr.cell_tower_info
Hive default delimiter is ctrl-A (\001).
I think PigStorage('\001') should work to use the Hive output data in PIG.
Alternatively Hive table can be defined as fields terminated by '\t' , so that the result can be used directly in PIG.
Can you try this?
input.txt
Haryana Ambala 404 20 80 37591 76.76746 30.373488 404-20-80-37591
Haryana Ambala 404 20 80 30021 76.76746 30.373488 404-20-80-30021
Haryana Ambala 404 20 80 37591 76.76746 30.373488 404-20-80-37591
PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\s+')) AS (f1:chararray,f2:chararray,f3:int,f4:int,f5:int,f6:int,f7:double,f8:double,f9:chararray);
C = FOREACH B GENERATE f1,f2,f8,f9;
DUMP C;
Output:
(Haryana,Ambala,30.373488,404-20-80-37591)
(Haryana,Ambala,30.373488,404-20-80-30021)
(Haryana,Ambala,30.373488,404-20-80-37591)
if your data is stored in hive table then,
X = LOAD 'hdfs://localhost:9000/user/hive/warehouse/input.txt' using PigStorage('\t') AS line;
Y = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\s+')) AS (A1:chararray,A2:chararray,A3:int,A4:int,A5:int,A6:int,A7:double,a8:double,A9:chararray);
Z = FOREACH B GENERATE A1,A2,A8,A9;
DUMP Z;
in my case port is 9000. give according to your system.
Related
I am facing a strange problem with pig generate function where if I do not use the first field the data generated seems to be wrong. Is this the expected behaviour ?
a = load '/input/temp2.txt' using PigStorage(' ','-tagFile') as (fname:chararray,line:chararray) ;
grunt> b = foreach a generate $1;
grunt> dump b;
(temp2.txt)
(temp2.txt)
grunt> c = foreach a generate $0,$1;
grunt> dump c;
(temp2.txt,field1,field2)
(temp2.txt,field1,field22)
$cat temp2.txt
field1,field2
field1,field22
pig -version
Apache Pig version 0.15.0 (r1682971)
compiled Jun 01 2015, 11:44:35
In the example I was expecting dump b to return data file values instead of the file name
in your example , you use PigStorage(' ','-tagFile') ,so each line were split by space .
then:
$0 ->field1,field2
$1 -> nothing ,
just use PigStorage(',','-tagFile') .
I want to split a file into 4 equal parts using Apache pig. Example, if a file has 100 lines the first 25 should go to the 1st output file and so on.. the last 25 lines should go to the 4th output file. Can someone help me to achieve this. I am using Apache pig because the number of records in the file will be in Millions and there are previous steps that generate the file that needs to be split uses Pig.
I did a bit of digging on this, because it comes up the the Hortonworks sample exam for hadoop. It doesn't seem to be well documented - but its quite simple really. In this example I was using the Country sample database offered for download on dev.mysql.com:
grunt> storeme = order data by $0 parallel 3;
grunt> store storeme into '/user/hive/countrysplit_parallel';
Then if we have a look at the directory in hdfs:
[root#sandbox arthurs_stuff]# hadoop fs -ls /user/hive/countrysplit_parallel
Found 4 items
-rw-r--r-- 3 hive hdfs 0 2016-04-08 10:19 /user/hive/countrysplit_parallel/_SUCCESS
-rw-r--r-- 3 hive hdfs 3984 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00000
-rw-r--r-- 3 hive hdfs 4614 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00001
-rw-r--r-- 3 hive hdfs 4768 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00002
Hope that helps.
You can use some of the below PIG feature to achieve your desired result.
SPLIT function http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT
MultiStorage class : https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/piggybank/storage/MultiStorage.html
Write custom PIG storage : https://pig.apache.org/docs/r0.7.0/udf.html#Store+Functions
You have to provide some condition based on your data.
This could do. But may be there could be a better option.
A = LOAD 'file' using PigStorage() as (line:chararray);
B = RANK A;
C = FILTER B BY rank_A > 1 and rank_A <= 25;
D = FILTER B BY rank_A > 25 and rank_A <= 50;
E = FILTER B BY rank_A > 50 and rank_A <= 75;
F = FILTER B BY rank_A > 75 and rank_A <= 100;
store C into 'file1';
store D into 'file2';
store E into 'file3';
store F into 'file4';
My requirement changed a bit, I have to store only the first 25% of the data into one file and the rest to another file. Here is the pig script that worked for me.
ip_file = LOAD 'input file' using PigStorage('|');
rank_file = RANK ip_file by $2;
rank_group = GROUP rank_file ALL;
with_max = FOREACH rank_group GENERATE COUNT(rank_file),FLATTEN(rank_file);
top_file = filter with_max by $1 <= $0/4;
rest_file = filter with_max by $1 > $0/4;
sort_top_file = order top_file by $1 parallel 1;
store sort_top_file into 'output file 1' using PigStorage('|');
store rest_file into 'output file 2 using PigStorage('|');
I want to load a CSV (just comma separated) file into my Hbase table. I already tried it with help of some googled articles, now just I am able to load entire row (or line) as value into Hbase, i.e. all values in single row are getting stored as single column, but I want to split the row based on delimiter comma (,) and store those vales into different columns in Hbase table's column family.
Please help to solve my issue. Any suggestions are appreciated.
Following are my present using input file, agent configuration file and hbase output files.
1)input file
8600000US00601,00601,006015-DigitZCTA,0063-DigitZCTA,11102
8600000US00602,00602,006025-DigitZCTA,0063-DigitZCTA,12869
8600000US00603,00603,006035-DigitZCTA,0063-DigitZCTA,12423
8600000US00604,00604,006045-DigitZCTA,0063-DigitZCTA,33548
8600000US00606,00606,006065-DigitZCTA,0063-DigitZCTA,10603
2)agent configuration file
agent.sources = spool
agent.channels = fileChannel2
agent.sinks = sink2
agent.sources.spool.type = spooldir
agent.sources.spool.spoolDir = /home/cloudera/Desktop/flume
agent.sources.spool.fileSuffix = .completed
agent.sources.spool.channels = fileChannel2
#agent.sources.spool.deletePolicy = immediate
agent.sinks.sink2.type = org.apache.flume.sink.hbase.HBaseSink
agent.sinks.sink2.channel = fileChannel2
agent.sinks.sink2.table = sample
agent.sinks.sink2.columnFamily = s1
agent.sinks.sink2.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
agent.sinks.sink1.serializer.regex = "\"([^\"]+)\""
agent.sinks.sink2.serializer.regexIgnoreCase = true
agent.sinks.sink1.serializer.colNames =col1,col2,col3,col4,col5
agent.sinks.sink2.batchSize = 100
agent.channels.fileChannel2.type=memory
3)HBase output
hbase(main):009:0> scan 'sample'
ROW COLUMN+CELL
1431064328720-0LalKGmSf3-1 column=s1:payload, timestamp=1431064335428, value=8600000US00602,00602,006025-DigitZCTA,0063-DigitZCTA,12869
1431064328720-0LalKGmSf3-2 column=s1:payload, timestamp=1431064335428, value=8600000US00603,00603,006035-DigitZCTA,0063-DigitZCTA,12423
1431064328720-0LalKGmSf3-3 column=s1:payload, timestamp=1431064335428, value=8600000US00604,00604,006045-DigitZCTA,0063-DigitZCTA,33548
1431064328721-0LalKGmSf3-4 column=s1:payload, timestamp=1431064335428, value=8600000US00606,00606,006065-DigitZCTA,0063-DigitZCTA,10603
4 row(s) in 0.0570 seconds
hbase(main):010:0>
My Source is a log file having "þ" as delimiter.I am trying to read this file in Pig.Please look at the options I tried.
Option 1 :
Using PigStorage("þ") - This does'nt work out as it cant handle unicode characters.
Option 2 :
I tried reading the lines as string and tried to split the line with "þ".This also does'nt work out as the STRSPLIT left out the last field as it has "\n" in the end.
I can see multiple questions in web, but unable to find a solution.
Kindly direct me with this.
Thorn Details :
http://www.fileformat.info/info/unicode/char/fe/index.htm
Is this the solution are you expecting?
input.txt:
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
helloþworldþhelloþworld
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)þ(.*)þ(.*)þ(.*)'));
dump B;
Output:
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
(hello,world,hello,world)
Added 2nd option with different datatypes:
input.txt
helloþ1234þ1970-01-01T00:00:00.000+00:00þworld
helloþ4567þ1990-01-01T00:00:00.000+00:00þworld
helloþ8901þ2001-01-01T00:00:00.000+00:00þworld
helloþ9876þ2014-01-01T00:00:00.000+00:00þworld
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)þ(.*)þ(.*)þ(.*)')) as (f1:chararray,f2:long,f3:datetime,f4:chararray);
DUMP B;
DESCRIBE B;
Output:
(hello,1234,1970-01-01T00:00:00.000+00:00,world)
(hello,4567,1990-01-01T00:00:00.000+00:00,world)
(hello,8901,2001-01-01T00:00:00.000+00:00,world)
(hello,9876,2014-01-01T00:00:00.000+00:00,world)
B: {f1: chararray,f2: long,f3: datetime,f4: chararray}
Another thorn symbol A¾:
input.txt
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
1077A¾04-01-2014þ04-30-2014þ0þ0.0þ0
PigScript:
A = LOAD 'jinput.txt' as line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)A¾(.*)þ(.*)þ(.*)þ(.*)þ(.*)')) as (f1:long,f2:datetime,f3:datetime,f4:int,f5:double,f6:int);
DUMP B;
describe B;
Output:
(1077,04-01-2014,04-30-2014,0,0.0,0)
(1077,04-01-2014,04-30-2014,0,0.0,0)
(1077,04-01-2014,04-30-2014,0,0.0,0)
B: {f1: long,f2: datetime,f3: datetime,f4: int,f5: double,f6: int}
}
This should work (replace the unicode code point with the one that's working for you, this is for capital thorn):
A = LOAD 'input' USING
B = FOREACH A GENERATE STRSPLIT(f1, '\\u00DE', -1);
I don't see why the last field should be left out.
Somehow, this does not work:
A = LOAD 'input' USING PigStorage('\00DE');
I have a input file like this:
481295b2-30c7-4191-8c14-4e513c7e7577,1362974399,56973118825,56950298471,true
67912962-dd84-46fa-84ef-a2fba12c2423,1362974399,56950556676,56982431507,false
cc68e779-4798-405b-8596-c34dfb9b66da,1362974399,56999223677,56998032823,true
37a1cc9b-8846-4cba-91dd-19e85edbab00,1362974399,56954667454,56981867544,false
4116c384-3693-4909-a8cc-19090d418aa5,1362974399,56986027804,56978169216,true
I only need the line which the last filed is "true". So I use the following Pig Latin:
records = LOAD 'test/test.csv' USING PigStorage(',');
A = FILTER records BY $4 'true';
DUMP A;
The problem is the second command, I always get the error:
2013-08-07 16:48:11,505 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 25> mismatched input ''true'' expecting SEMI_COLON
Why? I also try "$4 == 'true'" but still doesn't work though. Could anyone tell me how to do this simple thing?
How about:
A = FILTER records BY $4 == 'true' ;
Also, if you know how many fields the data will have beforehand, you should give it a schema. Something like:
records = LOAD 'test/test.csv' USING PigStorage(',')
AS (val1: chararray, val2: int, val3: int, val4: int, bool: chararray);
Or whatever names/types fit your needs.