Concatinating files with Timestamp in PIG

Concatinating files with Timestamp in PIG - hadoop

How can I concatenate a time stamp with the output generated by pig. I need to save the output generated by pig to one more folder with a time stamp so that it can be used as historical data for a future purpose. I tried to use CurrentTime() but it gave me a error like this:
2015-03-31 19:29:58,249 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 1> Cannot expand macro 'CurrentTime'. Reason: Macro must be defined before expansion.
How do I define this macro?
Here is the code :
A = load '/user/root/b2.out';
X = FILTER A BY ($2 == 'Error') OR ($2=='Info') OR ($2=='Warning') OR ($2=='Critical');
D = FOREACH X GENERATE $0,$2,$4,$6,$8;
store D into CONCAT('/user/root/ELABD/finalout',CurrentTime());

CONCAT can only be used inside a relation (aka foreach statement), so you cannot use it to construct an output file location.
Two possible solutions here I think:
Use a %declare statement in your pig script that uses something like date in bash to get current time and use that as parameter, e.g.
%declare DATETIME `date +%Y-%m-%dT%H-%M-%S`
...
store D into '/user/root/ELABD/finalout/$DATETIME';
Alternatively use something like Oozie to schedule your pig jobs and have Oozie generate your output location based on date/time.

Related

How to use current timestamp as filename for Hive output

I'm using this code to write the results of a Hive query to the specified file:
INSERT OVERWRITE DIRECTORY '/user/test.user/test.csv'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ESCAPED BY '"' STORED AS TEXTFILE
SELECT
...
I don't want the filename to be test.csv however but the unix timestamp, that is 1517213651.csv or something like that.
I understand I can't use the concat function to manipulate the filename, but that is as far as I got.
How do I get the timestamp of the moment of query execution to be the filename of my output?
EDIT: We're using Cloudera.

Another option is to put the Hive insert inside of a Shell Script. Define a Date variable in the script and then use the Date Variable to define the output file.
TIMESTAMP_VAR=date +"%Y-%m-%d-%H-%M-%S"
FILENAME_VAR=/user/test/${TIMESTAMP_VAR}.csv
You can manipulate the timestamp layout in numerous ways.

you have to add TalendDate.getDate("CCYYMMDD") in file path.
"/File1/Output_File_" + TalendDate.getDate("CCYYMMDD") + ".csv"

Multi-line JSON read using Apache PIG

I have a JSON file and want to read using Apache Pig.
I tried using the regular JSONLOADER, but looks like JSONLOADER works only with single line JSON. Then I tried with Elephant-Bird. But I am still not able to see the results correctly. Can any one please suggest a solution?
Input :
{"employees":[
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter", "lastName":"Jones"}
]}
Note : I dont want to convert the input in to a single line.
Script:
A = LOAD 'input' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
B = FOREACH A GENERATE FLATTEN($0#'employees');
Dump B;
Expected result should be :
([firstName#John,lastName#Doe])
([firstName#Anna,lastName#Smith])
([firstName#Peter,lastName#Jones])

As mentioned in the comments by siva, the answer is basically that you do need to change your input to a single line.
JsonLoader or elephantbird loader will always works only with single
line . It will not work with multiline. You need to convert your input
to single line before passing to pig. One workaround would be write a
shell script and call the logic to replace multiline to single line
using 'SED' command and then call the pig script in the shell script.
This link will help you how to call pig thru shell script.

How does Pig use Hadoop Globs in a 'load' statement?

As I've noted previously, Pig doesn't cope well with empty (0-byte) files. Unfortunately, there are lots of ways that these files can be created (even within Hadoop utilitities).
I thought that I could work around this problem by explicitly loading only files that match a given naming convention in the LOAD statement using Hadoop's glob syntax. Unfortunately, this doesn't seem to work, as even when I use a glob to filter down to known-good input files, I still run into the 0-byte failure mentioned earlier.
Here's an example: Assume I have the following files in S3:
mybucket/a/b/ (0 bytes)
mybucket/a/b/myfile.log (>0 bytes)
mybucket/a/b/yourfile.log (>0 bytes)
If I use a LOAD statement like this in my pig script:
myData = load 's3://mybucket/a/b/*.log as ( ... )
I would expect that Pig would not choke on the 0-byte file, but it still does. Is there a trick to getting Pig to actually only look at files that match the expected glob pattern?

This is a fairly ugly solution, but globs that don't rely on the * wildcard syntax appear to work. So, in our workflow (before calling our pig script), we list all of the files below the prefix we're interested, and then create a specific glob that consists of only the paths we're interested in.
For example, in the example above, we list "mybucket/a":
hadoop fs -lsr s3://mybucket/a
Which returns a list of files, plus other metadata. We can then create the glob from that data:
myData = load 's3://mybucket/a/b{/myfile.log,/yourfile.log}' as ( ... )
This requires a bit more front-end work, but allows us to specifically target files we're interested and avoid 0-byte files.
Update: Unfortunately, I've found that this solution fails when the glob pattern gets long; Pig ends up throwing an exception "Unable to create input slice".

Hadoop Pig: Passing Command Line Arguments

Is there a way to do this? eg, pass the name of the file to be processed, etc?

This showed up in another question, but you can indicate the input parameter on the command line and use that when you are loading, for example:
Command Line:
pig -f script.pig -param input=somefile.txt
script.pig:
raw = LOAD '$input' AS (...);
Note that if you are using the Amazon Web Services Elastic Map Reduce then the '$input' is what is passed to the script for any input you provide.

You can use ...
1. if there are few parameters then use -param (-p)
2. if there are lot of parameters then use -param_file (-m)
You can use either approach depending on the nature of value of your command line arguments, I use -param when i am developing and testing my scripts. Once pig script is ready for batch processing or running thru crontab, I use -param_file so that if any change required, I can easily update the params.init file.
man pig will show you all available options.
-m, -param_file path to the parameter file
-p, -param key value pair of the form param=val
Here is sample code ...
students.txt (input data)
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
params.init (file to hold all parameters)
fileName='hdfs://horton/user/jgosalia/students.txt'
cityName='Chennai'
filter.pig
students = LOAD '$fileName' USING PigStorage(',') AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
students = FILTER students BY city == '$cityName';
DUMP students;
OPT 1: Using params on command line (-param or -p) & Output
pig -param fileName='hdfs://horton/user/jgosalia/students.txt' -param cityName='Chennai' filter.pig
... Trimming the logs ...
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
OPT 2: Using params file on command line (-param_file or -m) & Output
pig -param_file params.init filter.pig
... Trimming the logs ...
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
NOTE: use absolute path for file paths (both as parameters and when giving param file path to -param_file (-m)).

It is simple to pass in parameters to a PIG script.
First mark your variables in pig using '$' for example $input_file. Then pass the parameters to you script using pig -params input_file='/path/to/data'
for more information look here: http://wiki.apache.org/pig/ParameterSubstitution

Yes.
You can pass parameters along commandline options using pig's param option.
--customparam.pig
--load hdfs/local fs data
original = load '$input' using PigStorage('$delimiter');
--filter a specific field value into another bag
filtered = foreach original generate $split;
--storing data into hdfs/local fs
store filtered into '$output';
pig -x local -f customparam.pig -param input=Pig.csv -param
output=OUT/pig -param delimiter="," -param split='$1'
For more info: check this

Pig Latin: Load multiple files from a date range (part of the directory structure)

I have the following scenario-
Pig version used 0.70
Sample HDFS directory structure:
/user/training/test/20100810/<data files>
/user/training/test/20100811/<data files>
/user/training/test/20100812/<data files>
/user/training/test/20100813/<data files>
/user/training/test/20100814/<data files>
As you can see in the paths listed above, one of the directory names is a date stamp.
Problem: I want to load files from a date range say from 20100810 to 20100813.
I can pass the 'from' and 'to' of the date range as parameters to the Pig script but how do I make use of these parameters in the LOAD statement. I am able to do the following
temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...);
The following works with hadoop:
hadoop fs -ls /user/training/test/{20100810..20100813}
But it fails when I try the same with LOAD inside the pig script. How do I make use of the parameters passed to the Pig script to load data from a date range?
Error log follows:
Backend error message during job submission
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
... 14 more
Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test
at org.apache.pig.PigServer.openIterator(PigServer.java:521)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)
Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?
cheers

As zjffdu said, the path expansion is done by the shell. One common way to solve your problem is to simply use Pig parameters (which is a good way to make your script more resuable anyway):
shell:
pig -f script.pig -param input=/user/training/test/{20100810..20100812}
script.pig:
temp = LOAD '$input' USING SomeLoader() AS (...);

Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN} list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).

i ran across this answer when i was having trouble trying to create a file glob in a script and then pass it as a parameter into a pig script.
none of the current answers applied to my situation, but i did find a general answer that might be helpful here.
in my case, the shell expansion was happening and then passing that into the script - causing complete problems with the pig parser, understandably.
so by simply surrounding the glob in double-quotes protects it from being expanded by the shell, and passes it as is into the command.
WON'T WORK:
$ pig -f my-pig-file.pig -p INPUTFILEMASK='/logs/file{01,02,06}.log' -p OTHERPARAM=6
WILL WORK
$ pig -f my-pig-file.pig -p INPUTFILEMASK="/logs/file{01,02,06}.log" -p OTHERPARAM=6
i hope this saves someone some pain and agony.

So since this works:
temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader()
but this does not work:
temp = LOAD '/user/training/test/{20100810..20100812}' USING SomeLoader()
but if you want a date range that spans say 300 days and passing a full list to LOAD is not elegant to say the least. I came up with this and it works.
Say you want to load data from 2012-10-08 to today 2013-02-14, what you can do is
temp = LOAD '/user/training/test/{201210*,201211*,201212,2013*}' USING SomeLoader()
then do a filter after that
filtered = FILTER temp BY (the_date>='2012-10-08')

temp = LOAD '/user/training/test/2010081*/*' USING SomeLoader() AS (...);
load 20100810~20100819 data
temp = LOAD '/user/training/test/2010081{0,1,2}/*' USING SomeLoader() AS (...);
load 20100810~2010812 data
if the variable is in the middle of file path, concate subfolder name or use '*' for all files.

I found this problem is caused by linux shell. Linux shell will help you expand
{20100810..20100812}
to
20100810 20100811 20100812,
then you actually run command
bin/hadoop fs -ls 20100810 20100811 20100812
But in the hdfs api, it won't help you to expand the expression.

Thanks to dave campbell.
Some of the answer beyond are wrong since they got some votes.
Following is my test result:
Works
pig -f test.pig -param input="/test_{20120713,20120714}.txt"
Cannot have space before or after "," in the expression
pig -f test.pig -param input="/test_201207*.txt"
pig -f test.pig -param input="/test_2012071?.txt"
pig -f test.pig -param input="/test_20120713.txt,/test_20120714.txt"
pig -f test.pig -param input=/test_20120713.txt,/test_20120714.txt
Cannot have space before or after "," in the expression
Doesn't Work
pig -f test.pig -param input="/test_{20120713..20120714}.txt"
pig -f test.pig -param input=/test_{20120713,20120714}.txt
pig -f test.pig -param input=/test_{20120713..20120714}.txt

Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?
Probably you don't - this can be done using custom Load UDF, or try rethinking you directory structure (this will work good if your ranges are mostly static).
additionally: Pig accepts parameters, maybe this would help you (maybe you could do function that will load data from one day and union it to resulting set, but I don't know if it's possible)
edit: probably writing simple python or bash script that generates list of dates (folders) is the easiest solution, you than just have to pass it to Pig, and this should work fine

To Romain's answer, if you want to just parameterize the date, the shell will run like this:
pig -param input="$(echo {20100810..20100812} | tr ' ' ,)" -f script.pig
pig:
temp = LOAD '/user/training/test/{$input}' USING SomeLoader() AS (...);
Please note the quotes.

Pig support globe status of hdfs,
so I think pig can handle the pattern
/user/training/test/{20100810,20100811,20100812},
could you paste the error logs ?

Here's a script I'm using to generate a list of dates, and then put this list to pig script params. Very tricky, but works for me.
For example:
DT=20180101
DT_LIST=''
for ((i=0; i<=$DAYS; i++))
do
d=$(date +%Y%m%d -d "${DT} +$i days");
DT_LIST=${DT_LIST}$d','
done
size=${#DT_LIST}
DT_LIST=${DT_LIST:0:size-1}
pig -p input_data=xxx/yyy/'${DT_LIST}' script.pig

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Concatinating files with Timestamp in PIG - hadoop

Related

How to use current timestamp as filename for Hive output

Multi-line JSON read using Apache PIG

How does Pig use Hadoop Globs in a 'load' statement?

Hadoop Pig: Passing Command Line Arguments

Pig Latin: Load multiple files from a date range (part of the directory structure)

Categories

Resources