How does Pig use Hadoop Globs in a 'load' statement? - hadoop

As I've noted previously, Pig doesn't cope well with empty (0-byte) files. Unfortunately, there are lots of ways that these files can be created (even within Hadoop utilitities).
I thought that I could work around this problem by explicitly loading only files that match a given naming convention in the LOAD statement using Hadoop's glob syntax. Unfortunately, this doesn't seem to work, as even when I use a glob to filter down to known-good input files, I still run into the 0-byte failure mentioned earlier.
Here's an example: Assume I have the following files in S3:
mybucket/a/b/ (0 bytes)
mybucket/a/b/myfile.log (>0 bytes)
mybucket/a/b/yourfile.log (>0 bytes)
If I use a LOAD statement like this in my pig script:
myData = load 's3://mybucket/a/b/*.log as ( ... )
I would expect that Pig would not choke on the 0-byte file, but it still does. Is there a trick to getting Pig to actually only look at files that match the expected glob pattern?

This is a fairly ugly solution, but globs that don't rely on the * wildcard syntax appear to work. So, in our workflow (before calling our pig script), we list all of the files below the prefix we're interested, and then create a specific glob that consists of only the paths we're interested in.
For example, in the example above, we list "mybucket/a":
hadoop fs -lsr s3://mybucket/a
Which returns a list of files, plus other metadata. We can then create the glob from that data:
myData = load 's3://mybucket/a/b{/myfile.log,/yourfile.log}' as ( ... )
This requires a bit more front-end work, but allows us to specifically target files we're interested and avoid 0-byte files.
Update: Unfortunately, I've found that this solution fails when the glob pattern gets long; Pig ends up throwing an exception "Unable to create input slice".

Related

How to implement file-substitution macro in bash?

I have a set of text files and a set of GoLang files. The GoLang files contain directives such as the following:
//go:embed hello.txt
var s string
I want to write a bash script which takes the above code and substitutes the following in its place:
var s string = "<contents of hello.txt>"
Specifically, I want to bash script to go through all GoLang source files and replace all go:embed/string declaration pairs with a string defined to be the contents of the file specified in the embed directive.
I'm wondering if there is an existing program which can be configured to do the above. Otherwise, I'm planning on writing the algorithm myself.
Further explaination:
I am trying to replicate GoLang's embed directive (https://tip.golang.org/pkg/embed/).
We are not yet on GoLang 1.16, so we cannot use this functionality, but we are replicating it as closely as possible so that moving over to the standard implementation is as painless as possible.
Below is an attempt at solving your problem:
for i in file1 file2; do
awk '/^\/\/go:embed /{f=$2;next}/^var/&&f{printf"%s = \"",$0;system("cat "f);print"\"";f=0;next}1' < "$i" > "$i.new"
done
The awk script prints all normal lines, only if it encounters the embed directive this line will be skipped (and the file name remembered in variable f). A subsequent line starting with var will then be extended by the content of the file with the remembered name (using the system call "cat").
Beware, there are no error checks at all, no attempt to fix quotes and whatever. So for practical use - unless the file contents you are about to embed are known to be good-natured - you probably have to take a more sophisticated approach.

Sed replace unusual file extension arising from gmv

As a result of using gmv on a large nested directory to flatten in, I have a number of duplicate files separated out and with the extensions "._1_" "._2_" etc ( .... ._n_ )
eg "a.pdf.\_1\_"
ie its
a(dot)pdf(dot)(back slash)1(back slash)
as opposed to
a(dot)pdf(dot)1
which I want to reduce it back to "a.pdf"
I tried something like
sed -i .bak "s|.\_1\_||" *
which is usually reliable and doesn't require escape characters. However its giving me
"error: illegal byte sequence"
Grateful for help to fix. This is on Mac OSX terminal. Ideally I'd like a generic solution to fix ._*_ forms where the * varies 1 to 9
There are two challenges here.
How to deal with the duplicate basename (The suffixes '1', '2', ... mostly like added to designate different sections of a single file - may be different pages a PDF, etc. Performing rename that will strip the files may cause some important files to disappear.
How to deal with the "error: illegal byte sequence" which indicate that some special characters (unicode) are part of the file name. Usually ASCII characters with value >= \0xc0, which can not be decoded according to the current local. The fact that the file names are escaped (as per OP "a.pdf.\_1\_" may hint at additional characters, not displayed (assuming this was not added by the OP).
Proposed solution is to rename the file, and place the 'sequence' part, that make the file unique BEFORE the extension, allowing the extension to be used to determine file type.
a.pdf.1 => a.1.pdf
The rename command to perform this task is:
rename 's/(.).pdf.(_._)/$1$2.pdf/' .pdf.__
Adjust the file name list as needed, and use -n to verify before running.
rename -n s/.\_1\_// *.*_1_
works (remove the -n once tested).

Update the tar. bz2 compressed file

We have 100 hundreds of file in trx_date.tar.bz2 compressed file which has request and response . below is file structure of trx_date.tar.bz2 : trx_date.tar: trx_date contains : log1 ,log2,log3 files which has xml request having some sensitive info and i would like to mask it to some default value. Request Request is having tag 1234567 and i want to mask it to i.e update it to log file to 3333333
I am able to grep it using the the :
Number1=bzcat $LOGDIR/$LOG_FORMAT | grep "<number>[0-2,4-9][0-2,4-9][0-2,4-9][0-2,4-9][0-2,4-9][0-2,4-9][0-2,4-9]"
how we can override the those value in the log files using shell script ?
Log file contains request and response.. Where we have tag like 123456 and also other tags as well . I want to read all the line of log file and replace that specific tag like below 333333 and save the info into same file. We have info tag with 333333 as well but I don't want to consider that.
In principle, you cannot do directly what you want (without extracting the file from your .tar.bz2 compressed archive), since a .tar.bz2 file is a bzip2-ed compression of a tar archive. So the only good solution would be to extract files from the archive, do the modification on the extracted files (e.g. with sed(1) or awk), and recreate an archive from it. Using sed on one particular textual file to replace a pattern like <number>[0-9]*</number> by <number>0000000</number> is easy. Writing a bash for loop to iterate that on several files is easy. So combine both approaches, or write a tiny shell or Python script doing that (on the extracted files).
In practice (but that is risky and I don't recommend that) you could hope that <number> digits </number> happens only in the files part of the tar archive you want to modify in place, and then you could perhaps replace (directly in the uncompressed tar archive), using e.g. sed(1), such sequences with other sequences of the same byte length (read more about the tar format: meta data such as file sizes appear in textual form, NUL bytes completed).
You might also consider using tardy, a tar post-processor (that you need to install).
I strongly recommend extracting the tar archive, operate on the extracted files, then recreate that archive again. Of course, you need enough disk space, and you have to estimate it. But tell your manager that disk space is cheap, generally cheaper than your labor costs.
PS. The command given in your question is really wrong and does not do what you dream of. Read more about redirection, pipelines, globbing, unix shells. Read carefully the documentation of Bash (notably basic shell features, shell expansion, command substitution). Read also the documentation of each command that you want to use, e.g. tar(1), grep(1), sed(1), etc....). Read the relevant man-pages(7) perhaps with the man(1) command.

zipping multiple files with matching pattern in NiFi

I'm trying to compress a list of files generated by previous processor. The names are random with start & end as repetitive.
Ex:
part-00000-1dfde626-2a4f-4bc2-aa43-eaf3c940b2a8-c000.csv
part-00000-547c93da-088e-46c4-a478-a41aabfef9ea-c000.csv
I'm trying to zip all the files in one single file using ExecuteStreamCommand processor. Following are my command & its arguments: It doesn't work.
command: /bin/zip
Argument: finalCompressedFile.zip;part.*csv
The regex part.*csv does match with all the file patterns generated. But the * is (what I suspect) is getting passed to bash shell as literal. If I give a single full file name, it does the job but then I won't be compressing all the files.
Any idea on this?

Pig Latin: Load multiple files from a date range (part of the directory structure)

I have the following scenario-
Pig version used 0.70
Sample HDFS directory structure:
/user/training/test/20100810/<data files>
/user/training/test/20100811/<data files>
/user/training/test/20100812/<data files>
/user/training/test/20100813/<data files>
/user/training/test/20100814/<data files>
As you can see in the paths listed above, one of the directory names is a date stamp.
Problem: I want to load files from a date range say from 20100810 to 20100813.
I can pass the 'from' and 'to' of the date range as parameters to the Pig script but how do I make use of these parameters in the LOAD statement. I am able to do the following
temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...);
The following works with hadoop:
hadoop fs -ls /user/training/test/{20100810..20100813}
But it fails when I try the same with LOAD inside the pig script. How do I make use of the parameters passed to the Pig script to load data from a date range?
Error log follows:
Backend error message during job submission
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
... 14 more
Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test
at org.apache.pig.PigServer.openIterator(PigServer.java:521)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)
Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?
cheers
As zjffdu said, the path expansion is done by the shell. One common way to solve your problem is to simply use Pig parameters (which is a good way to make your script more resuable anyway):
shell:
pig -f script.pig -param input=/user/training/test/{20100810..20100812}
script.pig:
temp = LOAD '$input' USING SomeLoader() AS (...);
Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN} list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).
i ran across this answer when i was having trouble trying to create a file glob in a script and then pass it as a parameter into a pig script.
none of the current answers applied to my situation, but i did find a general answer that might be helpful here.
in my case, the shell expansion was happening and then passing that into the script - causing complete problems with the pig parser, understandably.
so by simply surrounding the glob in double-quotes protects it from being expanded by the shell, and passes it as is into the command.
WON'T WORK:
$ pig -f my-pig-file.pig -p INPUTFILEMASK='/logs/file{01,02,06}.log' -p OTHERPARAM=6
WILL WORK
$ pig -f my-pig-file.pig -p INPUTFILEMASK="/logs/file{01,02,06}.log" -p OTHERPARAM=6
i hope this saves someone some pain and agony.
So since this works:
temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader()
but this does not work:
temp = LOAD '/user/training/test/{20100810..20100812}' USING SomeLoader()
but if you want a date range that spans say 300 days and passing a full list to LOAD is not elegant to say the least. I came up with this and it works.
Say you want to load data from 2012-10-08 to today 2013-02-14, what you can do is
temp = LOAD '/user/training/test/{201210*,201211*,201212,2013*}' USING SomeLoader()
then do a filter after that
filtered = FILTER temp BY (the_date>='2012-10-08')
temp = LOAD '/user/training/test/2010081*/*' USING SomeLoader() AS (...);
load 20100810~20100819 data
temp = LOAD '/user/training/test/2010081{0,1,2}/*' USING SomeLoader() AS (...);
load 20100810~2010812 data
if the variable is in the middle of file path, concate subfolder name or use '*' for all files.
I found this problem is caused by linux shell. Linux shell will help you expand
{20100810..20100812}
to
20100810 20100811 20100812,
then you actually run command
bin/hadoop fs -ls 20100810 20100811 20100812
But in the hdfs api, it won't help you to expand the expression.
Thanks to dave campbell.
Some of the answer beyond are wrong since they got some votes.
Following is my test result:
Works
pig -f test.pig -param input="/test_{20120713,20120714}.txt"
Cannot have space before or after "," in the expression
pig -f test.pig -param input="/test_201207*.txt"
pig -f test.pig -param input="/test_2012071?.txt"
pig -f test.pig -param input="/test_20120713.txt,/test_20120714.txt"
pig -f test.pig -param input=/test_20120713.txt,/test_20120714.txt
Cannot have space before or after "," in the expression
Doesn't Work
pig -f test.pig -param input="/test_{20120713..20120714}.txt"
pig -f test.pig -param input=/test_{20120713,20120714}.txt
pig -f test.pig -param input=/test_{20120713..20120714}.txt
Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?
Probably you don't - this can be done using custom Load UDF, or try rethinking you directory structure (this will work good if your ranges are mostly static).
additionally: Pig accepts parameters, maybe this would help you (maybe you could do function that will load data from one day and union it to resulting set, but I don't know if it's possible)
edit: probably writing simple python or bash script that generates list of dates (folders) is the easiest solution, you than just have to pass it to Pig, and this should work fine
To Romain's answer, if you want to just parameterize the date, the shell will run like this:
pig -param input="$(echo {20100810..20100812} | tr ' ' ,)" -f script.pig
pig:
temp = LOAD '/user/training/test/{$input}' USING SomeLoader() AS (...);
Please note the quotes.
Pig support globe status of hdfs,
so I think pig can handle the pattern
/user/training/test/{20100810,20100811,20100812},
could you paste the error logs ?
Here's a script I'm using to generate a list of dates, and then put this list to pig script params. Very tricky, but works for me.
For example:
DT=20180101
DT_LIST=''
for ((i=0; i<=$DAYS; i++))
do
d=$(date +%Y%m%d -d "${DT} +$i days");
DT_LIST=${DT_LIST}$d','
done
size=${#DT_LIST}
DT_LIST=${DT_LIST:0:size-1}
pig -p input_data=xxx/yyy/'${DT_LIST}' script.pig

Resources