Let's say I'm opening a large (several GB) file where I cannot read in the entire file as once.
If it's a csv file, we would use:
for chunk in pd.read_csv('path/filename', chunksize=10**7):
# save chunk to disk
Or we could do something similar with pandas:
import pandas as pd
with open(fn) as file:
for line in file:
# save line to disk, e.g. df=pd.concat([df, line_data]), then save the df
How does one "chunk" data with an awk script? Awk will parse/process text into a format you desire, but I don't know how to "chunk" with awk. One can write a script script1.awk and then process your data, but this processes the entire file at once.
Related question, with more concrete example: How to preprocess and load a "big data" tsv file into a python dataframe?
awk reads a single record (chunk) at a time by design. By default a record is line of data, but you can specify a record using the RS (record separator) variable. Each code block is conditionally executed on the current record before the next is read:
$ awk '/pattern/{print "MATCHED", $0 > "output"}' file
The above script will read a line at a time from the input file and if the that line matchs pattern it will save the line in the file output prepended with MATCHED before reading the next line.
Im trying to merge avro files into one big file, the problem is concat command does not accept the wildcard
hadoop jar avro-tools.jar concat /input/part* /output/bigfile.avro
I get:
Exception in thread "main" java.io.FileNotFoundException: File does
not exist: /input/part*
I tried to use "" and '' but no chance.
I quickly checked Avro's source code (1.7.7) and it seems that concat does not support glob patterns (basically, they call FileSystem.open() on each argument except the last one).
It means that you have to explicitly provide all the filenames as argument. It is cumbersome, but following command should do what you want:
IN=$(hadoop fs -ls /input/part* | awk '{printf "%s ", $NF}')
hadoop jar avro-tools.jar concat ${IN} /output/bigfile.avro
It would be a nice addition to add support of glob pattern to this command.
Instead of hadoop jar avro-tools.jar one can run java -jar avro-tools.jar, since you don't need hadoop for this operation.
I have one file named "filelist.txt" and the content of this file is a list files which I want to read into my pig script. For example, it can be organized as:
file1.txt
file2.txt
...
filen.txt
some of the solutions are trying to use regular expression, however there is no particular format in the filenames, the only thing we can do is to read the filenames from the filelist.txt
in each of the file is the actual data I want to read. For example, in file1, we can have:
value1
value2
value3
So how should I be able to read all these files values in my pig scripts?
There is no way to currently do this in pure pig. The best you can do in pure pig is use their builtin globbing which you can find information about here. It is fairly flexible, but doesn't sound like it will be enough for your purposes.
The other solution I can think of, if you can get that file in your local environment, is to use some sort of wrapper (I would recommend python). In that script you can read the file and generate a pig script to read those lines. Here is how that logic would work:
def addLoads(filesToRead, schema, delim='\\t'):
newLines = []
with open(filesToRead, 'r') as infile:
for n, f in enumerate(infile):
newLines.append("input{} = LOAD '{}' USING PigStorage('{}') AS {};".format(n, f, delim, schema))
to_union = [ 'input{}'.format(i) for i in range(1, len(newLines)+1) ]
newLines.append('loaded_lines = UNION {} ;'.format(', '.join(to_union)))
return '\n'.join(newLines)
Append the this to the beginning of the pig script you load from disk, and make sure that the rest of the script uses loaded_lines as the start.
You have to use pig load func and overwrite setlocation
#Override
public void setLocation(String location, Job job) throws IOException {
//Read location where you have all the input file names and convert that into a comma seperated string.
FileInputFormat.setInputPaths(job, [commaseperated list]);
}
Where location will be comma seperated list of your files.
Problem: I have two folders (one is Delta Folder-where the files get updated, and other is Original Folder-where the original files exist). Every time the file updates in Delta Folder I need merge the file from Original folder with updated file from Delta folder.
Note: Though the file names in Delta folder and Original folder are unique, but the content in the files may be different. For example:
$ cat Delta_Folder/1.properties
account.org.com.email=New-Email
account.value.range=True
$ cat Original_Folder/1.properties
account.org.com.email=Old-Email
account.value.range=False
range.list.type=String
currency.country=Sweden
Now, I need to merge Delta_Folder/1.properties with Original_Folder/1.properties so, my updated Original_Folder/1.properties will be:
account.org.com.email=New-Email
account.value.range=True
range.list.type=String
currency.country=Sweden
Solution i opted is:
find all *.properties files in Delta-Folder and save the list to a temp file(delta-files.txt).
find all *.properties files in Original-Folder and save the list to a temp file(original-files.txt)
then i need to get the list of files that are unique in both folders and put those in a loop.
then i need to loop each file to read each line from a property file(1.properties).
then i need to read each line(delta-line="account.org.com.email=New-Email") from a property file of delta-folder and split the line with a delimiter "=" into two string variables.
(delta-line-string1=account.org.com.email; delta-line-string2=New-Email;)
then i need to read each line(orig-line=account.org.com.email=Old-Email from a property file of orginal-folder and split the line with a delimiter "=" into two string variables.
(orig-line-string1=account.org.com.email; orig-line-string2=Old-Email;)
if delta-line-string1 == orig-line-string1 then update $orig-line with $delta-line
i.e:
if account.org.com.email == account.org.com.email then replace
account.org.com.email=Old-Email in original folder/1.properties with
account.org.com.email=New-Email
Once the loop finishes finding all lines in a file, then it goes to next file. The loop continues until it finishes all unique files in a folder.
For looping i used for loops, for splitting line i used awk and for replacing content i used sed.
Over all its working fine, its taking more time(4 mins) to finish each file, because its going into three loops for every line and splitting the line and finding the variable in other file and replace the line.
Wondering if there is any way where i can reduce the loops so that the script executes faster.
With paste and awk :
File 2:
$ cat /tmp/l2
account.org.com.email=Old-Email
account.value.range=False
currency.country=Sweden
range.list.type=String
File 1 :
$ cat /tmp/l1
account.org.com.email=New-Email
account.value.range=True
The command + output :
paste /tmp/l2 /tmp/l1 | awk '{print $NF}'
account.org.com.email=New-Email
account.value.range=True
currency.country=Sweden
range.list.type=String
Or with a single awk command if sorting is not important :
awk -F'=' '{arr[$1]=$2}END{for (x in arr) {print x"="arr[x]}}' /tmp/l2 /tmp/l1
I think your two main options are:
Completely reimplement this in a more featureful language, like perl.
While reading the delta file, build up a sed script. For each line of the delta file, you want a sed instruction similar to:
s/account.org.com.email=.*$/account.org.email=value_from_delta_file/g
That way you don't loop through the original files a bunch of extra times. Don't forget to escape & / and \ as mentioned in this answer.
Is using a database at all an option here?
Then you would only have to write code for extracting data from the Delta files (assuming that can't be replaced by a database connection).
It just seems like this is going to keep getting more complicated and slower as time goes on.
I have the following scenario-
Pig version used 0.70
Sample HDFS directory structure:
/user/training/test/20100810/<data files>
/user/training/test/20100811/<data files>
/user/training/test/20100812/<data files>
/user/training/test/20100813/<data files>
/user/training/test/20100814/<data files>
As you can see in the paths listed above, one of the directory names is a date stamp.
Problem: I want to load files from a date range say from 20100810 to 20100813.
I can pass the 'from' and 'to' of the date range as parameters to the Pig script but how do I make use of these parameters in the LOAD statement. I am able to do the following
temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...);
The following works with hadoop:
hadoop fs -ls /user/training/test/{20100810..20100813}
But it fails when I try the same with LOAD inside the pig script. How do I make use of the parameters passed to the Pig script to load data from a date range?
Error log follows:
Backend error message during job submission
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
... 14 more
Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test
at org.apache.pig.PigServer.openIterator(PigServer.java:521)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)
Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?
cheers
As zjffdu said, the path expansion is done by the shell. One common way to solve your problem is to simply use Pig parameters (which is a good way to make your script more resuable anyway):
shell:
pig -f script.pig -param input=/user/training/test/{20100810..20100812}
script.pig:
temp = LOAD '$input' USING SomeLoader() AS (...);
Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN} list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).
i ran across this answer when i was having trouble trying to create a file glob in a script and then pass it as a parameter into a pig script.
none of the current answers applied to my situation, but i did find a general answer that might be helpful here.
in my case, the shell expansion was happening and then passing that into the script - causing complete problems with the pig parser, understandably.
so by simply surrounding the glob in double-quotes protects it from being expanded by the shell, and passes it as is into the command.
WON'T WORK:
$ pig -f my-pig-file.pig -p INPUTFILEMASK='/logs/file{01,02,06}.log' -p OTHERPARAM=6
WILL WORK
$ pig -f my-pig-file.pig -p INPUTFILEMASK="/logs/file{01,02,06}.log" -p OTHERPARAM=6
i hope this saves someone some pain and agony.
So since this works:
temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader()
but this does not work:
temp = LOAD '/user/training/test/{20100810..20100812}' USING SomeLoader()
but if you want a date range that spans say 300 days and passing a full list to LOAD is not elegant to say the least. I came up with this and it works.
Say you want to load data from 2012-10-08 to today 2013-02-14, what you can do is
temp = LOAD '/user/training/test/{201210*,201211*,201212,2013*}' USING SomeLoader()
then do a filter after that
filtered = FILTER temp BY (the_date>='2012-10-08')
temp = LOAD '/user/training/test/2010081*/*' USING SomeLoader() AS (...);
load 20100810~20100819 data
temp = LOAD '/user/training/test/2010081{0,1,2}/*' USING SomeLoader() AS (...);
load 20100810~2010812 data
if the variable is in the middle of file path, concate subfolder name or use '*' for all files.
I found this problem is caused by linux shell. Linux shell will help you expand
{20100810..20100812}
to
20100810 20100811 20100812,
then you actually run command
bin/hadoop fs -ls 20100810 20100811 20100812
But in the hdfs api, it won't help you to expand the expression.
Thanks to dave campbell.
Some of the answer beyond are wrong since they got some votes.
Following is my test result:
Works
pig -f test.pig -param input="/test_{20120713,20120714}.txt"
Cannot have space before or after "," in the expression
pig -f test.pig -param input="/test_201207*.txt"
pig -f test.pig -param input="/test_2012071?.txt"
pig -f test.pig -param input="/test_20120713.txt,/test_20120714.txt"
pig -f test.pig -param input=/test_20120713.txt,/test_20120714.txt
Cannot have space before or after "," in the expression
Doesn't Work
pig -f test.pig -param input="/test_{20120713..20120714}.txt"
pig -f test.pig -param input=/test_{20120713,20120714}.txt
pig -f test.pig -param input=/test_{20120713..20120714}.txt
Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?
Probably you don't - this can be done using custom Load UDF, or try rethinking you directory structure (this will work good if your ranges are mostly static).
additionally: Pig accepts parameters, maybe this would help you (maybe you could do function that will load data from one day and union it to resulting set, but I don't know if it's possible)
edit: probably writing simple python or bash script that generates list of dates (folders) is the easiest solution, you than just have to pass it to Pig, and this should work fine
To Romain's answer, if you want to just parameterize the date, the shell will run like this:
pig -param input="$(echo {20100810..20100812} | tr ' ' ,)" -f script.pig
pig:
temp = LOAD '/user/training/test/{$input}' USING SomeLoader() AS (...);
Please note the quotes.
Pig support globe status of hdfs,
so I think pig can handle the pattern
/user/training/test/{20100810,20100811,20100812},
could you paste the error logs ?
Here's a script I'm using to generate a list of dates, and then put this list to pig script params. Very tricky, but works for me.
For example:
DT=20180101
DT_LIST=''
for ((i=0; i<=$DAYS; i++))
do
d=$(date +%Y%m%d -d "${DT} +$i days");
DT_LIST=${DT_LIST}$d','
done
size=${#DT_LIST}
DT_LIST=${DT_LIST:0:size-1}
pig -p input_data=xxx/yyy/'${DT_LIST}' script.pig