how to read multiple files from a file in Apache Pig? - hadoop

I have one file named "filelist.txt" and the content of this file is a list files which I want to read into my pig script. For example, it can be organized as:
file1.txt
file2.txt
...
filen.txt
some of the solutions are trying to use regular expression, however there is no particular format in the filenames, the only thing we can do is to read the filenames from the filelist.txt
in each of the file is the actual data I want to read. For example, in file1, we can have:
value1
value2
value3
So how should I be able to read all these files values in my pig scripts?

There is no way to currently do this in pure pig. The best you can do in pure pig is use their builtin globbing which you can find information about here. It is fairly flexible, but doesn't sound like it will be enough for your purposes.
The other solution I can think of, if you can get that file in your local environment, is to use some sort of wrapper (I would recommend python). In that script you can read the file and generate a pig script to read those lines. Here is how that logic would work:
def addLoads(filesToRead, schema, delim='\\t'):
newLines = []
with open(filesToRead, 'r') as infile:
for n, f in enumerate(infile):
newLines.append("input{} = LOAD '{}' USING PigStorage('{}') AS {};".format(n, f, delim, schema))
to_union = [ 'input{}'.format(i) for i in range(1, len(newLines)+1) ]
newLines.append('loaded_lines = UNION {} ;'.format(', '.join(to_union)))
return '\n'.join(newLines)
Append the this to the beginning of the pig script you load from disk, and make sure that the rest of the script uses loaded_lines as the start.

You have to use pig load func and overwrite setlocation
#Override
public void setLocation(String location, Job job) throws IOException {
//Read location where you have all the input file names and convert that into a comma seperated string.
FileInputFormat.setInputPaths(job, [commaseperated list]);
}
Where location will be comma seperated list of your files.

Related

Populate a value in a particular column in csv

I have a folder where there are 50 excel sheets in CSV format. I have to populate a particular value say "XYZ" in the column I of all the sheets in that folder.
I am new to unix and have looked for a couple of pages Here and Here . Can anyone please provide me the sample script to begin with?
For example :
Let's say column C in this case:
A B C
ASFD 2535
BDFG 64486
DFGC 336846
I want to update column C to value "XYZ".
Thanks.
I would export those files into csv format
- with semikolon as field separator
- eventually by leaving out column descriptions (otherwise see comment below)
Then the following combination of SHELL and SED script could more or less do already the trick
#! /bin/sh
for i in *.csv
do
sed -i -e "s/$/;XZY/" $i
done
-i means to edit the file in place, here you could append the value to all lines
-e specifies the regular expresssion for substitution
You might want to use a similar script like this, to rename "XYZ" to "C" only in the 1st line if the csv files should contain also column descriptions.

Combining many files columnwise, use first column only once

I have to combine a lot of similar csv files to one file. They are stored in many different subdirectories but the single csv files have the same name.
I need to append them columnwise, but I need the first "name" column only once. So I want to keep the first column of the first csv file and remove them from all following. Referring to this question I tried the following command: Iterating through all the subdirectories while the final file is in the main directory (And is in the beginning a copy of one of the many csv files, so that it already contains the "name" column):
for i in */; do paste final_table.csv <(cut -f 2- "$i"single_table.csv) > final_table.csv ; done
However it seems like paste does not work when one of the input files is also the output file.
How would I solve this correctly?
Don't overwrite with output the file you're reading input from. Instead, mv/rename it to an intermediate name, let your script read from that file, and output to a file with the original name. Remove the input file when complete.
Alternatively, choose an intermediate name for output file, write all input to it, and only after all input was processed, mv/rename output file to the final name.
as intemediate name, appending a temporary file name ending ("extension") could be useful.
The sponge utility from the moreutils package is what I always use for this kind of situation:
for i in */; do
paste final_table.csv <(cut -f 2- "$i"single_table.csv) | sponge final_table.csv
done
sponge quite simply "soaks up" standard in and writes to the filename you give it afterwards. It is written specifically for situations like this, to avoid the need for you to create (and then remember to delete) a temporary file.

Load multiple files with PigLatin (Hadoop)

I have a hdfs file list of csv files with same format. I need to be able to LOAD them with pig together. Eg:
/path/to/files/2013/01-01/qwe123.csv
/path/to/files/2013/01-01/asd123.csv
/path/to/files/2013/01-01/zxc321.csv
/path/to/files/2013/01-02/ert435.csv
/path/to/files/2013/01-02/fgh987.csv
/path/to/files/2013/01-03/vbn764.csv
They can not be globed as their name is "random" hash and their directories might contain more csv files.
As suggested in other comments, you can do this by pre-processing the file. Suppose your HDFS file is called file_list.txt, then you can do the following:
pig -param flist=`hdfs dfs -cat file_list.txt | awk 'BEGIN{ORS="";}{if (NR == 1) print; else print ","$0;}'` script.pig
The awk code gets rid of the newline characters and uses commas to separate the file names.
In your script (called script.pig in my example), you should use parameter substitution to load the data:
data = LOAD '$flist';
You aren't restricted to globbing. Use this:
LOAD '/path/to/files/2013/01-{01/qwe123,01/asd123,01/zxc321,02/ert435,02/fgh987,03/vbn764}.csv';

Replace last line of XML file

Looking for help creating a script that will replace the last line of an XML file with a tag. I have a few hundred files so I'm looking for something that will process them in a loop. I've managed to rename the files sequentially like this:
posts1.xml
posts2.xml
posts3.xml
etc...
to make it easier to loop through. But I have no idea how to write a script to do this. I'm open to using either Linux or Windows (but i would guess that Linux is better for this kind of task).
So if you want to append a line to every file:
sed -i '$a<YOUR_SHINY_NEW_TAG>' *xml
To replace the last line:
sed -i '$s/.*/<YOUR_SHINY_NEW_TAG>/' *xml
But do note, sed is not the ideal tool to modify xml.
XMLStarlet is a command-line toolkit for performing XML parsing and manipulations. Note that as an XML-aware toolkit, it'll respect XML structure, character encoding and entity substitution.
Check out the ed command to see how to modify documents. You can wrap this in a standard bash loop.
e.g. in a doc consisting of a chain of <elem>s, you can add a following <added>5</added>:
mkdir new
for x in *.xml; do
xmlstarlet ed -a "//elem[count(//elem)]" -t elem -n added -v 5 $x > new/$x
done
Linux way using sed:
To edit the last line of the file in place, you can use sed:
sed -i '$s_pattern_replacement_' filename
To change the whole line to "replacement" use $s_.*_replacement_. Be sure to escape any _'s in replacement with a \.
To loop over files, just use for:
for f in /path/posts*.xml; do sed -i '$s_.*_replacement_' $f; done
This, however, is a dirty way as it's not aware of the XML structure, whereas the XML structure is not affected by newlines. You have to be sure the last line of the files contains exactly what you expect it to.
It makes little to no difference whether you're on Linux, Windows or MacOS
The question is what language do you want to use?
The following is an example in c# (not optimized, but read it as speudocode):
string rootDirectory = #"c:\myfiles";
var files = Directory.GetFiles(rootDirectory, "*.xml");
foreach (var file in files)
{
var lines = File.ReadAllLines(file);
lines[lines.Length - 1] = "whatever you want here";
File.WriteAllLines(file, lines);
}
You can compile this and run it on Windows, Linux, etc..
Or you could do the same in Python.
Of course this method does not actually parse the XML,
but you just wanted to replace the last line right?

Concatenating strings fails when read from certain files

I have a web application that is deployed to a server. I am trying to create a script that amoing other things reads the current version of the web application from a properties file that is deployed along with the application.
The file looks like this:
//other content
version=[version number]
build=[buildnumber]
//other content
I want to create a variable that looks like this: version-buildnumber
Here is my script for it:
VERSION_FILE=myfile
VERSION_LINE="$(grep "version=" $VERSION_FILE)"
VERSION=${VERSION_LINE#$"version="}
BUILDNUMBER_LINE=$(grep "build=" $VERSION_FILE)
BUILDNUMBER=${BUILDNUMBER_LINE#$"build="}
THEVERSION=${VERSION}-${BUILDNUMBER}
The strange thing is that this works in some cases but not in others.
The problem I get is when I am trying to concatenate the strings (i.e. the last line above). In some cases it works perfectly, but in others characters from one string replace the characters from the other instead of being placed afterwards.
It does not work in these cases:
When I read from the deployed file
If I copy the deployed file to another location and read from there
It does work in these cases:
If I write a file from scratch and read from that one.
If I create my own file and then copy the content from the deployed file into my created file.
I find this very strange. Is there someone out there recognizing this?
It is likely that your files have carriage returns in them. You can fix that by running dos2unix on the file.
You may also be able to do it on the fly on the strings you're retrieving.
Here are a couple of ways:
Do it with sed instead of grep:
VERSION_LINE="$(sed -n "/version=/{s///;s/\r//g;p}" $VERSION_FILE)"
and you won't need the Bash parameter expansion to strip the "version=".
OR
Do the grep as you have it now and do a second parameter expansion to strip the carriage return.
VERSION=${VERSION_LINE#$"version="}
VERSION=${VERSION//$'\r'}
By the way, I recommend habitually using lowercase or mixed case variable names in order to reduce the chance of name collisions.
Given this foo.txt:
//other content
version=[version number]
build=[buildnumber]
//other content
you can extract a version-build string more easily with awk:
awk -F'=' '$1 == "version" { version = $2}; $1 == "build" { build = $2}; END { print version"-"build}' foo.txt
I don't know why your script doesn't work. Can you provide an example of erroneous output?
From this sentence:
In some cases it works perfectly, but in others characters from one string replace the characters from the other instead of being placed afterwards.
I can't understand what's actually going on (I'm not a native English speaker so it's probably my fault).
Cheers,
Giacomo

Resources