Hadoop FileUtil copymerge - Ignore header - hadoop

While writing out from spark to HDFS, depending upon the header setting, each file has a header. So when calling copymerge in FileUtil we get duplicated headers in merged file. Is there a way to retain header from 1st file and ignore others.

If you are planning to merge it as a single file and then fetch it on to your local file system, you can use getmerge.
getmerge
Usage: hadoop fs -getmerge [-nl] <src> <localdst>
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file. -skip-empty-file can be used to avoid unwanted newline characters in case of empty files.
Now to remove the headers, you should have an idea of how your header looks like.
Suppose if your header looks like:
HDR20171227
You can use:
sed -i '1,${/^HDR/d}' "${final_filename}"
where final_filename is the name of the file on local FS.
This will delete all lines that start with HDR in your file and occur after the first line.
If you are unsure about the header, you can first store it in a variable using
header=$(head -1 "${final_filename}" )
And then proceed to delete it using sed.

Related

Shell Script - Iterate through each line in text file and rename HDFS file

I have a text file in HDFS which would have records like below. The number of lines in file may vary every time.
hdfs://myfile.txt
file_name_1
file_name_2
file_name_3
I have the below hdfs directory and file structure like below.
hdfs://myfolder/
hdfs://myfolder/file1.csv
hdfs://myfolder/file2.csv
hdfs://myfolder/file3.csv
Using shell script I am able to count the number of files in HDFS directory and number of lines available in my HDFS text file. Only if the count matches between the number of files in directory and number of records in my text file, I am going to proceed further with the process.
Now, i am trying to rename hdfs://myfolder/file1.csv to hdfs://myfolder/file_name_1.csv using the first record from my text file.
Second file should be renamed to hdfs://myfolder/file_name_2.csv and third file to hdfs://myfolder/file_name_3.csv
I have difficulty in looping through both the text file and also the files in HDFS directory.
Is there an optimal way to achieve this using shell script.
You cannot do this directly from HDFS, you'd need to stream the file contents, then issue individual move commands.
e.g.
#!/bin/sh
COUNTER = 0
for file in $(hdfs dfs -cat file.txt)
do
NAME = $(sed $file ...) # replace text, as needed. TODO: extract the extension
hdfs dfs -mv file "$NAME_${COUNTER}.csv" # 'csv' for example - make sure the extension isn't duplicated!!
COUNTER = $((COUNTER + 1)
done

how to insert header file as first line into data file in HDFS without using getmerge(performance issue while copying to local)?

I am trying to insert header.txt as first line into data.txt without using getmerge. Getmerge copies to local and inserts into third file. But I want in HDFS only
Header.txt
Head1,Head2,Head3
Data.txt
100,John,28
101,Gill,25
102,James,29
I want output in Data.txt file only like below :
Data.txt
Head1,Head2,Head3
100,John,28
101,Gill,25
102,James,29
Please suggest me whether can we implement in HDFS only ?
HDFS supports a concat (short for concatenate) operation in which two files are merged together into one without any data transfer. It will do exactly what you are looking for. Judging by the file system shell guide documentation, it is not currently supported from the command line, so you will need to implement this in Java:
FileSystem fs = ...
Path data = new Path("Data.txt");
Path header = new Path("Header.txt");
Path dataWithHeader = new Path("DataWithHeader.txt");
fs.concat(dataWithHeader, header, data);
After this, Data.txt and Header.txt both cease to exist, replaced by DataWithHeader.txt.
Thanks for your reply.
I got other way like :
Hadoop fs cat hdfs_path/header.txt hdfs_path/data.txt | Hadoop fs -put - hdfs_path/Merged.txt
This is having drawback as cat command reads complete data which impacts performance.

How to combine url with filename from file

Text file (filename: listing.txt) with names of files as its contents:
ace.pdf
123.pdf
hello.pdf
Wanted to download these files from url http://www.myurl.com/
In bash, tried to merged these together and download the files using wget eg:
http://www.myurl.com/ace.pdf
http://www.myurl.com/123.pdf
http://www.myurl.com/hello.pdf
Tried variations of the following but without success:
for i in $(cat listing.txt); do wget http://www.myurl.com/$i; done
No need to use cat and loop. You can use xargs for this:
xargs -I {} wget http://www.myurl.com/{} < listing.txt
Actually, wget has options which can avoid loops & external programs completely.
-i file
--input-file=file
Read URLs from a local or external file. If - is specified as file, URLs are read from the standard input. (Use ./- to read from a file literally named -.)
If this function is used, no URLs need be present on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. If --force-html
is not specified, then file should consist of a series of URLs, one per line.
However, if you specify --force-html, the document will be regarded as html. In that case you may have problems with relative links, which you can solve either by adding "<base href="url">" to the documents or by
specifying --base=url on the command line.
If the file is an external one, the document will be automatically treated as html if the Content-Type matches text/html. Furthermore, the file's location will be implicitly used as base href if none was specified.
-B URL
--base=URL
Resolves relative links using URL as the point of reference, when reading links from an HTML file specified via the -i/--input-file option (together with --force-html, or when the input file was fetched remotely from a
server describing it as HTML). This is equivalent to the presence of a "BASE" tag in the HTML input file, with URL as the value for the "href" attribute.
For instance, if you specify http://foo/bar/a.html for URL, and Wget reads ../baz/b.html from the input file, it would be resolved to http://foo/baz/b.html.
Thus,
$ cat listing.txt
ace.pdf
123.pdf
hello.pdf
$ wget -B http://www.myurl.com/ -i listing.txt
This will download all the 3 files.

SQLLDR file path argument

I have more than 30 files to load the data.
The path changes at every run in those files. So the path becomes
INFILE "/home/dmf/Cycle7Data/ITEM_IMAGE.csv"
INFILE "/home/dmf/Cycle8Data/ITEM_IMAGE.csv"
The file names change on every control file (SUPPLIER.csv)
Is there any way to pass the File path in a variable, or set any Env. Variable?
So that the control file is not edited everytime
You can pass the data file name on the command line; from the documentation:
DATA specifies the name of the data file containing the data to be loaded. If you do not specify a file extension or file type, then the default is .dat.
If you specify a data file on the command line and also specify data files in the control file with INFILE, then the data specified on the command line is processed first. The first data file specified in the control file is ignored. All other data files specified in the control file are processed.
So pass the relevant file name with each invocation, e.g.
sqlldr user/passwd control=myfile.ctl data=/home/dmf/Cycle7Data/ITEM_IMAGE.csv
If you have lots of files to load from a directory you could have a shell script that loops over the directory contents and passes each file name in turn to an SQL*Loader session.

Edit file contents of folder before compress it

I want to compress the contents of a folder. The catch is that I need to modify the content of a file before compressing it. The modification should not alter the contents in the original folder but should be their in the compressed file
So far I was able to figure out altering file contents using sed command-
sed 's:/site_media/folder1/::g' index.html >index.html1
where /site_media/folder1/ is the string which I want to replace with empty string. Currently this code is creating another file named index.html1 as I don't want to make the changes inplace for the file index.html.
I tried pipelining this command with the zip command as follows
sed 's:/site_media/folder1/::g' folder1/index.html > index1.html |zip zips/folder1.zip folder1/
but I am not getting any contents when I unzip the file folder1.zip. Also the modified file in the compressed folder should be named index.html (and not index.html1)
You want to do two things. Then, say command1 && command2. In your case:
sed '...' folder1/index.html > index1.html && zip zips/folder1.zip folder1/
If you pipe commands, you use the output of the first to feed the second, which is something you don't want in this case.

Resources