Shell Script - Iterate through each line in text file and rename HDFS file - shell

I have a text file in HDFS which would have records like below. The number of lines in file may vary every time.
hdfs://myfile.txt
file_name_1
file_name_2
file_name_3
I have the below hdfs directory and file structure like below.
hdfs://myfolder/
hdfs://myfolder/file1.csv
hdfs://myfolder/file2.csv
hdfs://myfolder/file3.csv
Using shell script I am able to count the number of files in HDFS directory and number of lines available in my HDFS text file. Only if the count matches between the number of files in directory and number of records in my text file, I am going to proceed further with the process.
Now, i am trying to rename hdfs://myfolder/file1.csv to hdfs://myfolder/file_name_1.csv using the first record from my text file.
Second file should be renamed to hdfs://myfolder/file_name_2.csv and third file to hdfs://myfolder/file_name_3.csv
I have difficulty in looping through both the text file and also the files in HDFS directory.
Is there an optimal way to achieve this using shell script.

You cannot do this directly from HDFS, you'd need to stream the file contents, then issue individual move commands.
e.g.
#!/bin/sh
COUNTER = 0
for file in $(hdfs dfs -cat file.txt)
do
NAME = $(sed $file ...) # replace text, as needed. TODO: extract the extension
hdfs dfs -mv file "$NAME_${COUNTER}.csv" # 'csv' for example - make sure the extension isn't duplicated!!
COUNTER = $((COUNTER + 1)
done

Related

Read csv and update csv

I have csv file which has list of hadoop file path, so I have to read each hadoop file from each row and calling hadoop - get. It is working fine. But I would like to mark the csv 2nd column with files are copied to destination folder.
something like flag. How to do edit the second column in while loop and save it in same csv?
Input.csv
path,flag
file1path,
file2path,
So after copying each file want to mark flag as Y in the same file.

how to insert header file as first line into data file in HDFS without using getmerge(performance issue while copying to local)?

I am trying to insert header.txt as first line into data.txt without using getmerge. Getmerge copies to local and inserts into third file. But I want in HDFS only
Header.txt
Head1,Head2,Head3
Data.txt
100,John,28
101,Gill,25
102,James,29
I want output in Data.txt file only like below :
Data.txt
Head1,Head2,Head3
100,John,28
101,Gill,25
102,James,29
Please suggest me whether can we implement in HDFS only ?
HDFS supports a concat (short for concatenate) operation in which two files are merged together into one without any data transfer. It will do exactly what you are looking for. Judging by the file system shell guide documentation, it is not currently supported from the command line, so you will need to implement this in Java:
FileSystem fs = ...
Path data = new Path("Data.txt");
Path header = new Path("Header.txt");
Path dataWithHeader = new Path("DataWithHeader.txt");
fs.concat(dataWithHeader, header, data);
After this, Data.txt and Header.txt both cease to exist, replaced by DataWithHeader.txt.
Thanks for your reply.
I got other way like :
Hadoop fs cat hdfs_path/header.txt hdfs_path/data.txt | Hadoop fs -put - hdfs_path/Merged.txt
This is having drawback as cat command reads complete data which impacts performance.

Bash Script to read CSV file and search directory for files to copy

I'm working on creating bash script to read a CSV file (comma delineated). The file contains parts of names for files in another directory. I then need to take these names and use them to search the directory and copy the correct files to a new folder.
I am able to read the csv file. However, csv file only contains part of the file names so I need to use wildcards to search the directory for the files. I have been unable to get the wildcards to work within the directory.
CSV File Format (in notepad):
12
13
14
15
Example file names in target directory:
IXI12_asfds.nii
IXI13_asdscds.nii
IXI14_aswe32fds.nii
IXI15_asf432ds.nii
The prefix to all of the files is the same: IXI. The csv file contains the unique numbers for each target file which appear right after the prefix. The middle portion of the filenames are unique to each file.
#!/bin/bash
# CSV file with comma delineated numbers.
# CSV file only contains part of the file name. Need to add IXI to the
beginning, and search with a wildcard at the end.
input="CSV_file.csv"
while IFS=',' read -r file_name1
do
name=(IXI$file_name1)
cp $name*.nii /newfolder
done < "$input"
The error I keep getting is saying that no folder with the appropriate name is able to be identified.

Hadoop FileUtil copymerge - Ignore header

While writing out from spark to HDFS, depending upon the header setting, each file has a header. So when calling copymerge in FileUtil we get duplicated headers in merged file. Is there a way to retain header from 1st file and ignore others.
If you are planning to merge it as a single file and then fetch it on to your local file system, you can use getmerge.
getmerge
Usage: hadoop fs -getmerge [-nl] <src> <localdst>
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file. -skip-empty-file can be used to avoid unwanted newline characters in case of empty files.
Now to remove the headers, you should have an idea of how your header looks like.
Suppose if your header looks like:
HDR20171227
You can use:
sed -i '1,${/^HDR/d}' "${final_filename}"
where final_filename is the name of the file on local FS.
This will delete all lines that start with HDR in your file and occur after the first line.
If you are unsure about the header, you can first store it in a variable using
header=$(head -1 "${final_filename}" )
And then proceed to delete it using sed.

Merging MapReduce output

I have two MapReduce jobs which produce files in two separate directories which look like so:
Directory output1:
------------------
/output/20140102-r-00000.txt
/output/20140102-r-00000.txt
/output/20140103-r-00000.txt
/output/20140104-r-00000.txt
Directory output2:
------------------
/output-update/20140102-r-00000.txt
I want to merge these two directories together in a new directory /output-complete/ where the 20140102-r-00000.txt replaces the original file in the /output directory and all of the "-r-0000x" is removed from the file name. The two original directories will now be empty and the resulting directory should look as follows:
Directory output3:
-------------------
/output-complete/20140102.txt
/output-complete/20140102.txt
/output-complete/20140103.txt
/output-complete/20140104.txt
What is the best way to do this? Can I use only HDFS shell commands? Do I need to create a java program to traverse both directories and do the logic?
you can use pig ...
get_data = load '/output*/20140102*.txt' using Loader()
store get_data into "/output-complete/20140102.txt"
or HDFS Command...
hadoop fs -cat '/output*/20140102*.txt' > output-complete/20140102.txt
single qoutes may not work, then try with double quotes
You can use hdfs command -getMerge for merging hdfs files.

Resources