Batch rename in hadoop - bash

How can I rename all files in a hdfs directory to have a .lzo extension? .lzo.index files should not be renamed.
For example, this directory listing:
file0.lzo file0.lzo.index file0.lzo_copy_1
could be renamed to:
file0.lzo file0.lzo.index file0.lzo_copy_1.lzo
These files are lzo compressed, and I need them to have the .lzo extension to be recognized by hadoop.

If you don't want to write Java Code for this - I think using the command line HDFS API is your best bet:
mv in Hadoop
hadoop fs -mv URI [URI …] <dest>
You can get the paths using a small one liner:
% hadoop fs -ls /user/foo/bar | awk '!/^d/ {print $8}'
/user/foo/bar/blacklist
/user/foo/bar/books-eng
...
the awk will remove directories from output..now you can put these files into a variable:
% files=$(hadoop fs -ls /user/foo/bar | awk '!/^d/ {print $8}')
and rename each file..
% for f in $files; do hadoop fs -mv $f $f.lzo; done
you can also use awk to filter the files for other criteria. This should remove files that match the regex nolzo. However it's untested. But this way you can write flexible filters.
% files=$(hadoop fs -ls /user/foo/bar | awk '!/^d|nolzo/ {print $8}' )
test if it works with replacing the hadoop command with echo:
$ for f in $files; do echo $f $f.lzo; done
Edit: Updated examples to use awk instead of sed for more reliable output.
The "right" way to do it is probably using the HDFS Java API .. However using the shell is probably faster and more flexible for most jobs.

When I had to rename many files I was searching for an efficient solution and stumbled over this question and thi-duong-nguyen's remark that renaming many files is very slow. I implemented a Java solution for batch rename operations which I can highly recommend, since it is orders of magnitude faster. The basic idea is to use org.apache.hadoop.fs.FileSystem's rename() method:
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://master:8020");
FileSystem dfs = FileSystem.get(conf);
dfs.rename(from, to);
where from and to are org.apache.hadoop.fs.Path objects. The easiest way is to create a list of files to be renamed (including their new name) and feed this list to the Java program.
I have published the complete implementation which reads such a mapping from STDIN. It renamed 100 files in less than four seconds (the same time was required to rename 7000 files!) while the hdfs dfs -mv based approach described before requires 4 minutes to rename 100 files.

We created an utility to do bulk renaming of files in HDFS: https://github.com/tenaris/hdfs-rename. The tool is limited, but if you want you can contribute to improve it with recursive, awk regex syntax and so on.

Related

Writing output to a text file using Hadoop Grip command

I have a file in HDFS - /user//SimpleDir/SimpleFile.txt and I'm trying to use Grep command and search for "MapReduce" in that file and print the results to a different file (simpleoutput.txt) in the same directory.
Any help is much appreciated!
You can try the following command
hadoop fs -cat /user/SimpleDir/SimpleFile.txt | grep -i Mapreduce > /user/SimpleDir/simpleoutput.txt

How to delete the most recently created files in multiple HDFS directories?

I made a mistake and have added a few hundred part files to a table partitioned by date. I am able to see which files are new (these are the ones I want to remove). Most cases I've seen on here relate to deleting files older than a certain date, but I only want to remove my most recent files.
For a single day, I may have 3 files as such, and I want to only remove the newfile. I can tell it's new because of the update timestamp when I use hadoop fs -ls
/this/is/my_directory/event_date1_newfile_20191114
/this/is/my_directory/event_date1_oldfile_20190801
/this/is/my_directory/event_date1_oldfile_20190801
I have many dates, so I'll have to complete this for event_date2, event_date3, etc etc, always removing the 'new_file_20191114' from each date.
The older dates are from August 2019, and my newfiles were updated yesterday, on 11/14/19.
I feel like there should be an easy/quick solution to this, but I'm having trouble finding the reverse case from what most folks have asked about.
AS mentioned in your answer you have got the list of files that needs to be deleted.
Create a simple script redirect the output to temp file
like this
hdfs dfs -ls /tmp | sort -k6,7 > files.txt
Please note sort -k6,7 this will give all the files but in sorted order of timestamp. I am sure you dont want to delete all thus you can select the top n files that needs to be deleted lets say 100
then you can update your command to
hdfs dfs -ls /tmp | sort -k6,7 | head -100 | awk '{print $8}' > files.txt
or if you know specific timestamp of your new files then you can try below command
hdfs dfs -ls /tmp | sort -k6,7 | grep "<exact_time_stamp>" | awk '{print $8}' > files.txt
Then read that file and delete all files one by one
while read file; do
hdfs -rm $file
echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted
done <files.txt
So you complete script can be like
#!/bin/bash
hdfs dfs -ls /tmp | sort -k6,7 | grep "<exact_time_stamp>" | awk '{print $8}' > files.txt
while read file; do
hdfs -rm $file
echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted
done <files.txt

Copy files from local to HDFS in alphabetical order - Sort

I need to copy files from local file system to HDFS through shell script. Suppose I have two files in my local system
fewInfo.tsv.gz
fewInfo.txt
In the above case, fewInfo.tsv.gz should be copied first(s comes before x) to HDFS and then fewInfo.txt should be copied. Is this possible?
Anyone aware of the internal structure as to how the "put" command works when multiple files are being copied to HDFS?
Hadoop version I am using is Hadoop 2.5.0-cdh5.3.1.
You could loop through the directory in order to find all files, sort the files and then execute the hdfs copy. The advantage would be that you can specify the constraints for the sort (e.g. by filename, date, order, etc.). There are many options to perform this. One would be to use the find command:
find /some/directory -type f -maxdepth 1 -type f | sort | while IFS= read -r filename; do hdfs dfs -copyFromLocal "$filename" hdfs://target/dir/; done
-maxdepth 1 argument prevents find from recursively descending into any subdirectories. (If you want such nested directories to get processed, you can omit this.)
-type -f specifies that only plain files will be processed.
sort defines that the found files will be sorted. Here you have the possibility to extend by reverse order, sort for modification date, etc.
while IFS= read -r filename loops thgough the found files. IFS in that loop is to preserve leading and trailing white space. The -r option prevents read from treating backslash as a special character.
hdfs dfs -copyFromLocal "$filename" hdfs://target/dir/ takes the sorted filenames and copies them from the local directory to the hdfs directory. Alternatively you can also use hadoop -fs put "$filename" hdfs://target/dir/

How to find if there are new files in a directory on HDFS (Hadoop) every 4 min using shell script

I have a directory on HDFS e.g: /user/customers , in this directory I am dumping data file of customer every 3 min, I want to write a shell script which will check this folder and if a new file is available then that file data will be put in HBASE, I have figured out how I will put the data in HBASE. But I am very new to shell scripting, I want to know how can I get the new file name.
My hadoop command to put the data of file in HBASE is as follows:
hadoop jar /opt/mapr/hbase/hbase-0.94.12/hbase-0.94.12-mapr-1310.jar importtsv -Dimporttsv.separator=, -Dimporttsv.columns=HBASE_ROW_KEY,cust:phno,cust:name,cust:memebershiptype /user/tablename customer.csv
Now the Idea is to replace this customer.csv file name with the filename which is recently dumped in the folder and then run this command.
So If am not wrong I will need a cron job to do the scheduling part. But I need the logic on how I can get the new file name in the above mentioned command first. Then my later part to learn is crontab for scheduling it for every 4 mins.
Please guide experts.
Try this script . it will give idea.basically first i am listing out the files and store them to customer_all_file.txt.in for loop pass the file name,store the file name to already processed files.difference command will find the new files and store them to need_to_processed files.its very simple go through it.
hadoop fs -ls hdfs://IPNamenode/user/customers/ | sed '1d;s/ */ /g' | cut -d\ -f8 | xargs -n 1 basename > /home/givepath/customer_all_file.txt
diff /home/givpath/customer_all_files.txt /home/givepath/customer_processedfiles.txt > /home/givepath/need_to_process.txt
for line in `awk '{ print $2 }' /home/givepath/need_to_process.txt`;
do
echo "$line"
hadoop jar /opt/mapr/hbase/hbase-0.94.12/hbase-0.94.12-mapr-1310.jar importtsv -Dimporttsv.separator=, -Dimporttsv.columns=HBASE_ROW_KEY,cust:phno,cust:name,cust:memebershiptype /user/tablename $line
echo "$line" >> /home/givepath/customer_already_processedfiles.txt
done
Renaming part:
Does all your csv files have the same name as customer.csv? If yes, you need to rename them while uploading each file into hdfs.
Crontab part:
You can run your shell script every 4 minutes by using:
*/4 * * * * /your/shell/script/path
Add this line by typing crontab -e in terminal.

Why is there no 'hadoop fs -head' shell command?

A fast method for inspecting files on HDFS is to use tail:
~$ hadoop fs -tail /path/to/file
This displays the last kilobyte of data in the file, which is extremely helpful. However, the opposite command head does not appear to be part of the shell command collections. I find this very surprising.
My hypothesis is that since HDFS is built for very fast streaming reads on very large files, there is some access-oriented issue that affects head. This makes me hesitant to do things to access the head. Does anyone have an answer?
I would say it's more to do with efficiency - a head can easily be replicated by piping the output of a hadoop fs -cat through the linux head command.
hadoop fs -cat /path/to/file | head
This is efficient as head will close out the underlying stream after the desired number of lines have been output
Using tail in this manner would be considerably less efficient - as you'd have to stream over the entire file (all HDFS blocks) to find the final x number of lines.
hadoop fs -cat /path/to/file | tail
The hadoop fs -tail command as you note works on the last kilobyte - hadoop can efficiently find the last block and skip to the position of the final kilobyte, then stream the output. Piping via tail can't easily do this.
Starting with version 3.1.0 we now have it:
Usage: hadoop fs -head URI
Displays first kilobyte of the file to stdout.
See here.
hdfs -dfs /path | head
is a good way to solve the problem.
you can try the folowing command
hadoop fs -cat /path | head -n
where -n can be replace with number of records to view
In Hadoop v2:
hdfs dfs -cat /file/path|head
In Hadoop v1 and v3:
hadoop fs -cat /file/path|head

Resources