Get last 5 lines of a file in Hadoop (HDFS) - hadoop

I have several files in my Hadoop cluster (on HDFS). I want to see the last 5 lines of every file. Is there a simple command to do so?

If you want to see the last 5 lines specifically (and not any more or any less) of a file in HDFS, you can use the following command but its not very efficient:
hadoop fs -cat /your/file/with/path | tail -5
Here's a more efficient command within hadoop, but it returns the last kilobyte of the data, not a user-specified number of lines:
hadoop fs -tail /your/file/with/path
Here's a reference to the hadoop tail command : http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html#tail

Related

Hadoop error when outputting the grep results to a new file in a different directory

I'm trying to read the contents of a few files and using grep find the lines with the my search query and then output the results into a folder in another directory. I get an error "No such file or directory exists". I have created the folder structure and the text file.
hadoop fs -cat /Final_Dataset/c*.txt | grep 2015-01-* > /energydata/2015/01/01.txt
ERROR:
-bash: /energydata/2015/01/01.txt: No such file or directory
> /energydata/2015/01/01.txt means that the output is being redirected to a local file. hdfs fs -cat sends output to your local machine and at that point you're no longer operating within Hadoop. grep simply acts on a stream of data, it doesn't care (or know) where it came from.
You need to make sure that /energydata/2015/01/ exists locally before you run this command. You can create it with mkdir -p /energydata/2015/01/.
If you're looking to pull certain records from a file on HDFS and then re-write the new file to HDFS then I'd suggest not cat-ing the file and instead keeping the processing entirely on the cluster, by using something like Spark or Hive to transform data efficiently. Or failing that just do a hadoop dfs -put <local_path> /energydata/2015/01/01.txt.
The following CLI command worked
hadoop fs -cat /FinalDataset/c*.txt | grep 2015-01-* | hadoop fs -put - /energydata/2015/01/output.txt

Splitting a file on Hadoop

I have an 8.8G file on the hadoop cluster that I'm trying to extract certain lines for testing purpose.
Seeing that Apache Hadoop 2.6.0 have no split command, how am I able to do it without having to download the file.
If the file was on a linux server I would've used:
$ csplit filename %2015-07-17%
The previous command works as desired, is something close to that possible on Hadoop?
You could use a combination of unix and hdfs commands.
hadoop fs -cat filename.dat | head -250 > /redirect/filename
Or if last KB of the file is suffice you could use this.
hadoop fs -tail filename.dat > /redirect/filename

processing result of hdfs command output

It is probably a question about stream processing. But I am not able to find a elegant solution using awk.
I am running a m/r job scheduled to run once a day. But there can be multiple HDFS directories on which it needs to run. For example, 3 input directories were uploaded to HDFS for the day, so 3 m/r jobs one for each directory needs to run.
So I need a solution, where i can extract filenames from the results of:
hdfs dfs -ls /user/xxx/17-03-15*
Then iterate over the filenames, launching one m/r job for each file.
Thanks
Browsing more on the issue, I found Hadoop provides a configuration settings for this issue. Here are details.
Also, I was just having some syntax issue and this simple awk command did, what i wanted:
files=`hdfs dfs -ls /user/hduser/17-03-15* | awk {'print $8'}`

Editing a multi million row file on Hadoop cluster

I am trying to edit a large file on Hadoop cluster and trim white spaces and special characters like ¦,*,#," etc from the file.
I dont want to copyToLocal and use a sed as i have 1000's of such files to edit.
MapReduce is perfect for this. Good thing you have it in HDFS!
You say you think you can solve your problem with sed. If that's the case, then Hadoop Streaming would be a good choice for a one-off.
$ hadoop jar /path/to/hadoop/hadoop-streaming.jar \
-D mapred.reduce.tasks=0 \
-input MyLargeFiles \
-output outputdir \
-mapper "sed ..."
This will fire up a MapReduce job that applies your sed command to every line in the entire file. Since there are 1000s of files, you'll have several mapper tasks hitting the files at once. The data will also go right back into the cluster.
Note that I set the number of reducers to 0 here. That's because its not really needed. If you want your output to be one file, than use one reducer, but don't specify -reducer. I think that uses the identity reducer and effectively just creates one output file with one reducer. The mapper-only version is definitely faster.
Another option, which I don't think is as good, but doesn't require MapReduce, and is still better than copyToLocal is to stream it through the node and push it back up without hitting disk. Here's an example:
$ hadoop fs -cat MyLargeFile.txt | sed '...' | hadoop fs -put - outputfile.txt
The - in hadoop fs -put tells it to take data from stdin instead of a file.

How can I concatenate two files in hadoop into one using Hadoop FS shell?

I am working with Hadoop 0.20.2 and would like to concatenate two files into one using the -cat shell command if possible (source: http://hadoop.apache.org/common/docs/r0.19.2/hdfs_shell.html)
Here is the command I'm submitting (names have been changed):
**/path/path/path/hadoop-0.20.2> bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv > /user/username/folder/outputdirectory/**
It returns bash: /user/username/folder/outputdirectory/: No such file or directory
I also tried creating that directory and then running it again -- i still got the 'no such file or directory' error.
I have also tried using the -cp command to copy both into a new folder and -getmerge to combine them but have no luck with the getmerge either.
The reason for doing this in hadoop is that the files are massive and would take a long time to download, merge, and re-upload outside of hadoop.
The error relates to you trying to re-direct the standard output of the command back to HDFS. There are ways you can do this, using the hadoop fs -put command with the source argument being a hypen:
bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv | hadoop fs -put - /user/username/folder/output.csv
-getmerge also outputs to the local file system, not HDFS
Unforntunatley there is no efficient way to merge multiple files into one (unless you want to look into Hadoop 'appending', but in your version of hadoop, that is disabled by default and potentially buggy), without having to copy the files to one machine and then back into HDFS, whether you do that in
a custom map reduce job with a single reducer and a custom mapper reducer that retains the file ordering (remember each line will be sorted by the keys, so you key will need to be some combination of the input file name and line number, and the value will be the line itself)
via the FsShell commands, depending on your network topology - i.e. does your client console have a good speed connection to the datanodes? This certainly is the least effort on your part, and will probably complete quicker than a MR job to do the same (as everything has to go to one machine anyway, so why not your local console?)
To concatenate all files in the folder to an output file:
hadoop fs -cat myfolder/* | hadoop fs -put - myfolder/output.txt
If you have multiple folders on hdfs and you want to concatenate files in each of those folders, you can use a shell script to do this. (note: this is not very effective and can be slow)
Syntax :
for i in `hadoop fs -ls <folder>| cut -d' ' -f19` ;do `hadoop fs -cat $i/* | suy hadoop fs -put - $i/<outputfilename>`; done
eg:
for i in `hadoop fs -ls my-job-folder | cut -d' ' -f19` ;do `hadoop fs -cat $i/* |hadoop fs -put - $i/output.csv`; done
Explanation:
So you basically loop over all the files and cat each of the folders contents into an output file on the hdfs.

Resources