Splitting a file on Hadoop - hadoop

I have an 8.8G file on the hadoop cluster that I'm trying to extract certain lines for testing purpose.
Seeing that Apache Hadoop 2.6.0 have no split command, how am I able to do it without having to download the file.
If the file was on a linux server I would've used:
$ csplit filename %2015-07-17%
The previous command works as desired, is something close to that possible on Hadoop?

You could use a combination of unix and hdfs commands.
hadoop fs -cat filename.dat | head -250 > /redirect/filename
Or if last KB of the file is suffice you could use this.
hadoop fs -tail filename.dat > /redirect/filename

Related

Hadoop error when outputting the grep results to a new file in a different directory

I'm trying to read the contents of a few files and using grep find the lines with the my search query and then output the results into a folder in another directory. I get an error "No such file or directory exists". I have created the folder structure and the text file.
hadoop fs -cat /Final_Dataset/c*.txt | grep 2015-01-* > /energydata/2015/01/01.txt
ERROR:
-bash: /energydata/2015/01/01.txt: No such file or directory
> /energydata/2015/01/01.txt means that the output is being redirected to a local file. hdfs fs -cat sends output to your local machine and at that point you're no longer operating within Hadoop. grep simply acts on a stream of data, it doesn't care (or know) where it came from.
You need to make sure that /energydata/2015/01/ exists locally before you run this command. You can create it with mkdir -p /energydata/2015/01/.
If you're looking to pull certain records from a file on HDFS and then re-write the new file to HDFS then I'd suggest not cat-ing the file and instead keeping the processing entirely on the cluster, by using something like Spark or Hive to transform data efficiently. Or failing that just do a hadoop dfs -put <local_path> /energydata/2015/01/01.txt.
The following CLI command worked
hadoop fs -cat /FinalDataset/c*.txt | grep 2015-01-* | hadoop fs -put - /energydata/2015/01/output.txt

How do I remove a file from HDFS

I am learning Hadoop and I have never worked on Unix before . So, I am facing a problem here . What I am doing is:
$ hadoop fs -mkdir -p /user/user_name/abcd
now I am gonna put a ready made file with name file.txt in HDFS
$ hadoop fs -put file.txt /user/user_name/abcd
The file gets stored in hdfs since it shows up on running -ls command.
Now , I want to remove this file from HDFS . How should i do this ? What command should i use?
If you run the command hadoop fs -usage you'll get a look at what commands the filesystem supports and with hadoop fs -help you'll get a more in-depth description of them.
For removing files the commands is simply -rm with -rf specified for recursively removing folders. Read the command descriptions and try them out.

Get last 5 lines of a file in Hadoop (HDFS)

I have several files in my Hadoop cluster (on HDFS). I want to see the last 5 lines of every file. Is there a simple command to do so?
If you want to see the last 5 lines specifically (and not any more or any less) of a file in HDFS, you can use the following command but its not very efficient:
hadoop fs -cat /your/file/with/path | tail -5
Here's a more efficient command within hadoop, but it returns the last kilobyte of the data, not a user-specified number of lines:
hadoop fs -tail /your/file/with/path
Here's a reference to the hadoop tail command : http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html#tail

How to read a .deflate file in hadoop

I got some pig generated files with part-r-00000.deflate extension. I know this is a compressed file. How do I generate a normal file in a readable format. When I used hadoop fs -text, I cannot get plaintext output. The output is still binary. How can I fix this problem?
You might be using a quite old Hadoop version (e.g: 0.20.0) in which fs -text can't inflate the compressed file.
As a workaround you may try this one-liner (based on this answer):
hadoop fs -text file.deflate | perl -MCompress::Zlib -e 'undef $/; print uncompress(<>)'
you can decompress on the fly by using this command
hdfs dfs -text file.deflate | hdfs dfs -put - uncompressed_destination_file

How can I concatenate two files in hadoop into one using Hadoop FS shell?

I am working with Hadoop 0.20.2 and would like to concatenate two files into one using the -cat shell command if possible (source: http://hadoop.apache.org/common/docs/r0.19.2/hdfs_shell.html)
Here is the command I'm submitting (names have been changed):
**/path/path/path/hadoop-0.20.2> bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv > /user/username/folder/outputdirectory/**
It returns bash: /user/username/folder/outputdirectory/: No such file or directory
I also tried creating that directory and then running it again -- i still got the 'no such file or directory' error.
I have also tried using the -cp command to copy both into a new folder and -getmerge to combine them but have no luck with the getmerge either.
The reason for doing this in hadoop is that the files are massive and would take a long time to download, merge, and re-upload outside of hadoop.
The error relates to you trying to re-direct the standard output of the command back to HDFS. There are ways you can do this, using the hadoop fs -put command with the source argument being a hypen:
bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv | hadoop fs -put - /user/username/folder/output.csv
-getmerge also outputs to the local file system, not HDFS
Unforntunatley there is no efficient way to merge multiple files into one (unless you want to look into Hadoop 'appending', but in your version of hadoop, that is disabled by default and potentially buggy), without having to copy the files to one machine and then back into HDFS, whether you do that in
a custom map reduce job with a single reducer and a custom mapper reducer that retains the file ordering (remember each line will be sorted by the keys, so you key will need to be some combination of the input file name and line number, and the value will be the line itself)
via the FsShell commands, depending on your network topology - i.e. does your client console have a good speed connection to the datanodes? This certainly is the least effort on your part, and will probably complete quicker than a MR job to do the same (as everything has to go to one machine anyway, so why not your local console?)
To concatenate all files in the folder to an output file:
hadoop fs -cat myfolder/* | hadoop fs -put - myfolder/output.txt
If you have multiple folders on hdfs and you want to concatenate files in each of those folders, you can use a shell script to do this. (note: this is not very effective and can be slow)
Syntax :
for i in `hadoop fs -ls <folder>| cut -d' ' -f19` ;do `hadoop fs -cat $i/* | suy hadoop fs -put - $i/<outputfilename>`; done
eg:
for i in `hadoop fs -ls my-job-folder | cut -d' ' -f19` ;do `hadoop fs -cat $i/* |hadoop fs -put - $i/output.csv`; done
Explanation:
So you basically loop over all the files and cat each of the folders contents into an output file on the hdfs.

Resources