hadoop RC file format : merge small files in hdfs - hadoop

I am seeking for way of combining small RC files generated by Map-reduce program.
What is best of going merge of small RC files to large one RC files.

You can try getmerge command . This takes a source directory and a destination file as input and concatenates files in source directory into the destination file.
Example , if Hive table name is search_combined_rc, you can get the combined rc file into a single file.
hadoop fs -getmerge /user/hive/warehouse/dev.db/search_combined_rc/ /localdata/destinationfilename
Since RCFile’s cannot be opened with the tools that open typical sequence files, you can try using rcfilecat tool to display the contents of RCFiles. you need to move back the file from local directory to HDFS.
hive --service rcfilecat /hdfsfilelocation

Related

How can I append multiple files in HDFS to a single file in HDFS without the help of local file system?

I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename

What does copyToLocal in the hadoop environment return?

I have a table in HDFS with the current path of /apps/hive/warehouse/ratings. I tried to download this to my local file system with the copyToLocal function in Hadoop.
The call worked and showed no errors, but when I go check in to the downloaded table is just a folder containing a file type.
Do you know what is the proper function call to download the table from HDFS as a CSV file?
This is the command that I am using at the moment
hadoop fs -copyToLocal /apps/hive/warehouse/ratings /home/maria_dev
this was to check what type of file i had
You can try
hadoop fs -get /apps/hive/warehouse/ratings /home/maria_dev
And after your file is in your local file system you can rename the file to what ever you want and add your preferred file format

How to edit txt file inside the HDFS in terminal?

Is there any way to modify the txt file inside HDFS directly via terminal?
Assume, I have "my_text_file.txt", and I would like to modify it inside HDFS using below command.
$ hdfs dfs -XXXX user/my_text_file.txt
I am interested to know "xxxx" if there exists any.
Please note that I don't want to make modification in local and then copy it to HDFS.
You cannot edit files, which all are already in HDFS. It will not support. HDFS works on "Write once, read many". So if you want to edit a file, make changes in your local copy then move it to HDFS.
Currently as explained by #BruceWayne, its not possible. It would be very difficult to edit the files stored in hdfs because all the files are distributed in hdfs and it would be very difficult to edit in the terminal using hdfs commands. Currently these are supported as terminal commands.
You can edit them by locating the data location of each datanode in the cluster.But that would be troublesome.
Moreover you can install HUE. With HUE you can edit the files in hdfs using web UI.
You can not edit files in HDFS, as it works on the principle of Write once, Read Many.But now a day, we can edit file using Hue file browser in cloudera.

How does getMerge work in Hadoop?

I would like to know, how does the getMerge command work in OS/HDFS level. Will it copy each and every byte/blocks from one file to another file,or just a simple file descriptor change? How costliest operation is it?
getmerge
Usage: hadoop fs -getmerge <src> <localdst> [addnl]
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file.
So, to answer your question,
Will it copy each and every byte/blocks from one file to another file
Yes, and no. It will find every HDFS block containing the files in the given source directory and concatenate them together into a single file on your local filesystem.
a simple file descriptor change
Not sure what you mean by that. getmerge doesn't change any file descriptors; it is just reading data from HDFS to your local filesystem.
How costliest operation is it?
Expect it to be as costly as manually cat-ing all the files in an HDFS directory. The same operation for
hadoop fs -getmerge /tmp/ /home/user/myfile
Could be achieved by doing
hadoop fs -cat /tmp/* > /home/user/myfile
The costly operation being the fetching of many file pointers and transferring those records over the network to your local disk.

Is there a tool to continuously copy contents of a directory to HDFS as they are?

I tried using flume directory spooler source and HDFS sink. But this does not serve my purpose because, the files are read by Flume and then get written to HDFS as part files which can be rolled by size/time (Please correct me if I've got this wrong). Is there a tool that continously does something like an HDFS put on all files that are dumped in the spool directory ?
If i got your question correctly then you have a and you are getting files into it and that file you want to move to HDFS without reading it and HDFS copyFromLocal will solve your issue then you just need to have an logic which can return you the recent files in the directory and run CopyFromLocal command to copy it in HDFS.

Resources