I am an newbie to Hadoop and Mapreduce.
Now I need to process a zip file in myproject using Mapreduce, like input have to be a zip file and the output may be zip or text file.
Can anyone give me sample or suggest me a link for that.
Thanks,
varadhan.S
I am also currently working on Hadoop and Mapreduce. there is no need to specially specify anything for Zip file. the hadoop automatically unzips and processes them. But, Output is Text what I am using. I am processing currently Huge number of Compressed files where each tar.gz file contains one text file each.
Regards
Balaram
Related
I have (huge) zip files (not gzip) sitting on HDFS. These files all contain multiple files. Is there any way, other than pulling it to local, to list the files in the zip file? Like zipinfo does on Linux.
HDFS does not support processing zip files.
i understand that listing files in a zip file is too simple to code java for it but
you may want to try processing them with MapReduce
try ZipFileInputFormat
I need some help. I am downloading a file from a webpage using python code and placing it in local file system and then transferring it into HDFS using put command and then performing operations on it.
But there might be some situations where the file size will be very large and downloading into Local File System is not a right procedure. So I want the file to be directly be downloaded into HDFS with out using the local file system at all.
Can any one suggest me some methods which one would be the best method to proceed?
If there are any errors in my question please correct me.
You can pipe it directly from a download to avoid writing it to disk, e.g.:
curl server.com/my/file | hdfs dfs -put - destination/file
The - parameter to -put tells it to read from stdin (see the documentation).
This will still route the download through your local machine, though, just not through your local file system. If you want to download the file without using your local machine at all, you can write a map-only MapReduce job whose tasks accept e.g. an input file containing a list of files to be downloaded and then download them and stream out the results. Note that this will require your cluster to have open access to the internet which is generally not desirable.
Is there any way to modify the txt file inside HDFS directly via terminal?
Assume, I have "my_text_file.txt", and I would like to modify it inside HDFS using below command.
$ hdfs dfs -XXXX user/my_text_file.txt
I am interested to know "xxxx" if there exists any.
Please note that I don't want to make modification in local and then copy it to HDFS.
You cannot edit files, which all are already in HDFS. It will not support. HDFS works on "Write once, read many". So if you want to edit a file, make changes in your local copy then move it to HDFS.
Currently as explained by #BruceWayne, its not possible. It would be very difficult to edit the files stored in hdfs because all the files are distributed in hdfs and it would be very difficult to edit in the terminal using hdfs commands. Currently these are supported as terminal commands.
You can edit them by locating the data location of each datanode in the cluster.But that would be troublesome.
Moreover you can install HUE. With HUE you can edit the files in hdfs using web UI.
You can not edit files in HDFS, as it works on the principle of Write once, Read Many.But now a day, we can edit file using Hue file browser in cloudera.
I have a Pig script (using a slightly modified MultiStorage) that transforms some data. Once the script runs, I have data in the following format on HDFS:
/tmp/data/identifier1/indentifier1-0,0001
/tmp/data/identifier1/indentifier1-0,0002
/tmp/data/identifier2/indentifier2-0,0001
/tmp/data/identifier3/indentifier3-0,0001
I'm attempting to use S3DistCp to copy these files to S3. I am using the --groupBy .*(identifier[0-9]).* option to combine files based on the identifier. The combination works, but when copying to S3, the folders are also copied. The end output is:
/s3bucket/identifier1/identifier1
/s3bucket/identifier2/identifier2
/s3bucket/identifier3/identifier3
Is there a way to copy these files without that first folder? Ideally, my output in S3 would look like:
/s3bucket/identifier1
/s3bucket/identifier2
/s3bucket/identifier3
Another solution I've considered is to use HDFS commands to pull those files out of their directories before copying to S3. Is that a reasonable solution?
Thanks!
The solution I've arrived upon is to use distcp to bring these files out of the directories before using s3distcp:
hadoop distcp -update /tmp/data/** /tmp/grouped
Then, I changed the s3distcp script to move data from /tmp/grouped into my S3 bucket.
Using distcp before s3distcp is really expensive. One other option you have is to create a manifest file with all your files in it and give its path to s3distcp. In this manifest you can define the "base name" of each file. If you need an example of a manifest file just run s3distcp on any folder with argument --outputManifest.
more information can be found here
I am seeking for way of combining small RC files generated by Map-reduce program.
What is best of going merge of small RC files to large one RC files.
You can try getmerge command . This takes a source directory and a destination file as input and concatenates files in source directory into the destination file.
Example , if Hive table name is search_combined_rc, you can get the combined rc file into a single file.
hadoop fs -getmerge /user/hive/warehouse/dev.db/search_combined_rc/ /localdata/destinationfilename
Since RCFile’s cannot be opened with the tools that open typical sequence files, you can try using rcfilecat tool to display the contents of RCFiles. you need to move back the file from local directory to HDFS.
hive --service rcfilecat /hdfsfilelocation