hadoop map reduce -archives not unpacking archives - hadoop

hope you can help me. I've got a head-scratching problem with hadoop map-reduce. I've been using the "-files" option successfully on a map-reduce, with hadoop version 1.0.3. However, when I use the "-archives" option, it copies the files, but does not uncompress them. What am I missing? The documentation says "Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes", but that's not what I'm seeing.
I have created 3 files - a text file "alice.txt", a zip file "bob.zip" (containing b1.txt and bdir/b2.txt), and a tar file "claire.tar" (containing c1.txt and cdir/c2.txt). I then invoke the hadoop job via
hadoop jar myJar myClass -files ./etc/alice.txt -archives ./etc/bob.zip,./etc/claire.tar <input_path> <output_path>
The files are indeed there and well-formed:
% ls -l etc/alice.txt etc/bob.zip etc/claire.tar
-rw-rw-r-- 1 hadoop hadoop 6 Aug 20 18:44 etc/alice.txt
-rw-rw-r-- 1 hadoop hadoop 282 Aug 20 18:44 etc/bob.zip
-rw-rw-r-- 1 hadoop hadoop 10240 Aug 20 18:44 etc/claire.tar
% tar tf etc/claire.tar
c1.txt
cdir/c2.txt
I then have my mapper test for the existence of the files in question, like so, where 'lineNumber' is the key passed into the mapper:
String key = Long.toString(lineNumber.get());
String [] files = {
"alice.txt",
"bob.zip",
"claire.tar",
"bdir",
"cdir",
"b1.txt",
"b2.txt",
"bdir/b2.txt",
"c1.txt",
"c2.txt",
"cdir/c2.txt"
};
String fName = files[ (int) (lineNumber.get() % files.length)];
String val = codeFile(fName);
output.collect(new Text(key), new Text(val));
The support routine 'codeFile' is:
private String codeFile(String fName) {
Vector<String> clauses = new Vector<String>();
clauses.add(fName);
File f = new File(fName);
if (!f.exists()) {
clauses.add("nonexistent");
} else {
if (f.canRead()) clauses.add("readable");
if (f.canWrite()) clauses.add("writable");
if (f.canExecute()) clauses.add("executable");
if (f.isDirectory()) clauses.add("dir");
if (f.isFile()) clauses.add("file");
}
return Joiner.on(',').join(clauses);
}
Using the Guava 'Joiner' class.
The output values from the mapper look like this:
alice.txt,readable,writable,executable,file
bob.zip,readable,writable,executable,dir
claire.tar,readable,writable,executable,dir
bdir,nonexistent
b1.txt,nonexistent
b2.txt,nonexistent
bdir/b2.txt,nonexistent
cdir,nonexistent
c1.txt,nonexistent
c2.txt,nonexistent
cdir/c2.txt,nonexistent
So you see the problem - the archive files are there, but they are not unpacked. What am I missing? I have also tried using DistributedCache.addCacheArchive() instead of using -archives, but the problem is still there.

the distributed cache doesn't unpack the archives files to the local working directory of your task - there's a location on each task tracker for job as a whole, and it's unpacked there.
You'll need to check the DistributedCache to find this location and look for the files there. The Javadocs for DistributedCache show an example mapper pulling this information.
You can use symbolic linking when defining the -files and -archives generic options and a symlink will be created in the local working directory of the map / reduce tasks making this easier:
hadoop jar myJar myClass -files ./etc/alice.txt#file1.txt \
-archives ./etc/bob.zip#bob,./etc/claire.tar#claire
And then you can use the fragment names in your mapper when trying to open files in the archive:
new File("bob").isDirectory() == true

Related

Merging small files in hadoop

I have a directory (Final Dir) in HDFS in which some files(ex :10 mb) are loading every minute.
After some time i want to combine all the small files to a large file(ex :100 mb). But the user is continuously pushing files to Final Dir. it is a continuous process.
So for the first time i need to combine the first 10 files to a large file (ex : large.txt) and save file to Finaldir.
Now my question is how i will get the next 10 files excluding the first 10 files?
can some please help me
Here is one more alternate, this is still the legacy approach pointed out by #Andrew in his comments but with extra steps of making your input folder as a buffer to receive small files pushing them to a tmp directory in a timely fashion and merging them and pushing the result back to input.
step 1 : create a tmp directory
hadoop fs -mkdir tmp
step 2 : move all the small files to the tmp directory at a point of time
hadoop fs -mv input/*.txt tmp
step 3 -merge the small files with the help of hadoop-streaming jar
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-Dmapred.reduce.tasks=1 \
-input "/user/abc/input" \
-output "/user/abc/output" \
-mapper cat \
-reducer cat
step 4- move the output to the input folder
hadoop fs -mv output/part-00000 input/large_file.txt
step 5 - remove output
hadoop fs -rm -R output/
step 6 - remove all the files from tmp
hadoop fs -rm tmp/*.txt
Create a shell script from step 2 till step 6 and schedule it to run at regular intervals to merge the smaller files at regular intervals (may be for every minute based on your need)
Steps to schedule a cron job for merging small files
step 1: create a shell script /home/abc/mergejob.sh with the help of above steps (2 to 6)
important note: you need to specify the absolute path of hadoop in the script to be understood by cron
#!/bin/bash
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv input/*.txt tmp
wait
/home/abc/hadoop-2.6.0/bin/hadoop jar /home/abc/hadoop-2.6.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-Dmapred.reduce.tasks=1 \
-input "/user/abc/input" \
-output "/user/abc/output" \
-mapper cat \
-reducer cat
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv output/part-00000 input/large_file.txt
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm -R output/
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm tmp/*.txt
step 2: schedule the script using cron to run every minute using cron expression
a) edit crontab by choosing an editor
>crontab -e
b) add the following line at the end and exit from the editor
* * * * * /bin/bash /home/abc/mergejob.sh > /dev/null 2>&1
The merge job will be scheduled to run for every minute.
Hope this was helpful.
#Andrew pointed you to a solution that was appropriate 6 years ago, in a batch-oriented world.
But it's 2016, you have a micro-batch data flow running and require a non-blocking solution.
That's how I would do it:
create an EXTERNAL table with 3 partitions, mapped on 3 directories
e.g. new_data, reorg and history
feed the new files into new_data
implement a job to run the batch compaction, and run it periodically
Now the batch compaction logic:
make sure that no SELECT query will be executed while the compaction is running, else it would return duplicates
select all files that are ripe for compaction (define your own
criteria) and move them from new_data directory to reorg
merge the content of all these reorg files, into a new file in history dir (feel free to GZip it on the fly, Hive will recognize the .gz extension)
drop the files in reorg
So it's basically the old 2010 story, except that your existing data flow can continue dumping new files into new_data while the compaction is safely running in separate directories. And in case the compaction job crashes, you can safely investigate / clean-up / resume the compaction without compromising the data flow.
By the way, I am not a big fan of the 2010 solution based on a "Hadoop Streaming" job -- on one hand, "streaming" has a very different meaning now; on the second hand, "Hadoop streaming" was useful in the old days but is now out of the radar; on the gripping hand [*] you can do it quite simply with a Hive query e.g.
INSERT INTO TABLE blahblah PARTITION (stage='history')
SELECT a, b, c, d
FROM blahblah
WHERE stage='reorg'
;
With a couple of SET some.property = somevalue before that query, you can define what compression codec will be applied on the result file(s), how many file(s) you want (or more precisely, how big you want the files to be - Hive will run the merge accordingly), etc.
Look into https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties under hive.merge.mapfiles and hive.merge.mapredfiles (or hive.merge.tezfiles if you use TEZ) and hive.merge.smallfiles.avgsize and then hive.exec.compress.output and mapreduce.output.fileoutputformat.compress.codec -- plus hive.hadoop.supports.splittable.combineinputformat to reduce the number of Map containers since your input files are quite small.
[*] very old SF reference here :-)

Checksum verification in Hadoop

Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?
I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?
I read client does checksum before data is written to HDFS
Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.
If your goal is to compare two files residing on HDFS, I would not use "hdfs dfs -checksum URI" as in my case it generates different checksums for files with identical content.
In the below example I am comparing two files with the same content in different locations:
Old-school md5sum method returns the same checksum:
$ hdfs dfs -cat /project1/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
$ hdfs dfs -cat /project2/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
However, checksum generated on the HDFS is different for files with the same content:
$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914
$ hdfs dfs -checksum /project2/file.txt
0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e
A bit puzzling as I would expect identical checksum to be generated against the identical content.
Checksum for a file can be calculated using hadoop fs command.
Usage: hadoop fs -checksum URI
Returns the checksum information of a file.
Example:
hadoop fs -checksum hdfs://nn1.example.com/file1
hadoop fs -checksum file:///path/in/linux/file1
Refer : Hadoop documentation for more details
So if you want to comapre file1 in both linux and hdfs you can use above utility.
I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.
So, you can compare the checksum to cross check.
https://github.com/srch07/HDFSChecksumForLocalfile
If you are doing this check via API
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
Option 1: for the value b9fdea463b1ce46fabc2958fc5f7644a
val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString
Option 2: for the value 3e50be59553b2ddaf401c575f8df6914
val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0)
It does crc check. For each and everyfile it create .crc to make sure there is no corruption.

Hadoop WordCount Output

I am new to hadoop and am running some of the examples to become more familiar with it. I ran wordcount and when I went to check the output hadoop fs -cat outt I got 3 directories instead of the usual one named outt/part-00000. Here are the directories I have:
-rw-r--r-- 1 hadoop supergroup 0 2014-07-11 20:13 outt/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 15 2014-07-11 20:13 outt/part-r-00000
-rw-r--r-- 1 hadoop supergroup 0 2014-07-11 20:13 outt/part-r-00001
When I do hadoop fs -cat outt/_SUCCESS and hadoop fs -cat outt/part-r-00001, nothing appears. However, when I do hadoop fs -cat outt/part-r-00000 I get: record_count 1.
My file just says "Hello World" so I am expecting the result: Hello 1 World 1.
Does anyone know how to get the correct output?
1.)The _success and part-r-00000/1 are not directories but files. Directory is more like a set of files and other directories
2.) _Success file is automatically created by hadoop if the submitted job is performed successfully by all the nodes and reducers and the result set is complete.
3.)If you are getting two part files it implies that you have two reducers in your job description. Check the code to find if there is any statement like job.setNumReduceTasks(2);. The part named 00000 is the output of first reducer and 00001 is the output of the second reducer. 'r' implies that the output is from reducer. If see 'm' instead of 'r' it means that you dont have a reducer and the job is map only job.
When you are doing hadoop fs -cat outt/part-r-00000 and getting output as : record_count 1
Which mean probably you are counting the number of lines in the input file.
Once you read a line, you need to tokenize the line and take each word (token) out of this.
Here is sample code:
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
You can find the full code here: WordCount
Here instead of StringTokenizer you can you split method of java API.

Hadoop MapReduce: with fixed number of input files?

for Map Reducer Job
In my input directory having around 1000 files. and each files contains some GB's of data.
for example /MyFolder/MyResults/in_data/20140710/ contains 1000 files.
when I give the inputpath as /MyFolder/MyResults/in_data/20140710 it's taking all 1000 files to process.
I would like to run a job by talking 200 files only at a time. How we can do this?
Here my command to execute:
hadoop jar wholefile.jar com.form1.WholeFileInputDriver -libjars myref.jar -D mapred.reduce.tasks=15 /MyFolder/MyResults/in_data/20140710/ <<Output>>
Can any help me, how to run a job like a batch size for the inputfiles.
Thanks in advance
-Vim
A simple way would be to modify your driver to take only 200 files as input out of all the files in that directory. Something like this:
FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] files = fs.globStatus(new Path("/MyFolder/MyResults/in_data/20140710/*"));
for (int i=0;i<200;i++) {
FileInputFormat.addInputPath(job, files[i].getPath());
}

Hadoop copy a directory?

Is there an HDFS API that can copy an entire local directory to the HDFS? I found an API for copying files but is there one for directories?
Use the Hadoop FS shell. Specifically:
$ hadoop fs -copyFromLocal /path/to/local hdfs:///path/to/hdfs
If you want to do it programmatically, create two FileSystems (one Local and one HDFS) and use the FileUtil class
I tried copying from the directory using
/hadoop/core/bin/hadoop fs -copyFromLocal /home/grad04/lopez/TPCDSkew/ /export/hadoop1/lopez/Join/TPCDSkew
It gave me an error saying Target is a directory . I then modified it to
/hadoop/core/bin/hadoop fs -copyFromLocal /home/grad04/lopez/TPCDSkew/*.* /export/hadoop1/lopez/Join/TPCDSkew
it works .
In Hadoop version:
Hadoop 2.4.0.2.1.1.0-390
(And probably later; I have only tested this specific version as it is the one I have)
You can copy entire directories recursively without any special notation using copyFromLocal e.g.,:
hadoop fs -copyFromLocal /path/on/disk /path/on/hdfs
which works even when /path/on/disk is a directory containing subdirectories and files.
You can also use the put command:
$ hadoop fs -put /local/path hdfs:/path
For programmer, you also can use copyFromLocalFile. Here is an example:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
val hdfsConfig = new Configuration
val hdfsURI = "hdfs://127.0.0.1:9000/hdfsData"
val hdfs = FileSystem.get(new URI(hdfsURI), hdfsConfig)
val oriPath = new Path("#your_localpath/customer.csv")
val targetFile = new Path("hdfs://your_hdfspath/customer.csv")
hdfs.copyFromLocalFile(oriPath, targetFile)

Resources