Write to different files using hadoop streaming - hadoop

I'm currently processing about 300 GB of log files on a 10 servers hadoop cluster. My data is being saved in folders named YYMMDD so each day can be accessed quickly.
My problem is that I just found out today that the timestamps I have in my log files are in DST (GMT -0400) instead of UTC as expected. In short, this means that logs/20110926/*.log.lzo contains elements from 2011-09-26 04:00 to 2011-09-27 20:00 and it's pretty much ruining any map/reduce done on that data (i.e. generating statistics).
Is there a way to do a map/reduce job to re-split every log files correctly? From what I can tell, there doesn't seem to be a way using streaming to send certain records in output file A and the rest of the records in output file B.
Here is the command I currently use:
/opt/hadoop/bin/hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar \
-D mapred.reduce.tasks=15 -D mapred.output.compress=true \
-D mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec \
-mapper map-ppi.php -reducer reduce-ppi.php \
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
-file map-ppi.php -file reduce-ppi.php \
-input "logs/20110922/*.lzo" -output "logs-processed/20110922/"
I don't know anything about java and/or creating custom classes. I did try the code posted at http://blog.aggregateknowledge.com/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/ (pretty much copy/pasted what was on there) but I couldn't get it to work at all. No matter what I tried, I would get a "-outputformat : class not found" error.
Thank you very much for your time and help :).

From what I can tell, there doesn't seem to be a way using streaming to send certain records in output file A and the rest of the records in output file B.
By using a custom Partitioner, you can specify which key goes to which reducer. By default the HashPartitioner is used. Looks like the only other Partitioner Streaming supports is KeyFieldBasedPartitioner.
You can find more details about the KeyFieldBasedPartitioner in the context of Streaming here. You need not know Java to configure the KeyFieldBasedPartitioner with Streaming.
Is there a way to do a map/reduce job to re-split every log files correctly?
You should be able to write a MR job to re-split the files, but I think Partitioner should solve the problem.

A custom MultipleOutputFormat and Partitioner seems like the correct way split your data by day.
As the author of that post, sorry that you had such a rough time. It sounds like if you were getting a "class not found" error, there was some issue with your custom output format not being found after you included it with "-libjars".


Storm - Writing to HDFS using compression

I want to store all my raw data incoming in my storm topology in a HDFS cluster.
This is JSON or binary data, incoming at a rate of 2k / secs.
I was trying to use the HDFS bolt (http://storm.apache.org/releases/0.10.0/storm-hdfs.htmlà , but it does not allow compression using the normal hdfs bolt
Compression is only possible using the Sequence File Bolt.
I don’t want to use sequence file, as I don’t have a real key.
Plus, I have already Cassandra for storing my key / value stuff and serving my request.
It just take too much disk (overhead) using Cassandra for my raw data (not this post objective to debate about this).
Can anyone help me with that ?
I can use the java Hadoop driver client to achieve that ?
Have anyone a code snippet of that ?
Okay there is no way to compress on the fly like I wanted.
But I have found a solution, I share it here if someone need it.
This problem is not only relative to Storm but it is a more general Hadoop question.
All my data are writted using the HdfsBolt :
RecordFormat format = new DelimitedRecordFormat().withFieldDelimiter("|");
//Synchronize data buffer with the filesystem every 1000 tuples
// Need to be configurable
SyncPolicy syncPolicy = new CountSyncPolicy(1000);
// Rotate data files when they reach five MB
// need to be configuration
FileRotationPolicy rotationPolicy = new FileSizeRotationPolicy(10.0f, FileSizeRotationPolicy.Units.MB);
// Use default, Storm-generated file names
FileNameFormat fileNameFormat = new DefaultFileNameFormat().withPath("/datadir/in_progress") ;
// Instantiate the HdfsBolt
HdfsBolt bolt = new HdfsBolt()
.addRotationAction(new MoveFileAction().withDestination("/datadir/finished"));
This is giving me one file per executor of my bolt.. Not easy to handle but it's okay :)
Then I schedule automatic compression using hadoop streaming (in a cron on namenode or something like this) :
hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
-Dmapred.reduce.tasks=0 \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
-input /datadir/finished \
-output /datadir/archives \
-mapper /bin/cat \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
Here I still have one issue :
One input file is compress into one archive.
So my 10MB input file (each for one worker) is compressing into one gzip (or bzip) of 1MB -> this is producing so many small files, and it's a problem in hadoop
To solve this issue, i will try to look at the hadoop archive (HAR) functionnality.
I also need to purge already compressed files in /datadir/finished
Hope I will have feedbacks from you guys
Keep in touch

Mapreduce - Right way to confirm whether the file is split or not

We had a lot of xml files and we wanted to process one xml using one mapper task because of obvious reasons to make the processing ( parsing ) simpler.
We wrote a mapreduce program to achieve that by overriding isSplitable method of input format class.It seems it is working fine.
However, we wanted to confirm if one mapper is used to process one xml file. IS there is a way to confirm by looking at the logs produced by driver program or any other way .
To answer your question, Just check the number of mapper count.
It should be equal to your number of input files.
Example :
Then the mapper count should be 3.
Here is the command.
mapred job -counter job_1449114544347_0001 org.apache.hadoop.mapreduce.JobCounter TOTAL_LAUNCHED_MAPS
You can get many details using mapred job -counter command. You can check video 54 and 55 from this playlist. It covers counters in detail.

Spark/Hadoop throws exception for large LZO files

I'm running an EMR Spark job on some LZO-compressed log-files stored in S3. There are several logfiles stored in the same folder, e.g.:
In the spark-shell I'm running a job that counts the lines in the files. If I count the lines individually for each file, there is no problem, e.g. like this:
// Works fine
If I use a wild-card to load all the files with a one-liner, I get two kinds of exceptions.
// One-liner throws exceptions
The exceptions are:
java.lang.InternalError: lzo1x_decompress_safe returned: -6
at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method)
java.io.IOException: Compressed length 1362309683 exceeds max block size 67108864 (probably corrupt file)
at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:291)
It seems to me that the solution is hinted by the text given with the last exception, but I don't know how to proceed. Is there a limit to how big LZO files are allowed to be, or what is the issue?
My question is: Can I run Spark queries that load all LZO-compressed files in an S3 folder, without getting I/O related exceptions?
There are 66 files of roughly 200MB per file.
The exception only occurs when running Spark with Hadoop2 core libs (ami 3.1.0). When running with Hadoop1 core libs (ami 2.4.5), things work fine. Both cases were tested with Spark 1.0.1.
kgeyti's answer works fine, but:
LzoTextInputFormat introduces a performance hit, since it checks for an .index file for each LZO file. This can be especially painful with many LZO files on S3 (I've experienced up to several minutes delay, caused by thousands of requests to S3).
If you know up front that your LZO files are not splittable, a more performant solution is to create a custom, non-splittable input format:
import org.apache.hadoop.fs.Path
import org.apache.hadoop.mapreduce.JobContext
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
class NonSplittableTextInputFormat extends TextInputFormat {
override def isSplitable(context: JobContext, file: Path): Boolean = false
and read the files like this:
I haven't run into this specific issue myself, but it looks like .textFile expects files to be splittable, much like the Cedrik's problem of Hive insisting on using CombineFileInputFormat
You could either index your lzo files, or try using the LzoTextInputFormat - I'd be interested to hear if that works better on EMR:
.map(_._2.toString) // if you just want a RDD[String] without writing a new InputFormat
yesterday we deployed Hive on a EMR cluster and had the same problem with some LZO files in S3 which have been taken without any problem by another non EMR cluster. After some digging in the logs I noticed, that the map tasks read the S3 files in 250MB chunks, although the files are definitely not splittable.
It turned out that the paramter mapreduce.input.fileinputformat.split.maxsize was set to 250000000 ~ 250MB. That resulted in LZO opening a stream from within a file and a ultimately a corrupt LZO block.
I set the parameter mapreduce.input.fileinputformat.split.maxsize=2000000000 bigger as the maximum file size of our input data and everything works now.
I'm not exactly sure how that correlates to Spark exactly, but changing the InputFormat might help, which seems like the problem in first place, as it has been mentioned in How Amazon EMR Hive Differs from Apache Hive.

How to control the number of hadoop streaming output files

Here is the detail:
The input files is in the hdfs path /user/rd/input, and the hdfs output path is /user/rd/output
In the input path, there are 20,000 files from part-00000 to part-19999, each file is about 64MB.
What I want to do is to write a hadoop streaming job to merge these 20,000 files into 10,000 files.
Is there a way to merge these 20,000 files to 10,000 files using hadoop streaming job? Or, in other words, Is there a way to control the number of hadoop streaming output files?
Thanks in advance!
It looks like right now you have a map-only streaming job. The behavior with a map-only job is to have one output file per map task. There isn't much you can do about changing this behavior.
You can exploit the way MapReduce works by adding the reduce phase so that it has 10,000 reducers. Then, each reducer will output one file, so you are left with 10,000 files. Note that your data records will be "scattered" across the 10,000... it won't be just two files concatenated. To do this, use the -D mapred.reduce.tasks=10000 flag in your command line args.
This is probably the default behavior, but you can also specify the identity reducer as your reducer. This doesn't do anything other than pass on the record, which is what I think you want here. Use this flag to do this: -reducer org.apache.hadoop.mapred.lib.IdentityReducer

Hadoop Streaming with SequenceFile (on AWS)

I have a large number of Hadoop SequenceFiles which I would like to process using Hadoop on AWS. Most of my existing code is written in Ruby, and so I would like to use Hadoop Streaming along with my custom Ruby Mapper and Reducer scripts on Amazon EMR.
I cannot find any documentation on how to integrate Sequence Files with Hadoop Streaming, and how the input will be provided to my Ruby scripts. I'd appreciate some instructions on how to launch jobs (either directly on EMR, or just a normal Hadoop command line) to make use of SequenceFiles and some information on how to expect the data to be provided to my script.
--Edit: I had previously referred to StreamFiles rather than SequenceFiles by mistake. I think the documentation for my data was incorrect, but apologies. The answer is easy with the change.
The answer to this is to specify the input format as a command line argument to Hadoop.
-inputformat SequenceFileAsTextInputFormat
The chances are that you want the SequenceFile as text, but there is also SequenceFileAsBinaryInputFormat if thats more appropriate.
Not sure if this is what you're asking for, but the command to use ruby map reduce scripts with the hadoop command line would look something like this:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb
You can (and should) use a combiner with big data sets. Add it with the -combiner option. The combiner output will feed directly into your mapper (but no guarantee how many times this will be called, if at all). Otherwise your input is split (according to the standard hadoop protocal) and feeds directly into your mapper. The example is from O'Reily's Hadoop: The Definitive Guide 3rd Edition. It has some very good information on streaming, and a section dedicated to streaming with ruby.
