Hadoop Streaming with SequenceFile (on AWS) - ruby

I have a large number of Hadoop SequenceFiles which I would like to process using Hadoop on AWS. Most of my existing code is written in Ruby, and so I would like to use Hadoop Streaming along with my custom Ruby Mapper and Reducer scripts on Amazon EMR.
I cannot find any documentation on how to integrate Sequence Files with Hadoop Streaming, and how the input will be provided to my Ruby scripts. I'd appreciate some instructions on how to launch jobs (either directly on EMR, or just a normal Hadoop command line) to make use of SequenceFiles and some information on how to expect the data to be provided to my script.
--Edit: I had previously referred to StreamFiles rather than SequenceFiles by mistake. I think the documentation for my data was incorrect, but apologies. The answer is easy with the change.

The answer to this is to specify the input format as a command line argument to Hadoop.
-inputformat SequenceFileAsTextInputFormat
The chances are that you want the SequenceFile as text, but there is also SequenceFileAsBinaryInputFormat if thats more appropriate.

Not sure if this is what you're asking for, but the command to use ruby map reduce scripts with the hadoop command line would look something like this:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb
You can (and should) use a combiner with big data sets. Add it with the -combiner option. The combiner output will feed directly into your mapper (but no guarantee how many times this will be called, if at all). Otherwise your input is split (according to the standard hadoop protocal) and feeds directly into your mapper. The example is from O'Reily's Hadoop: The Definitive Guide 3rd Edition. It has some very good information on streaming, and a section dedicated to streaming with ruby.

Related

Storm - Writing to HDFS using compression

I want to store all my raw data incoming in my storm topology in a HDFS cluster.
This is JSON or binary data, incoming at a rate of 2k / secs.
I was trying to use the HDFS bolt (http://storm.apache.org/releases/0.10.0/storm-hdfs.htmlà , but it does not allow compression using the normal hdfs bolt
Compression is only possible using the Sequence File Bolt.
I don’t want to use sequence file, as I don’t have a real key.
Plus, I have already Cassandra for storing my key / value stuff and serving my request.
It just take too much disk (overhead) using Cassandra for my raw data (not this post objective to debate about this).
Can anyone help me with that ?
I can use the java Hadoop driver client to achieve that ?
Have anyone a code snippet of that ?
Okay there is no way to compress on the fly like I wanted.
But I have found a solution, I share it here if someone need it.
This problem is not only relative to Storm but it is a more general Hadoop question.
All my data are writted using the HdfsBolt :
RecordFormat format = new DelimitedRecordFormat().withFieldDelimiter("|");
//Synchronize data buffer with the filesystem every 1000 tuples
// Need to be configurable
SyncPolicy syncPolicy = new CountSyncPolicy(1000);
// Rotate data files when they reach five MB
// need to be configuration
FileRotationPolicy rotationPolicy = new FileSizeRotationPolicy(10.0f, FileSizeRotationPolicy.Units.MB);
// Use default, Storm-generated file names
FileNameFormat fileNameFormat = new DefaultFileNameFormat().withPath("/datadir/in_progress") ;
// Instantiate the HdfsBolt
HdfsBolt bolt = new HdfsBolt()
.withFsUrl("hdfs://"+dfsHost+":"+dfsPort)
.withFileNameFormat(fileNameFormat)
.withRecordFormat(format)
.withRotationPolicy(rotationPolicy)
.withSyncPolicy(syncPolicy)
.addRotationAction(new MoveFileAction().withDestination("/datadir/finished"));
This is giving me one file per executor of my bolt.. Not easy to handle but it's okay :)
Then I schedule automatic compression using hadoop streaming (in a cron on namenode or something like this) :
hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
-Dmapred.reduce.tasks=0 \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
-input /datadir/finished \
-output /datadir/archives \
-mapper /bin/cat \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
Here I still have one issue :
One input file is compress into one archive.
So my 10MB input file (each for one worker) is compressing into one gzip (or bzip) of 1MB -> this is producing so many small files, and it's a problem in hadoop
To solve this issue, i will try to look at the hadoop archive (HAR) functionnality.
I also need to purge already compressed files in /datadir/finished
Hope I will have feedbacks from you guys
Keep in touch
Regards,
Bastien

hadoop input format for hadoop streaming. Wikihadoop Input Format

I wonder whether there is any differences between the InputFormats for hadoop and hadoop streaming. Does the Input Formats for hadoop streaming work also for hadoop and vice versa?
I am asking this because I found a special Input Format for the wikipedia dump files, the wikihadoop InputFormat. And there it is written that it is an Input Format for hadoop streaming? Why only for hadoop streaming? And not for hadoop?
Bests
As far as I know, there is no difference in how inputs are processed between Hadoop streaming jobs and regular MapReduce jobs written in Java.
The inheritance tree for StreamWikiDumpInputFormat is...
* InputFormat
* FileInputFormat
* KeyValueTextInputFormat
* StreamWikiDumpInputFormat
And since it eventually implements InputFormat, it can be used in regular MapReduce jobs
No..Type of MR job(streaming or java) is not the criteria for using(or developing) an InputFormat. An InputFormat is just an InputFormat and will work for both streaming and java MR jobs. It is type of the data, which you are going to process, based on which you use(or develop) an InputFormat. Hadoop natively provides different types of InputFormats which are normally sufficient to handle your needs. But sometimes your data is in such a state that none of these InputFormats are able to handle it.
Having said that, it is still possible to process that data using MR, and this is where you end up writing your own custom InputFormat, as the one you have specified above.
And I don't know why they have emphasized on Hadoop Streaming so much. It's just a Java class which does everything an InputFormat should do and implements everything which makes it eligible to do so. #climbage has made a very valid point regarding the same. So, it can be used with any MR job, streaming or java.
There is no difference between usual input formats and the one which were developed for a hadoop streaming.
When the author says that the format is developed for Hadoop Streaming the only thing she meant that her input format produces objects with a meaningfull toString methods. That's it.
For example, when I develop a input format for usage in Hadoop Streaming I try to avoid BinaryWritable and use Text instead.

Write to different files using hadoop streaming

I'm currently processing about 300 GB of log files on a 10 servers hadoop cluster. My data is being saved in folders named YYMMDD so each day can be accessed quickly.
My problem is that I just found out today that the timestamps I have in my log files are in DST (GMT -0400) instead of UTC as expected. In short, this means that logs/20110926/*.log.lzo contains elements from 2011-09-26 04:00 to 2011-09-27 20:00 and it's pretty much ruining any map/reduce done on that data (i.e. generating statistics).
Is there a way to do a map/reduce job to re-split every log files correctly? From what I can tell, there doesn't seem to be a way using streaming to send certain records in output file A and the rest of the records in output file B.
Here is the command I currently use:
/opt/hadoop/bin/hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar \
-D mapred.reduce.tasks=15 -D mapred.output.compress=true \
-D mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec \
-mapper map-ppi.php -reducer reduce-ppi.php \
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
-file map-ppi.php -file reduce-ppi.php \
-input "logs/20110922/*.lzo" -output "logs-processed/20110922/"
I don't know anything about java and/or creating custom classes. I did try the code posted at http://blog.aggregateknowledge.com/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/ (pretty much copy/pasted what was on there) but I couldn't get it to work at all. No matter what I tried, I would get a "-outputformat : class not found" error.
Thank you very much for your time and help :).
From what I can tell, there doesn't seem to be a way using streaming to send certain records in output file A and the rest of the records in output file B.
By using a custom Partitioner, you can specify which key goes to which reducer. By default the HashPartitioner is used. Looks like the only other Partitioner Streaming supports is KeyFieldBasedPartitioner.
You can find more details about the KeyFieldBasedPartitioner in the context of Streaming here. You need not know Java to configure the KeyFieldBasedPartitioner with Streaming.
Is there a way to do a map/reduce job to re-split every log files correctly?
You should be able to write a MR job to re-split the files, but I think Partitioner should solve the problem.
A custom MultipleOutputFormat and Partitioner seems like the correct way split your data by day.
As the author of that post, sorry that you had such a rough time. It sounds like if you were getting a "class not found" error, there was some issue with your custom output format not being found after you included it with "-libjars".

Using mahout and hadoop

I am a newbie trying to understand how will mahout and hadoop be used for collaborative filtering. I m having single node cassandra setup. I want to fetch data from cassandra
Where can I find clear installation steps for hadoop first and then mahout to work with cassandra?
(I think this is the same question you just asked on user#mahout.apache.org? Copying my answer.)
You may not need Hadoop at all, and if you don't, I'd suggest you not use it for simplicity. It's "necessary evil" to scale past a certain point.
You can have data on Cassandra but you will want to be able to read it into memory. If you can dump as a file, you can use FileDataModel. Or, you can emulate the code in FileDataModel to create one based on Cassandra.
Then, your two needs are easily answered:
This is not even a recommendation
problem. Just pick an implementation
of UserSimilarity, and use it to
compare a user to all others, and
pick the ones with highest
similarity. (Wrapping with
CachingUserSimilarity will help a
lot.)
This is just a recommender
problem. Use a
GenericUserBasedRecommender with
your UserSimilarity and DataModel
and you're done.
It of course can get much more complex than this, but this is a fine start point.
If later you use Hadoop, yes you have to set up Hadoop according to its instructions. There is no Mahout "setup". For recommenders, you would look at one of the RecommenderJob classes which invokes the necessary jobs on your Hadoop cluster. You would run it with the "hadoop" command -- again, this is where you'd need to just understand Hadoop.
The book Mahout in Action writes up most of the Mahout Hadoop jobs in some detail.
The book Mahout in Action did indeed just save me from a frustrating lack of docs.
I was following https://issues.apache.org/jira/browse/MAHOUT-180 ... which suggests a 'hadoop -jar' syntax that only gave me errors. The book has 'jar' instead, and with that fix my test job is happily running.
Here's what I did:
used the utility at http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html?showComment=1298565709376#c3501116664672385942 to convert a CSV representation of my matrix to a mahout file format. Copied it into Hadoop filesystem.
Uploaded mahout-examples-0.5-SNAPSHOT-job.jar from a freshly built Mahout on my laptop, onto the hadoop cluster's control box. No other mahout stuff on there.
Ran this: (assumes hadoop is configured; which I confirm with dfs -ls /user/danbri )
hadoop jar ./mahout-examples-0.5-SNAPSHOT-job.jar \
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver \
--input svdoutput.mht --output outpath --numRows 0 --numCols 4 --rank 50
...now whether I got this right is quite another matter, but it seems to be doing something!
you can follow following tutorial to learn. its ease to understand and stated clearly about basics of Hadoop:
http://developer.yahoo.com/hadoop/tutorial/

How do I control output files name and content of an Hadoop streaming job?

Is there a way to control the output filenames of an Hadoop Streaming job?
Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name would be the key.
Update:
Just found the answer - Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names.
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
I haven't seen any samples for this out there...
Can anyone point out to an Hadoop Streaming sample that makes use of a custom output format Java class?
Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
When using Hadoop Streaming, since only one JAR is supported you actually have to fork the streaming jar and put your new output format classes in it for streaming jobs to be able to reference it...
EDIT:
As of version 0.20.2 of hadoop this Class has been deprecated and you should now use:
http://hadoop.apache.org/docs/mapreduce/current/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
In general, Hadoop would have you consider the entire directory to be the output, and not an individual file. There's no way to directly control the filename, whether using Streaming or regular Java jobs.
However, nothing is stopping you from doing this splitting and renaming yourself, after the job has finished. You can $HADOOP dfs -cat path/to/your/output/directory/part-*, and pipe that to a script of yours that splits content up by keys and writes it to new files.

Resources