Using mahout and hadoop - hadoop

I am a newbie trying to understand how will mahout and hadoop be used for collaborative filtering. I m having single node cassandra setup. I want to fetch data from cassandra
Where can I find clear installation steps for hadoop first and then mahout to work with cassandra?

(I think this is the same question you just asked on user#mahout.apache.org? Copying my answer.)
You may not need Hadoop at all, and if you don't, I'd suggest you not use it for simplicity. It's "necessary evil" to scale past a certain point.
You can have data on Cassandra but you will want to be able to read it into memory. If you can dump as a file, you can use FileDataModel. Or, you can emulate the code in FileDataModel to create one based on Cassandra.
Then, your two needs are easily answered:
This is not even a recommendation
problem. Just pick an implementation
of UserSimilarity, and use it to
compare a user to all others, and
pick the ones with highest
similarity. (Wrapping with
CachingUserSimilarity will help a
lot.)
This is just a recommender
problem. Use a
GenericUserBasedRecommender with
your UserSimilarity and DataModel
and you're done.
It of course can get much more complex than this, but this is a fine start point.
If later you use Hadoop, yes you have to set up Hadoop according to its instructions. There is no Mahout "setup". For recommenders, you would look at one of the RecommenderJob classes which invokes the necessary jobs on your Hadoop cluster. You would run it with the "hadoop" command -- again, this is where you'd need to just understand Hadoop.
The book Mahout in Action writes up most of the Mahout Hadoop jobs in some detail.

The book Mahout in Action did indeed just save me from a frustrating lack of docs.
I was following https://issues.apache.org/jira/browse/MAHOUT-180 ... which suggests a 'hadoop -jar' syntax that only gave me errors. The book has 'jar' instead, and with that fix my test job is happily running.
Here's what I did:
used the utility at http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html?showComment=1298565709376#c3501116664672385942 to convert a CSV representation of my matrix to a mahout file format. Copied it into Hadoop filesystem.
Uploaded mahout-examples-0.5-SNAPSHOT-job.jar from a freshly built Mahout on my laptop, onto the hadoop cluster's control box. No other mahout stuff on there.
Ran this: (assumes hadoop is configured; which I confirm with dfs -ls /user/danbri )
hadoop jar ./mahout-examples-0.5-SNAPSHOT-job.jar \
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver \
--input svdoutput.mht --output outpath --numRows 0 --numCols 4 --rank 50
...now whether I got this right is quite another matter, but it seems to be doing something!

you can follow following tutorial to learn. its ease to understand and stated clearly about basics of Hadoop:
http://developer.yahoo.com/hadoop/tutorial/

Related

Hadoop Distcp - increasing distcp.dynamic.max.chunks.tolerable config and tuning distcp

I am trying to move data between two hadoop clusters using distcp. There is a lot of data to move with a large number of small files. In order to make it faster, I tried using -strategy dynamic, which according to the documentation, 'allows faster data-nodes to copy more bytes than slower nodes'.
I am setting the number of mappers to 400. when I launch the job, I see this error: java.io.IOException: Too many chunks created with splitRatio:2, numMaps:400. Reduce numMaps or decrease split-ratio to proceed.
when I googled about it, I found this link: https://issues.apache.org/jira/browse/MAPREDUCE-5402
In this link the author asks for a feature where we can increase distcp.dynamic.max.chunks.tolerable to resolve the issue.
The ticket says issue was resolved in the version 2.5.0. The hadoop version I am using is 2.7.3. So I believe it should be possible for me to increase the value of distcp.dynamic.max.chunks.tolerable.
However, I am not sure how can I increase that. Can this configuration be updated for a single distcp job by passing it like -Dmapreduce.job.queuename or do I have to update it on mapred-site.xml ? Any help would be appreciated.
Also does this approach work well if there are a large number of small files? are there any other parameters I can use to make it faster? Any help would be appreciated.
Thank you.
I was able to figure it out. A parameter can be passed with the distcp command instead of having to update the mapred-site.xml:
hadoop distcp -Ddistcp.dynamic.recordsPerChunk=50 -Ddistcp.dynamic.max.chunks.tolerable=10000 -skipcrccheck -m 400 -prbugc -update -strategy dynamic "hdfs://source" "hdfs://target"

Spark RDD problems

I am starting with spark and have never worked with Hadoop. I have 10 iMacs on which I have installed Spark 1.6.1 with Hadoop 2.6. I downloaded the precompiled version and just copied the extracted contents into /usr/local/spark/. I did all the environment variables setup with SCALA_HOME, changes to PATH and other spark conf. I am able to run both spark-shell and pyspark (with anaconda's python).
I have setup the standalone cluster; all the nodes are showing up on my web UI. Now, by using the python shell (ran on the cluster not locally) I followed this link's python interpreter word count example.
This is the code I have used
from operator import add
def tokenize(text):
return text.split()
text = sc.textFile("Testing/shakespeare.txt")
words = text.flatMap(tokenize)
wc = words.map(lambda x: (x,1))
counts = wc.reduceByKey(add)
counts.saveAsTextFile("wc")
It is giving me error that the file shakespeare.txt was not found on a slave nodes. Searching around I understood that if I am not using HDFS then the file should be present on each slave node on the same path. Here is the stack trace - github gist
Now, I have a few questions-
Isn't RDD supposed to be distributed? That is, it should have distributed (when the action was run on RDD) the file on all the nodes instead of requiring me to distribute it.
I downloaded the spark with Hadoop 2.6, but any of the Hadoop commands are not available to make a HDFS. I extracted the Hadoop jar file found in the spark/lib hoping to find some executable but there was nothing. So, what Hadoop related files were provided in the spark download?
Lastly, how can I run a distributed application (spark-submit) or a distributed analysis (using pyspark) on the cluster? If I have to create a HDFS then what extra steps are required? Also, how can I create a HDFS here?
If you read the Spark Programming Guide, you will find the answer to your first question:
To illustrate RDD basics, consider the simple program below:
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
The first line defines a base RDD from an external file. This dataset
is not loaded in memory or otherwise acted on: lines is merely a
pointer to the file. The second line defines lineLengths as the result
of a map transformation. Again, lineLengths is not immediately
computed, due to laziness. Finally, we run reduce, which is an action.
At this point Spark breaks the computation into tasks to run on
separate machines, and each machine runs both its part of the map and
a local reduction, returning only its answer to the driver program.
Remember that transformations are executed on the Spark workers (see link, slide n.21).
Regarding your second question, Spark contains only the libs, as you can see, to use the Hadoop infrastructure. You need to setup the Hadoop cluster first (Hdfs, etc etc), in order to use it (with the libs in Spark): have a look at Hadoop Cluster Setup.
To answer your last question, I hope that the official documentation helps, in particular Spark Standalone.

Hadoop Spark (Mapr) - AddFile how does it work

I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")

Merge HDFS files without going through the network

I could do this:
hadoop fs -text /path/to/result/of/many/reudcers/part* | hadoop fs -put - /path/to/concatenated/file/target.csv
But it will make the HDFS file get streamed through the network. Is there a way to tell the HDFS to merge few files on the cluster itself?
I have problem similar to yours.
Here is article with number of HDFS files merging options but all of them have some specifics. No one from this list meets my requirements. Hope this could help you.
HDFS concat (actually FileSystem.concat()). Not so old API. Requires original file to have last block full.
MapReduce jobs: probably I will take some solution based on this technology but it's slow to setup.
copyMerge - as far as I can see this will be again copy. But I did not check details yet.
File crush - again, looks like MapReduce.
So main result is if MapReduce setup speed suits you, no problem. If you have realtime requirements, things are getting complex.
One of my 'crazy' ideas is to use HBase coprocessor mechanics (endpoints) and files blocks locality information for this as I have Hbase on the same cluster. If the word 'crazy' doesn't stop you, look at this: http://blogs.apache.org/hbase/entry/coprocessor_introduction

Hadoop Streaming with SequenceFile (on AWS)

I have a large number of Hadoop SequenceFiles which I would like to process using Hadoop on AWS. Most of my existing code is written in Ruby, and so I would like to use Hadoop Streaming along with my custom Ruby Mapper and Reducer scripts on Amazon EMR.
I cannot find any documentation on how to integrate Sequence Files with Hadoop Streaming, and how the input will be provided to my Ruby scripts. I'd appreciate some instructions on how to launch jobs (either directly on EMR, or just a normal Hadoop command line) to make use of SequenceFiles and some information on how to expect the data to be provided to my script.
--Edit: I had previously referred to StreamFiles rather than SequenceFiles by mistake. I think the documentation for my data was incorrect, but apologies. The answer is easy with the change.
The answer to this is to specify the input format as a command line argument to Hadoop.
-inputformat SequenceFileAsTextInputFormat
The chances are that you want the SequenceFile as text, but there is also SequenceFileAsBinaryInputFormat if thats more appropriate.
Not sure if this is what you're asking for, but the command to use ruby map reduce scripts with the hadoop command line would look something like this:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb
You can (and should) use a combiner with big data sets. Add it with the -combiner option. The combiner output will feed directly into your mapper (but no guarantee how many times this will be called, if at all). Otherwise your input is split (according to the standard hadoop protocal) and feeds directly into your mapper. The example is from O'Reily's Hadoop: The Definitive Guide 3rd Edition. It has some very good information on streaming, and a section dedicated to streaming with ruby.

Resources