Spark RDD problems - hadoop

I am starting with spark and have never worked with Hadoop. I have 10 iMacs on which I have installed Spark 1.6.1 with Hadoop 2.6. I downloaded the precompiled version and just copied the extracted contents into /usr/local/spark/. I did all the environment variables setup with SCALA_HOME, changes to PATH and other spark conf. I am able to run both spark-shell and pyspark (with anaconda's python).
I have setup the standalone cluster; all the nodes are showing up on my web UI. Now, by using the python shell (ran on the cluster not locally) I followed this link's python interpreter word count example.
This is the code I have used
from operator import add
def tokenize(text):
return text.split()
text = sc.textFile("Testing/shakespeare.txt")
words = text.flatMap(tokenize)
wc = words.map(lambda x: (x,1))
counts = wc.reduceByKey(add)
counts.saveAsTextFile("wc")
It is giving me error that the file shakespeare.txt was not found on a slave nodes. Searching around I understood that if I am not using HDFS then the file should be present on each slave node on the same path. Here is the stack trace - github gist
Now, I have a few questions-
Isn't RDD supposed to be distributed? That is, it should have distributed (when the action was run on RDD) the file on all the nodes instead of requiring me to distribute it.
I downloaded the spark with Hadoop 2.6, but any of the Hadoop commands are not available to make a HDFS. I extracted the Hadoop jar file found in the spark/lib hoping to find some executable but there was nothing. So, what Hadoop related files were provided in the spark download?
Lastly, how can I run a distributed application (spark-submit) or a distributed analysis (using pyspark) on the cluster? If I have to create a HDFS then what extra steps are required? Also, how can I create a HDFS here?

If you read the Spark Programming Guide, you will find the answer to your first question:
To illustrate RDD basics, consider the simple program below:
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
The first line defines a base RDD from an external file. This dataset
is not loaded in memory or otherwise acted on: lines is merely a
pointer to the file. The second line defines lineLengths as the result
of a map transformation. Again, lineLengths is not immediately
computed, due to laziness. Finally, we run reduce, which is an action.
At this point Spark breaks the computation into tasks to run on
separate machines, and each machine runs both its part of the map and
a local reduction, returning only its answer to the driver program.
Remember that transformations are executed on the Spark workers (see link, slide n.21).
Regarding your second question, Spark contains only the libs, as you can see, to use the Hadoop infrastructure. You need to setup the Hadoop cluster first (Hdfs, etc etc), in order to use it (with the libs in Spark): have a look at Hadoop Cluster Setup.
To answer your last question, I hope that the official documentation helps, in particular Spark Standalone.

Related

Why is Spark setting partitions to the file size in bytes?

I have a very simple pyspark program that is supposed to read CSV files from S3:
r = sc.textFile('s3a://some-bucket/some-file.csv')
.map(etc... you know the drill...)
This was failing when running a local Spark node (it works in EMR). I was getting OOM errors and GC crashes. Upon further inspection, I realized that the number of partitions was insanely high. In this particular case r.getNumPartitions() would return 2358041.
I realized that that's exactly the size of my file in bytes. This, of course, makes Spark crash miserably.
I've tried different configurations, like chaning mapred.min.split.size:
conf = SparkConf()
conf.setAppName('iRank {}'.format(datetime.now()))
conf.set("mapred.min.split.size", "536870912")
conf.set("mapred.max.split.size", "536870912")
conf.set("mapreduce.input.fileinputformat.split.minsize", "536870912")
I've also tried using repartition or changing passing a partitions argument to textFile, to no avail.
I would love to know what makes Spark think that it's a good idea to derive the number of partitions from the file size.
In general it doesn't. As nicely explained by eliasah in his answer to Spark RDD default number of partitions it uses max of minPartitions (2 if not provided) and splits computed by Hadoop input format.
The latter one will by unreasonably high, only if instructed by the configuration. This suggests that some configuration file interferes with your program.
The possible problem with your code is that you use wrong configuration. Hadoop options should be set using hadoopConfiguration not Spark configuration. It looks like you use Python so you have to use private JavaSparkContext instance:
sc = ... # type: SparkContext
sc._jsc.hadoopConfiguration().setInt("mapred.min.split.size", min_value)
sc._jsc.hadoopConfiguration().setInt("mapred.max.split.size", max_value)
There was actually a bug in Hadoop 2.6 which would do this; the initial S3A release didn't provide a block size to Spark to split up, the default of "0" meant one-byte-per-job.
Later version should all take fs.s3a.block.size as the config option specifying the block size...something like 33554432 (= 32 MB) would be a start.
If you are using Hadoop 2.6.x. Don't use S3A. That's my recommendation.

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

Adding new files to a running hadoop cluster

consider that you have 10GB data and you want to process them by a MapReduce program using Hadoop. Instead of copying all the 10GB at the beginning to HDFS and then running the program, I want to for example copy 1GB and start the work and gradually add the remaining 9GB during the time. I wonder if it is possible in Hadoop.
Thanks,
Morteza
Unfortunately this is not possible with MapReduce. When you initiate a MapReduce Job, part of the setup process is determining block locations of your input. If the input is only partially there, the setup process will only work on those blocks and wont dynamically add inputs.
If you are looking for a stream processor, have a look at Apache Storm https://storm.apache.org/ or Apache Spark https://spark.apache.org/

Hadoop Spark (Mapr) - AddFile how does it work

I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")

Using mahout and hadoop

I am a newbie trying to understand how will mahout and hadoop be used for collaborative filtering. I m having single node cassandra setup. I want to fetch data from cassandra
Where can I find clear installation steps for hadoop first and then mahout to work with cassandra?
(I think this is the same question you just asked on user#mahout.apache.org? Copying my answer.)
You may not need Hadoop at all, and if you don't, I'd suggest you not use it for simplicity. It's "necessary evil" to scale past a certain point.
You can have data on Cassandra but you will want to be able to read it into memory. If you can dump as a file, you can use FileDataModel. Or, you can emulate the code in FileDataModel to create one based on Cassandra.
Then, your two needs are easily answered:
This is not even a recommendation
problem. Just pick an implementation
of UserSimilarity, and use it to
compare a user to all others, and
pick the ones with highest
similarity. (Wrapping with
CachingUserSimilarity will help a
lot.)
This is just a recommender
problem. Use a
GenericUserBasedRecommender with
your UserSimilarity and DataModel
and you're done.
It of course can get much more complex than this, but this is a fine start point.
If later you use Hadoop, yes you have to set up Hadoop according to its instructions. There is no Mahout "setup". For recommenders, you would look at one of the RecommenderJob classes which invokes the necessary jobs on your Hadoop cluster. You would run it with the "hadoop" command -- again, this is where you'd need to just understand Hadoop.
The book Mahout in Action writes up most of the Mahout Hadoop jobs in some detail.
The book Mahout in Action did indeed just save me from a frustrating lack of docs.
I was following https://issues.apache.org/jira/browse/MAHOUT-180 ... which suggests a 'hadoop -jar' syntax that only gave me errors. The book has 'jar' instead, and with that fix my test job is happily running.
Here's what I did:
used the utility at http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html?showComment=1298565709376#c3501116664672385942 to convert a CSV representation of my matrix to a mahout file format. Copied it into Hadoop filesystem.
Uploaded mahout-examples-0.5-SNAPSHOT-job.jar from a freshly built Mahout on my laptop, onto the hadoop cluster's control box. No other mahout stuff on there.
Ran this: (assumes hadoop is configured; which I confirm with dfs -ls /user/danbri )
hadoop jar ./mahout-examples-0.5-SNAPSHOT-job.jar \
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver \
--input svdoutput.mht --output outpath --numRows 0 --numCols 4 --rank 50
...now whether I got this right is quite another matter, but it seems to be doing something!
you can follow following tutorial to learn. its ease to understand and stated clearly about basics of Hadoop:
http://developer.yahoo.com/hadoop/tutorial/

Resources