I have launched a 10 node cluster with the ec2-script in standalone mode for Spark. I am accessing data in s3 buckets from within the PySpark shell but when I perform transormations on the RDD, only one node is ever used. For example the below will read in data from the CommonCorpus:
bucket = ("s3n://#aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/"
"/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10"
"-180-212-248.ec2.internal.warc.gz")
data = sc.textFile(bucket)
data.count()
When I run this, only one of my 10 slaves processes the data. I know this because only one slave (213) has any logs of the activity when viewed from the Spark web console. When I view the the activity in Ganglia, this same node (213) is the only slave with a spike in mem usage when the activity was run.
Furthermore I have the exact same performance when I run the same script with an ec2 cluster of only one slave. I am using Spark 1.1.0 and any help or advice is greatly appreciated.
...ec2.internal.warc.gz
I think you've hit a fairly typical problem with gzipped files in that they cannot be loaded in parallel. More specifically, a single gzipped file cannot be loaded in parallel by multiple tasks, so Spark will load it with 1 task and thus give you an RDD with 1 partition.
(Note, however, that Spark can load 10 gzipped files in parallel just fine; it's just that each of those 10 files can only be loaded by 1 task. You can still get parallelism across files, just not within a file.)
You can confirm that you only have 1 partition by checking the number of partitions in your RDD explicitly:
data.getNumPartitions()
The upper bound on the number of tasks that can run in parallel on an RDD is the number of partitions in the RDD or the number of slave cores in your cluster, whichever is lower.
In your case, it's the number of RDD partitions. You can increase that by repartitioning your RDD as follows:
data = sc.textFile(bucket).repartition(sc.defaultParallelism * 3)
Why sc.defaultParallelism * 3?
The Spark Tuning guide recommends having 2-3 tasks per core, and sc.defaultParalellism gives you the number of cores in your cluster.
Related
I have a Job that need to access parquet files on HDFS and I would like to minimise the network activity. So far I have HDFS Datanodes and Spark Workers started on the same nodes, but when I launch my job the data locality is always at ANY where it should be NODE_LOCAL since the data is distributed among all the nodes.
Is there any option I should configure to tell Spark to start the tasks where the data is ?
The property you are looking for is spark.locality.wait. If you increase its value it will execute jobs more locally, as spark wont send the data to other workers just because the one is busy on which the data resides. Although, setting the value to high might result in longer execution times cause you do not utilise workers efficiently.
Also have a look here:
http://spark.apache.org/docs/latest/configuration.html
I have a cluster of 4 machines that I need to run a benchmark against.
I decide to use Terasort to benchmark.
However, when I run the benchmark, only one out of four machine is under load, while the other three are completely idle.
If I run the test another time, a different machine would be completely under load while the other three would be idle.
When I create the dataset with Teragen everything works just fine, the load is evenly distributed between all the four machine.
What can be wrong in this configuration ?
Thanks
I hope your cluster is distributed properly as 4 nodes (1 name node , 1 secondary name node, 2 data nodes)
The process flow happens like it starts with name-node and job tracker will schedule the job for the task trackers which has the data blocks.
The usage of data-nodes depends on few factors like number of replication, number of mappers and number of blocks.
If The number of blocks are many, it will be placed evenly in all the data nodes of your cluster. If the replication factor is 2, then the blocks will be available in both the data nodes. So both can run the mappers which deal with those blocks
If you have two blocks for a file and two mappers will run simultaneously in the data nodes and utilize the resources properly.
In your case, it seems block size is the problem. Try to reduce it. so there should be at least 2 blocks which makes utilization will be more and so is the performance.
Hadoop can be tuned as per your need with the below settings.
dfs.replication in hdfs-site.xml
dfs.block.size in hdfs-site.xml
Good luck !!!
I have been reading up on Hadoop, YARN and SPARK. What makes sense to me thus far is what I have summarized below.
Hadoop MapReduce: Client choses an input file and hands if off to
Hadoop (or YARN). Hadoop takes care of splitting the flie based on
user's InputFormat and stores it on as many nodes that are available
and configured Client submits a job (map-reduce) to YARN, which
copeies the jar to available Data Nodes and executes the job. YARN is
the orchestrator that takes care of all the scheduling and running of
the actual tasks
Spark: Given a job, input and a bunch of configuration parameters, it
can run your job, which could be a series of transformations and
provide you the output.
I also understand MapReduce is a batch based processing paradigm and
SPARK is more suited for micro batch or stream based data.
There are a lot of articles that talks about how Spark can run on YARN and how they are complimentary, but none have managed to help me understand how those two come together during an acutal workflow. For example when a client has a job to submit, read a huge file and do a bunch of transformations what does the workflow look like when using Spark on YARN. Let us assume that the client's input file is a 100GB text file. Please include as much details as possible
Any help with this would be greatly appreciated
Thanks
Kay
Let's assume the large file is stored in HDFS. In HDFS the file is divided into blocks of some size (default 128 MB).
That means your 100GB file will be divided into 800 blocks. Each block will be replicated and can be stored on different node in the cluster.
When reading the file with Hadoop InputFormat list of splits with location is obtained first. Then there is created one task per each splits. That you will get 800 parallel tasks that are executed by runtime.
Basically the input process is the same for MapReduce and Spark, because both of the use Hadoop Input Formats.
Both of them will process each InputSplit in separate task. The main difference is that Spark has more rich set of transformations and can optimize the workflow if there is a chain of transformations that can be applied at once. As opposed to MapReduce where is always map and reduce phase only.
YARN stands for "Yet another resource negotiator". When a new job with some resource requirement (memory, processors) is submitted it is the responsibility of YARN to check if the needed resources are available on the cluster. If other jobs are running on the cluster are taking up too much of the resources then the new job will be made to wait till the prevoius jobs complete and resources are available.
YARN will allocate enough containers in the cluster for the workers and also one for the Spark driver. In each of these containers JVM is started with given resources. Each Spark worker can process multiple tasks in parallel (depends on the configured number of cores per executor).
e.g.
If you set 8 cores per Spark executor, YARN tries to allocated 101 containers in the cluster tu run 100 Spark workers + 1 Spark master (driver). Each of the workers will process 8 tasks in parallel (because of 8 cores).
My Spark cluster has 1 master and 3 workers (on 4 separate machines, each machine with 1 core), and other settings are as in the picture below, where spark.cores.max is set to 3, and spark.executor.cores also 3 (in pic-1)
But when I submit my job to Spark cluster, from the Spark web-UI I can see only one executor is used (according to used memory and RDD blocks in pic-2), but not all of the executors. In this case the processing speed is much slower than I expected.
Since I've set the max cores to be 3, shouldn't all the executors be used to this job?
How to configurate Spark to distribute current job to all executors, instead of only one executor running current job?
Thanks a lot.
------------------pic-1:
------------------pic-2:
You said you are running two receivers, what kind of Receivers are they (Kafka, Hdfs, Twitter ??)
Which spark version are you using?
In my experience, if you are using any Receiver other than file receiver, then it will occupy 1 core permanently.
So when you say you have 2 receivers, then 2 cores will be permanently used for receiving the data, so you are left with only 1 core which is doing the work.
Please post the Spark master hompage screenshot as well. And Job's Streaming page screenshot.
In spark streaming only 1 receiver is launched, to get the data from input source to RDD.
Repartitioning the data after the 1st transformation can increase parallelism.
I am new to Hadoop and Hive world.
I have written a Hive query which is processing 189 Million rows (40 GB file). While I am executing query. Hive query is executing in single machine and generating many map and reduce tasks. Is that expected behavior?
I have read in many articles Hadoop is distributed processing framework. What I was understanding Hadoop will split your job in multiple tasks and distribute those tasks in different nodes and once tasks finish reducer will join the output. Please correct me if I am wrong.
I have 1 master and 2 slave nodes. I am using Hadoop 2.2.0 and Hive 0.12.0.
your understanding about hive is correct- hive translates your Query to hadoop job which in turn gets split into multiple tasks, distribute to nodes,map > sort&shuffle > reduce aggregate > return to hive CLI.
If you have 2 slave nodes, Hive will split its workload across the two, provided your cluster is properly configured.
That being said, if your input file is not splittable (for example, it's a GZIP compressed file), Hadoop will not able to split/parallelize the work, and you will be stuck with a single input split and thus a single mapper, limiting the workload to a single machine.
Thank you all for your quick reply.
you all correct my job is converted into different task and distributed to nodes.
While I am checking Hadoop Web UI in first level it was showing job is running in single node. While I drill down further it is showing Mappers and Reducers and where the are running.
Thanks :)