50-60 gb of data in spark standalone mode - hadoop

I am trying to analyze around 50-60 gb of data. I thought of using spark to do that, but I do not have access to multiple nodes in a cluster. Can this level of processing be done using spark standalone mode ? If yes, I would like to know the estimated time required to process the data.Thanks!

Short answer: yes.
Spark will partition this file in many smaller chunks. In your case only few chunks will be executed at a time. These few chunks should fit in memory (you need to play with the configurations to get this right)
To summarize, you will be able to do it, but it would be faster if you had more memory/cores so you can processes more things in parallel.

Related

Spark Executors hang after Out of Memory

I have a spark application running on EMR (16 nodes, 1 master, 15 core, r3.2xlarge instances). For spark executor configuration, we use dynamic Allocation.
While loading the data into the RDD, I see that sometimes when there's a huge amount of data (700 Gb), then Spark runs Out of Memory, but it does not fail the App. Rather the app sits there hung. I'm not sure why this happens but here is my theory :-
We use dataframes which might be caching things.
The spark flag spark.dynamicAllocation.cachedExecutorIdleTimeout is set to infinity
My theory is that it might be caching things while creating dataframes but the cache is never relinquished and this leads to a Spark hang.
There are two solutions
Increase cluster size (worse case)
Figure out a way to add a timeout to Spark app.
Programatically kill the EMR step (could not find an API which does this)
Any leads about how to go about it ?
There could be two other possibilities. Either the partitions are too big, or you have sever skewness (size of partitions varies a lot).
Try to increase the number of partitions (anf hence, reduce their size) using repartition. This will randomly reshuffle the data throughout your executors (good to reduce skewness, but slow). Ideally, I like my partitions to be around 64Mo, depending on your machines.

Tasks taking longer over time in Apache Spark

I have a large dataset that I am trying to run with Apache Spark (around 5TB). I have noticed that when the job starts, it retrieves data really fast and the first stage of the job (a map transformation) gets done really fast.
However, after having processed around 500GB of data, that map transformation starts being slow and some of the tasks are taking several minutes or even hours to complete.
I am using 10 machines with 122 GB and 16CPUs and I am allocating all resources to each of the worker nodes. I thought about increasing the number of machines, but is there any other thing I could be missing?
I have tried with a small portion of my data set (30 GB) and it seemed to be working fine.
It seems that the stage gets completed locally in some nodes faster than in others. Driven from that observation, here is what I would try:
Cache the RDD that you process. Do not forget to unpersist it, when you don't need it anymore.
Understanding caching, persisting in Spark.
Check if the partitions are balanced, which doesn't seem to be
the case (that would explain why some local stages complete much
earlier than others). Having balanced partitions is the holy grail
in distributed-computing, isn't it? :)
How to balance my data across the partitions?
Reducing the communications costs, i.e. use less workers than you
use, and see what happens. Of course that heavily depends on your
application. You see, sometimes communication costs become so big,
they dominate, so using less machines for example, speeds up the
job. However, I would do that, only if steps 1 and 2 would not suffice.
Without any more info it would seem that at some point of the computation your data gets spilled to the disk because there is no more space in memory.
It's just a guess, you should check your Spark UI.

Spark SQL performance with Simple Scans

I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). Each machine has 12GB of Memory and 4 cores. My total data size is 150GB, stored in HDFS (stored as Hive tables), and I am running my queries through Spark SQL using hive context.
After checking the performance tuning documents on the spark page and some clips from latest spark summit, I decided to set the following configs in my spark-env:
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=2500M
(As my tasks tend to be long so the overhead of starting multiple JVMs, one per worker is much less than the total query times). As I monitor the job progress, I realized that while the Worker memory is 2.5GB, the executors (one per worker) have max memory of 512MB (which is default). I enlarged this value in my application as:
conf.set("spark.executor.memory", "2.5g");
Trying to give max available memory on each worker to its only executor, but I observed that my queries are running slower than the prev case (default 512MB). Changing 2.5g to 1g improved the performance time, it is close to but still worse than 512MB case. I guess what I am missing here is what is the relationship between the "WORKER_MEMORY" and 'executor.memory'.
Isn't it the case that WORKER tries to split this memory among its executors (in my case its only executor) ? Or there are other stuff being done worker which need memory ?
What other important parameters I need to look into and tune at this point to get the best response time out of my HW ? (I have read about Kryo serializer, and I am about trying that - I am mainly concerned about memory related settings and also knobs related to parallelism of my jobs). As an example, for a simple scan-only query, Spark is worse than Hive (almost 3 times slower) while both are scanning the exact same table & file format. That is why I believe I am missing some params by leaving them as defaults.
Any hint/suggestion would be highly appreciated.
Spark_worker_cores is shared across the instances. Increase the cores to say 8 - then you should see the kind of behavior (and performance) that you had anticipated.

Why is Spark fast when word count? [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
Test case: word counting in 6G data in 20+ seconds by Spark.
I understand MapReduce, FP and stream programming models, but couldn’t figure out the word counting is so amazing fast.
I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+ seconds. I guess there is index is performed before word counting, like Lucene does. The magic should be in RDD (Resilient Distributed Datasets) design which I don’t understand well enough.
I appreciate if anyone could explain RDD for the word counting case. Thanks!
First is startup time. Hadoop MapReduce job startup requires starting a number of separate JVMs which is not fast. Spark job startup (on existing Spark cluster) causes existing JVM to fork new task threads, which is times faster than starting JVM
Next, no indexing and no magic. 6GB file is stored in 47 blocks of 128MB each. Imagine you have a big enough Hadoop cluster that all of these 47 HDFS blocks are residing on different JBOD HDDs. Each of them would deliver you 70 MB/sec scan rate, which means you can read this data in ~2 seconds. With 10GbE network in your cluster you can transfer all of this data from one machine to another in just 7 seconds.
Lastly, Hadoop puts intermediate data to disks a number of times. It puts map output to the disk at least once (and more if the map output is big and on-disk merges happen). It puts the data to disks next time on reduce side before the reduce itself is executed. Spark puts the data to HDDs only once during the shuffle phase, and the reference Spark implementation recommends to increase the filesystem write cache not to make this 'shuffle' data hit the disks
All of this gives Spark a big performance boost compared to Hadoop. There is no magic in Spark RDDs related to this question
Other than the factors mentioned by 0x0FFF, local combining of results also makes spark run word count more efficiently. Spark, by default, combines results on each node before sending the results to other nodes.
In case of word count job, Spark calculates the count for each word on a node and then sends the results to other nodes. This reduces the amount of data to be transferred over network. To achieve the same functionality in Hadoop Map-reduce, you need to specify combiner class job.setCombinerClass(CustomCombiner.class)
By using combineByKey() in Spark, you can specify a custom combiner.
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Hope this helps.

Why submitting job to mapreduce takes so much time in General?

So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m.
I want to understand what is the bottleneck in job submitting process and understand next quote
Per-MapReduce overhead is significant: Starting/ending MapReduce job costs time
Some process I'm aware:
1. data splitting
2. jar file sharing
A few things to understand about HDFS and M/R that helps understand this latency:
HDFS stores your files as data chunk distributed on multiple machines called datanodes
M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers. (Think of summing various results from multiple mappers)
Each mapper and reducer is a full fledged program that is spawned on these distributed system. It does take time to spawn a full fledged programs, even if let us say they did nothing (No-OP map reduce programs).
When the size of data to be processed becomes very big, these spawn times become insignificant and that is when Hadoop shines.
If you were to process a file with a 1000 lines content then you are better of using a normal file read and process program. Hadoop infrastructure to spawn a process on a distributed system will not yield any benefit but will only contribute to the additional overhead of locating datanodes containing relevant data chunks, starting the processing programs on them, tracking and collecting results.
Now expand that to 100 of Peta Bytes of data and these overheads looks completely insignificant compared to time it would take to process them. Parallelization of the processors (mappers and reducers) will show it's advantage here.
So before analyzing the performance of your M/R, you should first look to benchmark your cluster so that you understand the overheads better.
How much time does it take to do a no-operation map-reduce program on a cluster?
Use MRBench for this purpose:
MRbench loops a small job a number of times
Checks whether small job runs are responsive and running efficiently on your cluster.
Its impact on the HDFS layer is very limited
To run this program, try the following (Check the correct approach for latest versions:
hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50
Surprisingly on one of our dev clusters it was 22 seconds.
Another issue is file size.
If the file sizes are less than the HDFS block size then Map/Reduce programs have significant overhead. Hadoop will typically try to spawn a mapper per block. That means if you have 30 5KB files, then Hadoop may end up spawning 30 mappers eventually per block even if the size of file is small. This is a real wastage as each program overhead is significant compared to the time it would spend processing the small sized file.
As far as I know, there is no single bottleneck which causes the job run latency; if there was, it would have been solved a long time ago.
There are a number of steps which takes time, and there are reasons why the process is slow. I will try to list them and estimate where I can:
Run hadoop client. It is running Java, and I think about 1 second overhead can be assumed.
Put job into the queue and let the current scheduler to run the job. I am not sure what is overhead, but, because of async nature of the process some latency should exists.
Calculating splits.
Running and syncronizing tasks. Here we face with the fact that TaskTrackes poll the JobTracker, and not opposite. I think it is done for the scalability sake. It mean that when JobTracker wants to execute some task, it do not call task tracker, but wait that approprieate tracker will ping it to get the job. Task trackers can not ping JobTracker to frequently, otherwise they will kill it in large clusters.
Running tasks. Without JVM reuse it takes about 3 seconds, with it overhead is about 1 seconds per task.
Client poll job tracker for the results (at least I think so) and it also add some latency to getting information that job is finished.
I have seen similar issue and I can state the solution to be broken in following steps :
When the HDFS stores too many small files with fixed chunk size, there will be issues on efficiency in HDFS, the best way would be to remove all unnecessary files and small files having data. Try again.
Try with the data nodes and name nodes:
Stop all the services using stop-all.sh.
Format name-node
Reboot machine
Start all services using start-all.sh
Check data and name nodes.
Try installing lower version of hadoop (hadoop 2.5.2) which worked in two cases and it worked in hit and trial.

Resources