Caching small RDD's with Spark takes too long and spark seems frozen

Caching small RDD's with Spark takes too long and spark seems frozen - spark-streaming

I use spark with caching with persist method. I have several RDDs what I cache but some of them are pretty small (about 300kbytes). Most of time it works well and usually lasts 1s the whole job, but sometimes it takes about 40s to store 300kbytes to cache.
If I go to the SparkUI->Cache, I can see how the percentage is increasing until 83% (250kbytes) and then it stops for a while. If I check the event time in the Spark UI I can see that when this happen there is a node where tasks takes very long time. This node could be any from the cluster, it's not always the same.
In the spark executor logs I can see it's that it takes about 40s in store 3.7kb when this problem occurs
INFO 2018-08-23 12:46:58 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1705_23 locally
INFO 2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.memory.MemoryStore: Block rdd_1692_7 stored as bytes in memory (estimated size 3.7 KB, free 1048.0 MB)
INFO 2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1692_7 locally
I have tried with MEMORY_ONLY, MEMORY_AND_SER and so on with the same results. I have checked the IO disk (although if I use memory_only I guess that it doesn't have sense) and I can't see any problem. This happens randomly, but it could be in the 25% of the jobs.
Any idea about what it could be happening?

Related

Shuffle phase lasts too long Hadoop

I'm having a MR job in which shuffle phase lasts too long.
At first I thought that it is because I'm emitting a lot of data from Mapper (around 5GB). Then I fixed that problem by adding a Combiner, thus emitting less data to Reducer. After that shuffle period did not shorten, as I thought it would.
My next idea was to eliminate Combiner, by combining in Mapper itself. That idea I got from here, where it says that data needs to be serialized/deserialized to use Combiner. Unfortunately shuffle phase is still the same.
My only thought is that it can be because I'm using a single Reducer. But this shouldn't be a case since I'm not emitting a lot of data when using Combiner or combining in Mapper.
Here are my stats:
Here are all the counters for my Hadoop (YARN) job:
I should also add that this is run on a small cluster of 4 machines. Each has 8GB of RAM (2GB reserved) and number of virtual cores is 12 (2 reserved).
These are virtual machines. At first they were all on a single unit, but then I separated them 2-2 on two units. So they were sharing HDD at first, now there are two machines per disk. Between them is a gigabit network.
And here are more stats:
Whole memory is occupied
CPU is constantly under pressure while the job is run (the picture shows CPU for two consecutive runs of same job)
My question is - why is shuffle time so big and how to fix it? I also don't understand how there was no speedup even though I have dramatically reduced the amount of data emitted from Mapper?

Few observations :
For a job of 30 mins, the GC time is too high (Try reusing objects rather creating a new one for each call in map()/Reduce() method)
Average map time is TOOOOO hight , 16 mins what are you doing in ur map ?
YARN memory is 99% , this signifies you are running too many services on your HDP cluster and RAM is not sufficient to support those many services.
Please increse YAN container memory, please give at least 1 GB.
This looks like a GC + overscheduled cluster problem

Tuning Hadoop job execution on YARN

A bit of intro - I'm learning about Hadoop. I have implemented machine learning algorithm on top of Hadoop (clustering) and tested it only on a small example (30MB).
A couple of days ago I installed Ambari and created a small cluster of four machines (master and 3 workers). Master has Resource manager and NameNode.
Now I'm testing my algorithm by increasing the amount of data (300MB, 3GB). I'm looking for a pointer how to tune up my mini-cluster. Concretely, I would like to know how to determine MapReduce2 and YARN settings in Ambari.
How to determine min/max memory for container, reserved memory for container, Sort Allocation Memory, map memory and reduce memory?
The problem is that execution of my jobs is very slow on Hadoop (and clustering is an iterative algorithm, which makes things worse).
I have a feeling that my cluster setup is not good, because of the following reason:
I run a job for a dataset of 30MB (I set-up block memory for this job to be 8MB, since data is small and processing is intensive) - execution time 30 minutes
I run the same job, but multiply same dataset 10 times - 300MB (same block size, 8MB) - execution time 2 hours
Now same amount of data - 300MB, but block size 128MB - same execution time, maybe even a bit greater than 2 hours
Size of blocks on HDFS is 128MB, so I thought that this will cause the speedup, but that is not the case. My doubts are that the cluster setup (min/max RAM size, map and reduce RAM) is not good, hence it cannot improve even though greater data locality is achieved.
Could this be the consequence of a bad setup, or am I wrong?

Please set the below properties in Yarn configuratins to allocate 33% of max yarn memory per job, which can be altered based on your requirement.
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.default.user-limit-factor=0.33
If you need further info on this, please refer following link https://analyticsanvil.wordpress.com/2015/08/16/managing-yarn-memory-with-multiple-hive-users/

Tasks taking longer over time in Apache Spark

I have a large dataset that I am trying to run with Apache Spark (around 5TB). I have noticed that when the job starts, it retrieves data really fast and the first stage of the job (a map transformation) gets done really fast.
However, after having processed around 500GB of data, that map transformation starts being slow and some of the tasks are taking several minutes or even hours to complete.
I am using 10 machines with 122 GB and 16CPUs and I am allocating all resources to each of the worker nodes. I thought about increasing the number of machines, but is there any other thing I could be missing?
I have tried with a small portion of my data set (30 GB) and it seemed to be working fine.

It seems that the stage gets completed locally in some nodes faster than in others. Driven from that observation, here is what I would try:
Cache the RDD that you process. Do not forget to unpersist it, when you don't need it anymore.
Understanding caching, persisting in Spark.
Check if the partitions are balanced, which doesn't seem to be
the case (that would explain why some local stages complete much
earlier than others). Having balanced partitions is the holy grail
in distributed-computing, isn't it? :)
How to balance my data across the partitions?
Reducing the communications costs, i.e. use less workers than you
use, and see what happens. Of course that heavily depends on your
application. You see, sometimes communication costs become so big,
they dominate, so using less machines for example, speeds up the
job. However, I would do that, only if steps 1 and 2 would not suffice.

Without any more info it would seem that at some point of the computation your data gets spilled to the disk because there is no more space in memory.
It's just a guess, you should check your Spark UI.

What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job always got stuck in repartitioning. Namely, the job will constantly get following errors and retries:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer
org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: /tmp/spark-...
I've tried to identify the problem but it seems like both memory and disk consumption of the machine throwing these errors are below 50%. I've also tried different configurations, including:
let driver/executor memory use 60% of total memory.
let netty to priortize JVM shuffling buffer.
increase shuffling streaming buffer to 128m.
use KryoSerializer and max out all buffers
increase shuffling memoryFraction to 0.4
But none of them works. The small job always trigger the same series of errors and max out retries (upt to 1000 times). How to troubleshoot this thing in such situation?
Thanks a lot if you have any clue.

Check your log if you get an error similar to this.
ERROR 2015-05-12 17:29:16,984 Logging.scala:75 - Lost executor 13 on node-xzy: remote Akka client disassociated
Every time you get this error is because you lose an executor. As why you lost an executor, that is another story, again check your log for clues.
One thing Yarn can kill your job, if it thinks that see you are using "too much memory"
Check for something like this:
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl - Container [<edited>] is running beyond physical memory limits. Current usage: 18.0 GB of 18 GB physical memory used; 19.4 GB of 37.8 GB virtual memory used. Killing container.
Also see: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html
The current state of the art is to increase
spark.yarn.executor.memoryOverhead until the job stops failing. We do have
plans to try to automatically scale this based on the amount of memory
requested, but it will still just be a heuristic.

I was also getting error
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
and looking further in log I found
Container killed on request. Exit code is 143
After searching for the exit code, I realized that's its mainly related to memory allocation. So I checked the amount of memory I have configured for executors. I found that by mistake I had configured 7g to driver and only 1g for executor. After increasing the memory of executor my spark job ran successfully.

Seems like after I do the changeQueue operation used may cause this problem, the server has been changed after I changed the queue.

Hadoop - Reduce the number of Spilled Records

I have an Ubuntu vm running in stand alone/pseudo mode with 4gb ram and 4 cores.
Everything is set to default except:
io.file.buffer.size=65536
io.sort.factor=50
io.sort.mb=500
mapred.tasktracker.map.tasks.maximum=4
mapred.tasktracker.reduce.tasks.maximum=4
This ofc will not be a production machine but I am fiddling with it to get the grips with the fine tuning.
My problem is that when I run my benchmark Hadoop Streaming job (get distinct records over a 1.8gb text file) I get quite a lot of spilled records and the above tweaks don't seem to reduce the spills. Also I have noticed that when I monitor the memory usage in the Ubuntu's System Monitor it never gets fully used and never goes above 2.2gb.
I have looked at chaging HADOOP_HEAP, mapred.map.child.java.opts and mapred.reduce.child.java.opts but I am not sure what to set these to as the defaults seem as though they should be enough.
Is there a setting I am missing that will allow Hadoop to utilise the remaining ram therefore reduce spilled records (hopefully speeding up jobs) or is this normal behaviour?
Many Thanks!

In addition to increasing memory, have you considered if you can run a combiner for your task after the map step, which will compress and reduce the amount of records that need to be kept in memory or spilled?
Unfortunately when you are using streaming, seems that this has to be coded in Java, and can't be in whatever language you're using.
http://wiki.apache.org/hadoop/HadoopStreaming

The default memory assigned to map/reduce task is 200mb. You can increase that value with -Dmapred.child.java.opts=-Xmx512M
Anyway, this is a very interesting material about hadoop tunning Hadoop Performance
Hope it helps!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio