Debug failed shuffles in hadoop map reduces - hadoop

I am seeing as the size of the input file increase failed shuffles increases and job complete time increases non linearly.
eg.
75GB took 1h
86GB took 5h
I also see average shuffle time increase 10 fold
eg.
75GB 4min
85GB 41min
Can someone point me a direction to debug this?

Whenever you are sure your algorithms are correct, automatic hard-disk volumes partioning or fragmentation problems may occur somewhere after that 75Gb threshold, as of you are probably using the same filesystem for caching the results.

Related

How to avoid Kafka latency spikes caused by log segment flush

We're experiencing a big latency spikes (two orders of magnitude) on 99th percentile in our Kafka deployment. We googled a bit and found that this is pretty well documented phenomenon: https://issues.apache.org/jira/browse/KAFKA-9693
In the ticket, suggested "solution" is disabling log flush but that's hardly an acceptable solution if you care about data consistency.
We've tried to tune around log sizes, flush intervals etc. but that's only delaying the log flush doing nothing to the magnitude of the spike.
Question
Is there any real solution/workaround to this problem? To be clear, I'm talking about how to lower the spike down to the minimum.

How to improve AWS Glue's performance?

I have a simple job on AWS that takes more than 25 minutes. I changed the number of DPUs from 10 to 100 (the max allowed), the job still takes 13 minutes.
Any other suggestions on improving the performance?
I've noticed the same behavior.
My understanding is that the job time includes spinning up an EMR cluster, which takes several minutes. So if it takes.. say 8 minutes (just a guess), then your job time went from 17 -> 5.
Unless CPU or memory was a bottleneck for your existing job, adding more DPUs (i.e. more CPU and memory) wouldn't benefit your job significantly. At least the benefits will not be linear, i.e. 10 times more DPU doesn't mean that the job will run 10 times faster.
I suggest that you gradually increase the number of DPUs to look at performance gains, and you will notice that after a certain point adding more DPUs doesn't have a major impact on performance and that probably is the right amount of DPUs for your job.
Can we take a look at your job? Sometimes simple may not be performant. We've found that simple things like using the DynamicFrame.map transformation is really slow and you might be better off using a tmp table and mapping your data using the SQLContext

How long does it take to process the file If I have only one worker node?

Let's say I have a data with 25 blocks and the replication factor is 1. The mapper requires about 5 mins to read and process a single block of the data. Then how can I calculate the time for one worker node? The what about 15 nodes? Will the time be changed if we change the replication factor to 3?
I really need a help.
First of all I would advice reading some scientific papers regarding the issue (Google Scholar is a good starting point).
Now a bit of discussion. From my latest experiments I have concluded that processing time has very strong relation with amount of data you want to process (makes sense). On our cluster, on average it takes around 7-8 seconds for Mapper to read a block of 128MBytes. Now there are several factors which you need to consider in order to predict the overall execution time:
How much data the Mapper produces, which will determine moreless the time Hadoop requires to execute Shuffling
What Reducer is doing? Does it do some iterative processing? (might be slow!)
What is the configuration of the resources? (how many Mappers and Reducers are allowed to run on the same machine)
Finally are there other jobs running simultaneously? (this might be slowing down the jobs significantly, since your Reducer slots can be occupied waiting for data instead of doing useful things).
So already for one machine you are seeing the complexity of the task of predicting the time of job execution. Basically during my study I was able to conclude that in average one machine is capable of processing from 20-50 MBytes/second (the rate is calculated according to the following formula: total input size/total job running time). The processing rate includes the staging time (when your application is starting and uploading required files to the cluster for example). The processing rate is different for different use cases and greatly influenced by the input size and more importantly the amount of data produced by Mappers (once again this values are for our infrastructure and on different machine configuration you will be seeing completely different execution times).
When you start scaling your experiments, you would see in average improved performance, but once again from my study I could conclude that it is not linear and you would need to fit by yourself, for your own infrastructure the model with respective variables which would approximate the job execution time.
Just to give you an idea, I will share some part of the results. The rate when executing determine use case on 1 node was ~46MBytes/second, for 2 nodes it was ~73MBytes/second and for 3 nodes it was ~85MBytes/second (in my case the replication factor was equal to the number of nodes).
The problem is complex requires time, patience and some analytical skills to solve it. Have fun!

MongoDB insert performance with 2nd index

I'm trying to insert about 250 million documents that are each roughly 400 bytes into MongoDB 3.0 with WiredTiger. I need to search on only one short string key, _user_lower. Although I'm using WiredTiger now, which is much better than MMAPv1, I did use MMAPv1 first and had similar issues.
My server (a very cheap VPS) has:
250 GB magnetic disk
1 GB RAM
2 GB Swap
2.1 GHz single-core CPU
I know that this machine is really slow, and I'm asking it to do something a bit unrealistic. But I'm confused about how it started so fast with one index, and the second just ruined the performance:
I inserted all the data that I had at the time (about 250M rows) without any index except on _id. This performed very well, considering my awful hardware:
Approximately 5000 inserts per second (totally acceptable)
This rate was nearly constant for the 14 hours hours it took to complete
The index size on _id once complete was nearly 2.5GB. Note that this is more than double my physical RAM.
The RES of the process didn't exceed 450 MB according to mongostat.
No swapping
top seemed to indicate that CPU time wasn't all being spent waiting for the disk (so a significant amount was spent in userspace, presumably with WiredTiger in the snappy code)
Then I built a (non-unique) index on the only field I need to query by, _user_lower. This took 7.7 hours, which is fine since that's a one-time deal. The index ended up being 1.6 GB, which seems really low to me when compared to the _id index. The RES went up to about 750 MB.
Then, I downloaded a new data set to load. It was only 102 MB (238 K documents). I loaded it in the same way, using mongoimport, but this time:
Only 80 inserts per second (slower at times)
RES stayed at around 750 MB
top says almost 100% of the CPU was spent waiting for IO
Of course, load went through the roof.
I could understand a sizable performance hit, since that index has to be updated. But I didn't expect this much. I've read all over the place that my indexes should fit in RAM, but the performance was great during the initial insert, where the index quickly outgrew my memory.
Can I optimize the _user_index index at all? I don't know what this would even mean, but maybe only index the first few characters? I'm definitely willing to halve the query performance in exchange for tripling the insert performance.
What accounts for the massive performance hit? How do I fix it without new hardware? I'm not really attached to MongoDB, so alternatives that don't have these performance characteristics are fine. I have an idea that just uses flat files which would probably work but I don't want to write all that code.
When adding new items to a collection, the database will have to keep the index up-to-date. Since the index in MongoDB is a B-Tree by default, that means it will have to insert an item in the tree. While that isn't a particularly expensive operation in the best case, it comes with two potential performance problems:
performance jitter: from time to time, the B-Tree bucket might be full, requiring a bucket split and hence a lot more operations than the 'simple' insert
the insert destination must be readily available
In this case, the latter is likely to cause trouble: because the insertion of a name hits a random node in the tree (i.e, the name insertion doesn't follow a pattern) and your RAM is smaller than the index, chances are high that the destination must be fetched from disk. Unfortunately, the performance of disk seeks is orders of magnitude lower than main memory references. If you're unlucky, the first ref location requires another disk seek such that for a single insert multiple disk reads are required before MongoDB can even begin writing. That can take hundreds of milliseconds, with spinning disks or some contention on typical IaaS infrastructure even seconds.
Because ObjectIds are generated monotonically (the timestamp is the most significant part), the insertion always happens at the end and it is possible to keep the destination largely in RAM. Performance jitter, i.e. problem 1 might still be an issue since a bucket split might require a disk seek, but it happens so rarely compared to the first case that it doesn't wreck average performance, which should explain the observed behavior.
Also, when the bucket is filled by a monotonically increasing value, MongoDB will split the bucket when it is 90% filled; with random insertion, splits will happen a lot earlier, at 50%, so the tree is a little more 'dense' in that case.

How could I tell if my hadoop config parameter io.sort.factor is too small or too big?

After reading http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html we came to the conclusion our 6-nodes hadoop cluster could use some tuning, and io.sort.factor seems to be a good candidate, as it controls an important tradeoff. We're planning on tweaking and testing, but planning ahead and knowing what to expect and what to watch for seems reasonable.
It's currently on 10. How would we know that it's causing us too much merges? When we raise it, how would we know it's causing too much files to be opened?
Note that we can't follow the blog log extracts directly as it's updated to CDH3b2, and we're working on CDH3u2, and they have changed...
There are a few tradeoffs to consider.
the number of seeks being done when merging files. If you increase the merge factor too high, then the seek cost on disk will exceed the savings from doing a parallel merge (note that OS cache might mitigate this somewhat).
Increasing the sort factor decreases the amount of data in each partition. I believe the number is io.sort.mb / io.sort.factor for each partition of sorted data. I believe the general rule of thumb is to have io.sort.mb = 10 * io.sort.factor (this is based on the seek latency of the disk on the transfer speed, I believe. I'm sure this could be tuned better if it was your bottleneck. If you keep these in line with each other, then the seek overhead from merging should be minimized
If you increase io.sort.mb, then you increase memory pressure on the cluster, leaving less memory available for job tasks. Memory usage for sorting is mapper tasks * io.sort.mb -- so you could find yourself causing extra GCs if this is too high
Essentially,
If you find yourself swapping heavily, then there's a good chance you have set the sort factor too high.
If the ratio between io.sort.mb and io.sort.factor isn't correct, then you may need to change io.sort.mb (if you have the memory) or lower the sort factor.
If you find that you are spending more time in your mappers than in your reducers, then you may want to increase the number of map tasks and decrease the sort factor (assuming there is memory pressure).

Resources