How to improve AWS Glue's performance? - performance

I have a simple job on AWS that takes more than 25 minutes. I changed the number of DPUs from 10 to 100 (the max allowed), the job still takes 13 minutes.
Any other suggestions on improving the performance?

I've noticed the same behavior.
My understanding is that the job time includes spinning up an EMR cluster, which takes several minutes. So if it takes.. say 8 minutes (just a guess), then your job time went from 17 -> 5.

Unless CPU or memory was a bottleneck for your existing job, adding more DPUs (i.e. more CPU and memory) wouldn't benefit your job significantly. At least the benefits will not be linear, i.e. 10 times more DPU doesn't mean that the job will run 10 times faster.
I suggest that you gradually increase the number of DPUs to look at performance gains, and you will notice that after a certain point adding more DPUs doesn't have a major impact on performance and that probably is the right amount of DPUs for your job.

Can we take a look at your job? Sometimes simple may not be performant. We've found that simple things like using the DynamicFrame.map transformation is really slow and you might be better off using a tmp table and mapping your data using the SQLContext

Related

Spark Stage performance, found GC Time very high just for few tasks

I'm trying to tune a Spark application, in order to reduce overall time execution, but I'm having a strange behaviour during a Stage execution.
Basically just 14/120 tasks needs around 20 min to finish, the others instead take 4 or 5 min to be completed.
Looking a the Spark UI, the partitioning seems good, the only difference I see is the GC Time that is very high for the 14 tasks.
I attach an image of the situation.
Do you have any idea for find the performance solution?
I had a similar problem and could resolve it by using Parallel GC instead of G1GC. You may add the following options to the executors additional Java options in the submit request
-XX:+UseParallelGC -XX:+UseParallelOldGC

How long does it take to process the file If I have only one worker node?

Let's say I have a data with 25 blocks and the replication factor is 1. The mapper requires about 5 mins to read and process a single block of the data. Then how can I calculate the time for one worker node? The what about 15 nodes? Will the time be changed if we change the replication factor to 3?
I really need a help.
First of all I would advice reading some scientific papers regarding the issue (Google Scholar is a good starting point).
Now a bit of discussion. From my latest experiments I have concluded that processing time has very strong relation with amount of data you want to process (makes sense). On our cluster, on average it takes around 7-8 seconds for Mapper to read a block of 128MBytes. Now there are several factors which you need to consider in order to predict the overall execution time:
How much data the Mapper produces, which will determine moreless the time Hadoop requires to execute Shuffling
What Reducer is doing? Does it do some iterative processing? (might be slow!)
What is the configuration of the resources? (how many Mappers and Reducers are allowed to run on the same machine)
Finally are there other jobs running simultaneously? (this might be slowing down the jobs significantly, since your Reducer slots can be occupied waiting for data instead of doing useful things).
So already for one machine you are seeing the complexity of the task of predicting the time of job execution. Basically during my study I was able to conclude that in average one machine is capable of processing from 20-50 MBytes/second (the rate is calculated according to the following formula: total input size/total job running time). The processing rate includes the staging time (when your application is starting and uploading required files to the cluster for example). The processing rate is different for different use cases and greatly influenced by the input size and more importantly the amount of data produced by Mappers (once again this values are for our infrastructure and on different machine configuration you will be seeing completely different execution times).
When you start scaling your experiments, you would see in average improved performance, but once again from my study I could conclude that it is not linear and you would need to fit by yourself, for your own infrastructure the model with respective variables which would approximate the job execution time.
Just to give you an idea, I will share some part of the results. The rate when executing determine use case on 1 node was ~46MBytes/second, for 2 nodes it was ~73MBytes/second and for 3 nodes it was ~85MBytes/second (in my case the replication factor was equal to the number of nodes).
The problem is complex requires time, patience and some analytical skills to solve it. Have fun!

Performance issue for batch insertion into marklogic

I have the requirement to insert 10,000 docs into marklogic in less than 10 seconds.
I tested in one single-node marklogic server in the following way:
use xdmp:spawn to pass the doc insertion task to task server;
use xdmp:document-insert without specify forest explicitly;
the task server has 8 theads to process tasks;
We have enabled CPF.
The performance is very bad: it took 2 minutes to finish the 10,000 doc creation.
I'm sure the performance will be better if I tested it in a cluster environment, but I'm not sure whether it can finish in less than 10 seconds.
Please advise the way of improving the performance.
I would start by gathering more information. What version of MarkLogic is this? What OS is it running on? What's the CPU? RAM? What's the storage subsystem? How many forests are attached to the database?
Then gather OS-level metrics, to see if one of the subsystems is an obvious bottleneck. For now I won't speculate beyond that.
If you need a fast load, I wouldn't use xdmp:spawn for each individual document, nor use CPF. But 2 minutes for 10k docs doesn't necessarily sound slow. On the other hand, I have reached up to 3k/sec, but without range indexes, transforms, whatsoever. And a very fast disk (e.g. ssd)..
HTH!
Assuming 2 socket server, 128GB-256GB of ram, fast IO (400-800MB/sec sustained)
Appropriate number of forests (12 primary or 6 primary/6 secondary)
More than 8 threads assuming enough cores
CPF off
Turn on perf history, look in metrics, and you will see where the bottleneck is.
SSD is not required - just IO throughput...which multiple spinning disks provide without issue.

why do pentaho jobs ran through kitchen take a lot of cpu resources?

could you please give some small explanation on what happens when kitchen.bat calls a job?
I can only guess that it instantiates it and that probably is the reason why my taskmgr spikes up whenever I call 5 jobs all at the same time. after a couple of seconds, the spikes would wind down.
or maybe not? would there be other reasons that the calling of jobs through kitchen uses a lot of resources?
would there be ways to save the cpu resources while taking advantage of parallelism (calling the jobs all at the same time)? are there optimizations that can be done?
How exactly are you calling 5 jobs at the same time? in the shell script? In which case the spike is because you're starting 5 JVM's at the same time - starting the JVM is relatively expensive. And there should be no need to do this - you can do it all in one JVM and do the parallelisation in the job?
Kitchen itself doesn't specifically use a lot of resources. If your transformation has a large number of steps, then getting that going can take some time, but not ages.
Is this really a problem? Why does it matter if your cpu spikes for a couple of seconds? The point of parallelism is generally to max out the CPU/box/resource!

Optimal number of Resque workers for maximum performance

I am using Resque for achieving cheap parallelism in my academic research - I split huge tasks into relatively small independent portions and submit them to Resque. These tasks do some heavy stuff, extensively using both database(MongoDB if that's important) and CPU.
All this works extremely slow - for my relatively small portion of dataset 1000 jobs get created and 14 hours of constant working of 2 workers is enough only for finishing ~800 of them. As you might've already suspected, this speed is more than frustrating.
I have a quad-core processor(Core i5 something, not high-end) and apart from Mongo instance and resque workers nothing gets scheduled on CPU for a considerable period of time.
Now that you know my story, all I am asking is - how do I squeeze maximum out of this setting? I believe that 3 workers + 1 mongo instance will quickly fill up all the cores, but at the same time mongo doesn't have to work all the time..

Resources