Understanding Apache Spark Web UI performance metrics - performance

I'm new to Spark and I'm trying to understand the metrics in the Web UI that are related to in my Spark Application (developed through Dataset API). I've watched few videos by Spark Summit and Databricks and most of the videos I watched were about a general overview of the Web UI like: definition of stage/job/task, how to understand when something is not working properly (e.g. not balanced work between executors), suggestions about things to avoid while programming, etc.
However, I couldn't find a detailed explaination of each performance metrics. In particular I'm interested understanding the things in the following images that are related to a Query that contains a groupBy(Col1, Col2), a orderBy(Col1, Col2) and a show().
Job 0
If I understood well, the default max partition size is set to 128 MB. Since my dataset size is 1378MB I get 11 tasks that work with 128MB, right? and since in the first stage I did some filtering (before applying groupBy) tasks write in memory so Shuffle Write is 108.3KB but why do I get 200 tasks for second stage?
After the groupBy I used an orderBy, is the number of tasks related to how my dataset is or it is related to the size of it?
UPDATE: I found this spark.sql.shuffle.partitions of 200 default partitions conundrum and some other questions, but now I'm wondering if there is a specific reason for it to be 200?
Stage 0
Why some tasks have result serialization here? If I understood well the serialization is related to the output so any show(), count(), collect(), etc. But in this stage those actions are not present (before the groupBy).
Stage 1
Is it normal that there is a huge part for result serialization time? I called show() (that takes 20 rows by default and there is an orderBy) so all tasks run in parallel and that one serialized all its records?
Why only one task have a considerable Shuffle Read Time? I expected all to have at least a small amount of Shuffle Read Time, again it is something related to my dataset?
The deserialization time is related to reading my dataset file? I'm asking because I wouldnt have expected it there since it is stage 1 and it was already present in stage 0.
Job 1- caching
Since I'm dealing with 3 queries that starts from the same dataset, I used cache() at the beginning of the first Query. I was wondering why it shows 739.9MB / 765 [input size/records] ... In the first query it shows 1378.8 MB / 7647431 [input size/records].
I guess that it has 11 tasks since the size of the dataset cached is still 1378MB but 765 is a really low number compared to the initial that was 7647431 so I dont think it is really related to records/rows, right?
Thanks for reading.

Related

Design Pattern - Spring KafkaListener processing 1 million records in 1 hour

My spring boot application is going to listen to 1 million records an hour from a kafka broker. The entire processing logic for each message takes 1-1.5 seconds including a database insert. Broker has 64 partitions, which is also the concurrency of my #KafkaListener.
My current code is only able to process 90 records in a minute in a lower environment where I am listening to around 50k records an hour. Below is the code and all other config parameters like max.poll.records etc are default values:
#KafkaListener(id="xyz-listener", concurrency="64", topics="my-topic")
public void listener(String record) {
// processing logic
}
I do get "it is likely that the consumer was kicked out of the group" 7-8 times an hour. I think both of these issues can be solved through isolating listener method and multithreading processing of each message but I am not sure how to do that.
There are a few points to consider here. First, 64 consumers seems a bit too much for a single application to handle consistently.
Considering each poll by default fetches 500 records per consumer at a time, your app might be getting overloaded and causing the consumers to get kicked out of the group if a single batch takes more than the 5 minutes default for max.poll.timeout.ms to be processed.
So first, I'd consider scaling the application horizontally so that each application handles a smaller amount of partitions / threads.
A second way to increase throughput would be using a batch listener, and handling processing and DB insertions in batches as you can see in this answer.
Using both, you should be processing a sensible amount of work in parallel per app, and should be able to achieve your desired throughput.
Of course, you should load test each approach with different figures to have proper metrics.
EDIT: Addressing your comment, if you want to achieve this throughput I wouldn't give up on batch processing just yet. If you do the DB operations row by row you'll need a lot more resources for the same performance.
If your rule engine doesn't do any I/O you can iterate each record from the batch through it without losing performance.
About data consistency, you can try some strategies. For example, you can have a lock to ensure that even through a rebalance only one instance will process a given batch of records at a given time - or perhaps there's a more idiomatic way of handling that in Kafka using the rebalance hooks.
With that in place, you can batch load all the information you need to filter out duplicated / outdated records when you receive the records, iterate each record through the rule engine in memory, and then batch persist all results, to then release the lock.
Of course, it's hard to come up with an ideal strategy without knowing more details about the process. The point is by doing that you should be able to handle around 10x more records within each instance, so I'd definitely give it a shot.

NiFi: MergeRecord is generating duplicates

I have some troubles with the MergeRecord processor in Nifi. You can see the whole Nifi flow below: I'm getting a json array from an API, then I split it, I apply some filters and then I want to build the json array again.
Nifi workflow
I'm able to build the good json array from all the chunks, but the problem is that the processor is generating data indefinitely. When I execute the job step by step (by starting / stopping every processors one by one) everything is fine, but when the MergeRecord is running it's generating the same data even if I stop the begin of the flow (so there is no more inputs...)
You can see a screenshot below of the data in the "merged" box that are stacking
data stacked
I scheduled this processor every 10 sec, and after 30 sec you can see that it executed 3 times and generated 3 times the same file while there is no more data above. It's weird because when you look at the "original" box of the processor I can see the right original amount of data (18,43Kb). But the merged part is still increasing...
Here is the configuration of the MergeRecord:
configuration
I suppose that I'm missing something but I don't know why !
Thank you for your help,
Regards,
Thomas

Apache Niffi getMongo Processor

I am new in niffi i am using getMongo to extract document from mongodb but same result is coming again and again but the result of query is only 2 document the query is {"qty":{$gt:10}}
There is a similar question regarding this. Let me quote what I had said there:
"GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach."
The question can be found here.

running data science clustering job with a parse.com background job

So I have a mobile app that wants to use parse for login, user data, content, etc... but I also need to run an hourly k-means clustering job on my entire user data set. I was looking at Parse jobs as a possible solution. My question is since the clustering algorithms will probably take up a lot of memory - since they will need to load all the users into memory - will it be possible or useful to use parse for this, or to run map reduce jobs with the the background jobs.... or is this really beyond the means of parse and I should look at setting up my own backend instead of using a backend-as-a-service.
Parse offers background jobs https://parse.com/docs/cloud_code_guide#jobs
As long as the job completes within 15 minutes, it is fine.
However, there is an open bug on Parse that hasn't been fixed over the last 1.5 months. We believe that this is actually a memory issue (and if that's the case you might run into it, as well). Here's the bug ID: https://developers.facebook.com/bugs/1586656868273252/

Writing high volume reducer output to HBase

I have an Hadoop MapReduce job whose output is a row-id with a Put/Delete operation for that row-id. Due to the nature of the problem, the output is rather high volume. We have tried several method to get this data back into HBase and they have all failed...
Table Reducer
This is way to slow since it seems that it must do a full round trip for every row. Due to how the keys sort for our reducer step, the row-id is not likely to be on the same node as the reducer.
completebulkload
This seems to take a long time (never completes) and there is no real indication of why. Both IO and CPU show very low usage.
Am I missing something obvious?
I saw from your answer to self that you solved your problem but for completeness I'd mention that there's another option - writing directly to hbase. We have a set up where we stream data into HBase and with proper key and region splitting we get to more than 15,000 1K records per second per node
CompleteBulkLoad was the right answer. Per #DonaldMiner I dug deeper and found out that the CompleteBulkLoad process was running as "hbase" which resulted in a permission denied error when trying to move/rename/delete the source files. The implementation appears to retry for a long time before giving an error message; up to 30 minutes in our case.
Giving the hbase user write access to the files resolved the issue.

Resources