Minimal Spark session/configuration for optimal unit test performance? - performance

Spark 2.1.0 and Scala 2.11.
Our project has hundreds of unit tests that perform relatively simple operations such as creating data sets of 3 or 4 objects and performing simple transformations on them. Many of these tests take as long as 5-10 seconds to run, which with hundreds of tests adds up to many minutes and is becoming a problem for our CI build. The operations are so simple that I wonder if there is a Spark configuration we can use to speed things up.
For example simply creating a data set like this:
val histData = Seq(
FooType(id = "id1", code = "code1",orgId = 1l),
FooType(id = "id2", code = "code2",orgId = 1l),
FooType(id = "id3", code = "code3",orgId = 1l)
).toDS()
takes 800 msec (FooType is a case class). After creating 2 or 3 data sets like this and a few filter / map / join operations (I really don't think those details matter but if you do let me know and I will post them), the collect() takes 1000-2000 msec. Add up a few operations like this and a test can take 5-10 seconds.
For the unit tests we are only concerned with the functional aspects of the test, we do not need threading, scaling, cacheing, on-disk storage, etc. The test data is small (usually less than 1KB) and is created in memory (not read from disk or any external source), and assertions are performed on the transformed objects in memory. I understand that behind the scenes Spark may be invoking the DAGScheduler, code generator, etc. and I wonder if there is a way to execute the jobs without that functionality. Or if that does have to be done, to do it once at the start of the unit test suite and use that throughout.
The session is created with something like this:
session = SparkSession.builder.config("spark.sql.shuffle.partitions","10").getOrCreate()
and the same session is used in each unit test. We are invoking the Spark API methods directly from the unit test, so there is no Spark submit or separate process or job, it all runs in the JVM created by either the IDE or gradle that invokes the unit tests.
It just seems to me that these operations should be taking a few msec each, and I'm looking for a way to pare down the Spark configuration so it evaluates everything in memory in the fastest way possible. Thanks for any tips or ideas.

Related

Input formats and strategies for Junit testing

My Java application have almost 800 Junit(5) test cases, my team is adding around 30 new cases a week. This component is a heart of the whole system, So every case needs lot of input data objects. Initially it is designed to read input data object (main object) from JSON files, Its like reading atleast 2/3 Jsons from #beforeClass method of every testclass. We are exepected to add more workflows this year and this does not look a viable solution to read the input data from Jsons. These many reads are causing performance issues as well and impacting the build time.
Can some one suggest alternate approaches to prepare the input data for Junit test cases.
We might not go with preparing manually in each test class as this will be a huge effort and one of the json in each test case is around 200 lines of data. (We use Gradle 6X and Junit 5).

Control parallelism in Apache Beam Dataflow pipeline

We are experimenting with Apache Beam (using Go SDK) and Dataflow to parallelize one of our time consuming tasks. For little more context, we have caching job which takes some queries, runs it across database and caches them. Each database query may take few seconds to many minutes and we want to run those in parallel for quicker task completion.
Created a simple pipeline that looks like this:
// Create initial PCollection.
startLoad := beam.Create(s, "InitialLoadToStartPipeline")
// Emits a unit of work along with query and date range.
cachePayloads := beam.ParDo(s, &getCachePayloadsFn{Config: config}, startLoad)
// Emits a cache response which includes errCode, errMsg, time etc.
cacheResponses := beam.ParDo(s, &cacheQueryDoFn{Config: config}, cachePayloads)
...
The number units which getCachePayloadsFn emits are not a lot and will be mostly in hundreds and max few thousands in production.
Now the issue is cacheQueryDoFn is not getting executed in parallel and queries are getting executed sequentially one by one. We confirmed this by putting logs in StartBundle and ProcessElement by logging goroutine id, process id, start and end time etc in caching function to confirm that there is no overlap in execution.
We would want to run the queries always in parallel even if there are just 10 queries. From our understanding and documentations, it creates bundles from the overall input and those bundles run in parallel and within bundle it runs sequentially. Is there a way to control the number of bundles from the load or any way to increase parallelism?
Things we tried:
Keeping num_workers=2 and autoscaling_algorithm=None. It starts two VMs but runs Setup method to initialize DoFn on only one VM and uses that for entire load.
Found sdk_worker_parallelism option here. But not sure how to correctly set it. Tried setting it with beam.PipelineOptions.Set("sdk_worker_parallelism", "50"). No effect.
By default, the Create is not parallel and all the DoFns are being fused into the same stage as the Create, so they also have no parallelism. See https://beam.apache.org/documentation/runtime/model/#dependent-parallellism for some more info on this.
You can explicitly force a fusion break with the Reshuffle transform.

Understanding Apache Spark Web UI performance metrics

I'm new to Spark and I'm trying to understand the metrics in the Web UI that are related to in my Spark Application (developed through Dataset API). I've watched few videos by Spark Summit and Databricks and most of the videos I watched were about a general overview of the Web UI like: definition of stage/job/task, how to understand when something is not working properly (e.g. not balanced work between executors), suggestions about things to avoid while programming, etc.
However, I couldn't find a detailed explaination of each performance metrics. In particular I'm interested understanding the things in the following images that are related to a Query that contains a groupBy(Col1, Col2), a orderBy(Col1, Col2) and a show().
Job 0
If I understood well, the default max partition size is set to 128 MB. Since my dataset size is 1378MB I get 11 tasks that work with 128MB, right? and since in the first stage I did some filtering (before applying groupBy) tasks write in memory so Shuffle Write is 108.3KB but why do I get 200 tasks for second stage?
After the groupBy I used an orderBy, is the number of tasks related to how my dataset is or it is related to the size of it?
UPDATE: I found this spark.sql.shuffle.partitions of 200 default partitions conundrum and some other questions, but now I'm wondering if there is a specific reason for it to be 200?
Stage 0
Why some tasks have result serialization here? If I understood well the serialization is related to the output so any show(), count(), collect(), etc. But in this stage those actions are not present (before the groupBy).
Stage 1
Is it normal that there is a huge part for result serialization time? I called show() (that takes 20 rows by default and there is an orderBy) so all tasks run in parallel and that one serialized all its records?
Why only one task have a considerable Shuffle Read Time? I expected all to have at least a small amount of Shuffle Read Time, again it is something related to my dataset?
The deserialization time is related to reading my dataset file? I'm asking because I wouldnt have expected it there since it is stage 1 and it was already present in stage 0.
Job 1- caching
Since I'm dealing with 3 queries that starts from the same dataset, I used cache() at the beginning of the first Query. I was wondering why it shows 739.9MB / 765 [input size/records] ... In the first query it shows 1378.8 MB / 7647431 [input size/records].
I guess that it has 11 tasks since the size of the dataset cached is still 1378MB but 765 is a really low number compared to the initial that was 7647431 so I dont think it is really related to records/rows, right?
Thanks for reading.

How to keep webserver responsive while executing many asynchronous background tasks

I am working on a web application that provides its users to optionally execute long-running processes 'in background'. An example would be some long-running report generation, or deleting thousands of objects simultaneously.
I've implemented this using an ExecutorService defined as FixedThreadPool using a ThreadFactory. The ThreadFactory is built like this:
ThreadFactoryBuilder()
.setNameFormat(clientId + "-BackgroundTask-%d")
.setDaemon(true)
.setPriority(Thread.MIN_PRIORITY)
.build()
I execute the task like this:
Future<TaskStatus> future = clientExecutors.get(clientId).submit(
backgroundTask::execute);
taskFutures.put(backgroundTask.getTaskId(), future);
How can I enforce my webserver to always priorize handling new incoming requests (as fast as possible) over executing background tasks?
In other words: It should never ever happen, that a user has to wait long time while browsing the site, just because there are a lot of background-tasks executing. As you can see from above, I tried to do this by setting .setPriority(Thread.MIN_PRIORITY). However that does not seem to be sufficient.
Furthermore, as for now, I've set some arbitrary value for the FixedThreadPool size (10) and use it globally for the entire background-handling of the application (and all its customers).
Instead I would like to define a threadpool for each customer, to make sure each customer has the same privilege to run a certain amount of tasks in the background. Say, each customer has a FixedThreadPool of size 5, and on the server I'll have a max. of 50 different customers. That would add up to 250 running background tasks at the same time.
The most important requirement here is: it does not matter, how long these background-tasks need to execute (say 2 minutes, or 20 minutes). What is important, is that each customer has the ability to send 5 tasks to be executed in background, and each of those are worked on equally.
I've tested running 30 cpu-intensive background tasks and it turns out that while these are running and cpu is near 100%, new incoming requests take a very long time to be handled.
So obviously, I am doing it wrong.
Update 12.09.2017
I've read about microservices and while it sounds great I see a great challenge in splitting the necessary parts from our monolithic application. Mostly because nearly every operation might turn into a long running process given a big enough data selection.
Furthermore, wouldn't I run into the same problem with my microservice, i.e. the server running the microservice would suffer the same performance degradation. Well the only good thing would, that the rest of the web app would not suffer from it anymore.
I've read some posts about introducing Thread.sleep(1) or Thread.sleep in general into CPU-heavy operations to reduce the amount of CPU used in these operations. I've also read about someone who introduced this as an aspect so that he can even change the amount of time waited dynamically in order to have some control about how much cpu would be used.
However, my gut tells me that ain't right either. What do you think about introducing Thread.sleep to lower the amount of CPU used for a task? Is this common practice? If not, what would be the right approach?
I would highly consider changing your system architecture to offload these long-running requests to a separate instance instead of running them in-process with the general request-service application. In general I think it is an anti-pattern to handle both batch / online (or long / short running) processing in the same application instance.
Ideally you'd build a standalone microservice to handle these requests, but you could also simply just deploy X instances of your existing application, and configure your load balancer to route requests to the long running invocation paths (e.g. POST /myapp/longrunningjob) only to the instances dedicated to running these long-running processes.

How to write rows asynchronously in Spark Streaming application to speed up batch execution?

I have a spark job where I need to write the output of the SQL query every micro-batch. Write is a expensive operation perf wise and is causing the batch execution time to exceed the batch interval.
I am looking for ways to improve the performance of write.
Is doing the write action in a separate thread asynchronously like shown below a good option?
Would this cause any side effects because Spark itself executes in a distributed manner?
Are there other/better ways of speeding up the write?
// Create a fixed thread pool to execute asynchronous tasks
val executorService = Executors.newFixedThreadPool(2)
dstream.foreachRDD { rdd =>
import org.apache.spark.sql._
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
import spark.sql
val records = rdd.toDF("record")
records.createOrReplaceTempView("records")
val result = spark.sql("select * from records")
// Submit a asynchronous task to write
executorService.submit {
new Runnable {
override def run(): Unit = {
result.write.parquet(output)
}
}
}
}
1 - Is doing the write action in a separate thread asynchronously like shown below a good option?
No. The key to understand the issue here is to ask 'who is doing the write'. The write is done by the resources allocated for your job on the executors in a cluster. Placing the write command on an async threadpool is like adding a new office manager to an office with a fixed staff. Will two managers be able to do more work than one alone given that they have to share the same staff? Well, one reasonable answer is "only if the first manager was not giving them enough work, so there's some free capacity".
Going back to our cluster, we are dealing with a write operation that is heavy on IO. Parallelizing write jobs will lead to contention for IO resources, making each independent job longer. Initially, our job might look better than the 'single manager version', but trouble will eventually hit us.
I've made a chart that attempts to illustrate how that works. Note that the parallel jobs will take longer proportionally to the amount of time that they are concurrent in the timeline.
Once we reach that point where jobs start getting delayed, we have an unstable job that will eventually fail.
2- Would this cause any side effects because Spark itself executes in a distributed manner?
Some effects I can think of:
Probably higher cluster load and IO contention.
Jobs are queuing on the Threadpool queue instead of on the Spark Streaming Queue. We loose the ability to monitor our job through the Spark UI and monitoring API, as the delays are 'hidden' and all is fine from the Spark Streaming point of view.
3- Are there other/better ways of speeding up the write?
(ordered from cheap to expensive)
If you are appending to a parquet file, create a new file often. Appending gets expensive with time.
Increase your batch interval or use Window operations to write larger chunks of Parquet. Parquet likes large files
Tune the partition and distribution of your data => make sure that Spark can do the write in parallel
Increase cluster resources, add more nodes if necessary
Use faster storage
Is doing the write action in a separate thread asynchronously like shown below a good option?
Yes. It's certainly something to consider when optimizing expensive queries and saving their results to external data stores.
Would this cause any side effects because Spark itself executes in a distributed manner?
Don't think so. SparkContext is thread-safe and promotes this kind of query execution.
Are there other/better ways of speeding up the write?
YES! That's the key to understand when to use the other (above) options. By default, Spark applications run in FIFO scheduling mode.
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
That means that to make a room for executing multiple writes asynchronously and in parallel you should configure your Spark application to use FAIR scheduling mode (using spark.scheduler.mode property).
You will have to configure so-called Fair Scheduler Pools to "partition" executor resources (CPU and memory) into pools that you can assign to jobs using spark.scheduler.pool property.
Quoting Fair Scheduler Pools:
Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool "local property" to the SparkContext in the thread that’s submitting them.

Resources