How to write rows asynchronously in Spark Streaming application to speed up batch execution? - performance

I have a spark job where I need to write the output of the SQL query every micro-batch. Write is a expensive operation perf wise and is causing the batch execution time to exceed the batch interval.
I am looking for ways to improve the performance of write.
Is doing the write action in a separate thread asynchronously like shown below a good option?
Would this cause any side effects because Spark itself executes in a distributed manner?
Are there other/better ways of speeding up the write?
// Create a fixed thread pool to execute asynchronous tasks
val executorService = Executors.newFixedThreadPool(2)
dstream.foreachRDD { rdd =>
import org.apache.spark.sql._
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
import spark.sql
val records = rdd.toDF("record")
records.createOrReplaceTempView("records")
val result = spark.sql("select * from records")
// Submit a asynchronous task to write
executorService.submit {
new Runnable {
override def run(): Unit = {
result.write.parquet(output)
}
}
}
}

1 - Is doing the write action in a separate thread asynchronously like shown below a good option?
No. The key to understand the issue here is to ask 'who is doing the write'. The write is done by the resources allocated for your job on the executors in a cluster. Placing the write command on an async threadpool is like adding a new office manager to an office with a fixed staff. Will two managers be able to do more work than one alone given that they have to share the same staff? Well, one reasonable answer is "only if the first manager was not giving them enough work, so there's some free capacity".
Going back to our cluster, we are dealing with a write operation that is heavy on IO. Parallelizing write jobs will lead to contention for IO resources, making each independent job longer. Initially, our job might look better than the 'single manager version', but trouble will eventually hit us.
I've made a chart that attempts to illustrate how that works. Note that the parallel jobs will take longer proportionally to the amount of time that they are concurrent in the timeline.
Once we reach that point where jobs start getting delayed, we have an unstable job that will eventually fail.
2- Would this cause any side effects because Spark itself executes in a distributed manner?
Some effects I can think of:
Probably higher cluster load and IO contention.
Jobs are queuing on the Threadpool queue instead of on the Spark Streaming Queue. We loose the ability to monitor our job through the Spark UI and monitoring API, as the delays are 'hidden' and all is fine from the Spark Streaming point of view.
3- Are there other/better ways of speeding up the write?
(ordered from cheap to expensive)
If you are appending to a parquet file, create a new file often. Appending gets expensive with time.
Increase your batch interval or use Window operations to write larger chunks of Parquet. Parquet likes large files
Tune the partition and distribution of your data => make sure that Spark can do the write in parallel
Increase cluster resources, add more nodes if necessary
Use faster storage

Is doing the write action in a separate thread asynchronously like shown below a good option?
Yes. It's certainly something to consider when optimizing expensive queries and saving their results to external data stores.
Would this cause any side effects because Spark itself executes in a distributed manner?
Don't think so. SparkContext is thread-safe and promotes this kind of query execution.
Are there other/better ways of speeding up the write?
YES! That's the key to understand when to use the other (above) options. By default, Spark applications run in FIFO scheduling mode.
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
That means that to make a room for executing multiple writes asynchronously and in parallel you should configure your Spark application to use FAIR scheduling mode (using spark.scheduler.mode property).
You will have to configure so-called Fair Scheduler Pools to "partition" executor resources (CPU and memory) into pools that you can assign to jobs using spark.scheduler.pool property.
Quoting Fair Scheduler Pools:
Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool "local property" to the SparkContext in the thread that’s submitting them.

Related

Control parallelism in Apache Beam Dataflow pipeline

We are experimenting with Apache Beam (using Go SDK) and Dataflow to parallelize one of our time consuming tasks. For little more context, we have caching job which takes some queries, runs it across database and caches them. Each database query may take few seconds to many minutes and we want to run those in parallel for quicker task completion.
Created a simple pipeline that looks like this:
// Create initial PCollection.
startLoad := beam.Create(s, "InitialLoadToStartPipeline")
// Emits a unit of work along with query and date range.
cachePayloads := beam.ParDo(s, &getCachePayloadsFn{Config: config}, startLoad)
// Emits a cache response which includes errCode, errMsg, time etc.
cacheResponses := beam.ParDo(s, &cacheQueryDoFn{Config: config}, cachePayloads)
...
The number units which getCachePayloadsFn emits are not a lot and will be mostly in hundreds and max few thousands in production.
Now the issue is cacheQueryDoFn is not getting executed in parallel and queries are getting executed sequentially one by one. We confirmed this by putting logs in StartBundle and ProcessElement by logging goroutine id, process id, start and end time etc in caching function to confirm that there is no overlap in execution.
We would want to run the queries always in parallel even if there are just 10 queries. From our understanding and documentations, it creates bundles from the overall input and those bundles run in parallel and within bundle it runs sequentially. Is there a way to control the number of bundles from the load or any way to increase parallelism?
Things we tried:
Keeping num_workers=2 and autoscaling_algorithm=None. It starts two VMs but runs Setup method to initialize DoFn on only one VM and uses that for entire load.
Found sdk_worker_parallelism option here. But not sure how to correctly set it. Tried setting it with beam.PipelineOptions.Set("sdk_worker_parallelism", "50"). No effect.
By default, the Create is not parallel and all the DoFns are being fused into the same stage as the Create, so they also have no parallelism. See https://beam.apache.org/documentation/runtime/model/#dependent-parallellism for some more info on this.
You can explicitly force a fusion break with the Reshuffle transform.

Spring Scheduling Quartz and thousands of jobs

According to the business logic of my Spring Boot application with Quartz Scheduling and MongoDB as Job persistent storage, every user of the system can create the postponed job that must be executed at some point in time. The user chooses the time when it must be executed.
Right now I'm thinking about the approach where every user will create a dedicated JobDetail for every postponed job, something like this:
schedulerFactoryBean.getScheduler().addJob(jobDetail(), true, true);
The issue I can potentially see here, that with this approach I can quickly create thousands of jobs in Quartz scheduler. Previously I never scheduled such amount of jobs in Spring Scheduling with Quartz and don't know how the system will handle it. Is it a good idea to implement the system in such way and will Spring Scheduling Quartz handle such amount of jobs without problems?
Yes, Quartz itself can handle thousands of jobs and triggers without any issues.
If you are going to have many jobs executing concurrently, just make sure that you configure Quartz with a sufficient number of worker threads. The number of worker threads should be typically equal to the maximum number of jobs that can be running concurrently + some small buffer (10% or so) just in case.
From what you write I assume that your jobs will be one-off jobs, i.e. each job will be executed only once. If that is the case, Quartz can automatically discard your jobs as soon as they finish executing unless your jobs are marked as durable. Quartz automatically removes non-durable jobs if they are not scheduled to run in the future. This feature may help you reduce the total number of registered jobs.
I hope this helps. If not, please ask.

How to keep webserver responsive while executing many asynchronous background tasks

I am working on a web application that provides its users to optionally execute long-running processes 'in background'. An example would be some long-running report generation, or deleting thousands of objects simultaneously.
I've implemented this using an ExecutorService defined as FixedThreadPool using a ThreadFactory. The ThreadFactory is built like this:
ThreadFactoryBuilder()
.setNameFormat(clientId + "-BackgroundTask-%d")
.setDaemon(true)
.setPriority(Thread.MIN_PRIORITY)
.build()
I execute the task like this:
Future<TaskStatus> future = clientExecutors.get(clientId).submit(
backgroundTask::execute);
taskFutures.put(backgroundTask.getTaskId(), future);
How can I enforce my webserver to always priorize handling new incoming requests (as fast as possible) over executing background tasks?
In other words: It should never ever happen, that a user has to wait long time while browsing the site, just because there are a lot of background-tasks executing. As you can see from above, I tried to do this by setting .setPriority(Thread.MIN_PRIORITY). However that does not seem to be sufficient.
Furthermore, as for now, I've set some arbitrary value for the FixedThreadPool size (10) and use it globally for the entire background-handling of the application (and all its customers).
Instead I would like to define a threadpool for each customer, to make sure each customer has the same privilege to run a certain amount of tasks in the background. Say, each customer has a FixedThreadPool of size 5, and on the server I'll have a max. of 50 different customers. That would add up to 250 running background tasks at the same time.
The most important requirement here is: it does not matter, how long these background-tasks need to execute (say 2 minutes, or 20 minutes). What is important, is that each customer has the ability to send 5 tasks to be executed in background, and each of those are worked on equally.
I've tested running 30 cpu-intensive background tasks and it turns out that while these are running and cpu is near 100%, new incoming requests take a very long time to be handled.
So obviously, I am doing it wrong.
Update 12.09.2017
I've read about microservices and while it sounds great I see a great challenge in splitting the necessary parts from our monolithic application. Mostly because nearly every operation might turn into a long running process given a big enough data selection.
Furthermore, wouldn't I run into the same problem with my microservice, i.e. the server running the microservice would suffer the same performance degradation. Well the only good thing would, that the rest of the web app would not suffer from it anymore.
I've read some posts about introducing Thread.sleep(1) or Thread.sleep in general into CPU-heavy operations to reduce the amount of CPU used in these operations. I've also read about someone who introduced this as an aspect so that he can even change the amount of time waited dynamically in order to have some control about how much cpu would be used.
However, my gut tells me that ain't right either. What do you think about introducing Thread.sleep to lower the amount of CPU used for a task? Is this common practice? If not, what would be the right approach?
I would highly consider changing your system architecture to offload these long-running requests to a separate instance instead of running them in-process with the general request-service application. In general I think it is an anti-pattern to handle both batch / online (or long / short running) processing in the same application instance.
Ideally you'd build a standalone microservice to handle these requests, but you could also simply just deploy X instances of your existing application, and configure your load balancer to route requests to the long running invocation paths (e.g. POST /myapp/longrunningjob) only to the instances dedicated to running these long-running processes.

Does it matter where I submit hadoop jobs from?

Does it have any measurable effect on resources whether I submit a bunch of hadoop jobs from different client servers or all from the same one? I would think not since all the work is done in the cluster. Is this correct?
The only thing which is resource intensive on the client submitting to the Hadoop cluster is the calculation of the input splits. When the input data is huge or when too many jobs are submitted from the same client then because of the input split calculations, the job submission might become a bit slow.
I am not able to recall the Hadoop release or the parameter, but a configurable parameter was included to move the calculation of the input splits from the client submitting a job to the Hadoop cluster.
It really shouldn't matter where you submit your jobs from. The client itself doesn't do much, it uses RPC protocol to contact the services, and then just sits idle until the job is finished.
Also, the most important is what kind of scheduler you use to allocate resource, which is probably going to make the most significant difference and decide which resources to allocate to which job. More on job scheduling here.
I don't think you can move the input split calculation into Job Tracker in 'Classic' version. In YARN, you can move it using
"yarn.app.mapreduce.am.compute-splits-in-cluster"
I am guessing, Hadoop people didn't want to overload Job tracker with input split creation. Similar to the design decision of not assigning too much work for Namenode in HDFS.
In YARN, every job gets its own Application Master, so no worries about overloading a SPOF/bottleneck master like job tracker.
In reference to the original question, the client job would have to reach out to the namenode to get the block locations (I have see parts of code on block storage class calling data node for some meta data...not sure whether these happen during input split creation or in task tracker node) . This can become an issue if you are handling a lot of jobs on the same client node.
If you are using YARN, there would be a slight performance increase if all these communications happen inside the cluster.
Need to check how Oozie handles this issue.
Hopefully, this helps!
Arun

Is hadoop's job ThreadSafe?

Anyone knows if org.apache.hadoop.mapreduce.Job is thread-safe? In my application I create a thread for each job, and then waitForCompletion. And I have another monitor thread that checks every job's state with isComplete.
Is that safe? Are jobs thread-safe? Documentation doesn't seem to mention anything about it...
Thanks
Udi
Unlike the others, I also use threads to submit jobs in parallel and wait for their completion. You just have to use a job class instance per thread. If you share same job instances over multiple threads, you have to take care of the synchronization by yourself.
Why would you want to write a separate thread for each job? What exactly is your use case?
You can run multiple jobs in your Hadoop cluster. Do you have dependencies between the multiple jobs?
Suppose you have 10 jobs running. 1 job fails then would you need to re-run the 9 successful tasks.
Finally, job tracker will take care of scheduling multiple jobs on the Hadoop cluster. If you do not have dependencies then you should not be worried about thread safety. If you have dependencies then you may need to re-think your design.
Yes they are.. Actually the files is split in blocks and each block is executed on a separate node. all the map tasks run in parallel and then are fed to the the reducer after they are done. There is no question of synchronization as you would think about in multi threaded program. In multi threaded program all the threads are running on the same box and since they share some of the data you have to synchronize them
Just in case you need another kind of parallelism on the map task level, you should override run() method in your mapper and work with multiple threads there. Default implementation calls setup(), then map() times number of records to process, and finally it calls cleanup() method once.
Hope this helps someone!
If you are checking whether the jobs have finished I think you are a bit confused about how Map reduce works. You ought to be letting Hadoop do that for itself.

Resources