Flink CPU configuration - parallel-processing

Official Flink Documentation states that for each core in your cpu, you have to allocate 1 slot and increase the parallelism level by one simultaneously.
One of my custom operators requires more than 1 CPU for computing (It is how it works in Heron). My system's cpu is 2.5. But Flink only uses 1 cpu. Do you know how can I config the Flink to use more CPU with only 1 slot?

"One slot per core" is merely a rule of thumb. Nothing enforces this.
Each subtask (an instance of an operator chain) is single-threaded, but the slots within a task manager, and the task managers within a machine or container, will use all of the resources made available to them.

You could use one slot, but a multi-thread operation in your flink code
for example in scala,
env.setParallelism(1)
class myMap extends RichMapFunction {
override def map(...) = {
data.toParArray()... // this is scala concurrent collection, other language has similar ones, like c/cpp OMP, java multi-thread
}
}
then the flink operator has parallelism of 1, but the internal parallelism is controlled by your map method.

Related

How are threads of Processors invoked in Nifi flow?

I'm trying to learn writing custom Nifi Processor and from the documentation, the processor should be thread-safe. What I wanted to understand is, if, say - I have 100 flow file records connected to my custom processor, would my processor's onTrigger method ( assume that I haven't enabled #TriggerSerially on this method ) be triggered 100 times and in 100 separate threads ( irrespective of concurrently or not ), or is there a possibility that one flow file is used as input to more than one thread of onTrigger method on my processor.
I apologize if I didn't articulate the question correctly, but essentially, is is possible that the number of times my processor's onTrigger method is triggered, is more than the number of flow files that are connected as input to the processor?
The number of threads executing a processor is based on the number of concurrent tasks on the scheduling tab, which defaults to 1. If you increase this to 2, then 2 threads are concurrently executing the onTrigger method. A single flow file will only be processed by one of these threads.
The #TriggerSerially annotation prevents you from being able to increase the conccurent tasks, so it forces there to never be concurrent execution. A common use case for this would be a source processor that is pulling data from somewhere, typically you wouldn't to concurrently be pulling the same data twice.

How to give lesser memory(-xmx) for logwriter process and higher for worker process?

Version Info:
"org.apache.storm" % "storm-core" % "1.2.1"
"org.apache.storm" % "storm-kafka-client" % "1.2.1"
With this version of storm, when running topology, I see two processes for each worker: one for logwriter and other for worker, as mentioned here as well.
The problem is I want to give less memory to logwriter process(say -Xmx512mb) and higher to worker process(say -Xmx12gb). But I am not able to achieve this, I have tried playing with different config.
There is config topology.worker.logwriter.childopts, which is taken by logwriter process, but than I have given some config to worker.childopts
, but that also gets applied to both logwriter and worker process. Similary I tried using topology.worker.childopts and that also gets applied to both logwriter and worker process.
How to solve this to give lesser memory for logwriter process and higher for worker process?
As suggested by Rohde Døssing here and what I had also observed, The process of logwriter has both java parameters: -Xmx512m and after few more parameters it again have -Xmx10g, and in my case it is taking the first one and only using 512mb memory only, so it is not causing any problem as such to me.
However as suggested here, it is not very good thing to have same duplicate parameters in process as it might take in other order as well in some other system. so Storm should fix this.

Apache Flink, more threads than Kafka partitions

The data flow is simple like
kafka -> some logic -> kafka
and 'some logic' is a bottleneck here so I want to use more threads/tasks to increase throughput instead of increasing kafka partitions (currently 3). Order between input and output topics doesn't matter here.
It can be easily done with Apache Storm. I can just increase parallelism of a bolt for the some logic. How can I do it with Flink? More general question is if there is any simple way to use different parallelism for different stages with Flink?
This is quite simple in Flink. You can specify the parallelism of each operator using the setParallelism() method:
DataStream<String> rawEvents = env
.addSource(new FlinkKafkaConsumer010("topic", new SimpleStringSchema(), props));
DataSteam<String> mappedEvents = rawEvents
.flatMap(new Tokenizer())
.setParallelism(64); // set parallelism to 64

How to write rows asynchronously in Spark Streaming application to speed up batch execution?

I have a spark job where I need to write the output of the SQL query every micro-batch. Write is a expensive operation perf wise and is causing the batch execution time to exceed the batch interval.
I am looking for ways to improve the performance of write.
Is doing the write action in a separate thread asynchronously like shown below a good option?
Would this cause any side effects because Spark itself executes in a distributed manner?
Are there other/better ways of speeding up the write?
// Create a fixed thread pool to execute asynchronous tasks
val executorService = Executors.newFixedThreadPool(2)
dstream.foreachRDD { rdd =>
import org.apache.spark.sql._
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
import spark.sql
val records = rdd.toDF("record")
records.createOrReplaceTempView("records")
val result = spark.sql("select * from records")
// Submit a asynchronous task to write
executorService.submit {
new Runnable {
override def run(): Unit = {
result.write.parquet(output)
}
}
}
}
1 - Is doing the write action in a separate thread asynchronously like shown below a good option?
No. The key to understand the issue here is to ask 'who is doing the write'. The write is done by the resources allocated for your job on the executors in a cluster. Placing the write command on an async threadpool is like adding a new office manager to an office with a fixed staff. Will two managers be able to do more work than one alone given that they have to share the same staff? Well, one reasonable answer is "only if the first manager was not giving them enough work, so there's some free capacity".
Going back to our cluster, we are dealing with a write operation that is heavy on IO. Parallelizing write jobs will lead to contention for IO resources, making each independent job longer. Initially, our job might look better than the 'single manager version', but trouble will eventually hit us.
I've made a chart that attempts to illustrate how that works. Note that the parallel jobs will take longer proportionally to the amount of time that they are concurrent in the timeline.
Once we reach that point where jobs start getting delayed, we have an unstable job that will eventually fail.
2- Would this cause any side effects because Spark itself executes in a distributed manner?
Some effects I can think of:
Probably higher cluster load and IO contention.
Jobs are queuing on the Threadpool queue instead of on the Spark Streaming Queue. We loose the ability to monitor our job through the Spark UI and monitoring API, as the delays are 'hidden' and all is fine from the Spark Streaming point of view.
3- Are there other/better ways of speeding up the write?
(ordered from cheap to expensive)
If you are appending to a parquet file, create a new file often. Appending gets expensive with time.
Increase your batch interval or use Window operations to write larger chunks of Parquet. Parquet likes large files
Tune the partition and distribution of your data => make sure that Spark can do the write in parallel
Increase cluster resources, add more nodes if necessary
Use faster storage
Is doing the write action in a separate thread asynchronously like shown below a good option?
Yes. It's certainly something to consider when optimizing expensive queries and saving their results to external data stores.
Would this cause any side effects because Spark itself executes in a distributed manner?
Don't think so. SparkContext is thread-safe and promotes this kind of query execution.
Are there other/better ways of speeding up the write?
YES! That's the key to understand when to use the other (above) options. By default, Spark applications run in FIFO scheduling mode.
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
That means that to make a room for executing multiple writes asynchronously and in parallel you should configure your Spark application to use FAIR scheduling mode (using spark.scheduler.mode property).
You will have to configure so-called Fair Scheduler Pools to "partition" executor resources (CPU and memory) into pools that you can assign to jobs using spark.scheduler.pool property.
Quoting Fair Scheduler Pools:
Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool "local property" to the SparkContext in the thread that’s submitting them.

Is hadoop's job ThreadSafe?

Anyone knows if org.apache.hadoop.mapreduce.Job is thread-safe? In my application I create a thread for each job, and then waitForCompletion. And I have another monitor thread that checks every job's state with isComplete.
Is that safe? Are jobs thread-safe? Documentation doesn't seem to mention anything about it...
Thanks
Udi
Unlike the others, I also use threads to submit jobs in parallel and wait for their completion. You just have to use a job class instance per thread. If you share same job instances over multiple threads, you have to take care of the synchronization by yourself.
Why would you want to write a separate thread for each job? What exactly is your use case?
You can run multiple jobs in your Hadoop cluster. Do you have dependencies between the multiple jobs?
Suppose you have 10 jobs running. 1 job fails then would you need to re-run the 9 successful tasks.
Finally, job tracker will take care of scheduling multiple jobs on the Hadoop cluster. If you do not have dependencies then you should not be worried about thread safety. If you have dependencies then you may need to re-think your design.
Yes they are.. Actually the files is split in blocks and each block is executed on a separate node. all the map tasks run in parallel and then are fed to the the reducer after they are done. There is no question of synchronization as you would think about in multi threaded program. In multi threaded program all the threads are running on the same box and since they share some of the data you have to synchronize them
Just in case you need another kind of parallelism on the map task level, you should override run() method in your mapper and work with multiple threads there. Default implementation calls setup(), then map() times number of records to process, and finally it calls cleanup() method once.
Hope this helps someone!
If you are checking whether the jobs have finished I think you are a bit confused about how Map reduce works. You ought to be letting Hadoop do that for itself.

Resources