Flink workflow parallelism with custom source - parallel-processing

I have a workflow constructed in Flink that consists of a custom source, a series of maps/flatmaps and a sink.
The run() method of my custom source iterates through the files stored in a folder and collects, through the collect() method of the context, the name and the contents of each file (I have a custom object that stores this info in two fields).
I then have a series of maps/flatmaps transforming such objects which are then printed into files using a custom sink. The execution graph as this is produced in the Flink's Web UI is the following:
I have a cluster or 2 workers setup to have 6 slots each (they both have 6 cores, too). I set the parallelism to 12. From the execution graph I see that the parallelism of the source is 1, while the rest of the workflow has parallelism 12.
When I run the workflow (I have around 15K files in the dedicated folder) I monitor, using htop, the resources of my workers. All the cores reach up to 100% utilisation for most of the time but every roughly 30 minutes or so, 8-10 of the cores become idle for about 2-3 minutes.
My questions are the following:
I understand that the source runs having parallelism 1 which I believe is normal when reading from a local storage (my files are located into the same directory in each worker as I don't know which worker will be selected to execute the source). Is it normal indeed? Could you please explain why this is the case?
The rest of my workflow is executed having parallelism 12 which looks to be correct as by checking the task managers' logs I get prints from all the slots (e.g., .... [Flat Map -> Map -> Map -> Sink: Unnamed (**3/12**)] INFO ...., .... [Flat Map -> Map -> Map -> Sink: Unnamed (**5/12**)] INFO ...., etc.)). What I don't understand though is if one slot is executing the source role and I have 12 slots in my cluster, how is the rest of the workflow executed by 12 slots? Is one slot acting for both the source and one instance of the rest of the workflow? If yes, how are the resources for this specific slot allocated? Would it be possible for someone to explain the steps undergoing in this workflow? For example (this might be wrong):
Slot 1 reads files and forwards them to available slots (2 to 12)
Slot 1 forwards one file to itself and stops reading until it finishes its job
When done, slot 1 reads more files and forwards them to slots that became available
I believe what I describe above is wrong but I give it as an example to better explain my question
Why I have this idle state for the majority of the cores every 30 minutes (more or less) that lasts for about 3 minutes?

To answer the specific question about parallelizing your read, I would do the following...
Implement your custom source by extending the RichSourceFunction.
In your open() method, call getRuntimeContext().getNumberOfParallelSubtasks() to get the total parallelism and call getRuntimeContext().getIndexOfThisSubtask() to get the index of the sub-task being initialized.
In your run() method, as you iterate over files, get the hashCode() of each file name, modulo the total parallelism. If this is equal to your sub-task's index, then you process it.
In this way you can spread the work out over 12 sub-tasks, without having sub-tasks try to process the same file.

The single consumer setup limits the overall throughput of your pipeline to the performance of the only one consumer. Additionally, it introduces the heavy shuffle to all slots - in this case, all the data read by consumer gets serialized on this consumer slot as well, which is an additional CPU load. In contrast, having the consumer parallelism equal to map/flat map parallelsm would allow to chain the source -> map operations and avoid shuffle.
By default, Flink allows subtasks to share slots even if they are subtasks of different tasks, so long as they are from the same job. The result is that one slot may hold an entire pipeline of the job. So, in your case slot 1 has both consumer and map/flat map tasks, and other slots have only map/flat map tasks. See here for more details: https://ci.apache.org/projects/flink/flink-docs-release-1.10/concepts/runtime.html#task-slots-and-resources. Also, you can actually view the instances for each subtasks on Web UI.
Do you have checkpointing enabled ? If yes and if it's 30 minutes, then probably this is the interval when the state gets snapshotted.

Related

About the effect of parallelism in StormCrawler

I am currently working on a Storm Crawler based project. We have a fixed and limited amount of bandwidth for fetching page from the web. We have 8 worker with a large value for parallelism hint for different Bolt in the topology (i.e. 50). So lots of thread created for fetching the page. Is there any relation between increasing number of fetch_error and increasing parallelism_hint in the project? How can I determine the good value for the parallelism_hint in the Storm Crawler?
The parallelism hint is not something that should be applied to all bolts indiscriminately.
Ideally, you need one instance of FetcherBolt per worker, so in your case 8. As you've probably read in the WIKI or seen in the conf, the FetcherBolt handles internal threads for fetching. This is determined by the config fetcher.threads.number which is set to 50 in the archetypes' configurations (assuming this is what you used as a starting point).
Using too many FetcherBolt instances is counterproductive. It is better to change the value of fetcher.threads.number instead. If you have 50 Fetcher instances with a default number of threads of 50, that would give you 2500 fetching threads which might be too much for your available bandwidth.
As I mentioned before you want 1 FetcherBolt per worker, the number of internal fetching threads per bolt depends on your bandwidth. There is no hard rule for this, it depends on your situation.
One constant I have observed however is the ratio of parsing bolts to Fetcher bolts; usually, 4 parsers per fetcher works fine. Run Storm in deployed mode and check the capacity value for the parser bolts in the UI. If the value is 1 or above, try using more instances and see if it affects the capacity.
In any case, not all bolts need the same level of parallelism.

how to optimally use nifi wait processor

I am currently creating a flow, where I will be merging result of 10K http response. I have couple of questions. (please refer image below, I am numbering my questions as per image).
1) As queue is becoming too long, is it ok to put "concurrent task" as 10 for invokeHTTP? what should drive this? # of cores on the server?
2) wait is showing quite a big number, is this just # of bytes it is writing? or is this using that much memory? if this is just a write, then I might be ok...but if it is some internal queue, then soon I may run out of memory?
does it make sense to reduce this number? by increasing "Run Schedule" from 0 to say 20 sec?
3) what exactly is "Back Pressure Data Size Threshold", value is set at 1 GB, does it meant, if size of ff in queue is more than that, nifi will start dropping it? or will it somehow stop processing of upstream processor?
1) Yes increasing concurrent tasks on InvokeHttp would probably make sense. I wouldn't jump right to 10, but would test increasing from 1 to 2, 2 to 3, etc until it seems to be working better. Concurrent tasks is the number of threads that can concurrently execute the processor, the total number of threads for your NiFi instance is defined in the controller settings from top right menu under Timer Driven threads, you should set the timer driven threads based of the # of CPUs/core you have.
2) The stats on the processor are totals for the last 5 mins, so "In" is the total size of all the flow files that have come in to the processor in the last 5 mins. You can see "Out" is almost the same # which means almost all the flow files in have also been transferred out.
3) Back-pressure stops the upstream processor from executing until the back pressure threshold is reduced. The data size threshold is saying "when the total size of all flow files in the queue exceeds 1GB, then stop executing the upstream processor so that no more data enters the queue while the downstream processor works on the queue". In the case of a self-loop connection, I think back-pressure won't stop the processor from executing otherwise it will end up in a dead-lock where it can't produce more data but also can't work off the queue. In any case, data is never dropped unless you set flow file expiration on the queue.

Enabling Univa Grid Engine Resource Reservation without a time limit on jobs

My organization has a server cluster running Univa Grid Engine 8.4.1, with users submitting various kinds of jobs, some using a single CPU core, and some using OpenMPI to utilize multiple cores, all with varying and unpredictable run-times.
We've enabled a ticketing system so that one user can't hog the entire queue, but if the grid and queue are full of single-CPU jobs, no multi-CPU job can ever start (they just sit at the top of the queue waiting for the required number of cpu slots to become free, which generally never happens). We're looking to configure Resource Reservation such that, if the MPI job is the next in the queue, the grid will hold slots open as they become free until there's enough to submit the MPI job, rather than filling them with the single-CPU jobs that are further down in the queue.
I've read (here for example) that the grid makes the decision of which slots to "reserve" based on how much time is remaining on the jobs running in those slots. The problem we have is that our jobs have unknown run-times. Some take a few seconds, some take weeks, and while we have a rough idea how long a job will take, we can never be sure. Thus, we don't want to start running qsub with hard and soft time limits through -l h_rt and -l s_rt, or else our jobs could be killed prematurely. Resource Reservation appears to be using the default_duration, which we set to infinity for lack of a better number to use, and treating all jobs equally. Its picking slots filled by month-long jobs which have already been running for a few days, instead of slots filled by minute-long jobs which have only been running for a few seconds.
Is there a way to tell the scheduler to reserve slots for a multi-CPU MPI job as they become available, rather than pre-select slots based on some perceived run-time of the jobs in them?
Unfortunately I'm not aware of a way to do what you ask - I think that the reservation is created once at the time that the job is submitted, not progressively as slots become free. If you haven't already seen the design document for the Resource Reservation feature, it's worth a look to get oriented to the feature.
Instead, I'm going to suggest some strategies for confidently setting job runtimes. The main problem when none of your jobs have runtimes is that Grid Engine can't reserve space infinitely in the future, so even if you set some really rough runtimes (within an order of magnitude of the true runtime), you may get some positive results.
If you've run a similar job previously, one simple rule of thumb is to set max runtime to 150% of the typical or maximum runtime of the job, based on historical trends. Use qacct or parse the accounting file to get hard data. Of course, tweak that percentage to whatever suits your risk threshold.
Another rule of thumb is to set the max runtime not based on the job's true runtime, but based on a sense around "after this date, the results won't be useful" or "if it takes this long, something's definitely wrong". If you need an answer by Friday, there's no sense in setting the runtime limit for three months out. Similarly, if you're running md5sum on typically megabyte-sized files, there's no sense in setting a 1-day runtime limit; those jobs ought to only take a few seconds or minutes, and if it's really taking a long time, then something is broken.
If you really must allow true indefinite-length jobs, then one option is to divide your cluster into infinite and finite queues. Jobs specifying a finite runtime will be able to use both queues, while infinite jobs will have fewer resources available; this will incentivize users to work a little harder at picking runtimes, without forcing them to do so.
Finally, be sure that the multi-slot jobs are submitted with the -R y qsub flag to enable the resource reservation system. This could go in the system default sge_request file, but that's generally not recommended as it can reduce scheduling performance:
Since reservation scheduling performance consumption is known to grow with the number of pending jobs, use of -R y option is recommended only for those jobs actually queuing for bottleneck resources.

Individual Spark Task consume more time on computation if more cores are assigned

I am running a spark job with input file of size 6.6G (hdfs) with master as local. My Spark Job with 53 partitions completed quickly when I assign local[6] than local[2], however the individual task takes more computation time when number of cores are more. Say if I assign 1 core(local[1]) then each task takes 3 secs where the same goes up to 12 seconds if I assign 6 cores (local[6]). Where the time gets wasted? The spark UI shows increase in computation time for each task in local[6] case, I couldn't understand the reason why the same code takes different computation time when more cores are assigned.
Update:
I could see more %iowait in iostat output if I use local[6] than local[1]. Please let me know this is the only reason or any possible reasons. I wonder why this iowait is not reported in sparkUI. I see the increase in computing time than iowait time.
I am assuming you are referring to spark.task.cpus and not spark.cores.max
With spark.tasks.cpus each task get assigned more cores, but it doesn't necessarily have to use them. If you process is single threaded it really can't use them. You wind up with additional overhead without additional benefit and those cores are taken away from other single threaded tasks that can use them.
With spark.cores.max it is simply and overhead issue with transferring data around at the same time.

Is the number of map tasks determined by the number of nodes in hadoop in case the input is small in size or is part of the hardware idle?

I have a basic mapreduce question.
My input consists of many small files and I have designed a custom CombinedFileInputFormat (which is working properly).
The size of all files together is only like 100 Mb for 20 000 files, but processing an individual file takes a couple of minutes (it's a heavy indexing problem), therefore I want as many map tasks as possible. Will hadoop take care of this or do I have to enforce it and how? In the latter case my first guess would be to manipulate the maximum split size but I am not sure if I am on the right track. Any help greatly appreciated! (suggestions on how to set the split size best in the latter case are also helpful)
Some extra information to be more clear:
There is however another reason I wanted to process multiple files per task and that is that I want to be able to use combiners. The output of a single task only produces unique keys, but between several files there might be a substantial overlap. By processing multiple files with the same map task I can implement a combiner or make use of in-mapper combining. This would definitely limit the amount of IO. The fact is that although a single file has a size of a couple of kilobytes the output of this file is roughly 30 * 10^6 key-value pairs which easily leads to a couple of Gigabytes.
I don't think there is another way to allow combining (or in-mapper combining) if you have only one file per maptask?
Regards, Dieter
To get the best utilization for your long running map tasks, you'll probably want each file to run in it's own task rather than using your implementation of CombineInputFormat.
Using combine input format is usually advisable when you have small files that are quickly processed as it takes longer to instantiate the map task (jvm, config etc) than it does to process the file itself. You can alleviate this you by configuring 'JVM reuse', but still for a CPU bound tasks (as opposed to an IO bound tasks) you'll just want to run map tasks for each input file.
You will however need your Job Tracker to have a good chunk of memory allocated to it so it can manage and track the 20k map tasks created.
Edit: In response to your updated question, if you want to use combined input format then you'll need to set the configuration properties for min / max size per node / rack. Hadoop won't be able to do anything more intelligible than try and keep files that are data local or rack local together in the same map task.

Resources