I am trying to implement a serverless MapReduce using AWS Lambda functions for small payloads and low latency. I want to use multiple lambda functions for the Map, and another lambda function for the Reduce. Each Map function is expected to take less than a second to complete, and to produce a payload smaller than 100KB per function. I also want the Reduce function to be started at the same time as the Map functions, so that it can start reducing map results as soon as the first ones become available.
My question is the following: what is the best way to send notifications and payloads from the Map functions to the Reduce function, knowing that a single Reduce function must receive notification and payloads from all the Map functions during its lifespan (about 5 seconds)?
The Reduce function could poll an SQS queue, but I need to run multiple MapReduce jobs in parallel, and I am concerned by the latency resulting from the creation of one dedicated SQS queue for each MapReduce job. Is there a better way to approach this problem?
Related
I have a workflow constructed in Flink that consists of a custom source, a series of maps/flatmaps and a sink.
The run() method of my custom source iterates through the files stored in a folder and collects, through the collect() method of the context, the name and the contents of each file (I have a custom object that stores this info in two fields).
I then have a series of maps/flatmaps transforming such objects which are then printed into files using a custom sink. The execution graph as this is produced in the Flink's Web UI is the following:
I have a cluster or 2 workers setup to have 6 slots each (they both have 6 cores, too). I set the parallelism to 12. From the execution graph I see that the parallelism of the source is 1, while the rest of the workflow has parallelism 12.
When I run the workflow (I have around 15K files in the dedicated folder) I monitor, using htop, the resources of my workers. All the cores reach up to 100% utilisation for most of the time but every roughly 30 minutes or so, 8-10 of the cores become idle for about 2-3 minutes.
My questions are the following:
I understand that the source runs having parallelism 1 which I believe is normal when reading from a local storage (my files are located into the same directory in each worker as I don't know which worker will be selected to execute the source). Is it normal indeed? Could you please explain why this is the case?
The rest of my workflow is executed having parallelism 12 which looks to be correct as by checking the task managers' logs I get prints from all the slots (e.g., .... [Flat Map -> Map -> Map -> Sink: Unnamed (**3/12**)] INFO ...., .... [Flat Map -> Map -> Map -> Sink: Unnamed (**5/12**)] INFO ...., etc.)). What I don't understand though is if one slot is executing the source role and I have 12 slots in my cluster, how is the rest of the workflow executed by 12 slots? Is one slot acting for both the source and one instance of the rest of the workflow? If yes, how are the resources for this specific slot allocated? Would it be possible for someone to explain the steps undergoing in this workflow? For example (this might be wrong):
Slot 1 reads files and forwards them to available slots (2 to 12)
Slot 1 forwards one file to itself and stops reading until it finishes its job
When done, slot 1 reads more files and forwards them to slots that became available
I believe what I describe above is wrong but I give it as an example to better explain my question
Why I have this idle state for the majority of the cores every 30 minutes (more or less) that lasts for about 3 minutes?
To answer the specific question about parallelizing your read, I would do the following...
Implement your custom source by extending the RichSourceFunction.
In your open() method, call getRuntimeContext().getNumberOfParallelSubtasks() to get the total parallelism and call getRuntimeContext().getIndexOfThisSubtask() to get the index of the sub-task being initialized.
In your run() method, as you iterate over files, get the hashCode() of each file name, modulo the total parallelism. If this is equal to your sub-task's index, then you process it.
In this way you can spread the work out over 12 sub-tasks, without having sub-tasks try to process the same file.
The single consumer setup limits the overall throughput of your pipeline to the performance of the only one consumer. Additionally, it introduces the heavy shuffle to all slots - in this case, all the data read by consumer gets serialized on this consumer slot as well, which is an additional CPU load. In contrast, having the consumer parallelism equal to map/flat map parallelsm would allow to chain the source -> map operations and avoid shuffle.
By default, Flink allows subtasks to share slots even if they are subtasks of different tasks, so long as they are from the same job. The result is that one slot may hold an entire pipeline of the job. So, in your case slot 1 has both consumer and map/flat map tasks, and other slots have only map/flat map tasks. See here for more details: https://ci.apache.org/projects/flink/flink-docs-release-1.10/concepts/runtime.html#task-slots-and-resources. Also, you can actually view the instances for each subtasks on Web UI.
Do you have checkpointing enabled ? If yes and if it's 30 minutes, then probably this is the interval when the state gets snapshotted.
Hadoop map-reduce and it's echo-systems (like Hive..) we usually use for batch processing. But I would like to know is there any way that we can use hadoop MapReduce for realtime data processing example like live results, live tweets.
If not what are the alternatives for real time data processing or analysis?
Real-time App with Map-Reduce
Let’s try to implement a real-time App using Hadoop. To understand the scenario, let’s consider a temperature sensor. Assuming the sensor continues to work, we will keep getting the new readings. So data will never stop.
We should not wait for data to finish, as it will never happen. Then maybe we should continue to do analysis periodically (e.g. every hour). We can run Spark every hour and get the last hour data.
What if every hour, we need the last 24 hours analysis? Should we reprocess the last 24 hours data every hour? Maybe we can calculate the hourly data, store it, and use them to calculate 24 hours data from. It will work, but I will have to write code to do it.
Our problems have just begun. Let us iterate few requirements that complicate our problem.
What if the temperature sensor is placed inside a nuclear plant and
our code create alarms. Creating alarms after one hour has elapsed
may not be the best way to handle it. Can we get alerts within 1
second?
What if you want the readings calculated at hour boundary while it
takes few seconds for data to arrive at the storage. Now you cannot
start the job at your boundary, you need to watch the disk and
trigger the job when data has arrived for the hour boundary.
Well, you can run Hadoop fast. Will the job finish within 1 seconds?
Can we write the data to the disk, read the data, process it, and
produce the results, and recombine with other 23 hours of data in one
second? Now things start to get tight.
The reason you start to feel the friction is because you are not
using the right tool for the Job. You are using the flat screwdriver
when you have an Allen-wrench screw.
Stream Processing
The right tool for this kind of problem is called “Stream Processing”. Here “Stream” refers to the data stream. The sequence of data that will continue to come. “Stream Processing” can watch the data as they come in, process them, and respond to them in milliseconds.
Following are reasons that we want to move beyond batch processing ( Hadoop/ Spark), our comfort zone, and consider stream processing.
Some data naturally comes as a never-ending stream of events. To do
batch processing, you need to store it, cut off at some time and
processes the data. Then you have to do the next batch and then worry
about aggregating across multiple batches. In contrast, streaming
handles neverending data streams gracefully and naturally. You can
have conditions, look at multiple levels of focus ( will discuss this
when we get to windows), and also easily look at data from multiple
streams simultaneously.
With streaming, you can respond to the events faster. You can produce
a result within milliseconds of receiving an event ( update). With
batch this often takes minutes.
Stream processing naturally fit with time series data and detecting
patterns over time. For example, if you are trying to detect the
length of a web session in a never-ending stream ( this is an example
of trying to detect a sequence), it is very hard to do it with
batches as some session will fall into two batches. Stream processing
can handle this easily. If you take a step back and consider, the
most continuous data series are time series data. For example, almost
all IoT data are time series data. Hence, it makes sense to use a
programming model that fits naturally.
Batch lets the data build up and try to process them at once while
stream processing data as they come in hence spread the processing
over time. Hence stream processing can work with a lot less hardware
than batch processing.
Sometimes data is huge and it is not even possible to store it.
Stream processing let you handle large fire horse style data and
retain only useful bits.
Finally, there are a lot of streaming data available ( e.g. customer
transactions, activities, website visits) and they will grow faster
with IoT use cases ( all kind of sensors). Streaming is a much more
natural model to think about and program those use cases.
In HDP 3.1, Hive-Kafka integration was introduced for working with real-time data. For more info, see the docs: Apache Hive-Kafka Integration
You can add Apache Druid to a Hadoop cluster to process OLAP queries on event data, and you can use Hive and Kafka with Druid.
Hadoop/Spark shines in case of handling large volume of data and batch processing on it but when your use case is revolving around real time analytics requirement then Kafka Steams and druid are good options to consider.
Here's the good reference link to understand a similar use case:
https://www.youtube.com/watch?v=3NEQV5mjKfY
Hortonworks also provides HDF Stack (https://hortonworks.com/products/data-platforms/hdf/) which works best with use cases related to data in motion.
Kafka and Druid documentation is a good place to understand strength of both technologies. Here are their documentation links:
Kafka: https://kafka.apache.org/documentation/streams/
Druid: http://druid.io/docs/latest/design/index.html#when-to-use-druid
I have a Step functions setup that is spawning a Lambda function. The Lambda functions are getting spawned too fast and causing other services to throttle, so I would like Step functions to have a rate limit on the number of job it kicks off at a given time.
How do I best approach this?
I would try setting this limit at Lambda Function side using concurrent execution limit - https://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html. This way you can limit maximum number of executions of that specific Lambda Function thus leaving unreserved concurrency pool for rest of functions.
I'm looking for a creative and most efficient way to flatten write bursts to dynamodb.
I have 4 cron jobs that run every 3 minutes .each on its own thread. due to reason I can't control they start at the same time.
Part of the jobs is to write a few 1000s of rows to dynamodb. This takes normally 10 to 30 seconsa using batch writes.
Because of the timing the 4 jobs do the writing it in parallel.
I'm looking for the most efficient way to distribute the writes over time .
I don't want to add resources of not necessary. Probably the solution includes some kind of cache and additional cron job.
I have memcache available. However there is probably something more efficient than writing to memcache and reading .
Maybe a log file on the server ?
What would you do?
It's php with apache on ububtu.
An established pattern, especially if you just need the writes to get there eventually, is to put your records into an SQS queue first, and have a background task that reads messages from SQS and puts them into the dynamodb and a maximum prescribed rate - this is useful when you don't want to pay for the high write throughput to match your peak periods of writes to the database.
SQS has the benefit of being able to accept messages at almost any scale and yet you can reduce your dynamodb costs by writing rows at a low, predictable pace.
I Would like to make Service call for each row of File . Our Source File is greater than 50 GB. Iterating over 50GB of Row might take more time. Is there any built-in feature or any Map Reduce program need to be written to make call to service for each row. Since Map Reduce offer little bit of parallelization. Is there any custom tool already build this requirement
The basic requirement for map-reduce, is tasks should be run in parallel without any impact on individual results. if your service call is independent of other stuff, you can use map reduce. I think only map will suffice, to take care of reading each row and doing a service call. However, you need to think the other side of the map too. What are you going to do with the service call and eventually with map. That part decides the reducer thing