Flattening Dynamodb write bursts - caching

I'm looking for a creative and most efficient way to flatten write bursts to dynamodb.
I have 4 cron jobs that run every 3 minutes .each on its own thread. due to reason I can't control they start at the same time.
Part of the jobs is to write a few 1000s of rows to dynamodb. This takes normally 10 to 30 seconsa using batch writes.
Because of the timing the 4 jobs do the writing it in parallel.
I'm looking for the most efficient way to distribute the writes over time .
I don't want to add resources of not necessary. Probably the solution includes some kind of cache and additional cron job.
I have memcache available. However there is probably something more efficient than writing to memcache and reading .
Maybe a log file on the server ?
What would you do?
It's php with apache on ububtu.

An established pattern, especially if you just need the writes to get there eventually, is to put your records into an SQS queue first, and have a background task that reads messages from SQS and puts them into the dynamodb and a maximum prescribed rate - this is useful when you don't want to pay for the high write throughput to match your peak periods of writes to the database.
SQS has the benefit of being able to accept messages at almost any scale and yet you can reduce your dynamodb costs by writing rows at a low, predictable pace.


Can I use the same Redis instance for task queue and cache?

I've read responses to a couple similar questions on stackoverflow, and although it seems like sharing a single instance for two purposes is fine, I would like to know the potential downside.
My main concern is the cache filling up the memory and slowing down or breaking the task queue. Is this possible? I use caching heavily, so should I be worried about this scenario?
Theoretically, you can use the same Redis instance for task queue and caching.
There're some downsides
Longer query time
High memory usage
High CPU usage
Any fail safe task queue, makes a lot of redis calls to move a task from one data structure to other and for other actions. You should check your task queue, how many redis calls it would make in a seconds for 1 queueu and N queues. If the number of Redis queueries is proportional to the number of queues than you should see can your Redis server handles such requests.
Since you're using same Redis instance for task queue and cache the number of entries in your cache could be very large, see it's not going beyond it's memory limit. Losing cache data is fine but you should not loose task queue data.
Due to a large number of queries the CPU utilization would increase, hopefully it won't reach 90% or so, watch for any cpu spike.
Given you're going to use same Redis server for task queue, you should enable backup for Redis server, so that you can restore tasks from the backup. When you're doing backup likely backup would be done for whole data not only task queues.

Can we use Hadoop MapReduce for real-time data process?

Hadoop map-reduce and it's echo-systems (like Hive..) we usually use for batch processing. But I would like to know is there any way that we can use hadoop MapReduce for realtime data processing example like live results, live tweets.
If not what are the alternatives for real time data processing or analysis?
Real-time App with Map-Reduce
Let’s try to implement a real-time App using Hadoop. To understand the scenario, let’s consider a temperature sensor. Assuming the sensor continues to work, we will keep getting the new readings. So data will never stop.
We should not wait for data to finish, as it will never happen. Then maybe we should continue to do analysis periodically (e.g. every hour). We can run Spark every hour and get the last hour data.
What if every hour, we need the last 24 hours analysis? Should we reprocess the last 24 hours data every hour? Maybe we can calculate the hourly data, store it, and use them to calculate 24 hours data from. It will work, but I will have to write code to do it.
Our problems have just begun. Let us iterate few requirements that complicate our problem.
What if the temperature sensor is placed inside a nuclear plant and
our code create alarms. Creating alarms after one hour has elapsed
may not be the best way to handle it. Can we get alerts within 1
What if you want the readings calculated at hour boundary while it
takes few seconds for data to arrive at the storage. Now you cannot
start the job at your boundary, you need to watch the disk and
trigger the job when data has arrived for the hour boundary.
Well, you can run Hadoop fast. Will the job finish within 1 seconds?
Can we write the data to the disk, read the data, process it, and
produce the results, and recombine with other 23 hours of data in one
second? Now things start to get tight.
The reason you start to feel the friction is because you are not
using the right tool for the Job. You are using the flat screwdriver
when you have an Allen-wrench screw.
Stream Processing
The right tool for this kind of problem is called “Stream Processing”. Here “Stream” refers to the data stream. The sequence of data that will continue to come. “Stream Processing” can watch the data as they come in, process them, and respond to them in milliseconds.
Following are reasons that we want to move beyond batch processing ( Hadoop/ Spark), our comfort zone, and consider stream processing.
Some data naturally comes as a never-ending stream of events. To do
batch processing, you need to store it, cut off at some time and
processes the data. Then you have to do the next batch and then worry
about aggregating across multiple batches. In contrast, streaming
handles neverending data streams gracefully and naturally. You can
have conditions, look at multiple levels of focus ( will discuss this
when we get to windows), and also easily look at data from multiple
streams simultaneously.
With streaming, you can respond to the events faster. You can produce
a result within milliseconds of receiving an event ( update). With
batch this often takes minutes.
Stream processing naturally fit with time series data and detecting
patterns over time. For example, if you are trying to detect the
length of a web session in a never-ending stream ( this is an example
of trying to detect a sequence), it is very hard to do it with
batches as some session will fall into two batches. Stream processing
can handle this easily. If you take a step back and consider, the
most continuous data series are time series data. For example, almost
all IoT data are time series data. Hence, it makes sense to use a
programming model that fits naturally.
Batch lets the data build up and try to process them at once while
stream processing data as they come in hence spread the processing
over time. Hence stream processing can work with a lot less hardware
than batch processing.
Sometimes data is huge and it is not even possible to store it.
Stream processing let you handle large fire horse style data and
retain only useful bits.
Finally, there are a lot of streaming data available ( e.g. customer
transactions, activities, website visits) and they will grow faster
with IoT use cases ( all kind of sensors). Streaming is a much more
natural model to think about and program those use cases.
In HDP 3.1, Hive-Kafka integration was introduced for working with real-time data. For more info, see the docs: Apache Hive-Kafka Integration
You can add Apache Druid to a Hadoop cluster to process OLAP queries on event data, and you can use Hive and Kafka with Druid.
Hadoop/Spark shines in case of handling large volume of data and batch processing on it but when your use case is revolving around real time analytics requirement then Kafka Steams and druid are good options to consider.
Here's the good reference link to understand a similar use case:
Hortonworks also provides HDF Stack (https://hortonworks.com/products/data-platforms/hdf/) which works best with use cases related to data in motion.
Kafka and Druid documentation is a good place to understand strength of both technologies. Here are their documentation links:
Kafka: https://kafka.apache.org/documentation/streams/
Druid: http://druid.io/docs/latest/design/index.html#when-to-use-druid

DynamoDB Stream to Lambda slow/unsuable

I've connected a lambda to a DyDB table via a stream. When a record is written to the table, it triggers the lambda. The traffic is very bursty, so nothing might happen for a while, then I'll write several thousand records.
What I'm seeing is a few lambda instances will be triggered, but not enough to handle the burst. Then at random times, the number of lambda instances will jump an order of magnitude or two (from 2 to 90 or more), and it will catch up. The problem is the jump might not occur for 30 minutes or more.
I'm seeing the records written to the table very quickly (seconds). The processing of 20 records by the lambda shouldn't take more than 2 minutes. It seems like the lambdas are spending most of their time sitting around waiting for records to show up. The record key for the table is a GUID.
Things I've tried
Playing with the number of records to make sure there's no lambda timeouts (20 seems to be conservative, but 100 causes timeouts)
Moving the lambda to a different subnet
Batching the writes to the table (~500-1000 records in a batch)
Breaking up the writes in hopes it would trigger more lambdas (~20-100 records in a batch)
Increasing the lambda memory to the max (3GB)
Reducing memory to be larger than used (1GB, 300Mb used)
Is there a better pattern to be using? Should I skip the stream and just write SNS messages? I don't care about order, but would prefer to not run the job more than once.
So here's what I found out.
It looks like the problem is contention on the DynamoDB stream by the lambda instances.
My solution was to skip the DynamoDB stream and not use it, and post to an SNS queue. The lambdas pick up the messages, and scale much better. Times have gone from hours to seconds.

Growing hash-of-queues beyond main memory limits

I have a cluster application, which is divided into a controller and a bunch of workers. The controller runs on a dedicated host, the workers phone in over the network and get handed jobs, so far so normal. (Basically the "divide-and-conquer pipeline" from the zeromq manual, with job-specific wrinkles. That's not important right now.)
The controller's core data structure is unordered_map<string, queue<string>> in pseudo-C++ (the controller is actually implemented in Python, but I am open to the possibility of rewriting it in something else). The strings in the queues define jobs, and the keys of the map are a categorization of the jobs. The controller is seeded with a set of jobs; when a worker starts up, the controller removes one string from one of the queues and hands it out as the worker's first job. The worker may crash during the run, in which case the job gets put back on the appropriate queue (there is an ancillary table of outstanding jobs). If it completes the job successfully, it will send back a list of new job-strings, which the controller will sort into the appropriate queues. Then it will pull another string off some queue and send it to the worker as its next job; usually, but not always, it will pick the same queue as the previous job for that worker.
Now, the question. This data structure currently sits entirely in main memory, which was fine for small-scale test runs, but at full scale is eating all available RAM on the controller, all by itself. And the controller has several other tasks to accomplish, so that's no good.
What approach should I take? So far, I have considered:
a) to convert this to a primarily-on-disk data structure. It could be cached in RAM to some extent for efficiency, but jobs take tens of seconds to complete, so it's okay if it's not that efficient,
b) using a relational database - e.g. SQLite, (but SQL schemas are a very poor fit AFAICT),
c) using a NoSQL database with persistency support, e.g. Redis (data structure maps over trivially, but this still appears very RAM-centric to make me feel confident that the memory-hog problem will actually go away)
Concrete numbers: For a full-scale run, there will be between one and ten million keys in the hash, and less than 100 entries in each queue. String length varies wildly but is unlikely to be more than 250-ish bytes. So, a hypothetical (impossible) zero-overhead data structure would require 234 – 237 bytes of storage.
Ultimately, it all boils down on how you define efficiency needed on part of the controller -- e.g. response times, throughput, memory consumption, disk consumption, scalability... These properties are directly or indirectly related to:
number of requests the controller needs to handle per second (throughput)
acceptable response times
future growth expectations
From your options, here's how I'd evaluate each option:
a) to convert this to a primarily-on-disk data structure. It could be
cached in RAM to some extent for efficiency, but jobs take tens of
seconds to complete, so it's okay if it's not that efficient,
Given the current memory hog requirement, some form of persistent storage seems a reaonsable choice. Caching comes into play if there is a repeatable access pattern, say the same queue is accessed over and over again -- otherwise, caching is likely not to help.
This option makes sense if 1) you cannot find a database that maps trivially to your data structure (unlikely), 2) for some other reason you want to have your own on-disk format, e.g. you find that converting to a database is too much overhead (again, unlikely).
One alternative to databases is to look at persistent queues (e.g. using a RabbitMQ backing store), but I'm not sure what the per-queue or overall size limits are.
b) using a relational database - e.g. SQLite, (but SQL schemas are a
very poor fit AFAICT),
As you mention, SQL is probably not a good fit for your requirements, even though you could surely map your data structure to a relational model somehow.
However, NoSQL databases like MongoDB or CouchDB seem much more appropriate. Either way, a database of some sort seems viable as long as they can meet your throughput requirement. Many if not most NoSQL databases are also a good choice from a scalability perspective, as they include support for sharding data across multiple machines.
c) using a NoSQL database with persistency support, e.g. Redis (data
structure maps over trivially, but this still appears very RAM-centric
to make me feel confident that the memory-hog problem will actually go
An in-memory database like Redis doesn't solve the memory hog problem, unless you set up a cluster of machines that each holds a part of the overall data. This makes sense only if keeping all data in-memory is needed due to low response times requirements. Yet, given the nature of your jobs, taking tens of seconds to complete, response times, respective to workers, hardly matter.
If you find, however, that response times do matter, Redis would be a good choice, as it handles partitioning trivially using either client-side consistent-hashing or at the cluster level, thus also supporting scalability scenarios.
In any case
Before you choose a solution, be sure to clarify your requirements. You mention you want an efficient solution. Since efficiency can only be gauged against some set of requirements, here's the list of questions I would try to answer first:
how many jobs are expected to complete, say per minute or per hour?
how many workers are needed to do so?
concluding from that:
what is the expected load in requestes/per second, and
what response times are expected on part of the controller (handing out jobs, receiving results)?
And looking into the future:
will the workload increase, i.e. does your solution need to scale up (more jobs per time unit, more more data per job?)
will there be a need for persistency of jobs and results, e.g. for auditing purposes?
Again, concluding from that,
how will this influence the number of workers?
what effect will it have on the number of requests/second on part of the controller?
With these answers, you will find yourself in a better position to choose a solution.
I would look into a message queue like RabbitMQ. This way it will first fill up the RAM and then use the disk. I have up to 500,000,000 objects in queues on a single server and it's just plugging away.
RabbitMQ works on Windows and Linux and has simple connectors/SDKs to about any kind of language.

Performance of Resque jobs

My Resque job basically takes params hash and stores it into the DB. In the process it does several reads and writes.
These R/Ws take approx. 5ms in total on my local machine and a little bit more on Heroku (I guess it's because of the shared DB).
However, the rate at which the queue is processed is very low / about 2-3 jobs per second. What could be causing this?
Thank you.
Check for a new job, lock a job, do the job, mark it as completed, look for a new job.
You might find that the negotiation to get a new job, accessing Redis etc is causing a lot of overhead. If your task is only 5ms long, it can probably live inside the request-response cycle. Background jobs are great when running a task would extend the response time considerably, very small jobs generally aren't worth the effort involved.
