I'm building a web-based CRON service using DynamoDB and Lambda. While I don't currently have the following problem, I'm curious about how I could solve it if it arises.
The architecture works like this:
Lambda A - query for all tasks that should occur in the current minute
Lambda A - for each task, increment a counter on the document
Lambda B - listen for the stream event for each document and run the actual CRON task
As far as I can tell, Lambda B should be scalable - AWS should run as many instances as needed to process all the stream events (I think).
But for Lambda A, say I have 1 billion documents that need to be processed each minute.
When I query for each minute's tasks, the Lambda will need to make multiple requests in order to fetch & update all the documents.
How could I architect the system such that all the documents get processed in < 60 seconds?
You're right, Lambda A would have to do a monster scan/query which wouldn't scale.
One way to architect this to make this work would be to partition your cron items so that you can invoke multiple lambdas in parallel (i.e. fan out the work) instead of just one (lambda A) so that each one handles a partition (or set of partitions) instead of the whole thing.
How you achieve this depends on what your current primary key looks like and how else you expect to query these items. Here's one solution:
cronID | rangeKey | jobInfo | counter
1001 | 72_2020-05-05T13:58:00 | foo | 4
1002 | 99_2020-05-05T14:05:00 | bar | 42
1003 | 01_2020-05-05T14:05:00 | baz | 0
1004 | 13_2020-05-05T14:10:00 | blah | 2
1005 | 42_2020-05-05T13:25:00 | 42 | 99
I've added a random prefix (00-99) to the rangeKey, so you can have different lambdas query different sets of items in parallel based on that prefix.
In this example you could invoke 100 lambdas each minute (the "Lambda A" types), with each handling a single prefix set. Or you could have say 5 lambdas, with each handling a range of 20 prefixes. You could even dynamically scale the number of lambda invocations up and down depending on load, without having to update the prefixes in your data in your table.
Since these lambdas are basically the same, you could just invoke lambda A the required number of times, injecting the appropriate prefix(es) for each one as a config.
EDIT
Re the 1MB page limit in your comment, you'll get a LastEvaluatedKey back if your query has been limited. Your lambda can execute queries in a loop, passing the LastEvaluatedKey value back as ExclusiveStartKey until you've got all the result pages.
You'll still need to be careful of running time (and catching errors to retry since this is not atomic) but fanning your lambdas as above will deal with the running time if you fan it widely enough.
I'm not sure about your project but looks like what you are asking is already in the AWS DynamoDb Documentation, read here:
When you create a new provisioned table in Amazon DynamoDB, you must
specify its provisioned throughput capacity. This is the amount of
read and write activity that the table can support. DynamoDB uses this
information to reserve sufficient system resources to meet your
throughput requirements.
You can create an on-demand mode table instead so that you don't have
to manage any capacity settings for servers, storage, or throughput.
DynamoDB instantly accommodates your workloads as they ramp up or down
to any previously reached traffic level. If a workload’s traffic level
hits a new peak, DynamoDB adapts rapidly to accommodate the workload.
For more information
You can optionally allow DynamoDB auto scaling to manage your table's
throughput capacity. However, you still must provide initial settings
for read and write capacity when you create the table. DynamoDB auto
scaling uses these initial settings as a starting point, and then
adjusts them dynamically in response to your application's
requirements
As your application data and access requirements change, you might
need to adjust your table's throughput settings. If you're using
DynamoDB auto scaling, the throughput settings are automatically
adjusted in response to actual workloads. You can also use the
UpdateTable operation to manually adjust your table's throughput
capacity. You might decide to do this if you need to bulk-load data
from an existing data store into your new DynamoDB table. You could
create the table with a large write throughput setting and then reduce
this setting after the bulk data load is complete.
You specify throughput requirements in terms of capacity units—the
amount of data your application needs to read or write per second. You
can modify these settings later, if needed, or enable DynamoDB auto
scaling to modify them automatically.
I hope this can help your doubt.
Related
I have a ReportGeneration lambda that takes request from client and adds following entries to a DDB table.
Customer ID <hash key>
ReportGenerationRequestID(UUID) <sort key>
ExecutionStartTime
ReportExecutionStatus < workflow status>
I have enabled DDB stream trigger on this table and a create entry in this table triggers the report generation workflow. This is a multi-step workflow that takes a while to complete.
Where ReportExecutionStatus is the status of the report processing workflow.
I am supposed to maintain the history of all report generation requests that a customer has initiated.
Now What I am trying to do is avoid concurrent processing requests by the same customer, so if a report for a customer is already getting generated don’t create another record in DDB ?
Option Considered :
query ddb for the customerID(consistent read) :
- From the list see if any entry is either InProgress or Scheduled
If not then create a new one (consistent write)
Otherwise return already existing
Issue: If customer clicks in a split second to generate report, two lambdas can be triggered, causing 2 entires in DDB and two parallel workflows can be initiated something that I don’t want.
Can someone recommend what will be the best approach to ensure that there are no concurrent executions (2 worklflows) for the same Report from same customer.
In short when one execution is in progress another one should not start.
You can use ConditionExpression to only create the entry if it doesn't already exist - if you need to check different items, than you can use DynamoDB Transactions to check if another item already exists and if not, create your item.
Those would be the ways to do it with DynamoDB, getting a higher consistency.
Another option would be to use SQS FIFO queues. You can group them by the customer ID, then you wouldn't have concurrent processing of messages for the same customer. Additionally with this SQS solution you get all the advantages of using SQS - like automated retry mechanisms or a dead letter queue.
Limiting the number of concurrent Lambda executions is not possible as far as I know. That is the whole point of AWS Lambda, to easily scale and run multiple Lambdas concurrently.
That said, there is probably a better solution for your problem using a DynamoDB feature called "Strongly Consistent Reads"
By default reads to DynamoDB (if you use the AWS SDK) are eventually consistent, causing the behaviour you observed: Two writes to the same table are made but your Lambda only was able to notice one of those writes.
If you use Strongly consistent reads, the documentation states:
When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful.
So your Lambda needs to do a strongly consistent read to your table to check if the customer already has a job running. If there is already a job running the Lambda does not create a new job.
My program receives thousands of events in a second from different types. For example 100k API access in a second from users with millions of different IP addresses. I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every user and I want it to be like a sliding window. In this case, type of event is the user address.
I started using a time series database, InfluxDB; but it failed to insert 100k events per second and aggregate queries to find event counts in a minute or an hour is even worse. I am sure InfluxDB is not capable of inserting 100k events per second and performing 300k aggregate queries at the same time.
I don't want events retrieved from the database because they are just a simple address. I just want to count them as fast as possible in different time intervals. I want to get the number of events of type x in a specific time interval (for example, past 1 hour).
I don't need to store statistics in the hard disk; so maybe a data structure to keep event counts in different time intervals is good for me. On the other hand, I need it to be like a sliding window.
Storing all the events in RAM in a linked-list and iterating over it to answer queries is another solution that comes to my mind but because the number of events is too high, keeping all of the events in RAM could not be a good idea.
Is there any good data structure or even a database for this purpose?
You didn't provide enough details on events input format and how events can be delivered to statistics backend: is it a stream of udp messages, http put/post requests or smth else.
One possible solution would be to use Yandex Clickhouse database.
Rough description of suggested pattern:
Load incoming raw events from your application into memory-based table Events
with Buffer storage engine
Create materialized view with per-minute aggregation in another
memory-based table EventsPerMinute with Buffer engine
Do the same for hourly aggregation of data in EventsPerHour
Optionally, use Grafana with clickhouse datasource plugin to build
dashboards
In Clickhouse DB Buffer storage engine not associated with any on-disk table will be kept entirely in memory and older data will be automatically replaced with fresh. This will give you simple housekeeping for raw data.
Tables (materialized views) EventsPerMinute and EventsPerHour can be also created with MergeTree storage engine if case you want to keep statistics on disk. Clickhouse can easily handle billions of records.
At 100K events/second you may need some kind of shaper/load balancer in front of database.
you can think of a hazelcast cluster instead of simple ram. I also think a graylog or simple elastic seach but with this kind of load you shoud test. You can think about your data structure as well. You can construct a hour map for each address and put the event into the hour bucket. And when the time passes the hour you can calculate the count and cache in this hour's bucket. When you need a minute granularity you go to hours bucket and count the events under the list of this hour.
I have a requirement of moving documents from one storage area to another and planning to use Bulk Movement Jobs under Sweep Jobs in FileNet P8 v5.2.1.
My filter criteria is obviously (and only) the storage area id as I want to target a specific storage area and move the content to another storage area(kinda like archiving) without altering the security, relationship containment, document class etc.
When I run the job, though I have around 100,000 objects in the storage area that I am targeting; in examined objects field the job shows 500M objects and it took around 15hrs to move the objects. DBA analyze this situation to tell me that though I have all necessary indexes created on the docverion table(as per FileNet documentation), the job's still going for the full table scan.
Why would something like this happen?
What additional indexes can be used and how would that be helpful?
Is there a better way to do this with less time consumption?
Only for 2 and 3 questions.
About indexes you can use this documentation https://www-01.ibm.com/support/knowledgecenter/SSNW2F_5.2.0/com.ibm.p8.performance.doc/p8ppt237.htm
You can improve the performance of your jobs if you split all documents throught option "*Policy controlled batch size" (as i remember) at "Sweeps subsystem" tab in the Domain settings.
Use Time Slot management
https://www-01.ibm.com/support/knowledgecenter/SSNW2F_5.2.1/com.ibm.p8.ce.admin.tasks.doc/p8pcc179.htm?lang=ru
and Filter Timelimit option
https://www-01.ibm.com/support/knowledgecenter/SSNW2F_5.2.1/com.ibm.p8.ce.admin.tasks.doc/p8pcc203.htm?lang=ru
In commons you just split all your documents to the portions and process it in separated times and threads.
I have a table to which I add records whenever the user views a particular resource. The key fields are
Username
Resource
Date Viewed
On a history page of my app, I want to present a set number (e.g., top 5) of the user's most recently viewed Resources, but I want to group by Resource, so that if some were viewed several times, only the most recent of each one is shown.
To be clear, if the raw data looked like this:
UserA | ResourceA | Jan 1
UserA | ResourceA | Jan 2
UserA | ResourceB | Jan 3
UserA | ResourceA | Jan 4
...
...only the bottom two records would appear in the history page.
I know you can get server-side chronological sorting by using a string derived from the date in the PartitionKey or RowKey fields.
I also see that you could enable a crude grouping mechanism by using Username and Resource as your PartitionKey and RowKey fields, and then using Insert-or-update, to maintain a table in which you kept pointers for the most recent value for each combination. However, those records wouldn't be sorted chronologically.
Is there any way to design a set of tables so that I can get the data I need without retrieving tons of extra entities and sorting on the client? I'm willing to get elaborate with the design if that's what it takes. Thanks in advance!
First, I would strongly recommend that you read this excellent Azure Storage Table Design Guide: Designing Scalable and Performant Tables document from Storage team.
Yes, I would agree that it is somewhat tricky with Azure Table Storage but it is doable :).
What you have to do is keep multiple copies of the same data. Each copy will serve a different purpose.
Considering the scenario where you want to fetch most recent lines for Resource A and B, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks) reversed i.e. DateTime.MaxValue.Ticks - LastAccessedDateTime.Ticks. Reverse ticks is required to that most recent entries will show up on the top of the table.
RowKey: Resource name.
AccessDate: Indicates the last access date/time.
User: Name of the user who accessed that resource.
So when you are interested in just finding out most recently used resources, you could start fetching records from the top.
In short, your data storage approach should be primarily governed by how you want to fetch the data. It would even mean you will have to save the same data multiple times.
UPDATE
As discussed in the comments below, Table Service doesn't directly support Server Side Grouping. This is something that you would need to do on your own. What you could do is create a separate table to store the access counts. As and when the resources are accessed, you basically either insert a new record in that table or update the count for that resource in that table.
Assuming you're always interested in finding out resource access count within a date/time range, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks). The precision would depend on your reporting requirement. For example, if you want to maintain access counts by day then your precision would be a day.
RowKey: Resource name.
AccessCount: This field will constantly update as and when a resource is accessed.
LastAccessDateTime: This field will denote when a resource was last accessed.
For updating access counts, I would recommend that you make use of a background process. Basically in this approach, as a resource is accessed you add a message in a queue. This message will have resource name and date/time resource was last accessed. Then have a background process poll this queue and fetch messages. As the messages are received, you first get the current count and last access date/time for that resource. If no records are found, you simply insert a record in this table with count as 1. If a record is found then you compare the date/time from the table with the date/time sent in the message. If the date/time from the table is smaller than the date/time sent in the message, you update both count (increase that by 1) and last access date/time. If the date/time from the table is more than the date/time sent in the message, you only update the count.
Now to find most accessed resources in a time span, you simply query this table. Assuming there are limited number of resources (say in 100s), you can get this information from the table with at least 1 request. Since you're dealing with small amount of data, you can simply download this data on the client side and order it anyway you see fit. However to see the access details for a particular resource, you would have to fetch detailed data (1000 entities at a time).
Part of your brain might still be unconsciously trapped in relational-table design paradigms, I'm still getting to grips with that issue myself.
Rather than think of table storage as a database table (with the "query-ability" that goes with it) try visualizing it in more simple (dumb) terms.
A design problem I'm working on now is storing financial transaction data, and I want to know what the total $ amount of these transactions are. Because Azure table storage doesn't (yet?) offer aggregate functions I can't simply go .Sum(). To get around that I'm going to:
Sum the values of the transactions in my app before I pass them to azure.
I'll then pass that the result of the sum into azure as a separate piece of information, called RunningTotal.
Later on I can just return RunningTotal rather than pulling down all the transactions, and I can repeat the process by increment the value of RunningTotal each time i get new transactions.
Of course there are risks to this but the app is a personal one so the risk level is low and manageable, at least as a proof-of-concept.
Perhaps you can use a similar approach for the design of your system: compute useful values in advance. I'll almost be using table storage as a long-term cache rather than a database.
I'm pretty new to the field of big data and currently stucking by a fundamental decision.
For a research project i need to store millions of log entries per minute to my Cassandra based data center, which works pretty fine. (single data center, 4 nodes)
Log Entry
------------------------------------------------------------------
| Timestamp | IP1 | IP2 ...
------------------------------------------------------------------
| 2015-01-01 01:05:01 | 10.10.10.1 | 192.10.10.1 ...
------------------------------------------------------------------
Each log entry has a specific timestamp. The log entries should be queried by different time ranges in first instance. As recommended i start to "model my query" in a big row approach.
Basic C* Schema
------------------------------------------------------------------
| row key | column key a | column key b ...
------------------------------------------------------------------
| 2015-01-01 01:05 | 2015-01-01 01:05:01 | 2015-01-01 01:05:23
------------------------------------------------------------------
Additional detail:
column keys are composition of timestamp+uuid, to be unique and to avoid overwritings;
log entries of a specific time are stored nearby on a node by its identical partition key;
Thus log entries are stored in shorttime intervals per row. For example every log entry for 2015-01-01 01:05 with the precision of a minute. Queries are not really peformed as a range query with an < operator, rather entries are selected as blocks of a specified minute.
Range based queries succeed in a decent response time which is fine for me.
Question:
In the next step we want to gain additional informations by queries, which are mainly focused on the IP field. For example: select all the entries which have IP1=xx.xx.xx.xx and IP2=yy.yy.yy.yy.
So obviously the current model is pretty not usable for additional IP focused CQL queries. So the problem is not to find a possible solution, rather the various choices of possible technologies which could be a possible solution:
Try to solve the problem with standalone C* solutions. (Build a second model and administer the same data in a different shape)
Choose additional technologies like Spark...
Switch to HDFS/Hadoop - Cassandra/Hadoop solution...
and so on
With my lack of knowledge in this field, it is pretty hard to find the best way which i should take. Especially with the feeling that the usage of a cluster computing framework would be an excessive solution.
As I understood your question, your table schema looks like this:
create table logs (
minute timestamp,
id timeuuid,
ips list<string>,
message text,
primary key (minute,id)
);
With this simple schema, you:
can fetch all logs for a specific minute.
can fetch short inter-minute ranges of log events.
want to query dataset by IP.
From my point of view, there are multiple ways of implementing this idea:
create secondary index on IP addresses. But in C* you will lose the ability to query by timestamp: C* cannot merge primary and secondary indexes (like mysql/pgsql).
denormalize data. Write your log events to two tables at once, first being optimized for timestamp queries (minute+ts as PK), second being for IP-based queries (IP+ts as PK).
use spark for analytical queries. But spark will need to perform (full?) table scan (in a nifty distributed map-reduce way, but nevertheless it's a table scan) each time to extract all the data you've requested, so all your queries will require a lot of time to finish. This way can cause problems if you plan to have a lot of low-latency queries.
use external index like ElasticSearch for quering, and C* for storing the data.
For my opinion, the C* way of doing such things is to have a set of separate tables for different queries. It will give you an ability to perform blazing-fast queries (but with increased storage cost).