I have RDS that serves as the source of truth. A challenge that I have is to have this database partially synced to Redis to make it available for a server app to use. This would be a one way sync always going in one direction, but I can't wrap my head around what tools do I use to make these syncs happen preferably in an optimized way. In other words, rather than loading the entire data set it would be great if deltas are synced only.
I hope someone can give some insight on how this can be done.
Thank you!
Most RDBMS provide a way to subscribe to the transaction allowing you to put in place a "change data capture" event streams.
In this case you can subscribe to the databases events, and put the change or updated record inside Redis.
You can for example use Debezium to capture the event, as you can see Debezium community has connectors for various datasources.
--------- ------------ ---------
| RDBMS |========>>| Debezium | ======> | Redis |
--------- ------------ ---------
This demonstration (mostly Java) shows this (a little richer since it is using Redis Streams and an intermediate state), the event is capture by this method so capture Insert/Update/Delete in MySQL and sending the information to Redis.
Another option, that does not match you need but interesting, is to do a "write behind" Cache. In this case you update the Cache in Redis and Redis push the update in RDBMS using Gears.
You can find more information about this "write behind" with Redis Gears in this GitHub repo.
--------- --------- ---------
| Redis |========>>| Gears | ======> | RDBMS |
--------- --------- ---------
Related
I'm building a web-based CRON service using DynamoDB and Lambda. While I don't currently have the following problem, I'm curious about how I could solve it if it arises.
The architecture works like this:
Lambda A - query for all tasks that should occur in the current minute
Lambda A - for each task, increment a counter on the document
Lambda B - listen for the stream event for each document and run the actual CRON task
As far as I can tell, Lambda B should be scalable - AWS should run as many instances as needed to process all the stream events (I think).
But for Lambda A, say I have 1 billion documents that need to be processed each minute.
When I query for each minute's tasks, the Lambda will need to make multiple requests in order to fetch & update all the documents.
How could I architect the system such that all the documents get processed in < 60 seconds?
You're right, Lambda A would have to do a monster scan/query which wouldn't scale.
One way to architect this to make this work would be to partition your cron items so that you can invoke multiple lambdas in parallel (i.e. fan out the work) instead of just one (lambda A) so that each one handles a partition (or set of partitions) instead of the whole thing.
How you achieve this depends on what your current primary key looks like and how else you expect to query these items. Here's one solution:
cronID | rangeKey | jobInfo | counter
1001 | 72_2020-05-05T13:58:00 | foo | 4
1002 | 99_2020-05-05T14:05:00 | bar | 42
1003 | 01_2020-05-05T14:05:00 | baz | 0
1004 | 13_2020-05-05T14:10:00 | blah | 2
1005 | 42_2020-05-05T13:25:00 | 42 | 99
I've added a random prefix (00-99) to the rangeKey, so you can have different lambdas query different sets of items in parallel based on that prefix.
In this example you could invoke 100 lambdas each minute (the "Lambda A" types), with each handling a single prefix set. Or you could have say 5 lambdas, with each handling a range of 20 prefixes. You could even dynamically scale the number of lambda invocations up and down depending on load, without having to update the prefixes in your data in your table.
Since these lambdas are basically the same, you could just invoke lambda A the required number of times, injecting the appropriate prefix(es) for each one as a config.
EDIT
Re the 1MB page limit in your comment, you'll get a LastEvaluatedKey back if your query has been limited. Your lambda can execute queries in a loop, passing the LastEvaluatedKey value back as ExclusiveStartKey until you've got all the result pages.
You'll still need to be careful of running time (and catching errors to retry since this is not atomic) but fanning your lambdas as above will deal with the running time if you fan it widely enough.
I'm not sure about your project but looks like what you are asking is already in the AWS DynamoDb Documentation, read here:
When you create a new provisioned table in Amazon DynamoDB, you must
specify its provisioned throughput capacity. This is the amount of
read and write activity that the table can support. DynamoDB uses this
information to reserve sufficient system resources to meet your
throughput requirements.
You can create an on-demand mode table instead so that you don't have
to manage any capacity settings for servers, storage, or throughput.
DynamoDB instantly accommodates your workloads as they ramp up or down
to any previously reached traffic level. If a workload’s traffic level
hits a new peak, DynamoDB adapts rapidly to accommodate the workload.
For more information
You can optionally allow DynamoDB auto scaling to manage your table's
throughput capacity. However, you still must provide initial settings
for read and write capacity when you create the table. DynamoDB auto
scaling uses these initial settings as a starting point, and then
adjusts them dynamically in response to your application's
requirements
As your application data and access requirements change, you might
need to adjust your table's throughput settings. If you're using
DynamoDB auto scaling, the throughput settings are automatically
adjusted in response to actual workloads. You can also use the
UpdateTable operation to manually adjust your table's throughput
capacity. You might decide to do this if you need to bulk-load data
from an existing data store into your new DynamoDB table. You could
create the table with a large write throughput setting and then reduce
this setting after the bulk data load is complete.
You specify throughput requirements in terms of capacity units—the
amount of data your application needs to read or write per second. You
can modify these settings later, if needed, or enable DynamoDB auto
scaling to modify them automatically.
I hope this can help your doubt.
So we have a web application using .NET with Cassandra / Spark combo to produce online reports.
Currently we grab all relevant data from Cassandra and render it inside a table through a JavaScript plugin that also sorts it (depending on column clicked).
E.g.
PK = PARTITION KEY | CK = CLUSTERING KEY
PK PK CK
-------------------------------------
| user | date | application | time |
-------------------------------------
| A | 17500 | app1 | 3000 |
| A | 17500 | calc | 1000 |
| A | 17500 | word | 5000 |
-------------------------------------
However the data coming back is becoming increasingly larger: so we needed to develop some sort of pagination to avoid long request and front-end loading times.
The column most likely users would sort by is time and unfortunately is not part of the Clustering key and therefore cannot use the ORDER BY command.
A solution we came up with was creating a 'ranking' table with the same data
E.g.
PK PK CK
--------------------------------------------
| user | date | rank | application | time |
--------------------------------------------
| A | 17500 | 1 | word | 5000 |
| A | 17500 | 2 | app1 | 3000 |
| A | 17500 | 3 | calc | 1000 |
--------------------------------------------
...but this would put a lot more load onto Spark as the data gathered for 'time' is constantly incrementing and therefore changing the rank.
We could also order the results server-side, cache and retrieve limited data through ajax calls, but this method significantly increases memory load on the server (especially if many users are using the system at once).
Perhaps I'm overthinking this and there is a simple cassandra table construction could be used instead.
What would be the best way to solve this problem?
EDIT (15th Dec 2017):
Came accross something in Cassandra called Materialized Views which seems to be able to order non-keyed columns as clustering keys. This would be great for grabbing top number of rows and sorting but not pagination.
EDIT (18th Dec 2017):
The Datastax C# driver allows for pagination of results returned. The paging state can be saved and continued when needed. This together with the Materialized views would complete pagination.
EDIT (19th Dec 2017)
Haven't really delved into the pit of dataframes through spark - once setup they are incredibly fast to sort and filter on - treating it like SQL.
Key words: once setup. Found they took an average of around 7 seconds to create.
EDIT (29 Mar 2018)
Hit a snag with the current solution (Materialized View + limit results). The Materialized View needs to constantly update resulting in craploads of tombstones. This means: bad cluster performance.
See Sorting Results by Non-Clustering Key and Tombstones When Updating.
Back to square 1. sigh
EDIT (22 Aug 2018)
Through vigorous research: it appears the way to go is implementing a Solr solution. Solr allows for powerful & fast indexed searches as well as paging. This blog post 'Avoid pitfalls in scaling Cassandra' is a good resource from Walmart's dev that explains a solution on how they did paging using 'sharding'.
Been a good while since asking this question but wanted to post some information about the current solution.
Partition keys are 'key'.
Designing the database so only the data you want returned is returned.
Filtering to the exact partitionkey instead of also filtering on clustering keys improved performance of the cluster tremendously. We now only use 1 table with a single partition key instead of 100s of tables with composite keys. Sharding was also implemented.
KAFKA Data Streaming & Caching
One of the largest pitfalls we faced was just the huge amount of pressure the database had with our constant updating data, often inserting duplicate rows. This created problems with the size of memtables and flushing times which often saw nodes falling over. https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html
So we decided to change to streaming instead of batch processing (Spark).
Kafka streaming is so fast, no Cassandra querying is done until topics no longer need to kept in memory. Optimised Kafka topics stream to an intermediate caching system, sorts the data using Linq (C#), and keeps it there until a certain period of time has passed. Data is retrieved from this cache for paging.
Spark streaming would have also worked for us, but Kafka just fit better. Here's a good article on the difference and what might be better for you:
https://www.knowledgehut.com/blog/big-data/kafka-vs-spark
I'm currently create a program that imports all groups and feeds from Facebook which the user wants.
I used to use the Graph API with OAuth and this works very well.
But I came to the point that I realized that one request can't handle the import of 1000 groups plus the feeds.
So I'm looking for a solution that imports this data in the background (like a cron job) into a database.
Requirements
Runs in background
Runs under Linux
Restful
Questions
What's you experience about that?
Would hadoop the right solution?
You can use neo4j.
Neo4j is a graph database, reliable and fast for managing and querying highly connected data
http://www.neo4j.org/
1) Decide structure of nodes, relationships, and there properties and accordingly
You need to create API that will get data from facebook and store it in Neo4j.
I have used neo4j in 3 big projects, and it is best for graph data.
2) Create a cron jon that will get data from facebook and store into the neo4j.
I think implementing mysql for graph database is not a good idea. for large data neo4j is the good option.
Interestingly you designed the appropriate solution yourself already. So in fact you need following components:
a relational database, since you want to request data in a structured, quick way
-> from experiences I would pressure the fact to have a fully normalized data model (in your case with tables users, groups, users2groups), also have 4-Byte surrogate keys over larger keys from facebook (for back referencing you can store their keys as attributes, but internal relations are more efficient on surrogate keys)
-> establish indexes based on hashes rather than strings (eg. crc32(lower(STRING))) - an example select would than be this: select somethinguseful from users where name=SEARCHSTRING and hash=crc32(lower(SEARCHSTRING))
-> never,ever establish unique columns based on strings with length > 8 Byte; unique bulk inserts can be done based on hashes+string checking via insert...select
-> once you got that settled you could also look into sparse matrices (see wikipedia) and bitmaps to get your users2groups optimized (however I have learned that this is an extra that should not hinder you to come up with a first version soon)
a cron job that is run periodically
-> ideally along the caps, facebook is giving you (so if they rule you to not request more often than once per second, stick to that - not more, but also try to come as close as possible to the cap) -> invest some time in getting the management of this settled, if different types of requests need to be fired (request for user records <> requests for group records, but maybe hit by the same cap)
-> most of the optimization can only be done with development - so if I were you I would stick to any high level programming language that does not bother to much with var type juggling and that also comes along with a broad support for associative arrays such as PHP and I would programm that thing myself
-> I made good experiences with setting up the cron job as web page with deactivated output buffering (for php look at ob_end_flush(void)) - easy to test and the cron job can be triggered via curl; if you channel status outputs via an own function (eg with time stamps) this could then also become flexible to either run viw browser or via command line -> which means efficient testing + efficient production running
your user ui, which only requests your database and never, ever, never the external system api
lots of memory, to keep your performance high (optimal: all your data+index data fits into database memory/cache dedicated to the database)
-> if you use mysql as database you should look into innodb_flush_log_at_trx_commit=0, and innodb_buffer_pool_size (just google, if interested)
Hadoop is a file system layer - it could help you with availability. However I would put this into the category of "sparse matrix", which is nothing that stops you from coming up with a solution. From my experience availability is not a primary constraint in data exposure projects.
-------------------------- UPDATE -------------------
I like neo4j from the other answer. So I wondered what I can learn for my future projects. My experiences with mysql is that RAM is usually the biggest constraint. So increasing your RAM to be able to load the full database can gain you performance improvements by a factor of 2-1000 - depending on from where you are coming from. Everything else such as index improvements and structure somehow follows. So if I would need to make up a performance prioritization list, it would be something like this:
MYSQL + enough RAM dedicated to the database to load all data
NEO4J + enough RAM dedicated to the database to load all data
I would still prefer MYSQL. It stores records efficiently, but needs to run joins for deriving relations (which neo4j does not require to that extend). Join-costs are usually low with the right indexes and according to http://docs.neo4j.org/chunked/milestone/configuration-caches.html neo4j does need to add extra management data to the property separation. For big data projects those management data sums up and in full load to memory set ups requires you buy more memory. Performance wise these both options are ultimate. Further, much further down the line you would find this:
NEO4J + not enough RAM dedicated to the database to load all data
MYSQL + not enough RAM dedicated to the database to load all data
In worst case MYSQL will even put indexes to disk (at least partly), which can result in massive read delay. In comparison with NEO4J you could perform a ' direct jump from node to node' exercise, which should - at least in theory - be faster.
We are using oracle CQN for change notifications for specific queries.
This is working fine for all the inserts and updates. The problem is delete, On delete the notification is sent with ROWID amongst other details. We cannot use the ROWID to lookup the row any more because it has been deleted.
Is there a way to get more data in a CQN notification regarding the deleted row ?
I'm afraid not.
My understanding is that this service is tailored to allow servers or clients to implement caches. In which case the cached table or view is supposed to be loaded in memory including the rowid, upon a notification, the cache manager having subscribed to the CQN service is supposed to invalidate the rows affected by the rowid list (or fetch it again in advanced).
Real life example. This can be useful for real time databases like Intelligent Network (i.e. to manage Prepaid Su$bscribers on a telecom network) in which callers need to be put through asap. The machine in charge of authorizing the calls (the SCP, there are several of them on the whole territory) is usually an in-memory database and the real persistent db lives on another node (the SDP at a central datacenter). The SDP with its on-disk db receives life-cycle events and balance refils events and notifies the subscribing SCPs.
You might have a different usage model.
I had this problem too, instead of deleting a row, I used a column "Active", instead of deleting a row I changed "Active" from "YES" to "NO".
I have a table of non trivial size on a DB2 database that is updated X times a day per user input in another application. This table is also read by my web-app to display some info to another set of users. I have a large number of users on my web app and they need to do lots of fuzzy string lookups with data that is up-to-the-minute accurate. So, I need a server side cache to do my fuzzy logic on and to keep the DB from getting hammered.
So, what's the best option? I would hate to pull the entire table every minute when the data changes so rarely. I could setup a trigger to update a timestamp of a smaller table and poll that to see if I need refresh my cache, but that seems hacky to.
Ideally I would like to have DB2 tell my web-app when something changes, or at least provide a very lightweight mechanism to detect data level changes.
I think if your web application is running in WebSphere, setting up MQ would be a pretty good solution.
You could write triggers that use the MQ Series routines to add things to a queue, and your web app could subscribe to the queue and listen for updates.
If your web app is not in WebSphere then you could still look at this option but it might be more difficult.
A simple solution could be to have a timestamp (somewhere) for the latest change on to table.
The timestamp could be located in a small table/view that is updated by either the application that updates the big table or by an update-trigger on the big table.
The update-triggers only task would be to update the "help"-timestamp with currenttimestamp.
Then the webapp only checks this timestamp.
If the timestamp is newer then what the webapp has then the data is reread from the big table.
A "low-tech"-solution thats fairly non intrusive to the exsisting system.
Hope this solution fits your setup.
Regards
Sigersted
Having the database push a message to your webapp is certainly doable via a variety of mechanisms (like mqseries, etc). Similar and easier is to write a java stored procedure that gets kicked off by the trigger and hands the data to your cache-maintenance interface. But both of these solutions involve a lot of versioning dependencies, etc that could be a real PITA.
Another option might be to reconsider the entire approach. Is it possible that instead of maintaining a cache on your app's side you could perform your text searching on the original table?
But my suggestion is to do as you (and the other poster) mention - and just update a timestamp in a single-row table purposed to do this, then have your web-app poll that table. Similarly you could just push the changed rows to this small table - and have your cache-maintenance program pull from this table. Either of these is very simple to implement - and should be very reliable.