So we have a web application using .NET with Cassandra / Spark combo to produce online reports.
Currently we grab all relevant data from Cassandra and render it inside a table through a JavaScript plugin that also sorts it (depending on column clicked).
E.g.
PK = PARTITION KEY | CK = CLUSTERING KEY
PK PK CK
-------------------------------------
| user | date | application | time |
-------------------------------------
| A | 17500 | app1 | 3000 |
| A | 17500 | calc | 1000 |
| A | 17500 | word | 5000 |
-------------------------------------
However the data coming back is becoming increasingly larger: so we needed to develop some sort of pagination to avoid long request and front-end loading times.
The column most likely users would sort by is time and unfortunately is not part of the Clustering key and therefore cannot use the ORDER BY command.
A solution we came up with was creating a 'ranking' table with the same data
E.g.
PK PK CK
--------------------------------------------
| user | date | rank | application | time |
--------------------------------------------
| A | 17500 | 1 | word | 5000 |
| A | 17500 | 2 | app1 | 3000 |
| A | 17500 | 3 | calc | 1000 |
--------------------------------------------
...but this would put a lot more load onto Spark as the data gathered for 'time' is constantly incrementing and therefore changing the rank.
We could also order the results server-side, cache and retrieve limited data through ajax calls, but this method significantly increases memory load on the server (especially if many users are using the system at once).
Perhaps I'm overthinking this and there is a simple cassandra table construction could be used instead.
What would be the best way to solve this problem?
EDIT (15th Dec 2017):
Came accross something in Cassandra called Materialized Views which seems to be able to order non-keyed columns as clustering keys. This would be great for grabbing top number of rows and sorting but not pagination.
EDIT (18th Dec 2017):
The Datastax C# driver allows for pagination of results returned. The paging state can be saved and continued when needed. This together with the Materialized views would complete pagination.
EDIT (19th Dec 2017)
Haven't really delved into the pit of dataframes through spark - once setup they are incredibly fast to sort and filter on - treating it like SQL.
Key words: once setup. Found they took an average of around 7 seconds to create.
EDIT (29 Mar 2018)
Hit a snag with the current solution (Materialized View + limit results). The Materialized View needs to constantly update resulting in craploads of tombstones. This means: bad cluster performance.
See Sorting Results by Non-Clustering Key and Tombstones When Updating.
Back to square 1. sigh
EDIT (22 Aug 2018)
Through vigorous research: it appears the way to go is implementing a Solr solution. Solr allows for powerful & fast indexed searches as well as paging. This blog post 'Avoid pitfalls in scaling Cassandra' is a good resource from Walmart's dev that explains a solution on how they did paging using 'sharding'.
Been a good while since asking this question but wanted to post some information about the current solution.
Partition keys are 'key'.
Designing the database so only the data you want returned is returned.
Filtering to the exact partitionkey instead of also filtering on clustering keys improved performance of the cluster tremendously. We now only use 1 table with a single partition key instead of 100s of tables with composite keys. Sharding was also implemented.
KAFKA Data Streaming & Caching
One of the largest pitfalls we faced was just the huge amount of pressure the database had with our constant updating data, often inserting duplicate rows. This created problems with the size of memtables and flushing times which often saw nodes falling over. https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html
So we decided to change to streaming instead of batch processing (Spark).
Kafka streaming is so fast, no Cassandra querying is done until topics no longer need to kept in memory. Optimised Kafka topics stream to an intermediate caching system, sorts the data using Linq (C#), and keeps it there until a certain period of time has passed. Data is retrieved from this cache for paging.
Spark streaming would have also worked for us, but Kafka just fit better. Here's a good article on the difference and what might be better for you:
https://www.knowledgehut.com/blog/big-data/kafka-vs-spark
Related
I have RDS that serves as the source of truth. A challenge that I have is to have this database partially synced to Redis to make it available for a server app to use. This would be a one way sync always going in one direction, but I can't wrap my head around what tools do I use to make these syncs happen preferably in an optimized way. In other words, rather than loading the entire data set it would be great if deltas are synced only.
I hope someone can give some insight on how this can be done.
Thank you!
Most RDBMS provide a way to subscribe to the transaction allowing you to put in place a "change data capture" event streams.
In this case you can subscribe to the databases events, and put the change or updated record inside Redis.
You can for example use Debezium to capture the event, as you can see Debezium community has connectors for various datasources.
--------- ------------ ---------
| RDBMS |========>>| Debezium | ======> | Redis |
--------- ------------ ---------
This demonstration (mostly Java) shows this (a little richer since it is using Redis Streams and an intermediate state), the event is capture by this method so capture Insert/Update/Delete in MySQL and sending the information to Redis.
Another option, that does not match you need but interesting, is to do a "write behind" Cache. In this case you update the Cache in Redis and Redis push the update in RDBMS using Gears.
You can find more information about this "write behind" with Redis Gears in this GitHub repo.
--------- --------- ---------
| Redis |========>>| Gears | ======> | RDBMS |
--------- --------- ---------
I'm building a web-based CRON service using DynamoDB and Lambda. While I don't currently have the following problem, I'm curious about how I could solve it if it arises.
The architecture works like this:
Lambda A - query for all tasks that should occur in the current minute
Lambda A - for each task, increment a counter on the document
Lambda B - listen for the stream event for each document and run the actual CRON task
As far as I can tell, Lambda B should be scalable - AWS should run as many instances as needed to process all the stream events (I think).
But for Lambda A, say I have 1 billion documents that need to be processed each minute.
When I query for each minute's tasks, the Lambda will need to make multiple requests in order to fetch & update all the documents.
How could I architect the system such that all the documents get processed in < 60 seconds?
You're right, Lambda A would have to do a monster scan/query which wouldn't scale.
One way to architect this to make this work would be to partition your cron items so that you can invoke multiple lambdas in parallel (i.e. fan out the work) instead of just one (lambda A) so that each one handles a partition (or set of partitions) instead of the whole thing.
How you achieve this depends on what your current primary key looks like and how else you expect to query these items. Here's one solution:
cronID | rangeKey | jobInfo | counter
1001 | 72_2020-05-05T13:58:00 | foo | 4
1002 | 99_2020-05-05T14:05:00 | bar | 42
1003 | 01_2020-05-05T14:05:00 | baz | 0
1004 | 13_2020-05-05T14:10:00 | blah | 2
1005 | 42_2020-05-05T13:25:00 | 42 | 99
I've added a random prefix (00-99) to the rangeKey, so you can have different lambdas query different sets of items in parallel based on that prefix.
In this example you could invoke 100 lambdas each minute (the "Lambda A" types), with each handling a single prefix set. Or you could have say 5 lambdas, with each handling a range of 20 prefixes. You could even dynamically scale the number of lambda invocations up and down depending on load, without having to update the prefixes in your data in your table.
Since these lambdas are basically the same, you could just invoke lambda A the required number of times, injecting the appropriate prefix(es) for each one as a config.
EDIT
Re the 1MB page limit in your comment, you'll get a LastEvaluatedKey back if your query has been limited. Your lambda can execute queries in a loop, passing the LastEvaluatedKey value back as ExclusiveStartKey until you've got all the result pages.
You'll still need to be careful of running time (and catching errors to retry since this is not atomic) but fanning your lambdas as above will deal with the running time if you fan it widely enough.
I'm not sure about your project but looks like what you are asking is already in the AWS DynamoDb Documentation, read here:
When you create a new provisioned table in Amazon DynamoDB, you must
specify its provisioned throughput capacity. This is the amount of
read and write activity that the table can support. DynamoDB uses this
information to reserve sufficient system resources to meet your
throughput requirements.
You can create an on-demand mode table instead so that you don't have
to manage any capacity settings for servers, storage, or throughput.
DynamoDB instantly accommodates your workloads as they ramp up or down
to any previously reached traffic level. If a workload’s traffic level
hits a new peak, DynamoDB adapts rapidly to accommodate the workload.
For more information
You can optionally allow DynamoDB auto scaling to manage your table's
throughput capacity. However, you still must provide initial settings
for read and write capacity when you create the table. DynamoDB auto
scaling uses these initial settings as a starting point, and then
adjusts them dynamically in response to your application's
requirements
As your application data and access requirements change, you might
need to adjust your table's throughput settings. If you're using
DynamoDB auto scaling, the throughput settings are automatically
adjusted in response to actual workloads. You can also use the
UpdateTable operation to manually adjust your table's throughput
capacity. You might decide to do this if you need to bulk-load data
from an existing data store into your new DynamoDB table. You could
create the table with a large write throughput setting and then reduce
this setting after the bulk data load is complete.
You specify throughput requirements in terms of capacity units—the
amount of data your application needs to read or write per second. You
can modify these settings later, if needed, or enable DynamoDB auto
scaling to modify them automatically.
I hope this can help your doubt.
I am fairly new to Redis, so far I really like it. I started to wonder however if it is better - performance wise - to use a single query that returns a large object (storing information in JSON) or if I should use more smaller queries, that return smaller objects?
Redis is single threaded application. Each query would be executed strictly one by one.
The answer depends on your needs and size of query response. If you try to get large ammount of keys in one query (with MULTI or with LUA script) you may block your server to accept new queries. One query allow you to keep total time as small as possible.
Each query is:
Parse query.
Get data.
Send it with network.
For example:
-------------------------------------------------------------------> time
| | | |
client send query(Q) | | got it(G)!
redis execute(E, server blocked) send response(SR)
However, if you do a lot of small queries, the total time for information to be longer.
-------------------------------------------------------------------> time
| | | | | | | | | |
client Q | | G | ... Q | | G | ...
redis E SR idle E SR idle
The answer is (if you have high loaded system):
If you need to receive data on several tens of keys prefer to get them a few queries.
If each your key has amount of data (many kilobytes for example) use many queries.
Also if you want to save the JSON think about the mandatory use of some kind of serialization (for example messagepack or lz) to minimize memory consumption.
I'm pretty new to the field of big data and currently stucking by a fundamental decision.
For a research project i need to store millions of log entries per minute to my Cassandra based data center, which works pretty fine. (single data center, 4 nodes)
Log Entry
------------------------------------------------------------------
| Timestamp | IP1 | IP2 ...
------------------------------------------------------------------
| 2015-01-01 01:05:01 | 10.10.10.1 | 192.10.10.1 ...
------------------------------------------------------------------
Each log entry has a specific timestamp. The log entries should be queried by different time ranges in first instance. As recommended i start to "model my query" in a big row approach.
Basic C* Schema
------------------------------------------------------------------
| row key | column key a | column key b ...
------------------------------------------------------------------
| 2015-01-01 01:05 | 2015-01-01 01:05:01 | 2015-01-01 01:05:23
------------------------------------------------------------------
Additional detail:
column keys are composition of timestamp+uuid, to be unique and to avoid overwritings;
log entries of a specific time are stored nearby on a node by its identical partition key;
Thus log entries are stored in shorttime intervals per row. For example every log entry for 2015-01-01 01:05 with the precision of a minute. Queries are not really peformed as a range query with an < operator, rather entries are selected as blocks of a specified minute.
Range based queries succeed in a decent response time which is fine for me.
Question:
In the next step we want to gain additional informations by queries, which are mainly focused on the IP field. For example: select all the entries which have IP1=xx.xx.xx.xx and IP2=yy.yy.yy.yy.
So obviously the current model is pretty not usable for additional IP focused CQL queries. So the problem is not to find a possible solution, rather the various choices of possible technologies which could be a possible solution:
Try to solve the problem with standalone C* solutions. (Build a second model and administer the same data in a different shape)
Choose additional technologies like Spark...
Switch to HDFS/Hadoop - Cassandra/Hadoop solution...
and so on
With my lack of knowledge in this field, it is pretty hard to find the best way which i should take. Especially with the feeling that the usage of a cluster computing framework would be an excessive solution.
As I understood your question, your table schema looks like this:
create table logs (
minute timestamp,
id timeuuid,
ips list<string>,
message text,
primary key (minute,id)
);
With this simple schema, you:
can fetch all logs for a specific minute.
can fetch short inter-minute ranges of log events.
want to query dataset by IP.
From my point of view, there are multiple ways of implementing this idea:
create secondary index on IP addresses. But in C* you will lose the ability to query by timestamp: C* cannot merge primary and secondary indexes (like mysql/pgsql).
denormalize data. Write your log events to two tables at once, first being optimized for timestamp queries (minute+ts as PK), second being for IP-based queries (IP+ts as PK).
use spark for analytical queries. But spark will need to perform (full?) table scan (in a nifty distributed map-reduce way, but nevertheless it's a table scan) each time to extract all the data you've requested, so all your queries will require a lot of time to finish. This way can cause problems if you plan to have a lot of low-latency queries.
use external index like ElasticSearch for quering, and C* for storing the data.
For my opinion, the C* way of doing such things is to have a set of separate tables for different queries. It will give you an ability to perform blazing-fast queries (but with increased storage cost).
How I can calculate Query time and Query Execution plan in Couchbase.Is there any Utilities like Oracle Explain plan and tkprof in Couchbase db?
edit:
I am trying to see which database performs best for my data. So i am trying to experiment with mysql, mongodb, couchbase. I have tried with three different number of entries 10k, 20k, 40k entries.
With mysql, i can see the query time using "set profiling =1". with this settings i ran queries under three scenarios 1) without indexing primary key, 2) after indexing primary key 3) running the same query second time ( to see the effect of query caching)
Similarly i ran same tests with mongodb and summarized my results in a table format. I would want to run same tests with couchbase to see how well it would perform. I tried to search over net but couldn't find anything which i can follow to get similar results.
Below is the table i have (all times are in milli seconds). the second row with braces() shows the query time for second run.
Records Count Mysql MongoDB CouchBase
___________________ _______________ ___________
Without | With Without | With With Index
Index | Index Index | Index
10K 62.27325 | 8.537 3311 | 33
(33.3135) | (3.27825) (7) | (0)
20K 108.4075 | 23.238 132 | 39
(80.90525)| (4.576) (17) | (0)
40K 155.074 | 26.26725 48 | 10
(110.42) | (10.037) (42) | (0)
For couchbase i would want to know both the performance when retrieving a document using its key( similar function as memcahed). Also the query time using its views.
You have to understand that couchbase works differently to RDBMS's such as Oracle. Couchbase offers two ways for you to retrieve your data:
1) Key lookup, you know the key(s) of the document(s) that you want to retrieve.
2) Define Map Reduce jobs called Views which create indexes allowing you to query your data on attributes other than the key.
Couchbase documents are always consistent but views are not and are eventually consistent (although you have the ability to change this).
As the couchbase documentation states
Views are updated when the document data is persisted to disk. There is a delay between creating or updating the document, and the document being updated within the view.
So query time really depends on a variety of factors, can the view data be stale? How large is the data emitted from the index, and what is the current workload and db size? Couchbase provides the following 3 flags for working with views and how you want to access the data. False means the index has to be updated before returning the result, therefore it can potentially be slow.
false : Force a view update before returning data
ok : Allow stale views
update_after : Allow stale view, update view after it has been accessed
Please check out the official document for more in depth answers http://docs.couchbase.com/couchbase-manual-2.2/#views-and-indexes
Also you can check out this interesting article on caching views http://blog.couchbase.com/caching-queries-couchbase-high-performance
Currently in development at Couchbase is N1QL, effectively the couchbase version of SQL, this will have the EXPLAIN statement available, this won't be released until late 2014 I believe.
A blog post introducing N1QL
http://blog.couchbase.com/n1ql-it-makes-cents
A cheat sheet for N1QL
http://www.couchbase.com/communities/sites/default/files/Couchbase-N1QL-CheatSheet.pdf
And where you can download the dev preview if you want to play with N1QL
http://www.couchbase.com/communities/n1ql
Also checkout the cb stats tool http://docs.couchbase.com/couchbase-manual-2.2/#cbstats-tool it gives a high level overview of persistence rates,updates,key misses etc.