We have real time data coming in to our system. We have online queries which we need to serve. In order to serve these online queries we need are doing some pre-processing of the data so that we can serve faster.
Now my query is how do I preprocess the online real time data. There should be a way for me to figure out if the data was already processed or not. In order to find this difference, I have the following approaches:
I can have a flag which says that data is processed or unprocessed, based on which i can further take a decision to process or not
I can have a column family where I can insert the data with a TTL, and a topic in a message bus like kafka which gives me the row identifier in cassandra so that I can process this row in cassandra
I can have a column family per day and a topic in a message bus like kafka which gives me the row identifier of the corresponding column family
I can have a keyspace per day and a topic in a message bus like kafka which gives me the row identifier of the corresponding column family
I read some where that if, the number of deletions increases, then the number of tombstones increases and result in slow query times. Now I am confused with the approach I have to chose among the above four or is there a better way to solve this?
According to the datastax blog third option might be better fit.
Cassandra Anti-patterns
Related
I was asked this question in an interview. The details were that assume we are getting millions of events. Each event has a timestamp and other details. The systems design requires ability to enable end user to query most frequent records in last 10 minutes or 9 hours or may be 3 months.
Event can be seen as following
event_type: {CRUD + Search}
event_info: xxx
timestamp : ts...
The easiest way to to figure out this is to look at how other stream processing or map reduce libraries do this (and I have feeling your interviewers have seen these libraries). Its basically real time map reduce (you can lookup how that works as well).
I will outline two techniques for event processing. In reality most companies need to do both.
New school Stream processing (real time)
Lets assume for now they don't want the actual events but the more likely case of aggregates (I think that was the intent of your question)
An example stream processing project is pipelinedb (they have how it works on the bottom of their home page).
Events go into use a queue/ring buffer
A worker process reads those events in batches and rolls them up into partial buckets or window.
Finally there is combiner or reducer which takes the micro batches and actually does the updating. An example would be event counts. Because we are using a queue from above events come in ordered and depending on the queue we might be able to have multiple consumers that do the combing operation.
So if you want minute counts you would do rollups per minute and only store the sum of the events for that minute. This turns out to be fairly small space wise so you can store this in memory.
If you wanted those counts for month or day or even year you would just add up all the minute count buckets.
Now there is of course a major problem with this technique. You need to know what aggregates and pivots you would like to collect a priori.
But you get extremely fast look up of results.
Old school data warehousing (partitioning) and Map Reduce (batch processed)
Now lets assume they do want the actual events for a certain time period. This is expensive because if you store all the events in one place the lookup and retrieval is difficult. But if you use the fact that time is hierarchal you can store the events in a tree of tuples.
Reasons you would want the actual events is because you are doing adhoc querying and are willing to wait for the queries to perform.
You need some sort of queue for the stream of events.
A worker reads the queue and partitions the events based on time. For example you would have a partition for a certain day. This is akin to sharding. Many storage systems have support for this (e.g postgres partitions).
When you want a certain number of events over a period you union the partitions.
The partitioning is essentially hierarchal (minutes < hours < days etc) which means you can do tree like operations on them.
There are certain ways to store such events which is called time series data such that the partitioning index is automatic and fast. These are called TSDBs of which you can google for more info.
An example TSDB product would be influxdb.
Now going back to the fact that time (or at least how humans represent it) is organized tree like we can we can preform parallelization operations. This because a tree is DAG (directed acyclic graph). With a DAG you can do some analysis and basically recursively operate on the branches (also known as fork/join).
An example generic parallel storage product would citusdb.
Now of course this method has a massive draw back. It is expensive! Even if you make it fast by increasing the number of nodes you will have to pay for those nodes (distributed shards). An in theory the performance should scale linearly but in practice this does not happen (I will save you the details).
I think you will need to persist the data to the disk as
the query duration is super vague, and data might be loss due to some unforeseen circumstances like process killed, machine failure etc.
you can't keep all the events in memory due to memory
constraints(millions of events)
I would suggest using mysql as the data store with taking timestamp as one of the index key. But two events might have same timestamp. So make a composite index key with auto-increment id + timestamp.
Advantages of Mysql:
Super-reliable with replication
Support all kinds of CRUD operations and queries
On each query you can basically get the range of the timestamps as per your need.
First count the no. of events satisfying the query.
select count(*) from `events` where timestamp >= x and timestamp <=y.
If too many events satisfy the query, query them in batches.
select * from 'events' where timestamp >= x and timestamp <=y limit 1000 offset 0;
select * from 'events' where timestamp >= x and timestamp <=y limit 1000 offset 1000;
and so on.. till offset <= count of events matching the first query.
I am quite new to « Big Data » technologies, especially Cassandra, so I need your advices for the task I have to do. I have been looking to Datastax examples about handling timeseries, and different discussion here about this topic, but if you think I might have missed something, feel free to tell me.
Here it my problem.
I need to store and analyze data coming from about 100 sensor stations that we are testing. In each sensor station, we have several thousand sensors. So for each station, we run several tests (about 10, each one lasting about 2h30), during which the sensors are recording information every millisecond (can be boolean, integer or float). The records of each test are kept on the station during the test, then they are sent to me once the test is completed. It means about 10 GB for each test (each parameter is about 1 MB of information).
Here is a schema to illustrate the hierarchy:
Hierarchy description
Right now, I have access to a small Hadoop Cluster with Spark and Cassandra for testing. I may be able to install other tools, but I would really appreciate to keep working with Spark/Cassandra.
My question is: what could be the best data model for storing then analyzing the information coming from these sensors?
By “analyzing”, I mean:
find min, max, average value on a specific parameter recorded by a specific sensor on a specific station; or find those values for a specific parameter but for all the station; or find those value for a specific parameter but when other parameters (one or two) of the same station are upper than a limit
plot the evolution of one or more parameters to compare them visually (the same parameter on different stations, or different parameters on the same station)
do some correlation analysis between parameters or stations (eg. to find if a sensor is not working).
I was thinking of putting all the information in a Cassandra Table with the following data model:
CREATE TABLE data_stations (
station text, // station ID
test int, // test ID
parameter text, // name of recorded parameter/sensor
tps timestamp, // timestamp
val float, // measured value
PRIMARY KEY ((station, test, parameter), tps)
);
However, I don’t know if one table would be able to handle all the data : a quick calculation give 10^14 different rows according to the precedent data model (100 stations x 10 test x 10 000 parameters x 9,000,000ms (2h30 in milliseconds) ~= 10^14), even if each partition is “only” 9,000,000 rows.
Other ideas were to split the data in different table (eg. One table per station, or one table per test per station, etc.). I don’t know what and how to choose, so any advice is welcome!
Thank you very much for your time and help, if you need more information or details I would be glad to tell you more.
Piar
You are on the right track, Cassandra can handle such data. You may store all the data you want it column families and use Apache Spark over Cassandra to do the required aggregations.
I feel Apache Spark is good for your use case as it could be used for aggregations and calculating correlations.
You may also check out Apache Hive as it can work/query over data in HDFS directly(through external tables).
Check these :
Cassandra - Max. size of wide rows?
Limitations of Cassandra
I am trying to improve the performance of my database, which simplified set-up is the following :
EDIT
One table with 3 rows (id_device, timestamp, data) with a composite btree index (id_device, timestamp)
1k devices sending data every minute
The insert are quite fast, since PostgreSQL merely writes the rows in the order they are received. However, when trying to get many data with consecutive timestamp of a given device, the query is not so fast. The way I understand it is that due to the way the data is collected, there is never more than one row of a given device on each page of the table. Therefore, if I want to get 10k data with consecutive timestamp of a given device, PostgreSQL has to fetch 10k pages from disk. Besides, since this operation can be done on any of the 1k devices, those pages are not going to be kept in RAM.
I have tried to CLUSTER the table, and it indeed solve the performance issue, but this operation is incredibly long (~1 day) and it locks the entire table, so I discarded this solution.
I have read about the partitionning, but that would mean a lot of scripting if I need to add a new table every time a new devices is connected, and it seems to me a bit bug-prone.
I am rather confident in the fact that this set-up is not particularly original, so is there an advice I could use?
Thanks for reading,
Guillaume
I'm guessing your index also has low selectivity, because you're indexing device_id first (which are only 1000 different) and not timestamp first.
Depends on what you do with the data you fetch, but maybe the solution could be batching the operation, such as fetching the data for a predetermined period and processing data for all 1000 devices in one go.
Currently, I have a cassandra column family with large rows of data, to say more than 100,000. Now, I'd like to remove all data in this column family and the problem came up:
After all data is removed, I execute a lookup query in this column family, the cassandra will take tens of seconds to return a empty query result. And the time cost will increase Linearly when the original data is larger
It is caused by the tombstone feature while deleting data from the cassandra database. The lookup speed won't recover to normal until the next GC is fired. See Cassandra Distributed Deletes.
Because such query operations are frequently used in my system, I cannot bear the huge latency up to a few seconds.
Would you please give me a solution to this problem?
This sounds like a very bad way to use a database. Populate it, empty it, repeat. One way you can solve your problem is by using different CF names each time, as in when you empty the data and start repopulating it, create a new column family and use that and just drop the other colum family however this is hacky.
I'd suggest using compaction (gets rid of all the tombstones it can detect) to solve your problem, it is CPU intensive but it's better than waiting for tens of seconds for queries to respond. You can make the task less intensive on your machine by providing the specific ks & cf you want to compact:
./nodetool compact <ks_name> <cf_name>
Ritchard's point is a good one, gc_grace_seconds is set to 10 days by default so you will probably have to tweak this to allow for compaction to get rid of tombstones.
#Fify
If your column family is frequently modified (read then update then read the update again...), you should use the leveled compaction strategy
To make deleted columns removed quickier, change the property gc_grace_seconds of your column family
I'm getting around 1000 distinct events per second, (4 nodes cluster). After each event I will need to increase some counters. My question is, is it better to have a normal column family which has only one column and all the counters are treated like string with comma "," separated (example: "1,3,5,6,0,2") or it is better to create a Counter Column family with multiple columns? I read some document it says that counter column family can do read and write with consistency level 1 which is fast for reading. I don't really care much about write performance.
I think this depends on how you are receiving events and latency requirements.
If you are receiving them from multiple sources concurrently and need to write data as soon as possible it would seem that counters would be the better approach. With one big column, you would need to serialize all writes to any column as well as read the current value. This could also unnecessarily complicate your application code. If performance is a problem, you could try to enable the row cache for your counter column family. I have never tried to cache a counter column family, but I don't see any docs saying it is not supported. You can try it and check the JMX stats to see if it's working.
If you are receiving events single threaded and can do something like read data for 1000 events and then write once to cassandra while keeping the current counter values in memory, then a single column might be fine. But you need to realize that if you happen to just need to read a few counter values at a time, you'll be fetching a lot of unnecessary data on every read. Unless you do some tests that show that one column performs significantly better I would favor counters.