Which time series database supports these specific requirements? - elasticsearch

We have a database with more than a billion daily statistical records. Each record has multiple metrics (m1 through m10), and several immutable tags.
Record can also be associated with zero or more groups. The idea was to use multiple tags (e.g. g1, g2) to indicate the belonging of specific record to specific group.
Our data is stored on daily level, and most time-series databases are really optimized for more granular data. This represents a problem when we want to produce monthly, or quarterly graphs (e.g. InfluxDB have maximum aggregation period of 7d). We need a database that is really optimized for day-level data points and can produce quick aggregations on month/quarter/year level.
Furthermore, the relationship between records and groups is mutable. We need the database to support the batch update of records (pseudo: ADD TAG group1 TO records WHERE record_id: 101), or at least fast deletion/reinserting of updated data. This operation should be relatively fast.
We need something that can produce near-real-time results when aggregating data across tens of millions (filtered) records.
Our original solution is based on elasticsearch and it works quite well, but wanted to explore alternatives in time-series databases niche. Can anyone recommend a time-series database that supports these features?

Try ClickHouse. It is optimized for real-time processing and querying big amounts of data. We successfully used it to store hundreds of billions of records per day on a 15-node cluster. ClickHouse is able to scan billions of records per second per CPU core and its query performance scales linearly with the number of available CPU cores.
ClickHouse also supports infrequent data updates, so you can update groups for particular rows.
If you want more tradituonal TSDB, then take a look at VictoriaMetrics. It is built on architecture ideas from ClickHouse, so it is fast and provides good on-disk data compression.

Related

Maximum size of database for DQ mode in Power BI

I am using a database worth of 500 GBs. I want to visualize different columns to study the relationship between them using Power BI. However, there are performance issues while loading graphs.
I am using in DQ mode.
Its annoying to wait for 10 minutes for each visual to load.
Could anyone tell me if its a good idea to use Power BI for visualisation/making dashboard out of 500GBs of data?
What is the maximum limit of database we can use in DQ mode to create visuals efficiently?
DQ doesn't have a defined limit, MS have shown demos using a Petabyte database in this case for long running queries on a database, you have a few options.
Understand what queries are being run, and optimise your indexing strategy, maybe for example add a covering index
Optimise your data source, by using a column store index to move it in memory
Create database or table(s) with a the necessary subset of data from your main data.
Examine what objects are being used, and remove nested logic, views on top of views etc, with scalar conditions etc
The petabyte example by MS also used aggregation mode (Mentioned by WB in their answer) to store a subset of the data
I have used Direct Query to sit over data sources that have been around the 200GB range, however these have been mostly standard Star Schema data warehouses, or a defined reporting table, both which had the relevant indexes, covering indexes, or Column Store Indexes to allow more efficient retrieval of data. Direct Query Mode will slow down due to the number of query's that it has the do on the data source based on the measure, relationships and the connection overhead. Another can be the number of visuals on page, as each visual is a query and each one has to run on the data source.
You might want to look at aggregates in Power BI. You can basically import aggregate tables to Power BI that would satisfy needs for most of your visuals and resort to Direct Query for details that you might rarely need. When properly configured, aggregations will be cached and visuals that hit the aggregation will make use of that while those that don't will seamlessly query the DQ source.
Also, VertiPaq engine with its columnar store is quite efficient at compresses data. So given some smart modelling (get rid of unneeded high cardinality columns), you might actually end up with a much smaller model than your original data for all import.
Your mileage may vary.
As to the dataset limit itself, I believe it's 1GB/dataset when uploading to the service.

cassandra vs elastic search vs any other design suggestions

We have a need to run analytics queries on the data stored in rds. And that's becoming very very slow because of group by queries and ever increasing size of the tables.
For example we have following 3 tables in RDS :
alm(id,name,cli, group_id, con_id ...)
group(id, type,timestamp ...)
con(id,ip,port ...)
each of the tables have very high amount of data and are being updated several times a minute as the new data comes in.
Now we want to run aggregation queries like :
select name from alm, group, con where alm.group_id=group.id and alm.con_id=con.id group by name, group.type, con.ip
We also want users to run custom aggregation queries in the future as opposed to the fix query provided by us in future.
So far the options we are considering are moving to either Cassandra, Elasticsearch or Dynamo db so that aggregation would be faster. Can someone guide as to how to go about this problem ? Or any crumbs of experience ? Anybody know any technologies have severe advantage over others ?
Cassandra and DynamoDB are quite different from ElasticSearch. And all three are very different from relational database offerings.
For ad-hoc analytics, relational databases, with a well designed schema, can be pretty good up to the point where you need to split your data across multiple servers (then replication issues start to dominate the benefits). And that's really the primary motivation for non-relational databases. But the catch is that in order to solve the horizontal scaling problem, they generally trade some features such as joining and aggregating.
Elastic search is really great at answering search queries, but not particularly good at aggregations (other than very basic counts, sums and their estimates). It's amazing at indexing copious amounts of data but it can't answer queries that require complex cross index operations. It is also not as robust (rebuilding indexes may be needed from time to time)
If you have high volumes of data and you need aggregation, you pretty much have two options:
if you can get away with offline analytics, then distributed data processing frameworks such as Spark can get you the answers you need very efficiently
if you need online analytics, the most common approach is to pre-compute the aggregations and update as you get more data, so that answers to queries can be very fast without having to process a lot of data for each query
Don't be afraid to mix and match though. Relational databases have their purpose as do non-relationals. There is no silver bullet though.
One more options is Column-oriented databases, this kind of DB is more suitable for 'analytics' cases when you have many data fields and you want to perform aggregations or extract some subset of fields for big amount of data.
Recently Yandex ClickHouse becomes very popular and there is Column Oriented service from Amazon - Redshift. Also there are several other solutions
Store in parquet and use spark, partition efficiently

Reasons against using Elasticsearch as an OLAP cube

At first glance, it seems that with Elasticsearch as a backend it is easy and fast to build reports with pivot-like functionality as used in traditional business intelligence environments.
By "pivot-like" I mean that in SQL-terms, data is grouped by one to two dimensions, filtered, ordered by one or two dimensions and aggregated by several metrics e.g. with sum or count.
By "easy" I mean that with a sufficiently large cluster, no pre-aggregation of the data is required, which saves ETLs and data engineering time.
By "fast" I mean that due to Elasticsearch's near real time capability report latency can be reduced in many instances, when compared to traditional business intelligence systems.
Are there any reasons, not to use Elasticsearch for the above purpose?
ElasticSearch is a great alternative to a cube, we use it for that same purpose today. One huge benefit is that with a cube you need to know what dimensions you want to create reports on. With ES you just shove in more and more data and figure out later how you want to report on it.
At our company we regularly have data go through the following life cycle.
record is written to SQL
primary key from SQL is written to RabbitMQ
we respond back to the customer very quickly
When Rabbit has time, it uses the primary key to gather up all the data we want to report on
That data is written to ElasticSearch
A word of advice: If you think you might want to report on it, get it from the beginning. Inserting 1M rows into ES is very easy, updating 1M rows is a bigger pain.

How to achieve Data Sharding in Endeca (data partitioning)

Currently Oracle Commerce Guided Search (Endeca) supports only language specific partitions (i.e., One MDEX per Language). For systems with huge data volume base (say ~100 million records of ~200 stores), does anyone successfully implemented data partitioning (sharding) based on logical group of data (i.e., One MDEX per group-of-stores) so that the large set of data can be divided into smaller sets of data?
If so, what precautions to be taken while indexing data and strategies for querying the Assembler?
Don't think this is possible. Endeca used to support the Adgidx which allowed you to split or shard the mdex but that is no longer supported. Oracles justification for removing this is that with multithreading and multi-core processors it is no longer necessary. Apache Solr, however, supports sharing
The large set of data can be broken into smaller sets, where each set would be attributed to a property, say record.type, which would identify the different sets. So, basically we are normalizing the records in the Endeca index.
Now, while querying endeca, we can use the concept of record relationship navigation queries, using record-record relationships by applying a relationship filter, to bring back records of different types.
However, you might have to obtain a RRN license to enable the RRN feature in the mdex engine.

Doing analytical queries on large dynamic sets of data

I have a requirement where I have large sets of incoming data into a system I own.
A single unit of data in this set has a set of immutable attributes + state attached to it. The state is dynamic and can change at any time.
The requirements are as follows -
Large sets of data can experience state changes. Updates need to be fast.
I should be able to aggregate data pivoted on various attributes.
Ideally - there should be a way to correlate individual data units to an aggregated results i.e. I want to drill down into the specific transactions that produced a certain aggregation.
(I am aware of the race conditions here, like the state of a data unit changing after an aggregation is performed ; but this is expected).
All aggregations are time based - i.e. sum of x on pivot y over a day, 2 days, week, month etc.
I am evaluating different technologies to meet these use cases, and would like to hear your suggestions. I have taken a look at Hive/Pig which fit the analytics/aggregation use case. However, I am concerned about the large bursts of updates that can come into the system at any time. I am not sure how this performs on HDFS files when compared to an indexed database (sql or nosql).
You'll probably arrive at the optimal solution only by stress testing actual scenarios in your environment, but here are some suggestions. First, if write speed is a bottleneck, it might make sense to write the changing state to an append-only store, separate from the immutable data, then join the data again for queries. Append-only writing (e.g., like log files) will be faster than updating existing records, primarily because it minimizes disk seeks. This strategy can also help with the problem of data changing underneath you during queries. You can query against a "snapshot" in time. For example, HBase keeps several timestamped updates to a record. (The number is configurable.)
This is a special case of the persistence strategy called Multiversion Concurrency Control - MVCC. Based on your description, MVCC is probably the most important underlying strategy for you to perform queries for a moment in time and get consistent state information returned, even while updates are happening simultaneously.
Of course, doing joins over split data like this will slow down query performance. So, if query performance is more important, then consider writing whole records where the immutable data is repeated along with the changing state. That will consume more space, as a tradeoff.
You might consider looking at Flexviews. It supports creating incrementally refreshable materialized views for MySQL. A materialized view is like a snapshot of a query that is updated periodically with the data which has changed. You can use materialized views to summarize on multiple attributes in different summary tables and keep these views transactionally consistent with each other. You can find some slides describing the functionality on slideshare.net
There is also Shard-Query which can be used in combination with InnoDB and MySQL partitioning, as well as supporting spreading data over many machines. This will satisfy both high update rates and will provide query parallelism for fast aggregation.
Of course, you can combine the two together.

Resources