Prometheus metric and label naming for high cardinality dimensions - metrics

I want to report a metric for each item that is viewed in our system. We have tens of millions of items.
In the Prometheus documentation it warns not to label high cardinality metrics
CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
So in this case what is the best practice?

The practice is to use a logs-based system such as the ELK stack, rather than a metrics-based system like Prometheus or Graphite.

Related

methods to calculate a representative index value using multiple different types of metrics

I am trying to create a index metric of multiple metric to show a representative score for a specific category.
example:
In the datacenter, there are multiple components such as compute, storage, power, space, network, firewalls
each of these have multiple NODES and associated metrics like usage, performance, health, number of xxx processed etc. and these metrics are over time.
Another example:
Developers have multiple activities, like checkin, number of line modified, bug fixed per checkin, bugs, failed regression
Is there a mathematical model that can take these metrics & some additional configurations like importance, relevance and create a single score that I can track over time - and then if I wish to double click to understand blips on the graph, I can look at it by looking at the individual metrics ?
We have ELK stack collecting all this datacenter data and other ITSMC tools that have support metrics, SLA etc.
are there any applications that do this ? sort of like the tracker of S&P500 or FINTECH stocks etc. like the index fund that tracks multiple types of stocks based on invested $$.
thanks

Which time series database supports these specific requirements?

We have a database with more than a billion daily statistical records. Each record has multiple metrics (m1 through m10), and several immutable tags.
Record can also be associated with zero or more groups. The idea was to use multiple tags (e.g. g1, g2) to indicate the belonging of specific record to specific group.
Our data is stored on daily level, and most time-series databases are really optimized for more granular data. This represents a problem when we want to produce monthly, or quarterly graphs (e.g. InfluxDB have maximum aggregation period of 7d). We need a database that is really optimized for day-level data points and can produce quick aggregations on month/quarter/year level.
Furthermore, the relationship between records and groups is mutable. We need the database to support the batch update of records (pseudo: ADD TAG group1 TO records WHERE record_id: 101), or at least fast deletion/reinserting of updated data. This operation should be relatively fast.
We need something that can produce near-real-time results when aggregating data across tens of millions (filtered) records.
Our original solution is based on elasticsearch and it works quite well, but wanted to explore alternatives in time-series databases niche. Can anyone recommend a time-series database that supports these features?
Try ClickHouse. It is optimized for real-time processing and querying big amounts of data. We successfully used it to store hundreds of billions of records per day on a 15-node cluster. ClickHouse is able to scan billions of records per second per CPU core and its query performance scales linearly with the number of available CPU cores.
ClickHouse also supports infrequent data updates, so you can update groups for particular rows.
If you want more tradituonal TSDB, then take a look at VictoriaMetrics. It is built on architecture ideas from ClickHouse, so it is fast and provides good on-disk data compression.

How to design a system in which we can query top results in last n hours

I was asked this question in an interview. The details were that assume we are getting millions of events. Each event has a timestamp and other details. The systems design requires ability to enable end user to query most frequent records in last 10 minutes or 9 hours or may be 3 months.
Event can be seen as following
event_type: {CRUD + Search}
event_info: xxx
timestamp : ts...
The easiest way to to figure out this is to look at how other stream processing or map reduce libraries do this (and I have feeling your interviewers have seen these libraries). Its basically real time map reduce (you can lookup how that works as well).
I will outline two techniques for event processing. In reality most companies need to do both.
New school Stream processing (real time)
Lets assume for now they don't want the actual events but the more likely case of aggregates (I think that was the intent of your question)
An example stream processing project is pipelinedb (they have how it works on the bottom of their home page).
Events go into use a queue/ring buffer
A worker process reads those events in batches and rolls them up into partial buckets or window.
Finally there is combiner or reducer which takes the micro batches and actually does the updating. An example would be event counts. Because we are using a queue from above events come in ordered and depending on the queue we might be able to have multiple consumers that do the combing operation.
So if you want minute counts you would do rollups per minute and only store the sum of the events for that minute. This turns out to be fairly small space wise so you can store this in memory.
If you wanted those counts for month or day or even year you would just add up all the minute count buckets.
Now there is of course a major problem with this technique. You need to know what aggregates and pivots you would like to collect a priori.
But you get extremely fast look up of results.
Old school data warehousing (partitioning) and Map Reduce (batch processed)
Now lets assume they do want the actual events for a certain time period. This is expensive because if you store all the events in one place the lookup and retrieval is difficult. But if you use the fact that time is hierarchal you can store the events in a tree of tuples.
Reasons you would want the actual events is because you are doing adhoc querying and are willing to wait for the queries to perform.
You need some sort of queue for the stream of events.
A worker reads the queue and partitions the events based on time. For example you would have a partition for a certain day. This is akin to sharding. Many storage systems have support for this (e.g postgres partitions).
When you want a certain number of events over a period you union the partitions.
The partitioning is essentially hierarchal (minutes < hours < days etc) which means you can do tree like operations on them.
There are certain ways to store such events which is called time series data such that the partitioning index is automatic and fast. These are called TSDBs of which you can google for more info.
An example TSDB product would be influxdb.
Now going back to the fact that time (or at least how humans represent it) is organized tree like we can we can preform parallelization operations. This because a tree is DAG (directed acyclic graph). With a DAG you can do some analysis and basically recursively operate on the branches (also known as fork/join).
An example generic parallel storage product would citusdb.
Now of course this method has a massive draw back. It is expensive! Even if you make it fast by increasing the number of nodes you will have to pay for those nodes (distributed shards). An in theory the performance should scale linearly but in practice this does not happen (I will save you the details).
I think you will need to persist the data to the disk as
the query duration is super vague, and data might be loss due to some unforeseen circumstances like process killed, machine failure etc.
you can't keep all the events in memory due to memory
constraints(millions of events)
I would suggest using mysql as the data store with taking timestamp as one of the index key. But two events might have same timestamp. So make a composite index key with auto-increment id + timestamp.
Advantages of Mysql:
Super-reliable with replication
Support all kinds of CRUD operations and queries
On each query you can basically get the range of the timestamps as per your need.
First count the no. of events satisfying the query.
select count(*) from `events` where timestamp >= x and timestamp <=y.
If too many events satisfy the query, query them in batches.
select * from 'events' where timestamp >= x and timestamp <=y limit 1000 offset 0;
select * from 'events' where timestamp >= x and timestamp <=y limit 1000 offset 1000;
and so on.. till offset <= count of events matching the first query.

Google datastore - index a date created field without having a hotspot

I am using Google Datastore and will need to query it to retrieve some entities. These entities will need to be sorted by newest to oldest. My first thought was to have a date_created property which contains a timestamp. I would then index this field and sort on this field. The problem with this approach is it will cause hotspots in the database (https://cloud.google.com/datastore/docs/best-practices).
Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates.
Obviously sorting data on dates is properly the most common sorting performed on a database. If I can't index timestamps, is there another way I can accomplish being able to sort my queires from newest to oldest without hotspots?
As you note, indexing monotonically changed values doesn't scale and can lead to hotspots. Whether you are potentially impacted by this depends on your particular usage.
As a general rule, the hotspotting point of this pattern is 500 writes per second. If you know you're definitely going to stay under that you probably don't need to worry.
If you do need higher than 500 writes per second, but have a upper limit in mind, you could attempt a sharded approach. Basically, if you upper on writes per second is x, then n = ceiling(x/500), where n is the number of shards. When you write your timestamp, prepend random(1, n) at the start. This creates n random key ranges that each can perform up to 500 writes per second. When you query your data, you'll need to issue n queries and do some client side merging of the result streams.

ElasticSearch Scale Forever

ElasticSearch Community:
Suppose I have a customer named Twetter who has hired me today to build out their search capability for a 181 word social media site.
Assume I cannot predict the number of shards I will need for future scaling and the storage size is already in tens of terabytes.
Assume I do not need to edit any documents once they are indexed. This is strictly for searching.
Referencing the image above, there seems to be some documents which point to 'rolling indexes' ref1 ref2 ref3 whereby I may create a single index (ea. index named tweets1 -> N) on-the-fly. When one index fills up, I can simply add a new machine, with a new index, and add it to the same cluster and alias for searching.
Does this architecture hold water in production?
Are there any long term ramifications to this 'rolling index' architecture as opposed to predicting a shard count and scaling within that estimate?
A shard in elasticsearch is just a lucene index. An elasticsearch index is just a collection of lucene indices (shards). Given that, for capacity planning in your situation you simply need to figure out how many documents you can store in an index with only one shard and still get the query performance you want.
It is the underlying lucene indices that use up resources. Based on how your documents are indexed within the lucene indices, there is a finite number of shards that any single node in your cluster will be able to handle. You can always scale by adding more nodes to the cluster. Just monitor resource usage and query response times to know when to add more nodes.
It is perfectly reasonable to create indices named tweet_1, tweet_2, tweet_3, etc. rolling forward instead of worrying about resharding your data. It accomplishes the same thing in the end. Just use an index alias to hide the numbers.
Once you figure out how many documents you can store per shard to get your query performance, then decide how many shards per index you want to have and then multiply those numbers and cap the index at that number of documents in your code. Once you reach the cap you just roll over to a new index. Here is what I do in my code to determine which index to send a document to (I have sequential ids):
$index = 'file_' . (int)($fid / $docsPerIndex);
Note that I am using index templates so it can automatically create a new index without me having to manually roll over when the cap is reached.
One other consideration is what type of queries you will be performing. As the data grows you have two options for scaling.
You need to have enough nodes in your cluster for parallelizing the query that it can easily search across all indices and still respond quickly.
or
You need to name your indices such that you know which to query and only need to query a subset of the indices in the cluster.
Keep in mind that if you have sequential or predictable ids then elasticsearch can perform id based queries efficiently without actually having to query the whole cluster. If you let ES automatically assign ids (assuming you are using ES >=1.4.0) it will use predictable ids (flake ids) already. This also speeds up indexing. Random ids create a worst case scenario.
If your queries are going to be time based then it will have to search the entire set of indices for each query under this scheme. For time based queries you want to roll your indices over based on some amount of time (e.g. each day or month depending on how much data you receive in that time frame) and name them something like tweets_2015_01, tweets_2015_02, etc. By doing so you can narrow the set of indices you have to search at query time based on the requested search time range.

Resources