ES Time series data using Percentile/median

ES Time series data using Percentile/median - elasticsearch

With Elasticsearch I know I can do some nice time series data queries and get mean/max etc
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-statistical-facet.html
Is it possible though to only include the 90% percentile in that calculation and in Kibana in particular?
Any thoughts on how this could be done?

Elasticsearch doesn't currently support percentiles (including median).
Percentiles are much harder to compute than statistics in a distributed environment. Let's assume you have 2 shards. If you ask both of them for the sum of their values and the number of values, you would be able to know the global average value: ($sum1 + $sum2) / $(value_count1 + $value_count2).
On the other hand, if you want to compute the median, the only way to compute it accurately is to get all values from both shards, sort them and take the median. This would require lots of memory and of network bandwidth.
Fortunately there are algorithms that allow to compute good approximated values of percentiles with limited memory usage, and we are in particular looking into tdigest so it is quite likely that (approximate) percentiles will be supported in a future release of Elasticsearch.

Related

How to reduce ElasticSearch lucene segments without force merge

We have a cluster which stores 1.5m records, with a total size of 3.5GB. Every 30 minutes around 2-5k records get updated or created. Up until now after mass indexing the pre-existing data we were force merging to bring the amount of segments down from 30-35 to 1, which greatly increases the performance of the search. After a few days the amount of segments normally rises and levels out at about 7 or 8, and performance is still ok.
The issue with this is we plan to scale our data to around 80GB. If we do so my concern is by using force merge after the initial mass index the segment will be larger than 5GB, at which point it will not be considered for automatic merging by ElasticSearch, and performance will decrease. Without using force merge though I believe the amount of segments will be too high.
Is there a way to force elastic search to more aggressively optimize, without calling the force merge API? We have no users in the evenings or weekends, so ideally we could mass index, and then give it all weekend to optimize the segments to a lower number, with no concern for search performance during that time.

Elasticsearch index by date search performance - to split or not to split

I am currently playing around with Elasticsearch (ES). We are ingesting sensor data and for 3 years we have approximately 1,000,000,000 documents in one index, making the index about 50GB in size. Indexing performance is not that important as new data only arrives every 15 minutes per sensor on average, therefore I want to focus on searching and aggregating performance. We are running a front-end showing basically a dashboard about average values from last week compared to one year before etc.
I am using ES on AWS and after performance on one machine was quite slow, I spun up a cluster with 3 data nodes (each 2 cores, 8 GB mem), and gave the index 3 primary shards and one replica. Throwing computing power at the data certainly improved the situation and more power would help more, but my question is:
Would splitting the index for example by month increase the performance? Or being more specific: is querying (esp. by date) a smaller index faster if I adjust the queries adequatly, or does ES already 'know' where to find specific dates in a shard?
(I know about other benefits of having smaller indices, like being able to roll over and keep only a specific time interval, etc.)

1/ Elasticsearch only knows where to find a specific date in an index if your index is sorted by your date field. You can check the documentation here.
In your use case, it can improve drastically search performance. And since all the data will be added at the "end of the index" since its date sorted, you should not see much of indexation overhead.
2/ Without index sort, smaller time-bounded indices will work better (even if you target all your indices) since it will often allow a rewrite or your range query to a match_all / match_none internal query.
For more information about this behavior you should read this blog post :
Instant Aggregations: Rewriting Queries for Fun and Profit

Redis GEORADIUS with one ZSET versus a lot of ZSETs of particular size

What will work faster, one big ZSET with geodata where I'll query for 100m radius with GEORADIUS
OR
a lot of ZSETs where each ZSET is responsible for 100m X 100m square covering the whole world? and named after this 100m squares like:
left_corner1_49_2440000_28_5010000
left_corner2_49_2450000_28_5010000
.......
and have all the 100 meters to the right and bottom inside the sets.
So when searching for the nearest point I'll just omit the redundant digits from gps like: 49.2440408, 28.5011694 will become
49.2440000, 28.5010000 so this way I'll know the ZSETS's name where just to get all the exact values with 100 meters precision.
OR to question it in general form: how are the ZSET's names are stored and accessed in redis? If I have too much ZSETS will it impact performance while accessing them?

Precise comparison of this approaches could only be done via benchmark and it would be specific to your dataset and configuration. But architecturally speaking, your pros and cons are:
BIG ZSET: less bandwidth and less operations (CPU cycles) taken to execute, no problems on borders (possible duplicates with many ZSETS), can get throughput with sharding;
MANY ZSETS: less latency for other operations (while big ZSET is going, other commands are waiting), can get throughput with sharding AND latency with clustering.
As for bottom line question, I did not see implementation code, but set names should be the same keys as any other keys you use. This is what Redis FAQ says about number of keys:
What is the maximum number of keys a single Redis instance can hold? <...>
Redis can handle up to 2^32 keys, and was tested in practice to handle
at least 250 million keys per instance.
UPDATE:
Look at what Redis docs say about GEORADIUS:
Time complexity: O(N+log(M)) where N is the number of elements inside
the bounding box of the circular area delimited by center and radius
and M is the number of items inside the index.
It means that items outside of your query make O(log(M)) impact on your query. So, 17 hops for 10m items or 21 hop for 1b items which is quite affordable. The question left is will you do partitioning between nodes?

Influxdb(single node) scaling to ~200 writes per second

What is the maximum number of points that can be written to influxdb (single node) per second? Is it feasible to scale influxdb without going for the paid cluster? And should I consider elasticsearch instead of influxdb for time series data (~3000 bytes/sec/user) if I am expecting around 60 concurrent users?

Depends on hardware.
Limiting factors are
Cardinality of series in the DB (total unique series)
WAL disk throughput (this could be put on tmpfs if you don't have SSD)
Data disk throughput (use SSD for best results)
RAM (more is better)
CPU for ingestion, indexing and queries
How far a single node can go largely depends on these and on the workload.
For write-heavy workloads of low cardinality, CPU generally tends to run out faster than anything else, assuming SSDs are used and disk I/O has been optimised accordingly.
After that, cardinality is the biggest limiting factor. Schema design plays a huge role, much bigger than number of nodes.
From some benchmarks I have done, a single node easily scales to ~70K series per second, with CPU being the limiting factor. This was on an old version though, likely higher than that now. Again, largely depends on data and schema design.
It is feasible to scale it without paid cluster by adding separate nodes, but not if you want to keep a homogeneous view (single source of all your data). Scaling vertically (more CPU, RAM) works only as long as cardinality remains consistent, meaning more data points for roughly same number of series.
InfluxDB suggest up to 250K writes / second with 25 queries per second on up to 1M unique queries is feasible on a single node. See hardware guidelines.
For the amount of data you have single node is more than enough - size of data does not matter, number of series does. Avoid elasticsearch for time series data - needs much more infrastructure to handle same amount of data.

Practical Parallel Efficiency % in Teradata

Teradata is built for parallelism.
I believe that from the below query we can measure the Parallel Efficiency of user's query
SELECT
USERNAME,
NumOfActiveAMPs,
((sum(AMPCPUTime))/1024) / ((sum(MaxAmpCPUTime) * NumOfActiveAMPs)/1024) * 100 as Parallel_Efficiency,
count(1)
FROM dbc.qrylog
WHERE MaxAmpCPUTime > 0
group by 1,2
In a ideal situation, i believe PE can be 100%
But for various reasons, i see that most PE (rolled up) is usually less than 50%
What according to you is a good Parallel Efficiency % that we should try to achieve ?
I was told that trying to achieve a high PE (like 60% or more) is also not good for the state of the system, not sure of the reason though, is this true ? your thoughts ?
Thanks for sharing your thoughts !

Parallel Efficiency for a given query can be calculated as AMPCPUTime / (MaxAMPCPUTime * (HASHAMP () + 1)). Where (MaxAMPCPUTime * (HASHAMP () + 1)) is the ImpactCPU measure, representing the highest CPU consumed by a participating AMP in the query multiplied by the number of AMPs in the configuration. You may find individual workloads are all over the board on their parallel efficiency.
I some times wonder if PE for an individual query would be more accurate if you replace the number of nodes in the system with the number of AMPs used by the query. This metric is available in DBQL and may help balance queries that are using PI or USI access paths that are not all AMP operations.
Parallel efficiency for your overall system can be obtained using ResUsage metrics by dividing the average node utilization by the maximum node utilization. This helps you understand how evenly the system is processing a given workload but does not consider how "heavy" that workload might be. Here you are looking to see the overall efficiency to be greater than 60%, the closer to 100% the better the nodes are working together.
I know your inquiry was about individual queries, but I thought sharing details about the PE of your environment would be beneficial as well.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio