Difference in elasticsearch index size with same data and number of documents - elasticsearch

I have multiple elasticsearch clusters, every cluster has the same indices with the same data with the same number of documents. But there is a significant difference in the index size.
I tried to use merge api but it's not helping. The issue is, because of this elasticsearch is eventually running out of space:
{
"state": "UNASSIGNED",
"primary": true,
"node": null,
"relocating_node": null,
"shard": 3,
"index": "local-deals-1624295772015",
"recovery_source":
{
"type": "EXISTING_STORE"
},
"unassigned_info":
{
"reason": "ALLOCATION_FAILED",
"at": "2021-08-18T19:14:20.472Z",
"failed_attempts": 20,
"delayed": false,
"details": "shard failure, reason [lucene commit failed], failure IOException[No space left on device]",
"allocation_status": "deciders_no"
}
}
I have configured the elasticsearch cluster to not have more than 2 shards per node to improve the query performance.
Cluster-1:
Cluster-2:
Given these two clusters with the same documents, there is a difference of 90% in the index size which is not making sense to me. Can someone explain this behavior?
My quick fix is to increase the EBS volume.
Response to #Val's question:
There are multiple documents that are marked for deletion.
"5": {
"health": "yellow",
"status": "open",
"index": "local-deals-1624295772015",
"uuid": "s7QDLtuhRN6HM_VwtVTB0Q",
"pri": "6",
"rep": "1",
"docs.count": "8911560",
"docs.deleted": "18826270",
"store.size": "37gb",
"pri.store.size": "19.9gb"
}

You can try to run _forcemerge indeed. It is not a blocking call, it triggers an asynchronous task that will run in the background until the job is done.
You don't need to wait for the call to return in order to force merge segments.
Also know that this will not remove all deleted documents, but a good deal of them depending on the ratio deleted/docs.
You can find more info on the different merge settings in the MergePolicyConfig.java class.

Related

Elasticsearch - query based on event frequency

I have multiple indexes to store user tracking log. In which there is 1 index is index-pageview. How can I query out the list of users who viewed the page 10 times between 2021-12-11 and 2021-12-13 using IOS operating system?
Log example:
index: index-pageview
[
{
"user_id": 1,
"session_id": "xxx",
"timestamp": "2021-12-11 hh:mm:ss",
"platform": "IOS"
},
{
"user_id": 1,
"session_id": "yyy",
"timestamp": "2021-12-13 hh:mm:ss",
"platform": "Android"
}
]
You can try building a normal bool query on timestamp and platform and then either terms aggregation (possibly with min_doc_count: 10) or collapse on user_id. Both ways will have some limitations though:
aggregation might be slower (needs benchmarking)
aggregation bucket number is limited (at 10k by default)
collapse will work on at most size docs at a time (capped at 10k as well) so you might need scrolling and app-side processing
Though performance of these might be pretty poor. If you need to run queries like those very often I would consider using another storage (SQL? Something more fancy?)

Moving data from oine Elasticsearch index to another with higher number of shards or increasing shard number in existing index

I am new to Elasticsearch and I have been reading documentation in order to find a way of increasing amount of shards that my index consists of. Currently my index looks like this:
country_data 0 p STARTED 227 100.7kb 192.168.0.115 $HOSTNAME
country_data 0 r STARTED 227 100.7kb 192.168.0.116 $HOSTNAME
I wanted to increase the number of shard to 5 however I was unable to find a proper way of doing it. I learnt from another Stackoverflow question that I should be able to do it like this:
POST _reindex?slices=5
{
"source": {
"index": "country_data"
},
"dest": {
"index": "country_data_new"
}
}
However when I did that I got a copy of my country_data with same amount of shards and replicas (1 and 1). I tried to learn more about it in documentation but all I found is this: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/option_slices.html
I couldn't find anything in documentation about increasing number of shards in existing index or how can I move data to new index which would have more shards. I would be grateful for any insights into this problem or at least a website where could I learn how to do it.
This can be done in any of the below mentioned way.
1st Option : You can use the elastic search Split Index API.
I suggest you to please go through the documentation once before proceeding with this method.
2nd Option : Create a new index with same mappings and give the required settings for new shards. Then use the reindex API to copy data from source index to destination index
To create the new Index:
PUT /<NEW_INDEX_NAME>
{
"settings": {
"number_of_shards": <REQUIRED_NUMBER_OF_SHARDS>
},
"mappings": {<MAPPINGS_OF_SOURCE_INDEX>}
}
}
If you don't give the number of shards in the settings while creating an index, by default it creates index with one primary and one replica shard.
To Reindex from source to newly created index:
POST _reindex
{
"source": {
"index": "<SOURCE_INDEX_NAME>"
},
"dest": {
"index": "<NEW_INDEX_NAME>"
}
}

How can I show a table with the sum of value x of all childeren within Kibana

I'm have an elasticsearch database with documents stored the following way(, seperates the documents):
{
"path":"path/to/data"
"kind": "type1"
},
{
"path":"path/to/data/values1"
"kind": "type2"
"x": 2
},
{
"path":"path/to/data/values2"
"kind": "type2"
"x": 2
},
{
"path":"path/to/data/datasub"
"kind": "type1"
},
{
"path":"path/to/data/datasub/values1"
"kind": "type2"
"x": 1
}
Now I want the create table view/chart show all type2's with all the sum of x of all their childeren.
So I expect the total of path/to/data to be 5 and the total of path/to/data/datasub 1.
To consider: the depth of this structure could theoretically be unlimited
I'm running Elastichsearch 7 and Kibana 7 and I want to use the table visualisation to start with but I would like to be able to use this kind of aggregation throughout multiple visualisations. I have Googles a lot and found all kinds of Elastichsearch queries but nothing on how to achieve this in Kibana.
All help is much appreciated
For those who run into the same question:
The solution I ended up using is to split the path in to tokens prior to importing it into Elasticsearch. So consider a document having a path like "/this/is/a/path". This becomes the following array in the document:
[
"/this",
"/this/is",
"/this/is/a",
"/this/is/a/path"
]
You can then use a terms aggregation on it with various metrics to calculate your desired measurements.

Joining logstash with parent record

I'm using logstash to analyze my web servers access. At this time, it works pretty well. I used a configuration file that produce to me this kind of data :
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345"
...
}
This record are stored into "logstash-2016.10.02" index (one index per day).
I also created an other index named "publications". This index contains the publication metadata.
A json record looks like this :
{
"type": "publication",
"id": "boreal:12345",
"sm_title": "The title of the publication",
"sm_type": "thesis",
"sm_creator": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
"sm_departement": [
"UCL/CORE - Center for Operations Research and Econometrics",
],
"sm_date": "2001",
"ss_state": "A"
...
}
And I would like to create a query like "give me all access for 'Smith, John' publications".
As all my data are not into the same index, I can't use parent-child relation (Am I right ?)
I read this on a forum but it's an old post :
By limiting itself to parent/child type relationships elasticsearch makes life
easier for itself: a child is always indexed in the same shard as its parent,
so has_child doesn’t have to do awkward cross shard operations.
Using logstash, I can't place all data in a single index nammed logstash. By month I have more than 1M access... In 1 year, I wil have more than 15M record into 1 index... and I need to store the web access data for minimum 5 year (1M * 12 * 15 = 180M).
I don't think it's a good idea to deal with a single index containing more than 18M record (if I'm wrong, please let me know).
Is it exists a solution to my problem ? I don't find any beautifull solution.
The only I have a this time in my python script is : A first query to collect all id's about 'Smith, John' publications ; a loop on each publication to get all WebServer access for this specific publication.
So if "Smith, John" has 321 publications, I send 312 http requests to ES and the response time is not acceptable (more than 7 seconds ; not so bad when you know the number of record in ES but not acceptable for final user.)
Thanks for your help ; sorry for my english
Renaud
An idea would be to use the elasticsearch logstash filter in order to get a given publication while an access log document is being processed by Logstash.
That filter would retrieve the sm_creator field in the publications index having the same object_id and enrich the access log with whatever fields from the publication document you need. Thereafter, you can simply query the logstash-* index.
elasticsearch {
hosts => ["localhost:9200"]
index => publications
query => "id:%{object_id}"
fields => {"sm_creator" => "author"}
}
As a result of this, your access log document will look like this afterwards and for "give me all access for 'Smith, John' publications" you can simply query the sm_creator field in all your logstash indices
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345",
"author": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
...
}

ElasticSearch + Kibana - Unique count using pre-computed hashes

update: Added
I want to perform unique count on my ElasticSearch cluster.
The cluster contains about 50 millions of records.
I've tried the following methods:
First method
Mentioned in this section:
Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory.
Second method
Mentioned in this section:
Unless you configure Elasticsearch to use doc_values as the field data format, the use of aggregations and facets is very demanding on heap space.
My property mapping
"my_prop": {
"index": "not_analyzed",
"fielddata": {
"format": "doc_values"
},
"doc_values": true,
"type": "string",
"fields": {
"hash": {
"type": "murmur3"
}
}
}
The problem
When I use unique count on my_prop.hash in Kibana I receive the following error:
Data too large, data for [my_prop.hash] would be larger than limit
ElasticSearch has 2g heap size.
The above also fails for a single index with 4 millions of records.
My questions
Am I missing something in my configurations?
Should I increase my machine? This does not seem to be the scalable solution.
ElasticSearch query
Was generated by Kibana:
http://pastebin.com/hf1yNLhE
ElasticSearch Stack trace
http://pastebin.com/BFTYUsVg
That error says you don't have enough memory (more specifically, memory for fielddata) to store all the values from hash, so you need to take them out from the heap and put them on disk, meaning using doc_values.
Since you are already using doc_values for my_prop I suggest doing the same for my_prop.hash (and, no, the settings from the main field are not inherited by the sub-fields): "hash": { "type": "murmur3", "index" : "no", "doc_values" : true }.

Resources