How to design my elastic index for my application?

How to design my elastic index for my application? - elasticsearch

My application is a survey creation app where user can create survey with many questions of different types. Each survey can then be shared to any number of people whose responses are recorded as below...
{
"id" : 256, // submission id
"timeTaken" : "39.00",
"startTime" : "2020-07-19T05:37:38.873Z",
"state" : "COMPLETED",
"completedTime" : "2020-07-19T05:38:17.873Z",
"deviceType" : "COMPUTER",
"ip" : null,
"account_id" : 2,
"channel_id" : 48,
"contact_id" : null,
"survey_id" : 10,
"trigger_id" : 93,
"trigger_contact_id" : null,
"locked" : false,
"location" : null,
"language" : null,
"submission_id" : 256,
"question_90" : {
"skipped" : false,
"answer_choices" : [ 79 ]
},
"question_122" : {
"skipped" : false,
"otherChoice" : null,
"answer_choices" : [ 115, 113, 111, 110, 114 ]
},
"question_106" : {
"skipped" : false,
"answer_choices" : [
85
]
},
"question_120" : {
"answer_txt": "Great service",
"skipped" : false
},
"question_118" : {
"answer_txt": "Hello people",
"skipped" : false
},
"question_121" : {
"skipped" : false,
"answer_date" : "2020-06-04T20:01:49.783Z",
"answer_timezone" : 330
},
"question_108" : {
"skipped" : false,
"answer_int" : "93"
},
"question_105" : {
"skipped" : false,
"answer_string" : "+1 202 9932219"
},
"question_93" : {
"skipped" : false,
"answer_string" : "Kyra60#yahoo.com"
},
"question_117" : {
"skipped" : false
},
"question_92" : {
"skipped" : false,
"answer_txt" : "composite"
},
"question_107" : {
"skipped" : false,
"answer_bool" : true
},
}
Initially i had created one index per survey but it turned out to be a bad idea since each index allocated 5 shards and my application had nearly 20k surveys created by users. Amazon elastic service broke down and responded 60k shards were created in my 2 nodes..
In this dilemma, I have no idea on how to create my index or meaningfully partition it for efficient querying in the later stage.
Can anyone share some insights and ask me more question so that I can update question for clarity?

Looks like you are using elasticsearch version < 7.X where default number of primary shards were 5 which is changed to 1 and one of the reason was your problem of having a lot of smaller size shards which impacts the Elasticsearch performance.
You should ideally create just one index for all your survey and based on time-range or size you can roll-over to a new index.
you need to have survey_id(unique identification of survey) in your single index and when querying against the index, use survey_id in filter context to get the better query performance as filter contexts are cached by default.

Related

Elasticsearch deleted document reappears using logstash

I am running ES on single node cluster for development.
I am deleting a document using delete api from kibana. It is deleted for a second and immediately reappears. Any help would be appreciated
Here is api command I use:
DELETE test/_doc/12345
{
"_index" : "test",
"_type" : "_doc",
"_id" : "12345",
"_version" : 231,
"result" : "deleted",
"_shards" : {
"total" : 3,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 899,
"_primary_term" : 1
}
GET test/_count
{
"count" : 3,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
Immediately deleted doc is re-indexed
GET test/_count
{
"count" : 4,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}

According to the documentation:
...If clean_run is set to true, this value will be ignored and
sql_last_value will be set to Jan 1, 1970
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#_state
That may explain why all your data are added each 10 minutes. Remove the clean_run and test again or check if the _version filed is updated.

I found that it was an data issue. my logstash jdbc statement checks for modificationdate greater than sql_last_value. and scheduler is set to run every 10 seconds. The reappeared documents have modificationdate in the future, changing it to current date solved the problem

Elasticsearch max of field combined with a unique field

I have an index with two fields:
name: uuid
version: long
I now only want to count the documents (on a very large index [1 million+ entries]) where the version of the name is the highest. For e.g. a query on an index with the following documents:
{name="a", version=1}
{name="a", version=2}
{name="a", version=3}
{name="b", version=1}
... would return:
count=2
Is this somehow possible? I can not find a solution for this particular problem.

You are effectively describing a count of distinct names, which you can do with a cardinality aggregation.
Request:
GET test1/_search
{
"aggs" : {
"distinct_count" : {
"cardinality" : {
"field" : "name.keyword"
}
}
},
"size": 0
}
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"distinct_count" : {
"value" : 2
}
}
}

Elasticsearch get snapshot size

I'm looking for a way to get the storage size of an specific Elasticsearch snapshot? The snapshots are located on a shared filesystem.
It seems there is no API for this?

In order to get the size or status of the elasticsearch snapshot, run snapshot status API snapshot status API
curl -X GET "localhost:9200/_snapshot/my_repository/my_snapshot/_status?pretty"
Note: Mention appropriate values in the above curl.
Sample Output:
"snapshots" : [
{
"snapshot" : "index-01",
"repository" : "my_repository",
"uuid" : "OKHNDHSKENGHLEWNALWEERTJNS",
"state" : "SUCCESS",
"include_global_state" : true,
"shards_stats" : {
"initializing" : 0,
"started" : 0,
"finalizing" : 0,
"done" : 2,
"failed" : 0,
"total" : 2
},
"stats" : {
"incremental" : {
"file_count" : 149,
"size_in_bytes" : 8229187919
},
"total" : {
"file_count" : 463,
"size_in_bytes" : 169401330819
},
"start_time_in_millis" : 1631622333285,
"time_in_millis" : 208851,
"number_of_files" : 149,
"processed_files" : 149,
"total_size_in_bytes" : 8229187919,
"processed_size_in_bytes" : 8229187919
},
"indices" : {
"graylog_130" : {
"shards_stats" : {
"initializing" : 0,
"started" : 0,
"finalizing" : 0,
"done" : 2,
"failed" : 0,
"total" : 2
},
"stats" : {
"incremental" : {
"file_count" : 149,
"size_in_bytes" : 8229187919
},
"total" : {
"file_count" : 463,
"size_in_bytes" : 169401330819
},
"start_time_in_millis" : 1631622333285,
"time_in_millis" : 208851,
"number_of_files" : 149,
"processed_files" : 149,
"total_size_in_bytes" : 8229187919,
"processed_size_in_bytes" : 8229187919
},
"shards" : {
"0" : {
"stage" : "DONE",
"stats" : {
"incremental" : {
"file_count" : 97,
"size_in_bytes" : 1807163337
},
"total" : {
"file_count" : 271,
"size_in_bytes" : 84885391182
},
"start_time_in_millis" : 1631622334048,
"time_in_millis" : 49607,
"number_of_files" : 97,
"processed_files" : 97,
"total_size_in_bytes" : 1807163337,
"processed_size_in_bytes" : 1807163337
}
},
"1" : {
"stage" : "DONE",
"stats" : {
"incremental" : {
"file_count" : 52,
"size_in_bytes" : 6422024582
},
"total" : {
"file_count" : 192,
"size_in_bytes" : 84515939637
},
"start_time_in_millis" : 1631622333285,
"time_in_millis" : 208851,
"number_of_files" : 52,
"processed_files" : 52,
"total_size_in_bytes" : 6422024582,
"processed_size_in_bytes" : 6422024582
}
}
}
}
In the above output, look for
"total" : {
"file_count" : 463,
"size_in_bytes" : 169401330819
}
Now convert size_in_bytes to GB, you will get the exact size of the snapshot in GB's Convert bytes to GB

You could get storage used by index using _cat API ( primary store size). First snapshot should be around index size.
For Incremental snapshots, it depends . This is because snapshots are taken in a segment level ( index-.. ) so it may be much smaller depending your indexing. Merges could cause new segments to form etc..
https://www.elastic.co/blog/found-elasticsearch-snapshot-and-restore Gives a nice overview

I need an exact solution of the used size on the storage.
Now I use the following approach: separate directories on index/snapshot level and so I can get the used storage size on system level (du command) for a specific index or snapshot.

Count the number of duplicates in elasticsearch

I have an application inserting a numbered sequence of logs into elasticsearch.
Under certain conditions, after stopping my application, I find that in elasticsearch there are more logs than I have actually generated.
This simple aggregation helped me find out that a few duplicates are present:
curl /logstash-*/_search?pretty -d '{
size: 0,
aggs: {
msgnum_terms: {
terms: {
field: "msgnum.raw",
min_doc_count: 2,
size: 0
}
}
}
}'
msgnum is the field containing the numeric sequence. Normally it should be unique and the resulting doc_counts never exceed 1. Instead I get something like:
{
"took" : 33,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 100683,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"msgnum_terms" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "4097",
"doc_count" : 2
}, {
"key" : "4099",
"doc_count" : 2
...
...
...
}, {
"key" : "5704",
"doc_count" : 2
} ]
}
}
}
How can I count the exact number of duplicates in order to make sure that they are the only cause of mismatch between number of generated log lines and number of hits in elasticsearch?

Elastic Search Index Status

I am trying to setup a scripted reindex operation as suggested in: http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
To go with the suggestion of creating a new index, aliasing then deleting the old index I would need to have a way to tell when the indexing operation on the new index was complete. Ideally via the REST interface.
It has 80 million rows to index and can take a few hours.
I can't find anything helpful in the docs..

You can try with _stats : http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-stats.html
Eg :
{
"_shards" : {
"total" : 10,
"successful" : 5,
"failed" : 0
},
"_all" : {
"primaries" : {
"docs" : {
"count" : 0,
"deleted" : 0
},
"store" : {
"size_in_bytes" : 575,
"throttle_time_in_millis" : 0
},
"indexing" : {
"index_total" : 0,
"index_time_in_millis" : 0,
"index_current" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time_in_millis" : 0
},
I think, you can compare _all.total.docs.count and _all.total.indexing.index_current

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to design my elastic index for my application? - elasticsearch

Related

Elasticsearch deleted document reappears using logstash

Elasticsearch max of field combined with a unique field

Elasticsearch get snapshot size

Count the number of duplicates in elasticsearch

Elastic Search Index Status

Categories

Resources