How to be sure that all documents indexed in ElasticSearch

How to be sure that all documents indexed in ElasticSearch - elasticsearch

I have a question about Index Aliases and Zero Downtime
When we put a document to an index it takes time until document available for search.
How to check that all documents available for search before switching from old to a new index?

one way to get that information is to get the stats of the index (GET your-index/_stats/docs,indexing) and compare the stats of the docs and indexing blocks.
...
"_all" : {
"primaries" : {
"docs" : {
"count" : 1234, <-- searchable docs
"deleted" : 0
},
"indexing" : {
"index_total" : 1300, <--- indexed docs
"index_time_in_millis" : 13,
...
}
...
To make all your docs searchable, you can either wait for your refresh strategy to kick in, or you trigger an index refresh explicitly by using the refresh API (https://www.elastic.co/guide/en/elasticsearch/reference/6.6/indices-refresh.html)

Related

How to implement ElasticSearch new index creation after every fix number of days?

How to implement ElasticSearch new index creation after every fix number of days and if its possible then how to search over all the previous indexes? Currently we have only one index which has all the data. I looked at the RollOver API of ES, is this the correct way? But the problem seems when we want to search for some data in previous indexes, how this can be done? Any answers are appreciated, Thanks.

Yes, you are on the correct path, for searching into your old indices, you can link multiple indices to one alias, using alias API, and instead of searching for a single index, you need to search again the unified alias.
Refer to this official example on how to link multiple indices to the same alias(alias1 in the below example)
POST /_aliases
{
"actions" : [
{ "add" : { "index" : "test1", "alias" : "alias1" } },
{ "add" : { "index" : "test2", "alias" : "alias1" } }
]
}

Elasticsearch "_cat/indices" api update delayed until search?

Elasticsearch (v7.9.2) got an api _cat/indices to show index status, the last change made to docs.count seems not visiable, until a search or another update is made.
Is this behaior for the purpose of performance improvement?
And, is there any way to make it always up to date?
#Update - How I obverse this?
I'm using logstash to import data into es.
In the browser I have opened http://localhost:9200/_cat/indices?v.
After each import, I refresh the browser page, usually it changes.
After the logstash finish, and I terminate it, the count in the page is less than the count from source db (e.g mysql).
Then I refresh the page again and again, it won't change.
But, as I send a query request in postman to query the es index, then refresh again, the docs.count changed, the total count become the same as in the source db.
So, I'm summarizing following behavior:
At first, the docs.count do update after each import (aka. insert).
But, as importing continues for a while, without querying on the index, the page's docs.count stopped updating.
Then, a query on index will force docs.count update to the correct number.
After that, the above steps will repeat. It does look like some kind of delay until necessary optimization.
And, the index setting from http://localhost:9200/xxx/_settings:
(as requested from comment):
{
"xxx" : {
"settings" : {
"index" : {
"number_of_shards" : "1",
"provided_name" : "xxx",
"creation_date" : "1602844600812",
"analysis" : {
"analyzer" : {
"default_search" : {
"type" : "ik_max_word"
},
"default" : {
"type" : "ik_max_word"
}
}
},
"number_of_replicas" : "0",
"uuid" : "qLFMHhyBQNOOs1u_EcJbBg",
"version" : {
"created" : "7090299"
}
}
}
}
}

same issue on the ES version v7.9.3
from ES official docs:
To get an accurate count of Elasticsearch documents, use the cat count
or count APIs
the cat count API is accurate on my ES cluster.
GET _cat/count/log-uwsgi-2021?v
epoch
timestamp
count
1638855942
05:45:42
500

last doc.count will be shown when a refresh occurred.
it will refresh periodic base on refresh.interval setting.
from documention: Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds.

Attempting to delete all the data for an Index in Elasticsearch

I am trying to delete all the documents, i.e. data from an index. I am using v6.6 along with the dev tools in Kibana.
In the past, I have done this operation successfully but now it is saying 'not found'
{
"_index" : "new-index",
"_type" : "doc",
"_id" : "_query",
"_version" : 1,
"result" : "not_found",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 313,
"_primary_term" : 7
}
Here is my kibana statement
DELETE /new-index/doc/_query
{
"query": {
"match_all": {}
}
}
Also, the index GET operation which verified the index has data and exists:
GET new-index/doc/_search
I verified the type is doc but I can post the whole mapping, if needed.

Easier way is to navigate in Kibana to Management->Elasticsearch index mapping then select indexes you would like to delete via checkboxes, and click on Manage index -> delete index or flush index depending on your need.

I was able to resolve the issue by using a delete by query:
POST new-index/_delete_by_query
{
"query": {
"match_all": {}
}
}

Delete documents is a problematic way to clear data.
Preferable delete index:
DELETE [your-index]
From kibana console.
And recreate from scratch.
And more preferable way is to make a template for an index that creates index as well with the first indexed document.

Only solutions currently are to either delete the index itself (faster), or delete-by-query (slower)
https://www.elastic.co/guide/en/elasticsearch/reference/7.4/docs-delete-by-query.html
POST new-index/_delete_by_query?conflicts=proceed
{
"query": {
"match_all": {}
}
}
Delete API only removes a single document https://www.elastic.co/guide/en/elasticsearch/reference/7.4/docs-delete.html

My guess is that someone changed a field's name and now the DB (NoSQL) and Elasticsearch string name for that field doesn't match. So Elasticsearch tried to delete that field, but the field was "not found".
It's not an error I would lose sleep over.

How can I get options for filtering by a field directly from elasticsearch?

I want to populate a filtering field based on the data I have indexed inside Elasticsearch. How can I retrieve this data? For example, my documents inside index "test" and type "doc" could be
{"id":1, "tag":"foo", "name":"foothing"}
{"id":2, "tag":"bar", "name":"barthing"}
{"id":3, "tag":"foo", "name":"something"}
{"id":4, "tag":"quux", "name":"quuxthing"}
I'm looking for something like GET /test/doc/_magic?q=tag that would return [foo,bar,quux] from my data. I don't know what this is called or even possible. I don't want to get all index entries into memory and do this programmatically, I have millions of documents in the index with around a hundred different tags.
Is this possible with ES?

Yes, that's possible and this is called a terms aggregation
You can do it like this:
GET /test/doc/_search
{
"size": 0,
"aggs" : {
"tags" : {
"terms" : {
"field" : "tag.keyword",
"size": 100
}
}
}
}
Note that depending on the cardinality of your tag field, you can increase/decrease the size setting (10 by default).

ElasticSearch Filtered Aliases Creation - Best Practice

We are planning to use Filtered Aliases as mentioned here - https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html
Our input data is going to be a stream with each line of the stream corresponding to an object we would like to store in ES.
Each object contains an 'id', which we are using for routing and filtering.
QUESTION -
How do we create alias and index data in a performant way ?
-- Do we index all data, keep track of all the unique 'id's and the very end create the filtered alias ? OR
-- For each object, check if an alias for that 'id' exists; if it doesn't create one ?
I'm leaning towards the first approach. Is it advisable and performant when compared to the second approach ?
TIA.

Based on our discussion above and after having glanced over the blog article you posted, I'm pretty positive that in your case you don't need aliases at all and the routing key would suffice. Again, only because you have a single index, if you had many indices this would not be true anymore!
You simply need to specify the routing key to use when indexing your document. Until ES 2.0, you can use the _routing field for that purpose, even though it's been deprecated in ES 1.5, but in your case it serves your purpose.
{
"customer" : {
"_routing" : {
"required" : true,
"path" : "customer_id" <----- the field you use as the routing key
},
"properties": { ... }
}
}
Then when searching you simply need to specify &routing=<customer_id> in your search URL in addition to your customer id filter (since a given shard can host documents for different customers). Your search will go directly to the shard identified by the given routing key, and thus, only retrieve data from the specified customer.
Using a filtered alias for this brings nothing as the filter and routing key you'd include in your alias definition would not contribute anything additional, since the retrieved documents are already "filtered" (kind of) by the routing key. This is way easier than trying to detect (on each new document to index) if an alias exists or not and create it if it doesn't.
UPDATE:
Now if you absolutely have/want to create filtered aliases, the more performant way would be the first one you mentioned:
First index your daily data
Then run a terms aggregation on your customer_id field with size high enough (i.e. higher than the cardinality of the field, which was ~100 in your case) to make sure you capture all unique customer ids to create your aliases
Loop over all the buckets to retrieve all unique customer ids
Create all aliases in one shot using one action for each customer_id
curl -XPOST 'http://localhost:9200/_aliases' -d '{
"actions" : [
{
"add" : {
"index" : "customers",
"alias" : "alias_cid1",
"routing" : "cid1",
"filter" : { "term" : { "customer_id" : "cid1" } }
}
},
{
"add" : {
"index" : "customers",
"alias" : "alias_cid2",
"routing" : "cid2",
"filter" : { "term" : { "customer_id" : "cid2" } }
}
},
{
"add" : {
"index" : "customers",
"alias" : "alias_cid3",
"routing" : "cid3",
"filter" : { "term" : { "customer_id" : "cid3" } }
}
},
...
]
}'
Note that you don't have to worry if an alias already exists, the whole command won't fail and silently ignore the existing alias.
When this command has run, you'll have all your aliases on your unique index, properly configured with a filter and a routing key.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to be sure that all documents indexed in ElasticSearch - elasticsearch

I have a question about Index Aliases and Zero Downtime When we put a document to an index it takes time until document available for search. How to check that all documents available for search before switching from old to a new index?

Related

How to implement ElasticSearch new index creation after every fix number of days?

Elasticsearch "_cat/indices" api update delayed until search?

Attempting to delete all the data for an Index in Elasticsearch

How can I get options for filtering by a field directly from elasticsearch?

ElasticSearch Filtered Aliases Creation - Best Practice

Categories

Resources