How to handle multiple updates / deletes with Elasticsearch? - elasticsearch

I need to update or delete several documents.
When I update I do this:
I first search for the documents, setting a greater limit for the returned results (let’s say, size: 10000).
For each of the returned documents, I modify certain values.
I resent to elasticsearch the whole modified list (bulk index).
This operation takes place until point 1 no longer returns results.
When I delete I do this:
I first search for the documents, setting a greater limit for the returned results (let’s say, size: 10000)
I delete every found document sending to elasticsearch _id document (10000 requests)
This operation repeats until point 1 no longer returns results.
Is this the right way to make an update?
When I delete, is there a way I can send several ids to delete multiple documents at once?

For your massive index/update operation, if you don't use it already (not sure), you can take a look at the bulk api documentation. it is tailored for this kind of job.
If you want to retrieve lots of documents by small batches, you should use the scan-scroll search instead of using from/size. Related information can be found here.
To sum up :
scroll api is used to load results in memory and to be able to iterate over it efficiently
scan search type disable sorting, which is costly
Give it a try, depending on the data volume, it could improve the performance of your batch operations.
For the delete operation, you can use this same _bulk api to send multiple delete operation at once.
The format of each line is the following :
{ "delete" : { "_index" : "indexName", "_type" : "typeName", "_id" : "1" } }
{ "delete" : { "_index" : "indexName", "_type" : "typeName", "_id" : "2" } }

For deletion and update, if you want to delete or update by id you can use the bulk api:
Bulk API
The bulk API makes it possible to perform many index/delete operations
in a single API call. This can greatly increase the indexing speed.
The possible actions are index, create, delete and update. index and
create expect a source on the next line, and have the same semantics
as the op_type parameter to the standard index API (i.e. create will
fail if a document with the same index and type exists already,
whereas index will add or replace a document as necessary). delete
does not expect a source on the following line, and has the same
semantics as the standard delete API. update expects that the partial
doc, upsert and script and its options are specified on the next line.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html
You can also delete by query instead:
Delete By Query API
The delete by query API allows to delete documents from one or more
indices and one or more types based on a query. The query can either
be provided using a simple query string as a parameter, or using the
Query DSL defined within the request body.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html

Related

Elasticsearch SEARCH-API ignores some existing indices when searching with wildcards

I want to retrieve information about all available indices in my elasticsearch db. For that I send a request to "<elasticsearch_endpoint>/logs-cfsyslog-*/_search/?format=json".
The body of the request is irrelevant for this problem. I'm simple filtering for a specifiy value for one field. I would expect that the api returns all indices of the last 30 days. However, I only receive some of the available archives. Some that are missing are: 3rd March, 11th-17th and 26th-27th February.
But when I retrieve all available indices with the "_CAT" API via
"<elasticsearch_endpoint>/_cat/indices/logs-cfsyslogs-*"
I can see ALL indices that I expect.
I can even specify the exact date that I'm looking for in the search API via:
"<elasticsearch_endpoint>/logs-cfsyslog-2022.03.03/_search/?format=json"
and the API will return the index that I specified.
So why or how does elasticsearch not return for example the index from 3rd March 2022 when I use the wildcard "*" in the search request?
it may be due to one of the below reson.
First, Default value of size is 10
Considering you are calling "<elasticsearch_endpoint>/logs-cfsyslog-*/_search/?format=json" this API and not passing size parameter so elastic search return max 10 documents in response. try below API and check how many result you are getting and from which index.
<elasticsearch_endpoint>/logs-cfsyslog-*/_search/?format=json&size=10000
Second, Due to filtering
I'm simple filtering for a specifiy value for one field.
As you mentioned in question, you are applying filter for one field on specific value so might be chances that filter condition is not matching with other indices.
Please check what value you are getting for hits.total in your response and based on that you can set value of size parameter. Please not that elasticsearch will return max 10,000 documents.
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
}
}

Bulk insert documents but if it exists only update provided fields

I have an index which contains data as follows:
{
"some_field": string, -- exists in my database
"some_other_field": string, -- exists in my database
"another_field": string -- does NOT exist in my database
}
I have a script which grabs data from a database and performs a bulk insert. However, only some of the fields above come from the database as shown above.
If a document already exists, I still want to update the fields that come from the database, but without overwriting/deleting the field that does not come from the database.
I am using the bulk API to do this, however, I lose all data relating to another_field when running the script. Looking at bulk docs, I can't find any options to simply update an existing doc.
I am unable to share the script, but hope this might be enough information to shine some light on possible solutions.
TLDR;
Yes it is use index, as the doc explain:
(Optional, string) Indexes the specified document. If the document exists, replaces the document and increments the version. The following line must contain the source data to be indexed.
But make sure to provide the _id of the document in case of an update.
To understand
I created a toy project to replay and understand:
# post a single document
POST /71177773/_doc
{
"some_field": "data",
"some_other_field": "data"
}
GET /71177773/_search
# try to "update" with out providing an id
POST /_bulk
{"index":{"_index":"71177773"}}
{"some_field":"data","some_other_field":"data","another_field":"data"}
# 2 Documents exist now
GET /71177773/_search
# Try the same command but provide using the Id on the first documents
POST /_bulk
{"index":{"_index":"71177773", "_id": "<Id of the document>"}}
{"some_field":"data","some_other_field":"data","another_field":"data"}
# It seems it worked
GET /71177773/_search
If your question was:
Is Elasticsearch smart enough to recognise I want to update an existing document without providing the Id ?
I am afraid it is not possible.

Elasticsearch remove field performance

I'm calling the Update API frequently to remove fields from documents. What is the performance difference between
"script" : "ctx._source[\"name_of_field\"] = null"
and
"script" : "ctx._source.remove(\"name_of_field\")"
To my understanding, the second one deletes the document and reindexes the entire document. What is the behaviour of the first one?

Logstash replace old index

I'm using logstash to create an elastic index. The steps are :
1. logstash start
2. datas are retrieve with a jdbc input plugin
3. datas are indexed with an elasticsearch output plugin (with a template includes an alias)
4. logstash stop
The time, I've got an index call myindex-1 which can be requested with the alias myindex.
The second time, I've got an index call myindex-2 which can be requested with the alias myindex. The first index is now deprecated and I need to delete it just before (or after the step 4).
Do you know how to do this ?
First things first, if you know the deprecated index name, then it's just a question of adding a step 5:
curl -XDELETE 'http://localhost:9200/myindex-1'
So you'd wrap your logstash run into a script with this additional step - as to my knowledge there is no option for logstash to delete an index, it's simply not its purpose.
But from the way you describe your situation, it seems you're trying to keep the data available during the new index creation, could you elaborate a bit on your use case?
Reason for the asking is that with the current procedure, you're likely to end up with duplicate data (old and new version) during the indexing period.
If there is indeed a need to refresh the data, and assuming that you have an id in the data retrieved from the DB,
you might consider another approach: configuring 2 elasticsearch outputs in your logstash,
first one with action set to "delete" targeting the old entry in previous index,
second being your standard create into new index.
Depending on the nature of your data, there might also be other possibilities.
Create and populate myindex-2, don't alias it yet
Simultaneously add alias to myindex-2 and remove it from myalias-1
REST request for step 2:
POST /_aliases
{
"actions" : [
{ "remove" : { "index" : "myindex-1", "alias" : "myindex" } },
{ "add" : { "index" : "myindex-2", "alias" : "myindex" } }
]
}
Documentation here

combine fields of different documents in same index

I have 2 fields type in my index;
doc1
{
"category":"15",
"url":"http://stackoverflow.com/questions/ask"
}
doc2
{
"url":"http://stackoverflow.com/questions/ask"
"requestsize":"231",
"logdate":"22/12/2012",
"username":"mehmetyeneryilmaz"
}
now I need such a query that filter in same url field and returns fields both of documents:
result:
{
"category":"15",
"url":"http://stackoverflow.com/questions/ask"
"requestsize":"231",
"logdate":"22/12/2012",
"username":"mehmetyeneryilmaz"
}
The results given by elasticsearch are always per document, means that if there are multiple documents satisfying your query/filter, they would always appear as a different documents in the result and never merged into a single document. Hence merging them at client side is the one option which you can use. To avoid getting complete document and just to get the relevant fields, you can use "fields" in your query.
If this is not what you need and still needs narrowing down the result from the query itself, you can use top hit aggregations. It will give you the complete list of documents under a single bucket. But it would also have source field which would contain the complete documents itself.
Try giving a read to page:
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/search-aggregations-metrics-top-hits-aggregation.html

Resources