Elasticsearch reindex api -not able to copy all the documents - elasticsearch

I have set up the destination index new_dest_index prior to running a _reindex action, including setting up mappings, shard counts, replicas, etc.
I ran the below POST command to copy all the documents from source_index to new_dest_index but it looks like it runs in the background and copies only some of the documents, not all the data from source_index.
Can someone please help and also if there are any better ways to copy from one index to another?
POST _reindex
{
"source": {
"index": "source_index"
},
"dest": {
"index": "new_dest_index"
}
}

I think this is the best way to copy from one index to another.
The reindex process, if I remember correctly, copies bulks of 10,000 each time from one index to another. You are not seeing all documents in the destination index because the tasks hasn't finished (in the best of the cases).
You can always list the reindex tasks with _cat/tasks like:
GET _cat/tasks?v
If you see a reindex tasks in the output, it hasn't finished and you have to wait a little more. These processes take minutes, even hours, depending of the amount of documents to copy.
However, if you don't see it listed and the documents in one index does not match with the number of copied documents in the other one, the reindex process failed and has to be run again.
That last scenario is a bummer when you want to copy all the documents without restrictions.
A way to avoid that las scenario is to reindex with Queries. You can, for instance, run a reindex task for all the documents from January to March, another one for documents from April to June and so on.
You can run several reindex tasks without overlapping. Be mindful with this because having too much tasks could affect the performance or the health of your cluster.
Hope this is helpful! :)

Kevin had already showed the case where reindex task is not finished yet, I answer the case when reindex process is finished.
Note that _reindex API can cause data inconsistent problems which is the new updated (newly inserted + updated) on source_index which happen right after _reindex is triggered, is not applied to the new_dest_index.
For example, bofore you run the _reindex, you add a document:
PUT source_index/doc/3
{
"id": 3,
"searchable_name": "version1"
}
//responses
{
"_index": "source_index",
"_type": "doc",
"_id": "3",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"created": true
}
And then you trigger _reindex API, after triggering _reindex, you update your document:
PUT source_index/doc/3
{
"id": 3,
"searchable_name": "version2"
}
//responses
{
"_index": "source_index",
"_type": "doc",
"_id": "3",
"_version": 2,
"result": "updated",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"created": false
}
But after the _reindex finished, you check the version for the document in new_dest_index:
{
"_index": "new_dest_index",
"_type": "doc",
"_id": "3",
"_version": 1,
"found": true,
"_source": {
"id": 3,
"searchable_name": "version1"
}
}
The same problems can happen for documents which inserted after trigger _reindex
One solution for this is that the first time you reindex and keep version of the source_index using version_type= external setting for new_dest_index, after you traffic your write to new_dest_index, you can reindex again from source_index to new_dest_index to reindex the missing new update after _reindex is triggered.
You can check these settings in the docs here.

Related

Count of "actual hits" (not just matching docs) for arbitrary queries in Elasticsearch

This one really frustrates me. I tried to find a solution for quite a long time, but wherever I try to find questions from people asking for the same, they either want something a little different (like here or here or here) or don't get an answer that solves the problem (like here).
What I need
I want to know how many hits my search has in total, independently from the type of query used. I am not talking about the number of hits you always get from ES, which is the number of documents found for that query, but rather the number of occurrences of document features matching my query.
For example, I could have two documents with text a text field "description", both containing the word hero, but one of them containing it twice.
Like in this minimal example here:
Index mapping:
PUT /sample
{
"settings": {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
},
"mappings": {
"doc": {
"properties": {
"name": { "type": "keyword" },
"description": { "type": "text" }
}
}
}
}
Two sample documents:
POST /sample/doc
{
"name": "Jack Beauregard",
"description": "An aging hero"
}
POST /sample/doc
{
"name": "Master Splinter",
"description": "This rat is a hero, a real hero!"
}
...and the query:
POST /sample/_search
{
"query": {
"match": { "description": "hero" }
},
"_source": false
}
... which gives me:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.22396864,
"hits": [
{
"_index": "sample",
"_type": "doc",
"_id": "hoDsm2oB22SyyA49oDe_",
"_score": 0.22396864
},
{
"_index": "sample",
"_type": "doc",
"_id": "h4Dsm2oB22SyyA49xDf8",
"_score": 0.22227617
}
]
}
}
So there are two hits ("total": 2), which is correct, because the query matches two documents. BUT I want to know many times my query matched inside each document (or the sum of this), which would be 3 in this example, because the second document contained the search term twice.
IMPORTANT: This is just a simple example. But I want this to work for any type of query and any mapping, also nested documents with inner_hits and all.
I didn't expect this to be so difficult, because it must be an information ES comes across during search anyway, right? I mean it ranks the documents with more hits inside them higher, so why can't I get the count of these hits?
I am tempted to call them "inner hits", but that is the name of a different ES feature (see below).
What I tried / could try (but it's ugly)
I could use highlighting (which I do anyway) and try to make the highlighter generate one highlight for each "inner match" (and don't combine them), then post-process the complete set of search results and count all the highlights --> Of course, this is very ugly, because (1) I don't really want to post-process my results and (2) I'd have to get all results to do this by setting size to a high enough value, but actually i only want to get the number of results requested by the client. This would be a lot of overhead!
The feature inner_hits sounds very promising, but it just means that you can handle the hits inside nested documents independently to get a highlighting for each of them. I use this for my nested docs already, but it doesn't solve this problem because (1) it persists on inner hit level and (2) I want this to work with non-nested queries, too.
Is there a way to achieve this in a generic way for arbitrary queries? I'd be most thankful for any suggestions. I'm even down for solving it by tinkering with the ranking or using script fields, anything.
Thank's a lot in advance!
I would definitely not recommend this for any kind of practical use due to the awful performance, but this data is technically available in the term frequency calculation in the results from the explain API. See What is Relevance? for a conceptual explanation and Explain API for usage.

Elasticsearch query not returning _scroll_id for scroll query

We have an Elasticsearch cluster which all seems to be working fine except that scrolling does not work. When I do a query with a ?scroll=1m querystring no _scroll_id is returned in the results.
To check if it was anything to do with the existing Indexes I created a new Index:
PUT scroll_test
POST scroll_test/1
{
"foo": "bar"
}
POST scroll_test/2
{
"foo": "baz"
}
POST /scroll_test/_search?scroll=1m
{
"size": 1,
"query": {
"match_all": {}
}
}
returns
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "scroll_test",
"_type": "1",
"_id": "AV0N_R0jl33mdjPtW4uQ",
"_score": 1,
"_source": {
"foo": "bar"
}
}
]
}
}
We have just done a rolling upgrade from v5.2 to v5.4.3 (cluster health is now green). Scrolling still does not work after upgrading to v5.4.3.
I am able to execute scroll based queries on a local Elasticsearch v5.4.2 instance.
After reading a lot of other questions, I took away these main ideas:
Aggregation can't scroll
the query I copied from the Kibana "Discover" page "inspect" button had this, but I don't know what it was doing, and I was able to remove it with seemingly fine results.
Don't use scroll, and just use search_after:
docs state: We no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging through more than 10,000 hits, use the search_after parameter with a point in time (PIT).
I don't know if aggregations also miss out on search_after but I am playing it safe by not using them for now.

Removing From ElasticSearch by type last 7 day

I have different logs in elasticsearch 2.2 separate by 'type'. How can delete all data, only one of type, older one week? thanks
Example of logs:
{
"_index": "logstash-2016.02.23",
"_type": "dns_ns",
"_id": "AVMOj--RqgDl5Axva2Nt",
"_score": 1,
"_source": {
"#version": "1",
"#timestamp": "2016-02-23T14:37:07.029Z",
"type": "dns_ns",
"host": "11.11.11.11",
"clientip": "22.22.22.22",
"queryname": "api.bing.com",
"zonetype": "Public_zones",
"querytype": "A",
"querytype2": "+ED",
"dnsip": "33.33.33.33"
},
"fields": {
"#timestamp": [
1456238227029
]
}
}
See here or here on how to delete by query. In Elasticsearch 2.*, you might find the Delete by Query plugin useful.
Deleting "types" is no longer directly supported in ES 2.x A better plan is to have rolling indexes, that way deleting indexes older than 7 days becomes very easy.
Take the example of logstash, it creates an index for every day. You can then create an alias for logstash so that it queries all indexes. And then when it comes time to delete old data you can simply remove the entire index with:
DELETE logstash-2015-12-16

Elasticsearch does not return existing document sometimes

For the existing document when I fire following cURL command, I get:
root#unassigned-hostname:~# curl -XGET "http://localhost:9200/test/test3/573553/"
result:
{
"_index": "test",
"_type": "test3",
"_id": "573553",
"exists": false
}
When I fire the same command second time, I get:
root#unassigned-hostname:~# curl -XGET "http://localhost:9200/test/test3/573553/"
result:
{
"_index": "test",
"_type": "test3",
"_id": "573553",
"_version": 1,
"exists": true,
"_source": {
"id": "573553",
"name": "hVTHc",
"price": "21053",
"desc": "VGNHNXkAAcVblau"
}
}
I am using Elasticsearch 0.90.11 on Ubuntu 12.04.
Could anyone please help me figuring out this problem?
I have seen cases where elasticsearch shards can get out of sync during network partitions or very high add/update/delete volume (the same document getting updated/deleted/added within milliseconds of each other, potentially racing). There is no clean way to merge the shards, instead, you just randomly choose a winner. One way to check if this is the case is to repeatedly run a match all query and check if the results jump around at all.
If you want to roll the dice and see what happens you can set replicas down to 0 and then bump back up to whatever value you were using.
While this may not be the reason for your issue, it is worth noting this is one of the reasons not to depend on elasticsearch as your primary source of truth.

ElasticSearch returns document in search but not in GET

I'm doing a search of documents in my index, then subsequently trying to get some of them by _id. Despite receiving a set of results, some of the documents can not be retrieved with a simple get. Worse still, I CAN get the same document with a URI search where ?_id:<the id>
Just for example, running a simple GET
curl -XGET 'http://localhost:9200/keepbusy_process__issuer_application/KeepBusy__Activities__Activity/neHSKSBCSv-OyAYn3IFcew'
Gives me the result:
{
"_index" : "keepbusy_process__issuer_application",
"_type" : "KeepBusy__Activities__Activity",
"_id" : "neHSKSBCSv-OyAYn3IFcew",
"exists" : false
}
But if I do a search with same _id:
curl -XGET 'http://localhost:9200/keepbusy_process__issuer_application/KeepBusy__Activities__Activity/_search?q=_id:neHSKSBCSv-OyAYn3IFcew'
I get the expected result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.0,
"hits": [
{
"_index": "keepbusy_process__issuer_application",
"_type": "KeepBusy__Activities__Activity",
"_id": "neHSKSBCSv-OyAYn3IFcew",
"_score": 1.0,
"_source": {
"template_uid": "KeepBusy__Activities__Activity.create application",
"name": "create application",
"updated_at": "2014-01-08T10:02:33-05:00",
"updated_at_ms": 1389193353975
}
}
]
}
}
I'm indexing documents through the stretcher ruby API, and immediately after indexing I'm doing a refresh. My local setup is 2 nodes. I'm running v0.90.9
There is nothing obvious in the logs why this should fail. I've restarted the cluster and everything appears to start correctly, but the result is the same.
Is there something I'm missing or some way I can further diagnose this issue?
This issue typically occurs when documents are indexed with non-default routing (either explicitly set or deducted from parent's id in case of parent/child documents). If this is the case, try specifying correct routing in you get request.

Resources