Elasticsearch does not return existing document sometimes - elasticsearch

For the existing document when I fire following cURL command, I get:
root#unassigned-hostname:~# curl -XGET "http://localhost:9200/test/test3/573553/"
result:
{
"_index": "test",
"_type": "test3",
"_id": "573553",
"exists": false
}
When I fire the same command second time, I get:
root#unassigned-hostname:~# curl -XGET "http://localhost:9200/test/test3/573553/"
result:
{
"_index": "test",
"_type": "test3",
"_id": "573553",
"_version": 1,
"exists": true,
"_source": {
"id": "573553",
"name": "hVTHc",
"price": "21053",
"desc": "VGNHNXkAAcVblau"
}
}
I am using Elasticsearch 0.90.11 on Ubuntu 12.04.
Could anyone please help me figuring out this problem?

I have seen cases where elasticsearch shards can get out of sync during network partitions or very high add/update/delete volume (the same document getting updated/deleted/added within milliseconds of each other, potentially racing). There is no clean way to merge the shards, instead, you just randomly choose a winner. One way to check if this is the case is to repeatedly run a match all query and check if the results jump around at all.
If you want to roll the dice and see what happens you can set replicas down to 0 and then bump back up to whatever value you were using.
While this may not be the reason for your issue, it is worth noting this is one of the reasons not to depend on elasticsearch as your primary source of truth.

Related

Elasticsearch reindex api -not able to copy all the documents

I have set up the destination index new_dest_index prior to running a _reindex action, including setting up mappings, shard counts, replicas, etc.
I ran the below POST command to copy all the documents from source_index to new_dest_index but it looks like it runs in the background and copies only some of the documents, not all the data from source_index.
Can someone please help and also if there are any better ways to copy from one index to another?
POST _reindex
{
"source": {
"index": "source_index"
},
"dest": {
"index": "new_dest_index"
}
}
I think this is the best way to copy from one index to another.
The reindex process, if I remember correctly, copies bulks of 10,000 each time from one index to another. You are not seeing all documents in the destination index because the tasks hasn't finished (in the best of the cases).
You can always list the reindex tasks with _cat/tasks like:
GET _cat/tasks?v
If you see a reindex tasks in the output, it hasn't finished and you have to wait a little more. These processes take minutes, even hours, depending of the amount of documents to copy.
However, if you don't see it listed and the documents in one index does not match with the number of copied documents in the other one, the reindex process failed and has to be run again.
That last scenario is a bummer when you want to copy all the documents without restrictions.
A way to avoid that las scenario is to reindex with Queries. You can, for instance, run a reindex task for all the documents from January to March, another one for documents from April to June and so on.
You can run several reindex tasks without overlapping. Be mindful with this because having too much tasks could affect the performance or the health of your cluster.
Hope this is helpful! :)
Kevin had already showed the case where reindex task is not finished yet, I answer the case when reindex process is finished.
Note that _reindex API can cause data inconsistent problems which is the new updated (newly inserted + updated) on source_index which happen right after _reindex is triggered, is not applied to the new_dest_index.
For example, bofore you run the _reindex, you add a document:
PUT source_index/doc/3
{
"id": 3,
"searchable_name": "version1"
}
//responses
{
"_index": "source_index",
"_type": "doc",
"_id": "3",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"created": true
}
And then you trigger _reindex API, after triggering _reindex, you update your document:
PUT source_index/doc/3
{
"id": 3,
"searchable_name": "version2"
}
//responses
{
"_index": "source_index",
"_type": "doc",
"_id": "3",
"_version": 2,
"result": "updated",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"created": false
}
But after the _reindex finished, you check the version for the document in new_dest_index:
{
"_index": "new_dest_index",
"_type": "doc",
"_id": "3",
"_version": 1,
"found": true,
"_source": {
"id": 3,
"searchable_name": "version1"
}
}
The same problems can happen for documents which inserted after trigger _reindex
One solution for this is that the first time you reindex and keep version of the source_index using version_type= external setting for new_dest_index, after you traffic your write to new_dest_index, you can reindex again from source_index to new_dest_index to reindex the missing new update after _reindex is triggered.
You can check these settings in the docs here.

What does # mean in elastic search documents?

My question is: "What does the # mean in elastic search documents?" #timestamp automatically gets created along with #version. Why is this and what's the point?
Here is some context... I have a web app that writes logs to files. Then I have logstash forward these logs to elastic search. Finally, I use Kibana to visualize everything.
Here is an example of one of the documents in elastic search:
{
"_index": "logstash-2018.02.17",
"_type": "doc",
"_id": "0PknomEBajxXe2bTzwxm",
"_version": 1,
"_score": null,
"_source": {
"#timestamp": "2018-02-17T05:06:13.362Z",
"source": "source",
"#version": "1",
"message": "message",
"env": "development",
"host": "127.0.0.1"
},
"fields": {
"#timestamp": [
"2018-02-17T05:06:13.362Z"
]
},
"sort": [
1518843973362
]
}
# fields are usually ones generated by Logstash as metadata ones, #timestamp being the value that the event was processed by Logstash. Similarly #version is also being added by Logstash to denote the version number of the document.
Here is the reference.
The # field is the metadata created for Logstash. It is part of the data itself.
More info is here.

Using Timelion in ElasticSearch/Kibana 5.0

I'm trying to visualize a timeseries in Timelion. I have a few hundred datapoints in elasticsearch with this sort of format - I've manually removed some fields which I never meant to use in the timeseries plot.
"_index": "foo-2016-11-06",
"_type": "bar",
"_id": "7239171989271733678",
"_score": 1,
"_source": {
"timestamp": "2016-11-06T15:27:37.123581+00:00",
"rank": 2,
}
What I want is to quite simply plot the change in rank over time. I found this post Kibana Timelion plugin how to specify a field in the elastic search which seems to describe the same thing and I understand I should be able to just do .es(metric='sum:rank').
My problem is that no matter how I define my timelion query (even just calling .es(*)), I end up just getting a horizontal line where y=0.
timelion
Things I've tried so far:
Changed timefield in timelion.json from #timefield to just timefield
Extending the timeseries window (even into the future)
Set default_index to _all in timelion.json
Queried specific indices that I know contain data
All of them give me the same outcome which you can see in the attached picture. Does anyone have any idea what might be going on here?
Set the timelion.json as above:
{
"quandl": {
"key": ""
},
"es": {
"timefield": "timestamp",
"default_index": "_all",
"allow_url_parameter": false
},
"graphite": {
"url": "https://www.hostedgraphite.com/UID/ACCESS_KEY/graphite"
},
"default_interval": "1h",
"max_buckets": 2000
}
set the granularity to 'Auto' and use the above Timelion query:.es(index='foo-2016-11-06', metric='max:rank').

Removing From ElasticSearch by type last 7 day

I have different logs in elasticsearch 2.2 separate by 'type'. How can delete all data, only one of type, older one week? thanks
Example of logs:
{
"_index": "logstash-2016.02.23",
"_type": "dns_ns",
"_id": "AVMOj--RqgDl5Axva2Nt",
"_score": 1,
"_source": {
"#version": "1",
"#timestamp": "2016-02-23T14:37:07.029Z",
"type": "dns_ns",
"host": "11.11.11.11",
"clientip": "22.22.22.22",
"queryname": "api.bing.com",
"zonetype": "Public_zones",
"querytype": "A",
"querytype2": "+ED",
"dnsip": "33.33.33.33"
},
"fields": {
"#timestamp": [
1456238227029
]
}
}
See here or here on how to delete by query. In Elasticsearch 2.*, you might find the Delete by Query plugin useful.
Deleting "types" is no longer directly supported in ES 2.x A better plan is to have rolling indexes, that way deleting indexes older than 7 days becomes very easy.
Take the example of logstash, it creates an index for every day. You can then create an alias for logstash so that it queries all indexes. And then when it comes time to delete old data you can simply remove the entire index with:
DELETE logstash-2015-12-16

Discrepancies in ElasticSearch Results

I have a relatively simple search index built up for simple, plain text queries. No routing, custom analyzers or anything like that. One search instance/node, one index.
There are docs within the index that I have deleted, and the RESTfull API confirms that:
GET /INDEX_NAME/person/464
{
"_index": "INDEX_NAME",
"_type": "person",
"_id": "464",
"exists": false
}
However the doc is being returned from a simple search
POST /INDEX_NAME/person
{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "person.offices",
"query": "Chicago"
}
}
]
}
}
}
One of the rows that is returned:
{
"_index": "INDEX_NAME",
"_type": "person",
"_id": 464,
"_score": null,
"fields": [
...
]
}
I'm new to ElasticSearch and thought I finally had a grasp of the basic concepts before digging deeper. But I'm not sure why a document isn't accessible via REST but it is still appearing in the results?
I'm also running into the reverse issue where docs are returned from the API but they are not being returned in the search. For the sake of clarity I am considering that a separate issue for the time being, but I have a feeling that these two issues might be related.
Part of me wants to delete my index and rebuild it, but I don't want to get into the same situation in a few days (and I'm not sure if that would even help).
Any ideas or pointers on why this discrepancy might be happening? Maybe a process is in some zombie state and elasticsearch just needs to be restarted?

Resources