Elasticsearch _reindex API not copying documents - elasticsearch

I'm trying to upgrade an old 1.5 elastic index to 6.0, according to docs (https://www.elastic.co/guide/en/elasticsearch/reference/6.0/reindex-upgrade.html)
I can create a new index in 6.0 and then use reindex from remote using reindex from remote (https://www.elastic.co/guide/en/elasticsearch/reference/6.0/reindex-upgrade-remote.html)
Both of these instances are running inside docker containers I just wanted to test this in local before actually doing it in production
I can see there are documents indexed in my old index.
curl -XGET 'http://localhost:9200/old_index/_search?pretty'
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "old_index",
"_type" : "item",
"_id" : "92",
"_score" : 1.0,
"_source":{"user_id":3,"slug":"asdfaisjeilej","name":"lake.jpgasdad","item_type":"image","created_at":"2018-01-23T18:11:30Z","deleted_at":null,"content_length":1252171}
}]}
}
After creating a new index (new_index) in my elasticsearch 6.0 instance, with a slightly diff mapping (change string types to text), I then proceed to reindex from remote using the following command. (note than my other instance is running in port 9400)
curl -XPOST 'localhost:9400/_reindex?pretty' -H 'Content-Type: application/json' -d'
{
"source": {
"remote": {
"host": "http://localhost:9200"
},
"index": "old_index"
},
"dest": {
"index": "new_index"
}
}
I get the following response
{
"took" : 136,
"timed_out" : false,
"total" : 0,
"updated" : 0,
"created" : 0,
"deleted" : 0,
"batches" : 0,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0,
"failures" : [ ]
}
So basically, documents from old_index are not being copied to new_index, and I have no idea why this is happening. Is there a step I'm missing, I'm following elasticsearch docs exactly as they read apparently.

As I mentioned, I also had the same issue while migrating from Elasticsearch-2 to Elasticsearch-6 after I tested the remote-reindexing in staging environment without dockers.
My workaround was to create an instance of the old version (not on docker), load it from backup and reindex from it to elasticsearch 6 instance that not running on docker.
If you still want to run elasticsearch 6 on docker you can always mount the data to your container.
Hope you find it helpful.

Related

Elasticsearch has_child not returning back all parent documents

Here is the mapping data for both customer and customer_query documents, where customer is the parent and customer_query the child document.
When I run a generic search against all customer_query documents, I get back 127 documents.
However, when I run the following query against the parent
curl -XGET "http://localhost:9200/fts_index/customer/_search" -d'
{
"query": {
"has_child" : {
"type" : "customer_query",
"query" : { "match_all": {} }
}
}
}
}'
I get back only 23 documents. There should be 127 documents returned back since each customer_query document has a unique parent id assigned to it that does match up to the customer type.
When I retry creating my customer_query documents, I get a different number of documents back each time leading me to think it is some kind of shard issue. I have 5 shards assigned to the index.
{
"took" : 59,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 23,
"max_score" : 1.0,
"hits" : [ {
"_index" : "fts_index",
"_type" : "customer",
"_id" : "7579f2c0-e4e4-4374-82d7-bf4c508fc51d",
"_score" : 1.0,
"_routing" : "8754248f-1c51-46bf-970a-493c349c70a7",
"_parent" : "8754248f-1c51-46bf-970a-493c349c70a7",
....
I can't wrap my head around this issue. Any thoughts on what could be the issue? Is this a routing issue? If so, how do I rectify that with my search?

Entries of ElasticSearch hits

I am performing an ElasticSearch query through
curl -XGET 'http://localhost:9200/_search?pretty' -d '{\"_source\":[\"data.js_event\", \"data.timestamp\",\"data.uid\",\"data.element\"],\"query\":{\"match\":{\"event\":\"first_time_user_event\"}}}'
I am ONLY interested in the ouput of _source but the retrieval leads to something like
{"took" : 46,
"timed_out" : false,
"_shards" : {
"total" : 71,
"successful" : 71,
"failed" : 0
},
"hits" : {
"total" : 2062326,
"max_score" : 4.8918204,
"hits" : [ {
"_index" : "logstash-2015.11.22",
"_type" : "fluentd",
"_id" : "AVEv_blDT1yShMIEDDmv",
"_score" : 4.8918204,
"_source":{"data":{"js_event":"leave_page","timestamp":"2015-11-22T16:18:47.088Z","uid":"l4Eys1T/rAETpysn7E/Jog==","element":"body"}}
}, {
"_index" : "logstash-2015.11.21",
"_type" : "fluentd",
"_id" : "AVEnZa5nT1yShMIEDDW8",
"_score" : 4.0081544,
"_source":{"data":{"js_event":"hover","timestamp":"2015-11-21T00:15:15.097Z","uid":"E/4Fdl5uETvhQeX/FZIWfQ==","element":"infographic-new-format-selector"}}
},
...
Is there a way to get rid of _index, _type, _id and _score? The reason is that the query is performed on a remote server and I would like to limit the size of downloaded data.
Thanks.
Yes, you can use response filtering (only available since ES 1.7) by specifying what you want in the response (i.e. filter_path=hits.hits._source) and ES will filter it out for you:
curl -XGET 'http://localhost:9200/_search?filter_path=hits.hits._source&pretty' -d '{\"_source\":[\"data.js_event\", \"data.timestamp\",\"data.uid\",\"data.element\"],\"query\":{\"match\":{\"event\":\"first_time_user_event\"}}}'

Elasticsearch CouchDB River no hit

I have a problem with CouchDB and Elasticsearch. i use Docker to realise it. i have a working couchdb container on the default port. Now i use this container:
registry.hub.docker.com/u/jeko/elasticsearch-river-couchdb/
And i insert a new couchdb connection with this:
curl -X PUT '127.0.0.1:9200/_river/testdb/_meta' -d ' { "type" : "couchdb", "couchdb" : { "host" : "couchdb", "port" : 5984, "db" : "articles", "filter" : null }, "index" : { "index" : "articles", "type" : "articles", "bulk_size" : "100", "bulk_timeout" : "10ms" } }'
to have a working elasticsearch with the couchdb river. Now i checked with curl host/articles/articles/_search?pretty=true the documents. The Hits are empty.
{
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
i turned the debugger on and checked the logging file. The output is this: http://pastebin.com/ETkNmJzT
The only conspicuous thing i found is this line: [2015-02-20 14:04:24,554][DEBUG][plugins ] [Arc] [/elasticsearch/plugins/river-couchdb/_site] directory does not exist.
But i doesn't understand why it doesn't work. i can curl the IP

Unable to search attachment type field in an ElasticSearch indexed document

Search does not return any results although I do have a document that should match the query.
I do have the ElasticSearch mapper-attachments plugin installed per https://github.com/elasticsearch/elasticsearch-mapper-attachments. I have also googled the topic as well as browsed similar questions in stack overflow, but have not found an answer.
Here's what I typed into a windows 7 command prompt:
c:\Java\elasticsearch-1.3.4>curl -XDELETE localhost:9200/tce
{"acknowledged":true}
c:\Java\elasticsearch-1.3.4>curl -XPUT localhost:9200/tce
{"acknowledged":true}
c:\Java\elasticsearch-1.3.4>curl -XPUT localhost:9200/tce/contact/_mapping -d{\"
contact\":{\"properties\":{\"my_attachment\":{\"type\":\"attachment\"}}}}
{"acknowledged":true}
c:\Java\elasticsearch-1.3.4>curl -XPUT localhost:9200/tce/contact/1 -d{\"my_atta
chment\":\"SGVsbG8=\"}
{"_index":"tce","_type":"contact","_id":"1","_version":1,"created":true}
c:\Java\elasticsearch-1.3.4>curl localhost:9200/tce/contact/_search?pretty
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "tce",
"_type" : "contact",
"_id" : "1",
"_score" : 1.0,
"_source":{"my_attachment":"SGVsbG8="}
} ]
}
}
c:\Java\elasticsearch-1.3.4>curl localhost:9200/tce/contact/_search?pretty -d{\"
query\":{\"term\":{\"my_attachment\":\"Hello\"}}}
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
Note that the base64 encoded value of "Hello" is "SGVsbG8=", which is the value I have inserted into the "my_attachment" field of the document.
I am assuming that the mapper-attachments plugin has been deployed correctly because I don't get an error executing the mapping command above.
Any help would be greatly appreciated.
What analyzer is running against the my_attachment field?
if it's the standard analyser (can't see any listed) then the Hello in the text will be made lowercase in the index.
i.e. when doing a term search (which doesn't have an analyzer on it) - try searching for hello
curl localhost:9200/tce/contact/_search?pretty -d'
{"query":
{"term":
{"my_attachment":"hello"
}}}'
you can also see which terms have been added to the index:
curl 'http://localhost:9200/tce/contact/_search?pretty=true' -d '{
"query" : {
"match_all" : { }
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "my_attachment"
}
}
}
}'

Elasticsearch index last update time

Is there a way to retrieve from ElasticSearch information on when a specific index was last updated?
My goal is to be able to tell when it was the last time that any documents were inserted/updated/deleted in the index. If this is not possible, is there something I can add in my index modification requests that will provide this information later on?
You can get the modification time from the _timestamp
To make it easier to return the timestamp you can set up Elasticsearch to store it:
curl -XPUT "http://localhost:9200/myindex/mytype/_mapping" -d'
{
"mytype": {
"_timestamp": {
"enabled": "true",
"store": "yes"
}
}
}'
If I insert a document and then query on it I get the timestamp:
curl -XGET 'http://localhost:9200/myindex/mytype/_search?pretty' -d '{
> fields : ["_timestamp"],
> "query": {
> "query_string": { "query":"*"}
> }
> }'
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "myindex",
"_type" : "mytype",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"_timestamp" : 1417599223918
}
} ]
}
}
updating the existing document:
curl -XPOST "http://localhost:9200/myindex/mytype/1/_update" -d'
{
"doc" : {
"field1": "data",
"field2": "more data"
},
"doc_as_upsert" : true
}'
Re-running the previous query shows me an updated timestamp:
"fields" : {
"_timestamp" : 1417599620167
}
I don't know if there are people who are looking for an equivalent, but here is a workaround using shards stats for > Elasticsearch 5 users:
curl XGET http://localhost:9200/_stats?level=shards
As you'll see, you have some informations per indices, commits and/or flushs that you might use to see if the indice changed (or not).
I hope it will help someone.
Just looked into a solution for this problem. Recent Elasticsearch versions have a <index>/_recovery API.
This returns a list of shards and a field called stop_time_in_millis which looks like it is a timestamp for the last write to that shard.

Resources