Finding multiple Elasticsearch documents with same ids, different types - elasticsearch

I need to find out if any document with a certain id was already indexed in my ES database, so that I can delete them before indexing a new document.
The trouble is I do not know a priori the type it was indexed as.
I found the _mget query which sounds like it could be what I need, but then this quote in the documentation says I only get 1 (random) hit when searching
If you don’t set the type and have many documents sharing the same
_id, you will end up getting only the first matching document.
how can I get this behaviour; finding all documents sharing an _id, possibly > 1 with different _type in the same index without an expensive _search query?
thanks!

A simple term query on "_id" worked for me.
So I created a trivial index and added two documents each, for two different types:
PUT /test_index
POST /test_index/_bulk
{"index":{"_type":"type1","_id":1}}
{"name":"type1 doc1"}
{"index":{"_type":"type1","_id":2}}
{"name":"type1 doc2"}
{"index":{"_type":"type2","_id":1}}
{"name":"type2 doc1"}
{"index":{"_type":"type2","_id":2}}
{"name":"type2 doc2"}
And this query will return both documents with id 1:
POST /test_index/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"_id": "1"
}
}
}
}
}
...
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "type1",
"_id": "1",
"_score": 1,
"_source": {
"name": "type1 doc1"
}
},
{
"_index": "test_index",
"_type": "type2",
"_id": "1",
"_score": 1,
"_source": {
"name": "type2 doc1"
}
}
]
}
}
Here's the code I used:
http://sense.qbox.io/gist/a8085b57c22631148dd4c67769307caf6425fd95

Related

Elasticsearch search for a child and all his sibling documents grouped by parent

I would like to be able to submit a query which matches on child documents and returns the parent and all his child documents.
I have parent and child documents in my Elasticsearch index related through a join: https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html?baymax=rec&rogue=rec-1&elektra=guide.
I have items divided into groups, each item in my index is a separate child document(NOTE: It's required to be able search children separately by different query, so I can NOT use Nested objects). The parent document contains a few meaningful fields like (name, sku, image) so it's required to get Parent along with its children.
I've achieved my requirements using following query:
GET my_index/_search
{
"query": {
"has_child": {
"type": "child",
"query": {
"has_parent": {
"parent_type": "parent",
"query": {
"has_child": {
"type": "child",
"query": {
"multi_match": {
"query": "NV1540JR",
"fields": [
"name",
"sku"
]
}
}
}
}
}
},
"inner_hits": {}
}
}
}
It's returns following result, which is exactly what I need:
{
"took": 301,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "Az9GEAT",
"_score": 1.0,
"_source": {
"id": "Az9GEAT",
"name": "Gold Calacatta 2.0",
"sku": "NV1540",
"my_join-field": "parent"
},
"inner_hits": {
"child": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "zx9EEAR",
"_score": 1.0,
"_routing": "Az9GEAT",
"_source": {
"id": "zx9EEAR",
"name": "Gold Calacatta 12\" x 24\"",
"sku": "NV1540M-2",
"familyName": "Gold Calacatta 2.0",
"familySku": "NV1540",
"my_join-field": {
"name": "child",
"parent": "Az9GEAT"
}
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "Az9NEAT",
"_score": 1.0,
"_routing": "Az9GEAT",
"_source": {
"id": "Az9NEAT",
"name": "Gold Calacatta 2.0, 24\" x 48\"",
"sku": "NV1540JR",
"familyName": "Gold Calacatta 2.0",
"familySku": "NV1540",
"my_join-field": {
"name": "child",
"parent": "Az9GEAT"
}
}
}
]
}
}
}
}
]
}
}
In other way I could implement Application-side Join by making three different query calls(one to get all matching data, second to get siblings, third to get parents) and combining result in my Application. But not sure that it gonna be faster, cos of http request time and data processing time.
So, I'm a very newbee in elasticsearch and can't estimate how bad it is. How does it's affects the query performance? If there any other ways to get desired result? Or how my query could be improved? I'd be glad to hear any suggestions or thoughts! Thanks
For ES it's a standard practice to retrieve a list of object ids & performs a second request to return a complete document set.
You can implement your logic using 2 queries
Request (1) all documents satisfying your child search criteria. Select only child.id & child.parent_id fields to ensure you load only index data, no document _source searched. Request will be relatively fast
In your application code determine unique list of parent_ids & orphaned_child_ids
Request (2) all documents satisfying criteria: parent_id in parent_ids OR parent_id = NULL AND child_id in orphaned_child_ids

Delete Indexes by index name and type using elasticSearch 2.3.3 in java

I have a project in java where I index the data using elastic search 2.3.3. The indexes are of two types.
My index doc looks like:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "movies",
"_id": "uReb0g9KSLKS18sTATdr3A",
"_score": 1,
"_source": {
"genre": "Thriller"
}
},
{
"_index": "test_index",
"_type": "drama",
"_id": "cReb0g9KSKLS18sTATdr3B",
"_score": 1,
"_source": {
"genre": "SuperNatural"
}
},
{
"_index": "index1",
"_type": "drama",
"_id": "cReb0g9KSKLS18sT76ng3B",
"_score": 1,
"_source": {
"genre": "Romance"
}
}
]
}
}
I need to delete index of a particular name and type only.
For eg:- From the above doc, I want to delete indexes with Name "test_index" and type "drama".
So the result should look like:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "movies",
"_id": "uReb0g9KSLKS18sTATdr3A",
"_score": 1,
"_source": {
"genre": "Thriller"
}
},
{
"_index": "index1",
"_type": "drama",
"_id": "cReb0g9KSKLS18sT76ng3B",
"_score": 1,
"_source": {
"genre": "Romance"
}
}
]
}
}
Solutions tried:
client.admin().indices().delete(new DeleteIndexRequest("test_index").actionGet();
But it delete both indexes with name "test_index"
I have also tried various queries in sense beta plugin like:
DELETE /test_index/drama
It gives the error: No handler found for uri [/test_index/drama] and method [DELETE]
DELETE /test_index/drama/_query?q=_id:*&analyze_wildcard=true
It also doesn't work.
When I fire delete index request at that time id of indexes are unknown to us and I have to delete the indexes by name and type only.
How can I delete the required indexes using java api?
This used to be possible till ES 2.0 using the delete mapping API, however since 2.0 Delete Mapping API does not exist any more.
To do this you will have to install the Delete by Query plugin. Then you can simply do a match all query on your index and type and then delete all of them.
The query will look something like this:
DELETE /test_index/drama/_query
{
"query": {
"query": {
"match_all": {}
}
}
}
Also keep in mind that this will delete the documents in the mapping and not the mapping itself. If you want to remove the mapping too you'll have to reindex without the mapping.
This might be able to help you with the java implementation

Custom scoring function in Elasticsearch does not return expected field value

I create a custom scoring function for my documents that just returns the value of the field a for each document. But for some reason, in the example below, the last digits of the _score in the results differ from the last digits of the value of a for each document. What is happening here?
PUT test/doc/1
{
"a": 851459198
}
PUT test/doc/2
{
"a": 984968088
}
GET test/_search
{
"query": {
"function_score": {
"script_score": {
"script": {
"inline": "doc[\"a\"].value"
}
}
}
}
}
That will return the following:
{
"took": 16,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 984968060,
"hits": [
{
"_index": "test",
"_type": "doc",
"_id": "2",
"_score": 984968060,
"_source": {
"a": 984968088
}
},
{
"_index": "test",
"_type": "doc",
"_id": "1",
"_score": 851459200,
"_source": {
"a": 851459198
}
}
]
}
}
Why is the _score different than the value of the field a?
I'm using Elasticsearch 2.1.1
The _score value is internally hard coded as a float which can only accurately represent integers up to the value 134217728. Therefore, if you want to make use, in the scoring function, of a field stored as a number larger than that, it will overflow the buffer and be truncated. See this github issue

Does the elasticsearch ID have to be unique to a type or to the index?

Elasticsearch allows you to store a _type along with the _index. I was wondering if I were to provide my own _id should it be unique across the index?
It should be unique together
PUT so
PUT /so/t1/1
{}
PUT /so/t2/1
{}
GET /so/_search
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "so",
"_type": "t2",
"_id": "1",
"_score": 1,
"_source": {}
},
{
"_index": "so",
"_type": "t1",
"_id": "1",
"_score": 1,
"_source": {}
}
]
}
}
And the reason for that: you'd never get documents by index w/o knowing doctype, and querying ES with index-wide query will return documents including their types and indexes.
Absolutely, there are a few ways of doing it.
The first is using the PUT API, which allows us to specify an ID for a document. So, for the index index and the type type:
curl -XPUT "http://localhost:9200/index/type/1/" -d'
{
"test":"test"
}
Which gives me this document:
{
"_index": "index",
"_type": "type",
"_id": "1",
"_score": 1,
"_source": {
"test": "test"
}
}
Another way is to route the ID to a unique field in your mapping. For example, an md5 hash. So, for an index called index with a type called type, we can specify the following mapping:
curl -XPUT "http://localhost:9200/index/_mapping/type" -d'
{
"type": {
"_id":{
"path" : "md5"
},
"properties": {
"md5": {
"type":"string"
}
}
}
}
This time, I'm going to use the POST API, which automatically generates an ID. If you haven't specified a path in your mapping, it will automatically generate one for you.
curl -XPOST "http://localhost:9200/index/type/" -d'
{
"md5":"00000000000011111111222222223333"
}'
Which gives me the following document in a search:
{
"_index": "index",
"_type": "type",
"_id": "00000000000011111111222222223333",
"_score": 1,
"_source": {
"md5": "00000000000011111111222222223333"
}
}
The second method is generally preferred, because it provides consistency across the index. A perfectly valid id for an index could be 1 like in the example, or dog in another case.

Elasticsearch: get multiple specified documents in one request?

I am new to Elasticsearch and hope to know whether this is possible.
Basically, I have the values in the "code" property for multiple documents. Each document has a unique value in this property. Now I have the codes of multiple documents and hope to retrieve them in one request by supplying multiple codes.
Is this doable in Elasticsearch?
Regards.
Edit
This is the mapping of the field:
"code" : { "type" : "string", "store": "yes", "index": "not_analyzed"},
Two example values of this property:
0Qr7EjzE943Q
GsPVbMMbVr4s
What is the ES syntax to retrieve the two documents in ONE request?
First, you probably don't want "store":"yes" in your mapping, unless you have _source disabled (see this post).
So, I created a simple index like this:
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"code": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
added the two docs with the bulk API:
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"code":"0Qr7EjzE943Q"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"code":"GsPVbMMbVr4s"}
There are a number of ways I could retrieve those two documents. The most straightforward, especially since the field isn't analyzed, is probably a with terms query:
POST /test_index/_search
{
"query": {
"terms": {
"code": [
"0Qr7EjzE943Q",
"GsPVbMMbVr4s"
]
}
}
}
both documents are returned:
{
"took": 21,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.04500804,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.04500804,
"_source": {
"code": "0Qr7EjzE943Q"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.04500804,
"_source": {
"code": "GsPVbMMbVr4s"
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/a3e3e4f05753268086a530b06148c4552bfce324

Resources