Elasticsearch deeper level Parent-child relationship (grandchild) - elasticsearch

I need to index 3 levels (or more) of child-parent.
For example, the levels might be an author, a book, and characters from that book.
However, when indexing more than two-levels there is a problem with has_child and has_parent queries and filters.
If I have 5 shards, I get about one fifth of the results when running a "has_parent" query on the lowest level (characters) or a has_child query on the second level(books).
My guess is that a book gets indexed to a shard by it's parent id and so will reside together with his parent (author), but a character gets indexed to a shard based on the hash of the book id, which does not necessarily complies with the actual shard the book was indexed on.
And so, this means that all character of books of the same author do not necessarily reside in the same shard (kind of crippling the whole child-parent advantage really).
Am I doing something wrong? How can I resolve this, as I am in real need for complex queries such as "what authors wrote books with female characters" for example.
I mad a gist showing the problem, at:
https://gist.github.com/eranid/5299628
Bottom line is, that if I have a mapping:
"author" : {
"properties" : {
"name" : {
"type" : "string"
}
}
},
"book" : {
"_parent" : {
"type" : "author"
},
"properties" : {
"title" : {
"type" : "string"
}
}
},
"character" : {
"_parent" : {
"type" : "book"
},
"properties" : {
"name" : {
"type" : "string"
}
}
}
and a 5 shards index, I cannot make queries with "has_child" and "has_parent"
The query:
curl -XPOST 'http://localhost:9200/index1/character/_search?pretty=true' -d '{
"query": {
"bool": {
"must": [
{
"has_parent": {
"parent_type": "book",
"query": {
"match_all": {}
}
}
}
]
}
}
}'
returns only a fifth (approximately) of the characters.

You are correct, parent/child relationship can only work when all children of a given parent resides in the same shard as the parent. Elasticsearch achieves this by using parent id as a routing value. It works great on one level. However, it breaks on the second and consecutive levels. When you have parent/child/grandchild relationship parents are routed based on their id, children are routed based on the parent ids (works), but then grandchildren are routed based on the children ids and they end up in wrong shards. To demonstrate it on an example, let's assume that we are indexing 3 documents:
curl -XPUT localhost:9200/test-idx/author/Douglas-Adams -d '{...}'
curl -XPUT localhost:9200/test-idx/book/Mostly-Harmless?parent=Douglas-Adams -d '{...}'
curl -XPUT localhost:9200/test-idx/character/Arthur-Dent?parent=Mostly-Harmless -d '{...}'
Elasticsearch uses value Douglas-Adams to calculate routing for the document Douglas-Adams - no surprise here. For the document Mostly-Harmless, Elasticsearch sees that it has parent Douglas-Adams, so it uses again Douglas-Adams to calculate routing and everything is good - same routing value means same shard. But for the document Arthur-Dent Elasticsearch sees that it has parent Mostly-Harmless, so it uses value Mostly-Harmless as a routing and as a result document Arthur-Dent ends up in wrong shard.
The solution for this is to explicitly specify routing value for the grandchildren equal to the id of the grandparent:
curl -XPUT localhost:9200/test-idx/author/Douglas-Adams -d '{...}'
curl -XPUT localhost:9200/test-idx/book/Mostly-Harmless?parent=Douglas-Adams -d '{...}'
curl -XPUT localhost:9200/test-idx/character/Arthur-Dent?parent=Mostly-Harmless&routing=Douglas-Adams -d '{...}'

For the grandpa docs, you need to get the _id as the _routing.
For the father docs, just use the _parent (grandpa._id) as the _routing.
For the children docs, just use the grandpa._id as the _routing.

Related

Trino/presto with elastic : how to search nested objects?

I'm new to trino and I'm trying to use it to query nested objects in elastic search.
This is my mapping in elasticsearch:
{
"product_index": {
"mappings": {
"properties" :{
"id" : { "type" : "keyword"},
"name" { "type" : "keyword"},
"linked_products" :{
"type": "nested",
"properties" :{
"id" : { "type" : "keyword"}
}
}
}
}
}
}
I need to perform a query on the id field under linked_products .
what is the syntax in trino to perform a query on the id field?
Do I need to use special definitions on the target index mapping in elastic to map the nested section for trino?
=========================================================
Hi,
I will try to add some clarifications to my question.
We are trying to query the data according to the id field.
This is the query in Elastic:
get product_index/_search
{
"query": {
"nested" : {
"path" : "linked_products",
"query": {
"bool": {
"should" : [
{ "match" : {"linked_products.id" :123}}
]
}
}
}
}
}
We tried to query the id field in 2 ways:
Trino query -
select count(*)
from es_table aaa
where any_match(aaa.linked_products, x-> x.id=123)
When we try to query according to the id field the Pushdown to elastic doesn't happen and the connector retrieve all the documents to trino (this only happens with queries on nested documents).
send es-query from trino to elastic:
SELECT * FROM es.default."$query:"
It works but when we are trying to retrieve id's with many documents we got timeout from the elastic client.
I don't understand from the documentation if it is possible to perform scrolling when we are using es-query to avoid the timeout problem.
Trino maps nested object type to a ROW the same way that it maps a standard object type during a read. The nested designation itself serves no purpose to Trino since it only determines how the object is stored in Elasticsearch.
Assume we push the following document to your index.
curl -X POST "localhost:9200/product_index/_doc?pretty"
-H 'Content-Type: application/json' -d'
{
"id": "1",
"name": "foo",
"linked_products": {
"id": "123"
}
}
'
The way you would read this out in Trino would just be to use the standard ROW syntax.
SELECT
id,
name,
linked_products.id
FROM elasticsearch.default.product_index;
Result:
|id |name|id |
|---|----|---|
|1 |foo |123|
This is fine and well, but judging from the fact that the name of your nested object is plural, I'll assume you want to store an array of objects like so.
curl -X POST "localhost:9200/product_index/_doc?pretty" -H 'Content-Type: application/json' -d'
{
"id": "2",
"name": "bar",
"linked_products": [
{
"id": "123"
},
{
"id": "456"
}
]
}
'
If you run the same query as above, with the second document inserted, you'll get the following error.
SQL Error [58]: Query failed (#20210604_202723_00009_nskc4): Expected object for field 'linked_products' of type ROW: [{id=123}, {id=456}] [ArrayList]
This is because, Trino has no way of knowing which fields are arrays from the default Elasticsearch mapping. So to enable querying over this array, you'll need to follow the instructions in the docs to explicitly identify that field as an Array type in Trino using the _meta field. Here is the command that would be used in this example to indetify linked_products as an ARRAY.
curl --request PUT \
--url localhost:9200/product_index/_mapping \
--header 'content-type: application/json' \
--data '
{
"_meta": {
"presto":{
"linked_products":{
"isArray":true
}
}
}
}'
Now, you will need to account in the SELECT statement that linked_products is an ARRAY of type ROW. Not all of the indexes will have values, so you should use the index safe element_at function to avoid errors.
SELECT
id,
name,
element_at(linked_products, 1).id AS id1,
element_at(linked_products, 2).id AS id2
FROM elasticsearch.default.product_index;
Result:
|id |name|id1|id2 |
|---|----|---|----|
|1 |foo |123|NULL|
|2 |bar |123|456 |
=========================================================
Update to answer #gil bob's updated question.
There is currently no support for pushdown aggregates in the Elasticsearch connector but this is getting added in PR 7131
You can set the elasticsearch.request-timeout properties in your elasticsearch.properties file to increase the request timeout as a workaround until the pushdown occurs. If it's taking Elasticsearch this long to return it, this will need to get set whether you run the aggregation in Trino or Elasticsearch.

Is it possible to check that specific data matches the query without loading it to the index?

Imagine that I have a specific data string and a specific query. The simple way to check that the query matches the data is to load the data into the Elastic index and run the online query. But can I do it without putting it into the index?
Maybe there are some open-source libraries that implement the Elastic search functionality offline, so I can call something like getScore(data, query)? Or it's possible to implement by using specific API endpoints?
Thanks in advance!
What you can do is to leverage the percolator type.
What this allows you to do is to store the query instead of the document and then test whether a document would match the stored query.
For instance, you first create an index with a field of type percolator that will contain your query (you also need to add in the mapping any field used by the query so ES knows what their types are):
PUT my_index
{
"mappings": {
"properties": {
"query": {
"type": "percolator"
},
"message": {
"type": "text"
}
}
}
}
Then you can index a real query, like this:
PUT my_index/_doc/match_value
{
"query" : {
"match" : {
"message" : "bonsai tree"
}
}
}
Finally, you can check using the percolate query if the query you've just stored would match
GET /my_index/_search
{
"query" : {
"percolate" : {
"field" : "query",
"document" : {
"message" : "A new bonsai tree in the office"
}
}
}
}
So all you need to do is to only store the query (not the documents), and then you can use the percolate query to check if the documents would have been selected by the query you stored, without having to store the documents themselves.

How to create a common mapping template for indices?

For the app i created, the indices are generated once in a week. And the type and nature of the data is not varying and that implies, I need the same mapping type for these indices. Is it possible in elasticsearch to apply the same mapping to all the indices as they are created?. This could avoid me the overhead of defining mapping each time the index is created.
Definitely, you can use what is called an index template. Since your mapping type is stable, that's the perfect condition for using index templates.
It's as easy as creating an index. See below, whenever you want to index a document in an index whose name matches my_*, ES will select that template and create the index for you using the given mappings, settings and aliases:
curl -XPUT localhost:9200/_template/template_1 -d '{
"template" : "my_*",
"settings" : {
"number_of_shards" : 1
},
"aliases" : {
"my_alias" : {}
},
"mappings" : {
"my_type" : {
"properties" : {
"my_field": { "type": "string" }
}
}
}
}'
It's basically the technique used by Logstash when it needs to index new logs for each new day in a new daily index.
You can employ index template to address your problem. The official documentation can be found here.
A use case of how to apply the same with examples can be found in this blog

Elasticsearch has_child query/filter in Kibana 4

I cannot seem to get the has_child query (or filter) to function in Kibana 4. My code works in elasticsearch directly as a curl script, but not in Kibana 4, yet I understood this was a key feature of the upgrade. Can anybody shed any light?
The curl script as follows works in elasticsearch, returning all of the parents where they have a child object:
curl -XPOST localhost:port/indexname/_search?pretty -d '{
"query" : {
"has_child" : {
"type" : "object",
"query" : {
"match_all" : {}
}
}
}
}'
The above runs fine. Then to convert it to the JSON query to submit within Kibana, I've followed the general formatting rules: I've dropped the curl line and added the index name (and sometimes a blank filter [], but it doesn't seem to make much difference); no error is thrown but the whole dataset returns.
{
"index" : "indexname",
"query" : {
"has_child" : {
"type" : "object",
"query" : {
"match_all" : {}
}
}
}
}
Am I missing something? Has anybody else got a has_child query to run in Kibana 4?
Many thanks in advance
Toby

Exact (not substring) matching in Elasticsearch

{"query":{
"match" : {
"content" : "2"
}
}} matches all the documents whole content contains the number 2, however I would like the content to be exactly 2, no more no less - think of my requirement in a spirit of Java's String.equals.
Similarly for the second query I would like to match when the document's content is exactly '3 3' and nothing more or less. {"query":{
"match" : {
"content" : "3 3"
}
}}
How could I do exact (String.equals) matching in Elasticsearch?
Without seeing your index type mapping and sample data, it's hard to answer this directly - but I'll try.
Offhand, I'd say this is similar to this answer here (https://stackoverflow.com/a/12867852/382774), where you simply set the content field's index option to not_analyzed in your mapping:
"url" : {
"type" : "string",
"index" : "not_analyzed"
}
Edit: I wasn't clear enough with my original answer, shown above. I did not mean to imply that you should add the example code to your query, I meant that you need to specify in your index type mapping that the url field is of type string and it is indexed but not analyzed (not_analyzed).
This tells Elasticsearch to not bother analyzing (tokenizing or token filtering) the field when you're indexing your documents - just store it in the index as it exists in the document. For more information on mappings, see http://www.elasticsearch.org/guide/reference/mapping/ for an intro and http://www.elasticsearch.org/guide/reference/mapping/core-types/ for specifics on not_analyzed (tip: search for it on that page).
Update:
Official doc tells us that in a new version of Elastic search you can't define variable as "not_analyzed", instead of this you should use "keyword".
For the old version elastic:
{
"foo": {
"type" "string",
"index": "not_analyzed"
}
}
For new version:
{
"foo": {
"type" "keyword",
"index": true
}
}
Note that this functionality (keyword type) are from elastic 5.0 and backward compatibility layer is removed from Elasticsearch 6.0 release.
Official Doc
You should use filter instead of match.
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"content" : 2
}
}
}
}
And you got docs whose content is exact 2, instead of 20 or 2.1

Resources