Trino/presto with elastic : how to search nested objects? - elasticsearch

I'm new to trino and I'm trying to use it to query nested objects in elastic search.
This is my mapping in elasticsearch:
{
"product_index": {
"mappings": {
"properties" :{
"id" : { "type" : "keyword"},
"name" { "type" : "keyword"},
"linked_products" :{
"type": "nested",
"properties" :{
"id" : { "type" : "keyword"}
}
}
}
}
}
}
I need to perform a query on the id field under linked_products .
what is the syntax in trino to perform a query on the id field?
Do I need to use special definitions on the target index mapping in elastic to map the nested section for trino?
=========================================================
Hi,
I will try to add some clarifications to my question.
We are trying to query the data according to the id field.
This is the query in Elastic:
get product_index/_search
{
"query": {
"nested" : {
"path" : "linked_products",
"query": {
"bool": {
"should" : [
{ "match" : {"linked_products.id" :123}}
]
}
}
}
}
}
We tried to query the id field in 2 ways:
Trino query -
select count(*)
from es_table aaa
where any_match(aaa.linked_products, x-> x.id=123)
When we try to query according to the id field the Pushdown to elastic doesn't happen and the connector retrieve all the documents to trino (this only happens with queries on nested documents).
send es-query from trino to elastic:
SELECT * FROM es.default."$query:"
It works but when we are trying to retrieve id's with many documents we got timeout from the elastic client.
I don't understand from the documentation if it is possible to perform scrolling when we are using es-query to avoid the timeout problem.

Trino maps nested object type to a ROW the same way that it maps a standard object type during a read. The nested designation itself serves no purpose to Trino since it only determines how the object is stored in Elasticsearch.
Assume we push the following document to your index.
curl -X POST "localhost:9200/product_index/_doc?pretty"
-H 'Content-Type: application/json' -d'
{
"id": "1",
"name": "foo",
"linked_products": {
"id": "123"
}
}
'
The way you would read this out in Trino would just be to use the standard ROW syntax.
SELECT
id,
name,
linked_products.id
FROM elasticsearch.default.product_index;
Result:
|id |name|id |
|---|----|---|
|1 |foo |123|
This is fine and well, but judging from the fact that the name of your nested object is plural, I'll assume you want to store an array of objects like so.
curl -X POST "localhost:9200/product_index/_doc?pretty" -H 'Content-Type: application/json' -d'
{
"id": "2",
"name": "bar",
"linked_products": [
{
"id": "123"
},
{
"id": "456"
}
]
}
'
If you run the same query as above, with the second document inserted, you'll get the following error.
SQL Error [58]: Query failed (#20210604_202723_00009_nskc4): Expected object for field 'linked_products' of type ROW: [{id=123}, {id=456}] [ArrayList]
This is because, Trino has no way of knowing which fields are arrays from the default Elasticsearch mapping. So to enable querying over this array, you'll need to follow the instructions in the docs to explicitly identify that field as an Array type in Trino using the _meta field. Here is the command that would be used in this example to indetify linked_products as an ARRAY.
curl --request PUT \
--url localhost:9200/product_index/_mapping \
--header 'content-type: application/json' \
--data '
{
"_meta": {
"presto":{
"linked_products":{
"isArray":true
}
}
}
}'
Now, you will need to account in the SELECT statement that linked_products is an ARRAY of type ROW. Not all of the indexes will have values, so you should use the index safe element_at function to avoid errors.
SELECT
id,
name,
element_at(linked_products, 1).id AS id1,
element_at(linked_products, 2).id AS id2
FROM elasticsearch.default.product_index;
Result:
|id |name|id1|id2 |
|---|----|---|----|
|1 |foo |123|NULL|
|2 |bar |123|456 |
=========================================================
Update to answer #gil bob's updated question.
There is currently no support for pushdown aggregates in the Elasticsearch connector but this is getting added in PR 7131
You can set the elasticsearch.request-timeout properties in your elasticsearch.properties file to increase the request timeout as a workaround until the pushdown occurs. If it's taking Elasticsearch this long to return it, this will need to get set whether you run the aggregation in Trino or Elasticsearch.

Related

How to create document in elasticsearch to save data and to search it?

Here it is my requirement, This is my 3levels of data which I am gettting from DB , my requirement is when I search for Developer I should get all the values of Developer such as Geo and Graph from Data2 in a list and while coming to support my values should contain Server and Data in a list and then on the basis of selection of Data1 . Data3 should be able to do the search , like suppose when we select developer then Geopos and Graphpos...
the logic which i need to use here is of elasticsearch
data1 data2 data3
Developer GEO GeoPos
Developer GRAPH GraphPos
Support SERVER ServerPos
Support Data DataPos
this is what I have done to crete the index and to get the values
curl -X PUT http://localhost:9200/mapping_log
{ "mappings":{ "properties":{"data1:{"type": "text","fields":{"keyword":{"type":"keyword"}}}, {"data2":{"type": "text","fields":{"keyword":{"type":"keyword"}}}, {"data3":{"type": "text","fields":{"keyword":{"type":"keyword"}}}, } } } 
searching values , I am not sure what I am going to get can u pls help with search dsl query too
curl -X GET "localhost:9200/mapping_log/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"data1.data2": "product"
}
}
}
How to create document for such type of Data can we create json and post it through postman or curl ?
If your documents are not indexed in elastic search first you need to ingest them to an existing index in elastic with the aid of Logstah , you can find many configuration file related to you input database.
Before transforming your documents create and index in elastic with multi fields mapping, you can use dynamic mapping(elastic default mapping) also and change your Dsl query but I recommend to use multi fields mapping as follow
PUT /mapping{
"mappings":
{"properties": {"rating":{"type": "float"},
"content":{"type": "text"},
"author":{"properties": {
"name":{"type": "text"},
"email":{"type": "keyword"}
}}
}}
}
The result will be
Mapping result
then you can query the fields in kibana Dev tools with DSL query like below
GET /mapping/_search{
"query": {"match":
{ "author.email": "SOMEMAIL"}}
}

Elastic search query not returning results

I have an Elastic Search query that is not returning data. Here are 2 examples of the query - the first one works and returns a few records but the second one returns nothing - what am I missing?
Example 1 works:
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"data.case.field1": "ABC123"
}
}
}
'
Example 2 not working:
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": {
"term" : { "data.case.field1" : "ABC123" }
}
}
}
}
'
this is happening due to the difference between match and term queries, match queries are analyzed, which means it applied the same analyzer on the search term, which is used on field at index time, while term queries are not analyzed, and used for exact searches, and search term in term queries doesn't go through the analysis process.
Official doc of term query
Returns documents that contain an exact term in a provided field.
Official doc of match query
Returns documents that match a provided text, number, date or boolean
value. The provided text is analyzed before matching.
If you are using text field for data.case.field1 without any explicit analyzer than the default analyzer(standard) for the text field would be applied, which lowercase the text and store the resultant token.
For your text, a standard analyzer would produce the below token, please refer Analyze API for more details.
{
"text" : "ABC123",
"analyzer" : "standard"
}
And generated token
{
"tokens": [
{
"token": "abc123",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Now, when you use term query as a search term will not be analyzed and used as it is, which is in captical char(ABC123) it doesn't match the tokens in the index, hence doesn't return result.
PS: refer my this SO answer for more details on term and match queries.
What is your mapping for data.case.field1? If it is of type text, you should use a match query instead of term.
See the warning at the top of this page: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html#query-dsl-term-query
Unless we know the mapping type as text or keyword. It is relatively answering in the dark without knowing all the variables involved. May be you can try the following.
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"filter": { <- Try this if you have datatype as keyword
"term" : { "data.case.field1" : "ABC123" }
}
}
}
}
'

Elasticsearch query with wildcards

Use their data on Elasticsearch tutorials as an example, the following uri search hits 9 records,
curl -XGET 'remotehost:9200/bank/_search?q=city:R*d&_source_include=city&pretty&pretty'
while the following reques body search hits 0 records,
curl -XGET 'remotehost:9200/bank/_search?pretty' -H 'Content-Type: application/json'
-d'{"query": {"wildcard": {"city": "R*d"} },
"_source": ["city"]
}
'
But the two methods shoud be equivalent to each other. Any idea why this is happening? I use Elasticsearch 5.5.1 in docker.
You can get your expected result by hitting the below command. This commands add an extra .keyword with your command in field city.
curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'{"query": {"wildcard": {"city.keyword": "R*d"} }, "_source": ["city"]}'
Reason of adding .keyword
When you insert data to elasticsearch, you will notice a .keyword field and that field is not_analyzed. By default, the field you have inserted data, is standard analyzed and there is a multifield .keyword . If you create a field city with data, then it create a field city with standard analyzer and added a multifield .keyword which is not_analyzed.
In your case you need a not_analyzed field to query (as wildcard query). So, your query should be on city.keyword field which is by default not_analyzed.
In the first case, you have hit a get request to elasticsearch with query parameter. Elasticsearch will automatically converted the query as like second format.
For reliable source, you can follow the Official docs
The string field has split into two new types: text, which should be
used for full-text search, and keyword, which should be used for
keyword search.
To make things better, Elasticsearch decided to borrow an idea that
initially stemmed from Logstash: strings will now be mapped both as
text and keyword by default. For instance, if you index the
following simple document:
{
"foo": "bar"
}
Then the following dynamic mappings will be created:
{
"foo": {
"type" "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
As a consequence, it will both be possible to perform full-text search
on foo, and keyword search and aggregations using the foo.keyword
field.

Creating a mapping for an existing index with a new type in elasticsearch

I found an article on elasticsearch's site describing how to 'reindex without downtime', but that's not really acceptable every time a new element is introduced that needs to have a custom mapping (http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/)
Does anyone know why I can't create a mapping for an existing index but a new type in elasticsearch? The type doesn't exist yet, so why not? Maybe I'm missing something and it IS possible? If so, how can that be achieved?
Thanks,
Vladimir
Here is a simple example to create two type mapping in a index, (one after another)
I've used i1 as index and t1 and t2 as types,
Create index
curl -XPUT "http://localhost:9200/i1"
Create type 1
curl -XPUT "http://localhost:9200/i1/t1/_mapping" -d
{
"t1": {
"properties": {
"field1": {
"type": "string"
},
"field2": {
"type": "string"
}
}
}
}'
Create type 2
curl -XPUT "localhost:9200/i1/t2/_mapping" -d'
{
"t2": {
"properties": {
"field3": {
"type": "string"
},
"field4": {
"type": "string"
}
}
}
}'
Now Looking at mapping( curl -XGET "http://localhost:9200/i1/_mapping" ), It seems like it is working.
Hope this helps!! Thanks
If you're using Elasticsearch 6.0 or above, an index can have only one type.
So you have to create an index for your second type or create a custom type that would contain the two types.
For more details : Removal of multiple types in index

Elasticsearch deeper level Parent-child relationship (grandchild)

I need to index 3 levels (or more) of child-parent.
For example, the levels might be an author, a book, and characters from that book.
However, when indexing more than two-levels there is a problem with has_child and has_parent queries and filters.
If I have 5 shards, I get about one fifth of the results when running a "has_parent" query on the lowest level (characters) or a has_child query on the second level(books).
My guess is that a book gets indexed to a shard by it's parent id and so will reside together with his parent (author), but a character gets indexed to a shard based on the hash of the book id, which does not necessarily complies with the actual shard the book was indexed on.
And so, this means that all character of books of the same author do not necessarily reside in the same shard (kind of crippling the whole child-parent advantage really).
Am I doing something wrong? How can I resolve this, as I am in real need for complex queries such as "what authors wrote books with female characters" for example.
I mad a gist showing the problem, at:
https://gist.github.com/eranid/5299628
Bottom line is, that if I have a mapping:
"author" : {
"properties" : {
"name" : {
"type" : "string"
}
}
},
"book" : {
"_parent" : {
"type" : "author"
},
"properties" : {
"title" : {
"type" : "string"
}
}
},
"character" : {
"_parent" : {
"type" : "book"
},
"properties" : {
"name" : {
"type" : "string"
}
}
}
and a 5 shards index, I cannot make queries with "has_child" and "has_parent"
The query:
curl -XPOST 'http://localhost:9200/index1/character/_search?pretty=true' -d '{
"query": {
"bool": {
"must": [
{
"has_parent": {
"parent_type": "book",
"query": {
"match_all": {}
}
}
}
]
}
}
}'
returns only a fifth (approximately) of the characters.
You are correct, parent/child relationship can only work when all children of a given parent resides in the same shard as the parent. Elasticsearch achieves this by using parent id as a routing value. It works great on one level. However, it breaks on the second and consecutive levels. When you have parent/child/grandchild relationship parents are routed based on their id, children are routed based on the parent ids (works), but then grandchildren are routed based on the children ids and they end up in wrong shards. To demonstrate it on an example, let's assume that we are indexing 3 documents:
curl -XPUT localhost:9200/test-idx/author/Douglas-Adams -d '{...}'
curl -XPUT localhost:9200/test-idx/book/Mostly-Harmless?parent=Douglas-Adams -d '{...}'
curl -XPUT localhost:9200/test-idx/character/Arthur-Dent?parent=Mostly-Harmless -d '{...}'
Elasticsearch uses value Douglas-Adams to calculate routing for the document Douglas-Adams - no surprise here. For the document Mostly-Harmless, Elasticsearch sees that it has parent Douglas-Adams, so it uses again Douglas-Adams to calculate routing and everything is good - same routing value means same shard. But for the document Arthur-Dent Elasticsearch sees that it has parent Mostly-Harmless, so it uses value Mostly-Harmless as a routing and as a result document Arthur-Dent ends up in wrong shard.
The solution for this is to explicitly specify routing value for the grandchildren equal to the id of the grandparent:
curl -XPUT localhost:9200/test-idx/author/Douglas-Adams -d '{...}'
curl -XPUT localhost:9200/test-idx/book/Mostly-Harmless?parent=Douglas-Adams -d '{...}'
curl -XPUT localhost:9200/test-idx/character/Arthur-Dent?parent=Mostly-Harmless&routing=Douglas-Adams -d '{...}'
For the grandpa docs, you need to get the _id as the _routing.
For the father docs, just use the _parent (grandpa._id) as the _routing.
For the children docs, just use the grandpa._id as the _routing.

Resources