Elasticsearch: querying on both nested object properties and parent properties - elasticsearch

I have some documents which have nested objects inside nested objects :
{
"started_at": 1455088063966,
"ended_at": 1455088131966,
"tags": [{
"type": "transfer",
"at": 1455088064462,
"events": [{
"type": "transfer_processed",
"at": 1455088131981
}]
}, {
"at": 1455088138232,
"item": "tag",
"type": "info"
}]
}
Here, the main document has several nested objects (the tags), and for each tag there are several nested objects (the events).
I would like to get all the documents where the events of type transfer_processed occured 60000 milliseconds after the tags of type transfer.
For this, I would need to query on both tags.at, tags.type, tags.events.at and tags.events.type. And I can't figure out how: I only manage to query on the tags.events properties, or only on the tags properties, not both.

Nested objects are actually separate Lucene documents under the hood, so you are essentially trying to "join" multiple documents together to do your comparisons. Unfortunately, this is not supported by Elasticsearch.
Have a look at this similar question and answer which explain it well.

Related

Elasticsearch searching a subset of nested objects with simple_query_string

Let's say I have the following data format:
{
"name": "John Smith",
"publications": [
{
"category": "cat-aaa",
"text": "car airplane boat"
},
{
"category": "cat-bbb",
"text": "pen chain headphones"
},
{
"category": "cat-ccc",
"text": "mouse screen computer"
}
]
}
Right now, I am using simple_query_string to search across the publications.text field. A query such as car AND mouse would return the document above.
What I want is to search for people by publications' texts, but only for some categories. For instance, a query car AND mouse searching through categories cat-aaa and cat-ccc should return the above document. However, searching for car AND mouse through category cat-aaa should not match the above document.
The two approaches that I have tried so far did not work:
Without using nested queries it's not possible to filter publications as ES is flattening the data.
When using nested queries, I cannot search across all the publications of a person, so the query example from above wouldn't work. Nested queries seem to process individual nested documents, but I need to search across all nested documents matching the category I choose.
Can such a query be done using ES ?

Kafka Connect JDBC sink - write Avro field into PG JSONB

I'm trying to build a pipeline where Avro data is written into a Postgres DB. Everything works fine with simple schemas and the AvroConverter for the values. However, I would like to have a nested field written into a JSONB column. There are a couple of problems with this. First, it seems that the Connect plugin does not support STRUCT data. Second, the plugin cannot write directly into the JSONB column.
The second problem should be avoided by adding a cast in PG, as described in this issue. The first problem is proving more diffult. I have tried different transformations but have not been able to get the Connect plugin to interpret one complex field as a string. The schema in questions looks something like this (in practice there would be more fields on the first level besides the timestamp):
{
"namespace": "test.schema",
"name": "nested_message",
"type": "record",
"fields": [
{
"name": "timestamp",
"type": "long"
},
{
"name": "nested_field",
"type": {
"name": "nested_field_record",
"type": "record",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "prop",
"type": "float",
"doc": "Some property"
}
]
}
}
]
}
The message is written in Kafka as
{"timestamp":1599493668741396400,"nested_field":{"name":"myname","prop":377.93887}}
In order to write the contents of nested_field into a single DB column, I would like to interpret this entire field as a string. Is this possible? I have tried the cast transformation, but this only supports prmitive Avro types. Something along the lines of HoistField could work, but I don't see a way to limit this to a single field. Any ideas or advice would be greatly appreciated.
A completely different approach would be to use two connect plugins and UPSERT into the table. One plugin would use the AvroConverter for all fields save the nested one, while the second plugin uses the StringConverter for the nested field. This feels wrong in all kinds of ways though.

Script fields in nested objects specificaly geo_shapes

Part of my document mapping consisits of the mapping below
"locations": {
"type": "nested",
"properties": {
"point": {
"type": "geo_shape",
"tree": "quadtree",
"precision": "100m"
}
}
}
When I attempt to issue a script_field as part of a query Elasticsearch is returning an error
failed to run inline script [doc['locations.point'].distanceInMiles(53.4791,-2.2441)] using lang [groovy]
With a reason of:
failed to find field data builder for field locations.point, and type geo_shape
I'm assuming this is because the field is nested (it has a few (geo) points inside the field and the search matches on any one of them, however as it's nested the context of the path locations.point is obviously wrong, it needs to be something like locations.point[10] (for the 11th one perhaps - this is dependant on the context of the matched item in the query).
So, does anyone know a way to perform this properly? Is there a special operator I can tell the script so that it knows it needs to look at the matched point from the field?
Thanks in advance.
Turns out it's actually not possible to do this with geo_shape's

Query performance when applying the "Great mapping refactoring"

Our applications' entities are dynamic, we don't know how many properties they'll have or what their type will be.
Up until now, we've indexed our data in the following way:
{
"message": "some string",
"count": 1,
"date": "2015-06-01"
}
After reading the following blog:
We've understood that it's better to index the data like this:
{
"data": [
{
"key": "message",
"str_val": "some_string"
},
{
"key": "count",
"int_val": 1
},
{
"key": "date",
"date_val": "2015-06-01"
}
]
}
We were wondering how the index would work in terms of nested aggregations.
will the mapping refactoring above damage the indexing time (and/or the query/aggregation time) due to the fact that now, every entity will be nested one level deeper?
We have thousands of different object types, hence our mapping file is huge. That slows down the indexing time, so a mapping refactoring is highly necessary.
Are you aware of any disadvantages when it comes to refactoring our mapping as explained in the blog above?

Index main-object, sub-objects, and do a search on sub-objects (that return sib-objects)

I've an object like it (simplified here), Each strain have many chromosomes, that have many locus, that have many features, that have many products, ... Here I just put 1 of each.
The structure in json is:
{
"name": "my strain",
"public": false,
"authorized_users": [1, 23, 51],
"chromosomes": [
{
"name": "C1",
"locus": [
{
"name": "locus1",
"features": [
{
"name": "feature1",
"products": [
{
"name": "product1"
//...
}
]
}
]
}
]
}
]
}
I want to add this object in Elasticsearch, for the moment I've add objects separatly: locus, features and products. It's okay to do a search (I want type a keyword, watch in name of locus, name of features, and name of products), but I need to duplicate data like public and authorized_users, in each subobject.
Can I register the whole object in elasticsearch and just do a search on each locus level, features and products ? And get it individually ? (no return the Strain object)
Yes you can search at any level (ie, with a query like "chromosomes.locus.name").
But as you have arrays at each level, you will have to use nested objects (and nested query) to get exactly what you want, which is a bit more complex:
https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.3/query-dsl-nested-query.html
For your last question, no, you cannot get subobjects individually, elastic returns the whole json source object.
If you want only data from subobjects, you will have to use nested aggregations.

Resources