can I filter an array in elastic? - elasticsearch

I had to insert a huge amount of data into elastic and I have done it in the following manner.
I need to query this object but I am unable to filter the "logData" array. Can someone help me out here ? is it even possible to filter an array in elastic?
"_source":{
"FileName": "fileName.log"
"logData": [
{
"LineNumber": 1,
"Data": "data1"
},
{
"LineNumber": 2,
"Data": "Data2"
},
{
"LineNumber": 3,
"Data": "Data3"
},
{
"LineNumber": 4,
"Data": "Data4"
},
{
"LineNumber": 5,
"Data": "Data5"
},
{
"LineNumber": 6,
"Data": "Data6"
}
]}
Is there a way to query such that I get only few items from this array ?
like:
"_source":{
"FileName": "fileName.log"
"logData": [
{
"LineNumber": 1,
"Data": "data1"
},
{
"LineNumber": 2,
"Data": "Data2"
},
{
"LineNumber": 3,
"Data": "Data3"
}
]
}

There's no dedicated array mapping type in ES.
With that being said, when you have an array of objects with shared keys, it's recommended that you use the nested field type to preserve the connections of the individual sub-objects' attributes. If you don't use nested, the objects will be flattened which may lead to seemingly wrong query results.
As to the actual query -- assuming your mapping looks something like this:
PUT logs_index
{
"mappings": {
"properties": {
"logData": {
"type": "nested"
}
}
}
}
you'll need to filter those logData sub-documents of interest, perhaps with a terms_query. Then and only then can you extract only those array objects that've matched this query (lineNumber: 1 or 2 or 3).
The technique for that is called inner_hits:
POST logs/_search
{
"_source": ["FileName", "inner_hits.logData"],
"query": {
"nested": {
"path": "logData",
"query": {
"terms": {
"logData.LineNumber": [
1,
2,
3
]
}
},
"inner_hits": {}
}
}
}
Check this thread for more info.

Related

Filter documents out of the facet count in enterprise search

We use enterprise search indexes to store items that can be tagged by multiple tenants.
e.g
[
{
"id": 1,
"name": "document 1",
"tags": [
{ "company_id": 1, "tag_id": 1, "tag_name": "bla" },
{ "company_id": 2, "tag_id": 1, "tag_name": "bla" }
]
}
]
I'm looking to find a way to retrieve all documents with only the tags of company 1
This request:
{
"query": "",
"facets": {
"tags": {
"type": "value"
}
},
"sort": {
"created": "desc"
},
"page": {
"size": 20,
"current": 1
}
}
Is coming back with
...
"facets": {
"tags": [
{
"type": "value",
"data": [
{
"value": "{\"company_id\":1,\"tag_id\":1,\"tag_name\":\"bla\"}",
"count": 1
},
{
"value": "{\"company_id\":2,\"tag_id\":1,\"tag_name\":\"bla\"}",
"count": 1
}
]
}
],
}
...
Can I modify the request in a way such that I get no tags by "company_id" = 2 ?
I have a solution that involves modifying the results to strip the extra data after they are retrieved but I'm looking for a better solution.

Sorting by a nested field in elasticsearch

If I had a data structure that looked like this
[{"_id" 1
"scores" [{"student_id": 1, "score": 100"}, {"student_id": 2, "score": 80"}
]},
{"_id" 2
"scores" [{"student_id": 1, "score": 20"}, {"student_id": 2, "score": 90"}
]}]
Would it be possible to sort this dataset by student_1's score or by student_2's score?
For example if I sorted descending by student 1's score, I would get document 1,2, but if I sorted descending by student 2's score, I would get 2,1.
I could re-arrange the data, but I don't want to use another index because there's a bunch of metadata not included above for brevity. Thanks!
Yes, it is possible. You must use "nested" field type for your scores, that way you can keep the relation between each student_id and its score.
You can read an article I wrote about that subject:
https://opster.com/guides/elasticsearch/data-architecture/elasticsearch-nested-field-object-field/
Now the example:
Mappings
PUT test_students
{
"mappings": {
"properties": {
"scores": {
"type": "nested",
"properties": {
"student_id": {
"type": "keyword"
},
"score": {
"type": "long"
}
}
}
}
}
}
Documents
PUT test_students/_doc/1
{
"scores": [{"student_id": 1, "score": 100}, {"student_id": 2, "score": 80}]
}
PUT test_students/_doc/2
{
"scores": [{"student_id": 1, "score": 20}, {"student_id": 2, "score": 90}]
}
Query
POST test_students/_search
{
"sort" : [
{
"scores.score" : {
"mode" : "max",
"order" : "desc",
"nested": {
"path": "scores",
"filter": {
"term" : { "scores.student_id" : "2" }
}
}
}
}
]
}

Group by terms and get count of nested array property?

I would like to get the count from a document series where an array item matches some value.
I have documents like these:
{
"Name": "jason",
"Todos": [{
"State": "COMPLETED"
"Timer": 10
},{
"State": "PENDING"
"Timer": 5
}]
}
{
"Name": "jason",
"Todos": [{
"State": "COMPLETED"
"Timer": 5
},{
"State": "PENDING"
"Timer": 2
}]
}
{
"Name": "martin",
"Todos": [{
"State": "COMPLETED"
"Timer": 15
},{
"State": "PENDING"
"Timer": 10
}]
}
I would like to count how many documents I have where they have any Todos with COMPLETED State. And group by Name.
So from the above I would need to get:
jason: 2
martin: 1
Usually I do this with a term aggregation for the Name, and an other sub aggregation for other items:
"aggs": {
"statistics": {
"terms": {
"field": "Name"
},
"aggs": {
"test": {
"filter": {
"bool": {
"must": [{
"match_phrase": {
"SomeProperty.keyword": {
"query": "THEVALUE"
}
}
}
]
}
},
But not sure how to do this here as I have items in an array.
Elasticsearch has no problem with arrays because in fact it flattens them by default:
Arrays of inner object fields do not work the way you may expect. Lucene has no concept of inner objects, so Elasticsearch flattens object hierarchies into a simple list of field names and values.
So a query like the one you posted will do. I would use term query for keyword datatype, though:
POST mytodos/_search
{
"size": 0,
"aggs": {
"by name": {
"terms": {
"field": "Name"
},
"aggs": {
"how many completed": {
"filter": {
"term": {
"Todos.State": "COMPLETED"
}
}
}
}
}
}
}
I am assuming your mapping looks something like this:
PUT mytodos/_mappings
{
"properties": {
"Name": {
"type": "keyword"
},
"Todos": {
"properties": {
"State": {
"type": "keyword"
},
"Timer": {
"type": "integer"
}
}
}
}
}
The example documents that you posted will be transformed internally into something like this:
{
"Name": "jason",
"Todos.State": ["COMPLETED", "PENDING"],
"Todos.Timer": [10, 5]
}
However, if you need to query for Todos.State and Todos.Timer, for example, filter for those "COMPLETED" but only with Timer > 10, it will not be possible with such mapping because Elasticsearch forgets the link between fields of object array items.
In this case you would need to use something like nested datatype for such arrays, and query them with special nested query.
Hope that helps!

Elasticsearch query fails to return results when querying a nested object

I have an object which looks something like this:
{
"id": 123,
"language_id": 1,
"label": "Pablo de la Pena",
"office": {
"count": 2,
"data": [
{
"id": 1234,
"is_office_lead": false,
"office": {
"id": 1,
"address_line_1": "123 Main Street",
"address_line_2": "London",
"address_line_3": "",
"address_line_4": "UK",
"address_postcode": "E1 2BC",
"city_id": 1
}
},
{
"id": 5678,
"is_office_lead": false,
"office": {
"id": 2,
"address_line_1": "77 High Road",
"address_line_2": "Edinburgh",
"address_line_3": "",
"address_line_4": "UK",
"address_postcode": "EH1 2DE",
"city_id": 2
}
}
]
},
"primary_office": {
"id": 1,
"address_line_1": "123 Main Street",
"address_line_2": "London",
"address_line_3": "",
"address_line_4": "UK",
"address_postcode": "E1 2BC",
"city_id": 1
}
}
My Elasticsearch mapping looks like this:
"mappings": {
"item": {
"properties": {
"office": {
"properties": {
"data": {
"type": "nested",
}
}
}
}
}
}
My Elasticsearch query looks something like this:
GET consultant/item/_search
{
"from": 0,
"size": 24,
"query": {
"bool": {
"must": [
{
"term": {
"language_id": 1
}
},
{
"term": {
"office.data.office.city_id": 1
}
}
]
}
}
}
This returns zero results, however, if I remove the second term and leave it only with the language_id clause, then it works as expected.
I'm sure this is down to a misunderstading on my part of how the nested object is flattened, but I'm out of ideas - I've tried all kinds of permutations of the query and mappings.
Any guidance hugely appreciated. I am using Elasticsearch 6.1.1.
I'm not sure if you need the entire record or not, this solution gives every record that has language_id: 1 and has an office.data.office.id: 1 value.
GET consultant/item/_search
{
"from": 0,
"size": 100,
"query": {
"bool":{
"must": [
{
"term": {
"language_id": {
"value": 1
}
}
},
{
"nested": {
"path": "office.data",
"query": {
"match": {
"office.data.office.city_id": 1
}
}
}
}
]
}
}
}
I put 3 different records in my test index for proofing against false hits, one with different language_id and one with different office ids and only the matching one returned.
If you only need the office data, then that's a bit different but still solvable.

elasticsearch sort nested_filter not matching nested_path

I have an index with items of this form
{
"_index": "identity-index",
"_source": {
"names": [
"test"
],
"private": {
"lists": [
{
"listId": "56b8a0197f3c56654f8751b5",
"ratings": [
{
"rating": 4,
"authorId": "56499b7a97e3aa857cdc4f1d"
},
{
"rating": 4,
"authorId": "56b36646a24d50866de77928"
},
{
"rating": 4,
"authorId": "56cb16005082871b33ab1a60"
},
{
"rating": 4,
"authorId": "56b216a4c28edca956fe96d4"
},
{
"rating": 4,
"authorId": "56b34e8d8e324180259252f7"
}
]
},
{
"listId": "56c1c508da49cdd9662b102c"
}
]
}
},
"sort": [
"-Infinity"
]
}
I want to sort them by average rating given a listId:
I've tried a lot of ways and the closest I got was with this:
"sort": {
"private.lists.ratings.rating": {
"missing": "_last",
"order": "desc",
"mode": "avg",
"nested_path": "private.lists.ratings",
"nested_filter": {
"term": {
"private.lists.listId": "56c1c508da49cdd9662b102c"
}
}
}
},
The problem is that this scores everything as -Inf. I can't find any way to sort the nested elements in private.lists.ratings but taking into account the filter by private.lists.listId. The nested_path and nested_filter fields are different and I don't think they are supposed to be.
If the ratings field is analyzed with type nested, you can get what you want by copying the listId in each of the nested objects.
Unfortunately, nested objects are not part of the main document, and nested_filter (and nested_sort) can only disambiguate based on properties contained in each subdocument.
One solution could be to flatten your structure to a simple list of objects looking like the following
{
"listId": "56b8a0197f3c56654f8751b5",
"rating": 4,
"authorId": "56499b7a97e3aa857cdc4f1d"
}

Resources