ElasticSearch - Unable to filter on an array of strings - elasticsearch

I have the following model class:
public class NewsItem
{
public String Language { get; set; }
public DateTime DateUpdated { get; set; }
public List<String> Tags { get; set; }
}
I index it with NEST using the automapping, resulting in the mapping below:
{
"search": {
"mappings": {
"news": {
"properties": {
"dateUpdated": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"language": {
"type": "string"
},
"tags": {
"type": "string"
},
}
}
}
}
}
I then run a query on language which works fine:
{
"query": {
"constant_score": {
"filter": [
{
"terms": {
"language": [
"en"
]
}
}
]
}
},
"sort": {
"dateUpdated": {
"order": "desc"
}
}
}
But running the same query on the tags property doesn't work. Is there any special tricks to query an array field? I read the docs again and again and I don't understand why this query gives no results:
{
"query": {
"constant_score": {
"filter": [
{
"terms": {
"tags": [
"Hillary"
]
}
}
]
}
},
"sort": {
"dateUpdated": {
"order": "desc"
}
}
}
The document returned from another query:
{
"_index": "search",
"_type": "news",
"_score": 0.12265198,
"_source": {
"tags": [
"Hillary"
],
"language": "en",
"dateUpdated": "2016-11-07T15:41:00Z"
}
}

Your tags field is analyzed, hence Hillary has been indexed to hillary. So you have two ways out:
A. Use a match query instead (since terms query does not analyze the token
{
"query": {
"bool": {
"filter": [
{
"match": { <--- use match here
"tags": "Hillary"
}
}
]
}
},
"sort": {
"dateUpdated": {
"order": "desc"
}
}
}
B. Keep the terms query but lowercase the token:
{
"query": {
"bool": {
"filter": [
{
"terms": {
"tags": [
"hillary" <--- lowercase here
]
}
}
]
}
},
"sort": {
"dateUpdated": {
"order": "desc"
}
}
}

Elasticsearch by default runs an analyzer on all strings but Terms filter on other hand computer exact match. So this implies that ES is storing 'Hillary' as 'hillary' while you are querying for 'Hillary'. So, there are 2 ways to fix this. Either you use a match query instead of terms query or you don't automap and rather create an index and analyze the tags field as you want. You can also query 'hillary' but this would be a solution for this one case because if tag was something like 'us elections' us and elections both will be stored separately.

Related

elasticsearch need to add a must to a bool should query

I have the following query that works as expected:
GET <index_name>/_search
{
"sort": [
{
"irFileCreateTime": {
"order": "desc"
}
}
],
"query": {
"bool": {
"should": [
{
"match": {
"fileId": 46704
}
},
{
"match": {
"fileId": 46706
}
},
{
"match": {
"fileId": 46719
}
}
]
}
}
}
The problem is that I need to further filter the data, but the field I need to filter on is a text field. I have tried many different ways of putting a must match into my query but everything is either malformed or filters out all hits when I know it should only filter out half. How can I add a must match "irStatus":"COMPLETE" to this query? Thanks in advance.
What you're after is a term query on, preferably, the keyword of irStatus. That is to say:
GET index/_search
{
"sort": [
{
"irFileCreateTime": {
"order": "desc"
}
}
],
"query": {
"bool": {
"must": [
{
"term": {
"irStatus.keyword": {
"value": "COMPLETE"
}
}
}
],
"should": [
{
"match": {
"fileId": 46704
}
},
{
"match": {
"fileId": 46706
}
},
{
"match": {
"fileId": 46719
}
}
]
}
}
}
Assuming your mapping looks something like this:
{
"mappings": {
"properties": {
"irFileCreateTime": {
"type": "date"
},
"fileId": {
"type": "integer"
},
"irStatus": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
The reason it's apparently failing on your end is that "COMPLETE" has been lowercased due to standard analyzer.
Alternatively, you could do:
{
"must":[
{
"query_string":{
"query":"irStatus:COMPLETE AND (fileId:(46704 OR 46706 OR 46719))"
}
}
]
}

Querying multiple indices but limit the results from a single index - Elasticsearch 6.x

All, I am using ES 6.7 and trying to return the results from a single index while querying two indices( customer, payment) and doing a terms lookup against user-customers index. The index I want data from(customer) has more fields than the second index. But for some reason, I only see results from the payment index. The fields customerName, customerNumber, state, address only exist on customer index. But I only want customers that has totalCredits > 0(This exists only on payment index) ordered by the logic in the sort array. I tried adding an _index filter( setting this to customer) but dint help. Adding source filtering doesn't help either. Is this doable in ES 6.7?. Am I left with the option of adding the fields in the sort array to the payment index or are there some other options?
ES query
GET customer,payment/_search
{
"sort": [
{
"customerName": {
"order": "asc",
"unmapped_type": "keyword"
}
},
{
"customerNumber": {
"order": "asc"
}
},
{
"state": {
"order": "asc",
"unmapped_type": "keyword"
}
},
{
"address": {
"order": "asc",
"unmapped_type": "keyword"
}
}
],
"query": {
"bool": {
"filter": [
{
"bool": {
"must_not": [
{
"terms": {
"status": [
"pending"
]
}
}
]
}
}
],
"must": [
{
"terms": {
"customerNumber": {
"index": "user-customers",
"type": "_doc",
"id": "rennish#emial.com",
"path": "users"
}
}
},
{
"range": {
"totalCredits": {
"gt": 0
}
}
}
]
}
}
}

Filter and sort based on attributes in Terms lookup document in Elastic Search

I have some documents in my index:
POST "/index/thing/_bulk" -s -d'
{ "index":{ "_id": 1 } }
{ "title":"One thing"}
{ "index":{ "_id": 2 } }
{ "title":"Second thing"}
{ "index":{ "_id": 3 } }
{ "title":"Three things"}
{ "index":{ "_id": 4 } }
{ "title":"And so fourth"}
{ "index":{ "_id": 5 } }
{ "title":"Five things"}
'
I also have documents which contain a users collection which are linked to the other documents (things) through the documents id attribute like so:
PUT /index/collection/1
{
"items": [
{"id": 1, "time_added": "2017-08-07T09:07:15.000Z", "condition": "fair"},
{"id": 3, "time_added": "2019-08-07T09:07:15.000Z", "condition": "good"},
{"id": 4, "time_added": "2016-08-07T09:07:15.000Z", "condition": "poor"}
]
}
I then use a terms lookup to get all the things in a users collection like so:
GET /documents/_search
{
"query" : {
"terms" : {
"_id" : {
"index" : "index",
"type" : "collection",
"id" : 1,
"path" : "items.id"
}
}
}
}
This works fine. I get the three documents in the collection and can search, sort and use aggregations like I want.
But is there a way to aggregate, filter and sort those documents based on the attributes (time_added or condition in this case) in the collection document? Say I wanted to sort based on time_added or filter for condition=="good" from the collection?
Maybe a script that can be applied to collection to sort or filter the items in there? It feels like this is getting pretty close to sql like left-join, so maybe Elastic Search is the wrong tool?
It looks like you need the nested data type
Taking your data as an example:
Without nested type:
POST collection/_bulk?filter_path=_
{"index":{}}
{"items":[{"id":11,"time_added":"2017-08-07T09:07:15.000Z","condition":"fair"},{"id":13,"time_added":"2019-08-07T09:07:15.000Z","condition":"good"},{"id":14,"time_added":"2016-08-07T09:07:15.000Z","condition":"poor"}]}
{"index":{}}
{"items":[{"id":21,"time_added":"2017-09-07T09:07:15.000Z","condition":"fair"},{"id":23,"time_added":"2019-09-07T09:07:15.000Z","condition":"good"},{"id":24,"time_added":"2016-09-07T09:07:15.000Z","condition":"poor"}]}
{"index":{}}
{"items":[{"id":31,"time_added":"2017-10-07T09:07:15.000Z","condition":"fair"},{"id":33,"time_added":"2019-10-07T09:07:15.000Z","condition":"good"},{"id":34,"time_added":"2016-10-07T09:07:15.000Z","condition":"poor"}]}
{"index":{}}
{"items":[{"id":41,"time_added":"2017-11-07T09:07:15.000Z","condition":"fair"},{"id":43,"time_added":"2019-11-07T09:07:15.000Z","condition":"good"},{"id":44,"time_added":"2016-11-07T09:07:15.000Z","condition":"poor"}]}
{"index":{}}
{"items":[{"id":51,"time_added":"2017-12-07T09:07:15.000Z","condition":"fair"},{"id":53,"time_added":"2019-12-07T09:07:15.000Z","condition":"good"},{"id":54,"time_added":"2016-12-07T09:07:15.000Z","condition":"poor"}]}
Query (you'd get incorrect results - expected one, got five):
GET collection/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"items.condition": {
"value": "good"
}
}
},
{
"range": {
"items.time_added": {
"lte": "2019-09-01"
}
}
}
]
}
}
}
Aggregation (incorect results - look at the first bucket "2016-08-01T00:00:00.000Z" - it contains 3 CONDITION sub-buckets with every condition type)
GET collection/_search
{
"size": 0,
"aggs": {
"DATE": {
"date_histogram": {
"field": "items.time_added",
"calendar_interval": "month"
},
"aggs": {
"CONDITION": {
"terms": {
"field": "items.condition.keyword",
"size": 10
}
}
}
}
}
}
With nested type
DELETE collection
PUT collection
{
"mappings": {
"properties": {
"items": {
"type": "nested"
}
}
}
}
# and POST the same data from above
Query (returns just one result)
GET collection/_search
{
"query": {
"nested": {
"path": "items",
"query": {
"bool": {
"must": [
{
"term": {
"items.condition": {
"value": "good"
}
}
},
{
"range": {
"items.time_added": {
"lte": "2019-09-01"
}
}
}
]
}
}
}
}
}
Aggregation (the first date bucket contains just one CONDITION sub-bucket)
GET collection/_search
{
"size": 0,
"aggs": {
"ITEMS": {
"nested": {
"path": "items"
},
"aggs": {
"DATE": {
"date_histogram": {
"field": "items.time_added",
"calendar_interval": "month"
},
"aggs": {
"CONDITION": {
"terms": {
"field": "items.condition.keyword",
"size": 10
}
}
}
}
}
}
}
}
Hope that helps :)

Distinct values from array-field matching filter in Elasticsearch 2.4

In short: I want to lookup for distinct values in some field of the document BUT only matching some filter. The problem is in array-fields.
Imagine there are following documents in ES 2.4:
[
{
"states": [
"Washington (US-WA)",
"California (US-CA)"
]
},
{
"states": [
"Washington (US-WA)"
]
}
]
I'd like my users to be able to lookup all possible states via typeahead, so I have the following query for the "wa" user request:
{
"query": {
"wildcard": {
"states.raw": "*wa*"
}
},
"aggregations": {
"typed": {
"terms": {
"field": "states.raw"
},
"aggregations": {
"typed_hits": {
"top_hits": {
"_source": { "includes": ["states"] }
}
}
}
}
}
}
states.raw is a sub-field with not_analyzed option
This query works pretty well unless I have an array of values like in the example - it returns both Washington and California. I do understand why it happens (query and aggregations are working on top of the document and the document contains both, even though only one option matched the filter), but I really want to only see Washington and don't want to add another layer of filtering on the application side for the ES results.
Is there a way to do so via single ES 2.4 request?
You could use the "Filtering Values" feature (see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-bucket-terms-aggregation.html#_filtering_values_2).
So, your request could look like:
POST /index/collection/_search?size=0
{
"aggregations": {
"typed": {
"terms": {
"field": "states.raw",
"include": ".*wa.*" // You need to carefully quote the "wa" string because it'll be used as part of RegExp
},
"aggregations": {
"typed_hits": {
"top_hits": {
"_source": { "includes": ["states"] }
}
}
}
}
}
}
I can't hold myself back, though, and not tell you that using wildcard with leading wildcard is not the best solution. Do, please please, consider using ngrams for this:
PUT states
{
"settings": {
"analysis": {
"filter": {
"ngrams": {
"type": "nGram",
"min_gram": "2",
"max_gram": "20"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"ngrams"
],
"tokenizer": "standard"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"location": {
"properties": {
"states": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"ngrams": {
"type": "string",
"analyzer": "ngram_analyzer"
}
}
}
}
}
}
}
}
}
POST states/doc/1
{
"text":"bla1",
"location": [
{
"states": [
"Washington (US-WA)",
"California (US-CA)"
]
},
{
"states": [
"Washington (US-WA)"
]
}
]
}
POST states/doc/2
{
"text":"bla2",
"location": [
{
"states": [
"Washington (US-WA)",
"California (US-CA)"
]
}
]
}
POST states/doc/3
{
"text":"bla3",
"location": [
{
"states": [
"California (US-CA)"
]
},
{
"states": [
"Illinois (US-IL)"
]
}
]
}
And the final query:
GET states/_search
{
"query": {
"term": {
"location.states.ngrams": {
"value": "sh"
}
}
},
"aggregations": {
"filtering_states": {
"terms": {
"field": "location.states.raw",
"include": ".*sh.*"
},
"aggs": {
"typed_hits": {
"top_hits": {
"_source": {
"includes": [
"location.states"
]
}
}
}
}
}
}
}

ElasticSearch - Get only matching nested objects with All Top level fields in search response

let say I have following Document:
{
id: 1,
name: "xyz",
users: [
{
name: 'abc',
surname: 'def'
},
{
name: 'xyz',
surname: 'wef'
},
{
name: 'defg',
surname: 'pqr'
}
]
}
I want to Get only matching nested objects with All Top level fields in search response.
I mean If I search/filter for users with name 'abc', I want below response
{
id: 1,
name: "xyz",
users: [
{
name: 'abc',
surname: 'def'
}
]
}
How can I do that?
Reference : select matching objects from array in elasticsearch
If you're ok with having all root fields except the nested one and then only the matching inner hits in the nested field, then we can re-use the previous answer like this by specifying a slightly more involved source filtering parameter:
{
"_source": {
"includes": [ "*" ],
"excludes": [ "users" ]
},
"query": {
"nested": {
"path": "users",
"inner_hits": { <---- this is where the magic happens
"_source": [
"name", "surname"
]
},
"query": {
"bool": {
"must": [
{
"term": {
"users.name": "abc"
}
}
]
}
}
}
}
}
Maybe late, I use nested sorting to limit element on my nested relation, here a example :
"sort": {
"ouverture.periodesOuvertures.dateDebut": {
"order": "asc",
"mode": "min",
"nested_filter": {
"range": {
"ouverture.periodesOuvertures.dateFin": {
"gte": "2017-08-29",
"format": "yyyy-MM-dd"
}
}
},
"nested_path": "ouverture.periodesOuvertures"
}
},
Since 5.5 ES (I think) you can use filter on nested query.
Here a example of nested query filter I use:
{
"nested": {
"path": "ouverture.periodesOuvertures",
"query": {
"bool": {
"must": [
{
"range": {
"ouverture.periodesOuvertures.dateFin": {
"gte": "2017-08-29",
"format": "yyyy-MM-dd"
}
}
},
{
"range": {
"ouverture.periodesOuvertures.dateFin": {
"lte": "2017-09-30",
"format": "yyyy-MM-dd"
}
}
}
],
"filter": [
{
"range": {
"ouverture.periodesOuvertures.dateFin": {
"gte": "2017-08-29",
"format": "yyyy-MM-dd"
}
}
},
{
"range": {
"ouverture.periodesOuvertures.dateFin": {
"lte": "2017-09-30",
"format": "yyyy-MM-dd"
}
}
}
]
}
}
}
}
Hope this can help ;)
Plus if you ES is not in the last version (5.5) inner_hits could slow your query Including inner hits drastically slows down query results
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/search-request-inner-hits.html#nested-inner-hits-source
"inner_hits": {
"_source" : false,
"stored_fields" : ["name", "surname"]
}
but you may need to change mapping to set those fields as "stored_fields" , otherwise you can use
"inner_hits": {}
to get a result that not that perfect.
You can make such a request, but the response will have internal fields starting with _
{
"_source": {
"includes": [ "*" ],
"excludes": [ "users" ]
},
"query": {
"nested": {
"path": "users",
"inner_hits": {},
"query": {
"bool": {
"must": [
{ "match": { "users.name": "abc" }}
]
}
}
}
}
}
In one of my projects, My expectation was to retrieve unique conversation messages text(inner fields like messages.text) having specific tags. So instead of using inner_hits, I used aggregation like below,
final NestedAggregationBuilder aggregation = AggregationBuilders.nested("parentPath", "messages").subAggregation(AggregationBuilders.terms("innerPath").field("messages.tag"));
final NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
.addAggregation(aggregation).build();
final Aggregations aggregations = elasticsearchOperations.search(searchQuery, Conversation.class).getAggregations();
final ParsedNested parentAgg = (ParsedNested) aggregations.asMap().get("parentPath");
final Aggregations childAgg = parentAgg.getAggregations();
final ParsedStringTerms childParsedNested = (ParsedStringTerms) childAgg.asMap().get("innerPath");
// Here you will get unique expected inner fields in key part.
Map<String, Long> agg = childParsedNested.getBuckets().stream().collect(Collectors.toMap(Bucket::getKeyAsString, Bucket::getDocCount));
I use the following body to get that result (I have set the full path to the values):
{
"_source": {
"includes": [ "*" ],
"excludes": [ "users" ]
},
"query": {
"nested": {
"path": "users",
"inner_hits": {
"_source": [
"users.name", "users.surname"
]
},
"query": {
"bool": {
"must": [
{
"term": {
"users.name": "abc"
}
}
]
}
}
}
}
}
Also another way exists:
{
"_source": {
"includes": [ "*" ],
"excludes": [ "users" ]
},
"query": {
"nested": {
"path": "users",
"inner_hits": {
"_source": false,
"docvalue_fields": [
"users.name", "users.surname"
]
},
"query": {
"bool": {
"must": [
{
"term": {
"users.name": "abc"
}
}
]
}
}
}
}
}
See results in inner_hits of the result hits.
https://www.elastic.co/guide/en/elasticsearch/reference/7.15/inner-hits.html#nested-inner-hits-source

Resources