Elasticsearch - Show index-wide count for each returned result based from a given term - elasticsearch

Firstly i apologise if the terminology i use is incorrect as i am learning elasticsearch day by day and maybe use incorrect phrases.
After spending several days trying to figure this out and pulling my hair out i seem to be hitting brick walls every-time.
I am trying to get elasticsearch to provide a document count for each returned result, I will provide an example below..
{
"suggest": {
"text": "aberdeen",
"city": {
"completion": {
"field": "city_suggest",
"size": "2"
}
},
"street": {
"completion": {
"field": "street_suggest",
"size": "2"
}
}
},
"size": 0,
"aggs": {
"meta": {
"filter": {
"term": {
"city.raw": "aberdeen"
}
},
"aggs": {
"name": {
"terms": {
"field": "city.raw"
}
}
}
}
}
}
The above query returns the following results:
{
"took": 37,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1870535,
"max_score": 0,
"hits": []
},
"aggregations": {
"meta": {
"doc_count": 119196,
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Aberdeen",
"doc_count": 119196
}
]
}
}
},
"suggest": {
"city": [
{
"text": "Aberdeen",
"offset": 0,
"length": 8,
"options": [
{
"text": "Aberdeen",
"score": 100
}
]
}
],
"street": [
{
"text": "Aberdeen",
"offset": 0,
"length": 8,
"options": [
{
"text": "Davidson House, Aberdeen, AB15",
"score": 80
},
{
"text": "Bruce House, Aberdeen, AB15",
"score": 80
}
]
}
]
}
}
The result i am trying to achieve is to have an overall document count of each returned result so for example, The returned street address of "Davidson House, Aberdeen, AB15" would say how many documents in the index match this given address and this would be repeated for each result and the same for the city in a similar way to how the aggregated city currently shows the overall count.
{
"key": "Aberdeen",
"doc_count": 119196
}
Here is an example of something similar in production
The problem i believe i have faced with aggregations is i do not know the values that are going to be returned otherwise i could predefine them with aggregations like i did the city thus requesting the overall count of each given result that way.
To help give an overall example of how i pictured the results to be i will show how i pictured that possible working results to be like:
"suggest": {
"city": [
{
"text": "Aberdeen",
"offset": 0,
"length": 8,
"options": [
{
"text": "Aberdeen",
"score": 100,
"total_addresses": 196152
}
]
}
],
"street": [
{
"text": "Aberdeen",
"offset": 0,
"length": 8,
"options": [
{
"text": "Davidson House, Aberdeen, AB15",
"score": 80,
"total_addresses": 158
},
{
"text": "Bruce House, Aberdeen, AB15",
"score": 80,
"total_addresses": 30
}
]
}
]
}
En terms of the elasticsearch version i am using, I have two dev servers running elasticsearch 2.3 and 5.5 to see if the newer version of elasticsearch would make a difference and unfortunately i came up short so i have been using 2.3 in favour of 5.5
Any help or advice would be greatly appreciated, Thanks all.

you need to divide your query in two. First use the suggest API to gather suggestions, then run the aggregation on the result. The drawback of this solution would be, that you have a crazy fast suggestion (less than a millisecond, if you're lucky), against a longer running aggregation. If thats ok for you, this might be a good approach.
Another idea might be to have an own suggestion index with preaggregated data, that contains such a count - this index gets recreated regurlarly in the background.

Related

Source to destination Key Field mapping in Elastic Search

I have a elastic search index with source data coming in the following way:
"_source": {
"email": "smithamber#example.com",
"time": "2022-09-08T13:52:50.347861",
"message": "Pattern thank talk mention. Manage nearly tell beat. Difficult husband feel talk radio however.",
"sIp": "192.168.11.156",
"dIp": "80.254.211.60",
"ts": "2022-09-08T13:52:50"
}
Now I want a way to treat dynamically map #timestamp [destination key] field of ES doc to be time [source key]. For this i am using:
"runtime_mappings": {
"#timestamp": {
"type": "date",
"format": "yyyyMMdd'T'HHmmss.SSSZ",
"script": {
"source": "if (doc[\"time\"].size() == 0) {return} else {return doc[\"time\"].value;}",
"lang": "painless"
}
}
}
However, this does not work. Is there a better way to map source key field to destination key field in elastic search. I am open to static mapping as well if we set once before creating the index for one kind of source data.
I am looking for correct syntax for mapping my field.
Edited:
When I add the query -
{ "query": {
"range": {
"#timestamp": {
"gte": "now-5d",
"lte": "now"
}
}
}
}
I see no hits.
{
"took": 20,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
However, same query on field time gets me all filtered docs.
{
"took": 27,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": 1.0,
"hits": [
{
"_index": "topic-indexer-xxx",
"_id": "c28sIYMB0xJUJru8c47O",
"_score": 1.0,
"_source": {
"email": "albertthompson#example.com",
"time": "2022-09-07T15:25:33.672016",
"message": "Candidate future staff ever former run. Like quality personal specific trouble cell money move. Available majority memory model thing TV wrong. Summer anyone light key.",
"sIp": "192.168.103.75",
"dIp": "191.27.68.163"
}
},
....
}
For mapping I have also tried dynamic templates; but still no results on query for #timestamp field:
{
"dynamic_templates": [
{
"#timestamp": {
"match": "time",
"mapping": {
"type": "date",
"format": "strict_date_optional_time",
"copy_to": "#timestamp"
}
}
}
]
}
With #paulo's response, I just did a little fine tuning to resolve the issue; The below mapping (as set) works and then I can run range queries on the #timestamp field:
{
"runtime": {
"#timestamp": {
"type": "date",
"script": {
"source": "if (doc['time'].size() != 0){ emit(doc['time'].value.toEpochMilli());}",
"lang": "painless"
}
}
},
"properties": {
"#timestamp": {
"type": "date"
}
}
}
Tldr;
I feel you go mixed up in your painless script.
Please find below an example you should be able to reproduce on your side.
Time is already a date on my side. Elasticsearch was able to detect it automatically.
On another note, using runtime fields while very flexible, may lead to performance issue on the long run.
Maybe you should be looking into ingest pipeline.
Solution
POST /73684302/_doc
{
"email": "smithamber#example.com",
"time": "2022-09-08T13:52:50.347861",
"message": "Pattern thank talk mention. Manage nearly tell beat. Difficult husband feel talk radio however.",
"sIp": "192.168.11.156",
"dIp": "80.254.211.60",
"ts": "2022-09-08T13:52:50"
}
POST /73684302/_doc
{
"email": "smithamber#example.com",
"message": "Pattern thank talk mention. Manage nearly tell beat. Difficult husband feel talk radio however.",
"sIp": "192.168.11.156",
"dIp": "80.254.211.60",
"ts": "2022-09-08T13:52:50"
}
GET /73684302/_search
{
"runtime_mappings": {
"#timestamp": {
"type": "date",
"script": {
"source": """
if (doc["time"].size() != 0){
emit(doc["time"].value.toEpochMilli());
}
""",
"lang": "painless"
}
}
},
"_source": false,
"fields": ["#timestamp"]
}

Elasticsearch Term suggester is not returning correct suggestions when one character is missing (instead of misspelling)

I'm using Elasticsearch term suggester for spell correction. my index contains huge list of ads. Each ad has subject and body fields. I've found a problematic example for which the suggester is not suggesting correct suggestions.
I have lots of ads whose subject contains word "soffa" and also 5 ads whose subject contain word "sofa". Ideally, when I send "sofa" (wrong spelling) as text to suggester, it should return "soffa" (correct spelling) as suggestions (since soffa is correct spell and most of ads contains "soffa" and only few ads contains "sofa" (wrong spell)).
Here is my suggester query body :
{
"suggest": {
"text": "sofa",
"subjectSuggester": {
"term": {
"field": "subject",
"suggest_mode": "popular",
"min_word_length": 1
}
}
}
}
When I send above query, I get below response :
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"suggest": {
"subjectSuggester": [
{
"text": "sof",
"offset": 0,
"length": 4,
"options": [
{
"text": "soff",
"score": 0.6666666,
"freq": 298
},
{
"text": "sol",
"score": 0.6666666,
"freq": 101
},
{
"text": "saf",
"score": 0.6666666,
"freq": 6
}
]
}
]
}
}
As you see in above response, it returned "soff" but not "soffa" although I have lots of docs whose subject contains "soffa".
I even played with parameters like suggest_mode and string_distance but still no luck.
I also used phrase suggester instead of term suggester but still same. Here is my phrase suggester query :
{
"suggest": {
"text": "sofa",
"subjectuggester": {
"phrase": {
"field": "subject",
"size": 10,
"gram_size": 3,
"direct_generator": [
{
"field": "subject.trigram",
"suggest_mode": "always",
"min_word_length":1
}
]
}
}
}
}
I somehow think it doesn't work when one character is missing instead of being misspelled. in the "soffa" example, one "f" is missing.
while it works fine for misspells e.g it works fine for "vovlo".
When I send "vovlo" it gives me "volvo".
Any help would be hugely appreciated.
Try changing the "string_distance".
{
"suggest": {
"text": "sof",
"subjectSuggester": {
"term": {
"field": "title",
"min_word_length":2,
"string_distance":"ngram"
}
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html#term-suggester
I've found the workaround myself.
I added ngram filter and analyzer with max_shingle_size 3 which means trigram, then added a subfield with that analyzer (trigram) and performed suggester query on that field (instead of actual field) and it worked.
Here is the mapping changes :
{
"settings": {
"analysis": {
"filter": {
"shingle": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
},
"analyzer": {
"trigram": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle"
],
"char_filter": [
"diacritical_marks_filter"
]
}
}
}
},
"mappings": {
"properties": {
"subject": {
"type": "text",
"fields": {
"trigram": {
"type": "text",
"analyzer": "trigram"
}
}
}
}
}
}
And here is my corrected query :
{
"suggest": {
"text": "sofa",
"subjectSuggester": {
"term": {
"field": "subject.trigram",
"suggest_mode": "popular",
"min_word_length": 1,
"string_distance": "ngram"
}
}
}
}
Note that I'm performing suggester to subject.trigram instead of subject itself.
Here is the result :
{
"suggest": {
"subjectSuggester": [
{
"text": "sofa",
"offset": 0,
"length": 4,
"options": [
{
"text": "soffa",
"score": 0.8,
"freq": 282
},
{
"text": "soffan",
"score": 0.6666666,
"freq": 5
},
{
"text": "som",
"score": 0.625,
"freq": 102
},
{
"text": "sol",
"score": 0.625,
"freq": 82
},
{
"text": "sony",
"score": 0.625,
"freq": 50
}
]
}
]
}
}
As you can see above soffa appears as first suggestion.
There is sth weird in your result for the term suggester for the word sofa, take a look at the text that is being corrected:
"suggest": {
"subjectSuggester": [
{
"text": "sof",
"offset": 0,
"length": 4,
"options": [
{
"text": "soff",
"score": 0.6666666,
"freq": 298
},
{
"text": "sol",
"score": 0.6666666,
"freq": 101
},
{
"text": "saf",
"score": 0.6666666,
"freq": 6
}
]
}
]
}
As you can see it's sof and not sofa which means the correction is not for sofa but instead it's for sof, so I doubt that this issue is related to the analyzer you were using on this field, especially when looking at the results soff instead of soffa it's removing the last a

Aggregations and filters in Elastic - find the last hits and filter them afterwards

I'm trying to work with Elastic (5.6) and to find a way to retrieve the top documents per some category.
I have an index with the following kind of documents :
{
"#timestamp": "2018-03-22T00:31:00.004+01:00",
"statusInfo": {
"status": "OFFLINE",
"timestamp": 1521675034892
},
"name": "myServiceName",
"id": "xxxx",
"type": "Http",
"key": "key1",
"httpStatusCode": 200
}
}
What i'm trying to do with these, is retrieve the last document (#timestamp-based) per name (my categories), see if its statusInfo.status is OFFLINE or UP and fetch these results into the hits part of a response so I can put it in a Kibana count dashboard or somewhere else (a REST based tool I do not control and can't modify by myself).
Basically, I want to know how many of my services (name) are OFFLINE (statusInfo.status) in their last update (#timestamp) for monitoring purposes.
I'm stuck at the "Get how many of my services" part.
My query so far:
GET actuator/_search
{
"size": 0,
"aggs": {
"name_agg": {
"terms": {
"field": "name.raw",
"size": 1000
},
"aggs": {
"last_document": {
"top_hits": {
"_source": ["#timestamp", "name", "statusInfo.status"],
"size": 1,
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
]
}
}
}
}
},
"post_filter": {
"bool": {
"must_not": {
"term": {
"statusInfo.status.raw": "UP"
}
}
}
}
}
This provides the following response:
{
"all_the_meta":{...},
"hits": {
"total": 1234,
"max_score": 0,
"hits": []
},
"aggregations": {
"name_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "myCategory1",
"doc_count": 225,
"last_document": {
"hits": {
"total": 225,
"max_score": null,
"hits": [
{
"_index": "myIndex",
"_type": "Http",
"_id": "dummy id",
"_score": null,
"_source": {
"#timestamp": "2018-04-06T00:06:00.005+02:00",
"statusInfo": {
"status": "UP"
},
"name": "myCategory1"
},
"sort": [
1522965960005
]
}
]
}
}
},
{other_buckets...}
]
}
}
}
Removing the size make the result contain ALL of the documents, which is not what I need, I only need each bucket content (every one contains one bucket).
Removing the post filter does not appear to do much.
I think this would be feasible in ORACLE SQL with a PARTITION BY OVER clause, followed by a condition.
Does somebody know how this could be achieved ?
If I understand you correctly, you are looking for the latest doc that have status of OFFLINE in each group (grouped by name)?. In that case you can try the query below and the number of items in the bucket should give you the "how many are down" (for up you would change the term in the filter)
NOTE: this is done in latest version, so it uses keyword field instead of raw
POST /index/_search
{
"size": 0,
"query":{
"bool":{
"filter":{
"term": {"statusInfo.status.keyword": "OFFLINE"}
}
}
},
"aggs":{
"services_agg":{
"terms":{
"field": "name.keyword"
},
"aggs":{
"latest_doc":{
"top_hits": {
"sort": [
{
"#timestamp":{
"order": "desc"
}
}
],
"size": 1,
"_source": ["#timestamp", "name", "statusInfo.status"]
}
}
}
}
}
}

Elasticsearch term suggester does not return results on exact match

when i request the suggester with
{
"my-title-suggestions-1": {
"text": "tücher ",
"term": {
"field": "name",
}
},
"my-title-suggestions-2": {
"text": "tüchers ",
"term": {
"field": "name"
}
}
}
it returns
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"my-title-suggestions-1": [
{
"text": "tücher",
"offset": 0,
"length": 6,
"options": []
}
],
"my-title-suggestions-2": [
{
"text": "tüchers",
"offset": 0,
"length": 7,
"options": [
{
"text": "tücher",
"score": 0.8333333,
"freq": 6
}
]
}
]
}
i wonder why it does not return the exact match with the first suggester?
the second suggester obviously has that result.
can i add other options which will resolve this behavior?
edit:
the minimal mapping is just this ...
{
"name" : {
"analyzer" : "standard",
"type" : "string"
}
}
To add to what #ChintanShah25 said: According to https://www.elastic.co/guide/en/elasticsearch/reference/2.0/search-suggesters-term.html (see suggest_mode) the Term suggester will by default:
Only provide suggestions for suggest text terms that are not in the index.
I dont think you can do that and I am not sure why do you want exact match in suggestions, after all they are "suggestions".
Normally they are used to check misspelling. It will give you candidate suggestions that are similar and fall in edit distance of 2 for the word you entered.

ElasticSearch: Attempting to get spelling suggestion on proper names

Before I begin, let me just say that I'm no ElasticSearch expert, but I am currently tasked with tweaking some analyzers to get spelling suggestions working better in a couple of different situations. I've seen examples of people who are doing spelling suggestions on proper names, so I know it must be possible, but I've been at this for a couple days now, and I must be missing something, because ElasticSearch doesn't seem to recognize the name I'm looking for. Can you please help me figure this out? Thanks in advance!
Here's the analyzer I'm using for index as well as search:
"full_text": {
"filter": [
"lowercase",
"asciifolding",
],
"type": "custom",
"tokenizer": "keyword"
},
This should demonstrate that the field is tokenizing into one long keyword, which I want.
{
"query": {
"match": {
"_all": "combine 5"
}
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "my_field"
}
}
}
}
...and it outputs something like this, which shows how the field is being tokenized. Looks good:
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 75,
"max_score": 0.58574116,
"hits": [
{
"_index": "my_index",
"_type": "thing",
"_id": "1",
"_score": 0.58574116,
"fields": {
"terms": [
[
"combine 5"
]
]
}
}
}
}
... but when I do a suggest query, it doesn't suggest the field, even though it's just off by a space.
{
"query": {
"match": {
"_all": "combine 5"
}
},
"suggest": {
"suggest-0": {
"term": {
"field": "_all",
"size": 5
},
"text": "combine5"
}
}
}
Which returns a bunch of documents and this suggestion:
"suggest": {
"suggest-0": [
{
"text": "combine5",
"offset": 0,
"length": 8,
"options": [
{
"text": "combined",
"score": 0.875,
"freq": 15
},
{
"text": "combine",
"score": 0.85714287,
"freq": 17
}
]
}
]
}
Note that if I change the spelling suggestion to work just on the field that contains the text, it does suggest it, but not when I'm using _all. Is there a way to get the words in a specific field to be suggested when suggesting against _all?
I'm not sure this qualifies as exactly the answer I was looking for, but I ended up solving this by adding a field on the document containing the keyword value that I was looking for "combine5", so now it is registered as a word and if I suggest on that field, or _all, the word is suggested. It's also found in queries against _all.

Resources