Elasticsearch query response influenced by _id - elasticsearch

I created an index with the following mappings and settings:
{
"settings": {
"analysis": {
"analyzer": {
"case_insensitive_index": {
"type": "custom",
"tokenizer": "filename",
"filter": ["icu_folding", "edge_ngram"]
},
"default_search": {
"type":"standard",
"tokenizer": "filename",
"filter": [
"icu_folding"
]
}
},
"tokenizer" : {
"filename" : {
"pattern" : "[^\\p{L}\\d]+",
"type" : "pattern"
}
},
"filter" : {
"edge_ngram" : {
"side" : "front",
"max_gram" : 20,
"min_gram" : 3,
"type" : "edgeNGram"
}
}
}
},
"mappings": {
"metadata": {
"properties": {
"title": {
"type": "string",
"analyzer": "case_insensitive_index"
}
}
}
}
}
I have the following documents:
{"title":"P-20150531-27332_News.jpg"}
{"title":"P-20150531-27341_News.jpg"}
{"title":"P-20150531-27512_News.jpg"}
{"title":"P-20150531-27343_News.jpg"}
creating a document with simple numerical ids
111
112
113
114
and querying using the query
{
"from" : 0,
"size" : 10,
"query" : {
"match" : {
"title" : {
"query" : "P-20150531-27332_News.jpg",
"type" : "boolean",
"fuzziness" : "AUTO"
}
}
}
}
results in the correct scoring and ordering of the documents returned:
P-20150531-27332_News.jpg -> 2.780985
P-20150531-27341_News.jpg -> 0.8262239
P-20150531-27512_News.jpg -> 0.8120311
P-20150531-27343_News.jpg -> 0.7687101
Strangely, creating the same documents with UUIDs
557eec2e3b00002c03de96bd
557eec0f3b00001b03de96b8
557eec0c3b00001b03de96b7
557eec123b00003a03de96ba
as IDs results in different scorings of the documents:
P-20150531-27341_News.jpg -> 2.646321
P-20150531-27332_News.jpg -> 2.1998127
P-20150531-27512_News.jpg -> 1.7725387
P-20150531-27343_News.jpg -> 1.2718291
Is this an intentional behaviour of Elasticsearch? If yes - how can I preserve the correct ordering regardless of the IDs used?

In the query it looks like you should be using 'default_search' as the analyzer for match query unless you actuall intended to use egde-ngram on the search query too.
Example :
{
"from" : 0,
"size" : 10,
"query" : {
"match" : {
"title" : {
"query" : "P-20150531-27332_News.jpg",
"type" : "boolean",
"fuzziness" : "AUTO",
"analyzer" : "default_search"
}
}
}
}
default_search would be the default-search analyzer only if there is are no explicit search_analyzer or analyzer specified in the mapping of the field.
The articlehere gives a good explanation of the rules by which analyzers are applied.
Also to ensure idf takes into account documents across shards you could use search_type=dfs_query_then_fetch

Related

ELK bool query with match and prefix

I'm new in ELK. I have a problem with the followed search query:
curl --insecure -H "Authorization: ApiKey $ESAPIKEY" -X GET "https://localhost:9200/commsrch/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"should" : [
{"match" : {"cn" : "franc"}},
{"prefix" : {"srt" : "99889300200"}}
]
}
}
}
'
I need to find all documents that satisfies the condition: OR field "cn" contains "franc" OR field "srt" starts with "99889300200".
Index mapping:
{
"commsrch" : {
"mappings" : {
"properties" : {
"addr" : {
"type" : "text",
"index" : false
},
"cn" : {
"type" : "text",
"analyzer" : "compname"
},
"srn" : {
"type" : "text",
"analyzer" : "srnsrt"
},
"srt" : {
"type" : "text",
"analyzer" : "srnsrt"
}
}
}
}
}
Index settings:
{
"commsrch" : {
"settings" : {
"index" : {
"routing" : {
"allocation" : {
"include" : {
"_tier_preference" : "data_content"
}
}
},
"number_of_shards" : "1",
"provided_name" : "commsrch",
"creation_date" : "1675079141160",
"analysis" : {
"filter" : {
"ngram_filter" : {
"type" : "ngram",
"min_gram" : "3",
"max_gram" : "4"
}
},
"analyzer" : {
"compname" : {
"filter" : [
"lowercase",
"stop",
"ngram_filter"
],
"type" : "custom",
"tokenizer" : "whitespace"
},
"srnsrt" : {
"type" : "custom",
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "1",
"uuid" : "C15EXHnaTIq88JSYNt7GvA",
"version" : {
"created" : "8060099"
}
}
}
}
}
Query works properly with just only one condition. If query has only "match" condition, results has properly documents count. If query has only "prefix" condition, results has properly documents count.
In case of two conditions "match" and "prefix", i see in result documents that corresponds only "prefix" condition.
In ELK docs can't find any limitation about mixing "prefix" and "match", but as i see some problem exists. Please help to find where is the problem.
In continue of experince I have one more problem.
Example:
Source data:
1st document cn field: "put stone is done"
2nd document cn field:: "job one or two"
Mapping and index settings the same as described in my first post
Request:
{
"query": {
"bool": {
"should" : [
{"match" : {"cn" : "one"}},
{"prefix" : {"cn" : "one"}}
]
}
}
}
'
As I understand, the high scores got first document, because it has more repeats of "one". But I need high scores for documents, that has at least one word in field "cn" started from string "one". I have experiments with query:
{
"query": {
"bool": {
"should": [
{"match": {"cn": "one"}},
{
"constant_score": {
"filter": {
"prefix": {
"cn": "one"
}
},
"boost": 100
}
}
]
}
}
}
But it doesn't work properly. What's wrong with my query?

Elasticsearch: How to calculate the yield (percentage of success)?

My purpose is to calculate the yield of each benchId. Which means: For each bench, what is the percentage of team that have isPassed=True the first time they pass the test. I would like to have a visualization of each yield for each bench.
My Elasticsearch mapping is:
"test-logs" : {
"mappings" : {
"log" : {
"properties" : {
"benchGroup" : {
"type" : "keyword"
},
"benchId" : {
"type" : "keyword"
},
"date" : {
"type" : "date",
"format" : "yyyy/MM/dd HH:mm:ss"
},
"duration" : {
"type" : "float"
},
"finalStatus" : {
"type" : "keyword"
},
"isCss" : {
"type" : "boolean"
},
"isPassed" : {
"type" : "boolean"
},
"machine" : {
"type" : "keyword"
},
"sha1" : {
"type" : "keyword"
},
"uuid" : {
"type" : "keyword"
},
"team" : {
"type" : "keyword"
}
I tried to divide this issue in several sub-issues. I think I need to aggregate the documents by benchId then sub-aggregate them by team, ordering them by date then taking the first document. Then I think need to use a script to calculate isPassed=True/all first attemps.
No idea how to visualize the result on Kibana though.
I manage to create aggregations with this search:
GET _search
{
"size" : 0,
"aggs": {
"benchId": {
"terms": {
"field": "benchId"
},
"aggs": {
"teams": {
"terms": {
"script": "doc['uut'].join(' & ')",
"size": 10
}
}
}
}
}
}
I get the result I want but I have difficulties to include order by date ascending with limitation to one document by uut

Elasticsearch on object nested under objects array

Assuming I have the following index structure:
{
"title": "Early snow this year",
"body": "After a year with hardly any snow, this is going to be a serious winter",
"source": [
{
"name":"CNN",
"details": {
"site": "cnn.com"
}
},
{
"name":"BBC",
"details": {
"site": "bbc.com"
}
}
]
}
and I have a bool query to try and retrieve this document here:
{
"query": {
"bool" : {
"must" : {
"query_string" : {
"query" : "snow",
"fields" : ["title", "body"]
}
},
"filter": {
"bool": {
"must" : [
{ "term" : {"source.name" : "bbc"}},
{ "term" : {"source.details.site" : "BBC.COM"}}
]
}
}
}
}
}'
But it is not working with zero hits, how should I modify my query? It is only working if I remove the { "term" : {"source.details.site" : "BBC.COM"}}.
Here is the mapping:
{
"news" : {
"mappings" : {
"article" : {
"properties" : {
"body" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"source" : {
"properties" : {
"details" : {
"properties" : {
"site" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
You are doing a term query on "source.details.site". Term query means that the value you provide will not be analysed at query time. If you are using default mapping then source.details.site will be lowercased. Now when you query it with term and "BBC.COM", "BBC.COM" will not be analysed and ES is trying to match "BBC.COM" with "bbc.com" (because it was lowercased at index time) and result is false.
You can use match instead of term to get it analysed. But its better to use term query on your keyword field, it you know in advance the exact thing that would have been indexed. Term queries have good advantage of caching from ES side and it is faster than match queries.
You should clean your data at index time as you will write once and read always. So anything like "/", "http" should be removed if you are not losing the semantics. You can achieve this from your code while indexing or you can create custom analysers in your mapping. But do remember that custom analysers won't work on keyword field. So, if you try to achieve this on ES side, you wont be able to do aggregations on that field without enabling field data, that should be avoided. We have an experimental support for normalisers in latest update, but as it is experimental, don't use it in production. So in my opinion you should clean the data in your code.

Elasticsearch terms aggregate duplicates

I have a field using a ngram analyzer and trying to use a terms aggregate on the field to return unique documents by the field. The returned keys in the aggregates don't match the documents fields being returned and I'm getting duplicate fields.
"analysis" : {
"filter" : {
"autocomplete_filter" : {
"type" : "edge_ngram",
"min_gram" : "1",
"max_gram" : "20"
}
},
"analyzer" : {
"autocomplete" : {
"type" : "custom",
"filter" : [ "lowercase", "autocomplete_filter" ],
"tokenizer" : "standard"
}
}
}
}
"name" : {
"type" : "string",
"analyzer" : "autocomplete",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
{
"query": {
"query_string": {
"query":"bra",
"fields":["name"],
"use_dis_max":true
}
},
"aggs": {
"group_by_name": {
"terms": { "field":"name.raw" }
}
}
}
I'm getting back the following names and keys.
Braingeyser, Brainstorm, Braingeyser, Brainstorm, Brainstorm, Brainstorm, Bramblecrush, Brainwash, Brainwash, Braingeyser
{"key":"Bog Wraith","doc_count":18}
{"key":"Birds of Paradise","doc_count":15}
{"key":"Circle of Protection: Black","doc_count":15}
{"key":"Lightning Bolt","doc_count":15}
{"key":"Grizzly Bears","doc_count":14}
{"key":"Black Knight","doc_count":13}
{"key":"Bad Moon","doc_count":12}
{"key":"Boomerang","doc_count":12}
{"key":"Wall of Bone","doc_count":12}
{"key":"Balance","doc_count":11}
How can I get elasticsearch to only return unique fields from the aggregate?
To remove duplicates being returned in your aggregate you can try:
"aggs": {
"group_by_name": {
"terms": { "field":"name.raw" },
"aggs": {
"remove_dups": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}

Elasticsearch postings highlighter failing for some search strings

I have a search that works well with most search strings, but fails spectacularly on others. Experimenting, it appears to fail when at least one word in the query doesn't match (like this made up search phrase), with the error:
{
"error": "SearchPhaseExecutionException[Failed to execute phase [query_fetch], all shards failed; shardFailures {[w3zfoix_Qi-xwpVGbCbQWw][ia_test][0]: ElasticsearchIllegalArgumentException[the field [content] should be indexed with positions and offsets in the postings list to be used with postings highlighter]}]",
"status": 400
}
The simplest search which gives this error is the one below:
POST /myindex/_search
{
"from" : 0,
"size" : 25,
"query": {
"filtered" : {
"query" : {
"multi_match" : {
"type" : "most_fields",
"fields": ["title", "content", "content.english"],
"query": "Box Fexye"
}
}
}
},
"highlight" : {
"fields" : {
"content" : {
"type" : "postings"
}
}
}
}
My query is more complicated than this, and I need to use the "postings" highlighter to pull out the best matching sentence from a document.
Indexing of the relevant fields looks like:
"properties" : {
"title" : {
"type" : "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
}
},
"content" : {
"type" : "string",
"analyzer" : "standard",
"fields": {
"english": {
"type": "string",
"analyzer": "my_english"
},
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
},
"index_options" : "offsets",
"term_vector" : "with_positions_offsets"
}
}

Resources