Elasticsearch highlighted terms vary - elasticsearch

I'm running an Elasticsearch wildcard query with highlighting and wondering why there are extra words highlighted in the results.
A search for *exampleweb* shows that the highlighted terms vary (exampleweb.com, beta.exampleweb.com, etc) when I want only exampleweb to be highlighted.
Names is defined as text in the mapping if that matters.
URL
http://localhost:9200/wm/_search?filter_path=hits.hits.highlight
Request Body
{
"query":{
"wildcard":{
"names":{
"value":"*exampleweb*"
}
}
},
"highlight":{
"fields":{
"names":{}
}
}
}
Response
{
"hits": {
"hits": [
{
"highlight": {
"names": [
"325-<em>beta.exampleweb.com</em>"
]
}
},
{
"highlight": {
"names": [
"325.<em>exampleweb.com</em>"
]
}
},
{
"highlight": {
"names": [
"a2-gt-api-<em>preprod.fr.aws.exampleweb.com</em>"
]
}
}
]
}
}

By default standard analyzer is used on the text type field. The token generated for beta.exampleweb.com will be beta.exampleweb.com. Now when you are using wildcard query on names the terms matching a wildcard pattern (*exampleweb*) i.e beta.exampleweb.com will be returned.
To just highlight exampleweb in the names field you need to use pattern tokenizer, which will split the text into tokens when . is encountered.
Adding a working example
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "\\.| "
}
}
}
},
"mappings": {
"properties": {
"names": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"names" : "a2-gt-api-preprod.fr.aws.exampleweb.com"
}
{
"names" : "beta.exampleweb.com"
}
{
"names" : "325.exampleweb.com"
}
Search Query:
{
"query": {
"match": {
"names": "exampleweb"
}
},
"highlight": {
"fields": {
"names": {}
}
}
}
Search Result:
{
"hits": {
"hits": [
{
"highlight": {
"names": [
"beta.<em>exampleweb</em>.com"
]
}
},
{
"highlight": {
"names": [
"325.<em>exampleweb</em>.com"
]
}
},
{
"highlight": {
"names": [
"a2-gt-api-preprod.fr.aws.<em>exampleweb</em>.com"
]
}
}
]
}
}

Related

How to query by number and desconsider special characters

currently I have a document in my opensearch database with the value 1301-003.023.
If I run the following query the document will be returned:
GET test/example
{
"query": {
"match": {
"my_number": "1301-003.023"
}
}
}
the main problem is if the user run this query:
GET test/example
{
"query": {
"match": {
"my_number": "1301003.023"
}
}
}
In the query above the symbol - is missing, and it will returning nothing. I need to create a search that can deal with it but without return documents that doesn't have the exactly same numbers. So, if i search for 1301003023 I want to find the document with 1301-003.023, but I don't for documents with 1301-003.032 (see that the last two numbers were exchanged)
I created a new analyzer using char filter that mapping simbols "." and "-" to empty. So, the number "1301-003.023" becomes token "1301003023".
Full example:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_filter"
]
}
},
"char_filter": {
"my_filter": {
"type": "mapping",
"mappings": [
". => ",
"- => "
]
}
}
}
},
"mappings": {
"properties": {
"my_number": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Document
POST test/_bulk
{"index":{}}
{"my_number": "1301-003.023"}
Query
GET test/_search
{
"query": {
"match": {
"my_number": {
"query": "1301003023"
}
}
}
}
Results
"hits": [
{
"_index": "test",
"_id": "MC7v0IUBKJKciEqCrBP-",
"_score": 0.2876821,
"_source": {
"my_number": "1301-003.023"
}
}

search array of strings by partially match in elasticsearch

I got fields like that:
names: ["Red:123", "Blue:45", "Green:56"]
it's mapping is
"names": {
"type": "keyword"
},
how could I search like this
{
"query": {
"match": {
"names": "red"
}
}
}
to get all the documents where red is in element of names array?
Now it works only with
{
"query": {
"match": {
"names": "red:123"
}
}
}
You can add multi fields OR just change the type to text, to achieve your required result
Index Mapping using multi fields
{
"mappings": {
"properties": {
"names": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings":{
"properties":{
"names":{
"type":"text"
}
}
}
}
Index Data:
{
"names": [
"Red:123",
"Blue:45",
"Green:56"
]
}
Search Query:
{
"query": {
"match": {
"names": "red"
}
}
}
Search Result:
"hits": [
{
"_index": "64665127",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"names": [
"Red:123",
"Blue:45",
"Green:56"
]
}
}
]

Distinct values from array-field matching filter in Elasticsearch 2.4

In short: I want to lookup for distinct values in some field of the document BUT only matching some filter. The problem is in array-fields.
Imagine there are following documents in ES 2.4:
[
{
"states": [
"Washington (US-WA)",
"California (US-CA)"
]
},
{
"states": [
"Washington (US-WA)"
]
}
]
I'd like my users to be able to lookup all possible states via typeahead, so I have the following query for the "wa" user request:
{
"query": {
"wildcard": {
"states.raw": "*wa*"
}
},
"aggregations": {
"typed": {
"terms": {
"field": "states.raw"
},
"aggregations": {
"typed_hits": {
"top_hits": {
"_source": { "includes": ["states"] }
}
}
}
}
}
}
states.raw is a sub-field with not_analyzed option
This query works pretty well unless I have an array of values like in the example - it returns both Washington and California. I do understand why it happens (query and aggregations are working on top of the document and the document contains both, even though only one option matched the filter), but I really want to only see Washington and don't want to add another layer of filtering on the application side for the ES results.
Is there a way to do so via single ES 2.4 request?
You could use the "Filtering Values" feature (see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-bucket-terms-aggregation.html#_filtering_values_2).
So, your request could look like:
POST /index/collection/_search?size=0
{
"aggregations": {
"typed": {
"terms": {
"field": "states.raw",
"include": ".*wa.*" // You need to carefully quote the "wa" string because it'll be used as part of RegExp
},
"aggregations": {
"typed_hits": {
"top_hits": {
"_source": { "includes": ["states"] }
}
}
}
}
}
}
I can't hold myself back, though, and not tell you that using wildcard with leading wildcard is not the best solution. Do, please please, consider using ngrams for this:
PUT states
{
"settings": {
"analysis": {
"filter": {
"ngrams": {
"type": "nGram",
"min_gram": "2",
"max_gram": "20"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"ngrams"
],
"tokenizer": "standard"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"location": {
"properties": {
"states": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"ngrams": {
"type": "string",
"analyzer": "ngram_analyzer"
}
}
}
}
}
}
}
}
}
POST states/doc/1
{
"text":"bla1",
"location": [
{
"states": [
"Washington (US-WA)",
"California (US-CA)"
]
},
{
"states": [
"Washington (US-WA)"
]
}
]
}
POST states/doc/2
{
"text":"bla2",
"location": [
{
"states": [
"Washington (US-WA)",
"California (US-CA)"
]
}
]
}
POST states/doc/3
{
"text":"bla3",
"location": [
{
"states": [
"California (US-CA)"
]
},
{
"states": [
"Illinois (US-IL)"
]
}
]
}
And the final query:
GET states/_search
{
"query": {
"term": {
"location.states.ngrams": {
"value": "sh"
}
}
},
"aggregations": {
"filtering_states": {
"terms": {
"field": "location.states.raw",
"include": ".*sh.*"
},
"aggs": {
"typed_hits": {
"top_hits": {
"_source": {
"includes": [
"location.states"
]
}
}
}
}
}
}
}

Searching in all fields, case insensitive, and not analyzed

In elasticSearch,
How can I define a dynamic default mapping for any field (the fields are not predefined) that is searchable with spaces and case insensitive values.
For example, if i have two documents:
PUT myindex/mytype/1
{
"transaction": "test"
}
and
PUT myindex/mytype/2
{
"transaction": "test SPACE"
}
I'd like to perform the following queries:
Querying: "test", Expected result: "test"
Querying: "test space", Expected result "test SPACE"
I've tried to use:
PUT myindex
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_keyword":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
},
"mappings":{
"test":{
"properties":{
"title":{
"analyzer":"analyzer_keyword",
"type":"string"
}
}
}
}
}
But it gives me both document as result when looking for "test".
Apparently there was a mistake running my query:
Here's a solution I found to this problem, when using multi field query:
#any field mapping - not analyzed and case insensitive
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_keyword": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
},
"mappings": {
"doc": {
"dynamic_templates": [
{ "notanalyzed": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"analyzer":"analyzer_keyword"
}
}
}
]
}
}
}
#index test data
POST /test_index/doc/_bulk
{"index":{"_id":3}}
{"name":"Company Solutions", "a" : "a1"}
{"index":{"_id":4}}
{"name":"Company", "a" : "a2"}
#search for document with name “company” and a “a1”
POST /test_index/doc/_search
{
"query" : {
"filtered" : {
"filter": {
"and": {
"filters": [
{
"query": {
"match": {
"name": "company"
}
}
},
{
"query": {
"match": {
"a": "a2"
}
}
}
]
}
}
}
}
}

Elasticsearch index search for currency $ and £ signs

In some of my documents I have $ or £ symbols. I want to search for £ and retrieve documents containing that symbol. I've gone through the documentation but I'm getting some cognitive dissonance.
# Delete the `my_index` index
DELETE /my_index
# Create a custom analyzer
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and": {
"type": "mapping",
"mappings": [
"&=> and ",
"$=> dollar "
]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"html_strip",
"&_to_and"
],
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
}
}
This returns "the", "quick", "and", "brown", "fox" just as the documentation states:
# Test out the new analyzer
GET /my_index/_analyze?analyzer=my_analyzer&text=The%20quick%20%26%20brown%20fox
This returns "the", "quick", "dollar", "brown", "fox"
GET /my_index/_analyze?analyzer=my_analyzer&text=The%20quick%20%24%20brown%20fox
Adding some records:
PUT /my_index/test/1
{
"title": "The quick & fast fox"
}
PUT /my_index/test/1
{
"title": "The daft fox owes me $100"
}
I would have thought if I search for "dollar", I would get a result? Instead I get no results:
GET /my_index/test/_search
{ "query": {
"simple_query_string": {
"query": "dollar"
}
}
}
Or even using '$' with an analyzer:
GET /my_index/test/_search
{ "query": {
"query_string": {
"query": "dollar10",
"analyzer": "my_analyzer"
}
}
}
Your problem is that you specify a custom analyzer but you never use that. If you use term vertors you can verify that. So follow that steps:
When creating and index set custom analyzer for the `title field:
GET /my_index
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and": {
"type": "mapping",
"mappings": [
"&=> and ",
"$=> dollar "
]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"html_strip",
"&_to_and"
],
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
}, "mappings" :{
"test" : {
"properties" : {
"title" : {
"type":"string",
"analyzer":"my_analyzer"
}
}
}
}
}
Insert data:
PUT my_index/test/1
{
"title": "The daft fox owes me $100"
}
Check for term vectors:
GET /my_index/test/1/_termvectors?fields=title
Response:
{
"_index":"my_index",
"_type":"test",
"_id":"1",
"_version":1,
"found":true,
"took":3,
"term_vectors":{
"title":{
"field_statistics":{
"sum_doc_freq":6,
"doc_count":1,
"sum_ttf":6
},
"terms":{
"daft":{
"term_freq":1,
"tokens":[
{
"position":1,
"start_offset":4,
"end_offset":8
}
]
},
"dollar100":{ <-- You can see it here
"term_freq":1,
"tokens":[
{
"position":5,
"start_offset":21,
"end_offset":25
}
]
},
"fox":{
"term_freq":1,
"tokens":[
{
"position":2,
"start_offset":9,
"end_offset":12
}
]
},
"me":{
"term_freq":1,
"tokens":[
{
"position":4,
"start_offset":18,
"end_offset":20
}
]
},
"owes":{
"term_freq":1,
"tokens":[
{
"position":3,
"start_offset":13,
"end_offset":17
}
]
},
"the":{
"term_freq":1,
"tokens":[
{
"position":0,
"start_offset":0,
"end_offset":3
}
]
}
}
}
}
}
Now search:
GET /my_index/test/_search
{
"query": {
"match": {
"title": "dollar100"
}
}
}
That will find the match. But searching with query string as:
GET /my_index/test/_search
{ "query": {
"simple_query_string": {
"query": "dollar100"
}
}
}
won't find anything. Because it searches special _all field. And as I can see it aggregates fields as they are not analyzed:
GET /my_index/test/_search
{
"query": {
"match": {
"_all": "dollar100"
}
}
}
does not find a result. But:
GET /my_index/test/_search
{
"query": {
"match": {
"_all": "$100"
}
}
}
finds. I am not sure but the reason for that can be that the default analyzer is not the custom analyzer. To set a custom analyzer as default check:
Changing the default analyzer in ElasticSearch or LogStash
http://elasticsearch-users.115913.n3.nabble.com/How-we-can-change-Elasticsearch-default-analyzer-td4040411.html
http://grokbase.com/t/gg/elasticsearch/148kwsxzee/overriding-built-in-analyzer-and-set-it-as-default
http://elasticsearch-users.115913.n3.nabble.com/How-to-set-the-default-analyzer-td3935275.html

Resources