How to extract matching groups when searching regex in elasticsearch - elasticsearch

Am using elasticsearch to index some data (a text article) and mapped it as
"article": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
, then I use regexp queries to search for some matches, the result returns the correct documents, but is there a way to also return the matching groups/text which triggered the regex hit ?

You can use highlghting functionality of Elasticsearch.
Lets consider below is your sample document:
{
"article":"Elasticsearch Documentation"
}
Query:
{
"query": {
"regexp": {
"article": "el.*ch"
}
},
"highlight": {
"fields": {
"article": {}
}
}
}
Response
{
"_index" : "index1",
"_type" : "_doc",
"_id" : "cHzAH4IBgPd6xUeLm9QF",
"_score" : 1.0,
"_source" : {
"article" : "Elasticsearch Documentation"
},
"highlight" : {
"article" : [
"<em>Elasticsearch</em> Documentation"
]
}
}

Related

How to Order Completion Suggester with Fuzziness

When using a Completion Suggester with Fuzziness defined the ordering of results for suggestions are alphabetical instead of most relevant. It seems that whatever the fuzzines is set to is removed from the search/query term at the end of the term. This is not what I expected from reading the Completion Suggester Fuzziness docs which state:
Suggestions that share the longest prefix to the query prefix will be scored higher.
But that is not true. Here is a use case that proves this:
PUT test/
{
"mappings":{
"properties":{
"id":{
"type":"integer"
},
"title":{
"type":"keyword",
"fields": {
"suggest": {
"type": "completion"
}
}
}
}
}
}
POST test/_bulk
{ "index" : {"_id": "1"}}
{ "title": "HOLARAT" }
{ "index" : {"_id": "2"}}
{ "title": "HOLBROOK" }
{ "index" : {"_id": "3"}}
{ "title": "HOLCONNEN" }
{ "index" : {"_id": "4"}}
{ "title": "HOLDEN" }
{ "index" : {"_id": "5"}}
{ "title": "HOLLAND" }
The above creates an index and adds some data.
If a suggestion query is done on said data:
POST test/_search
{
"_source": {
"includes": [
"title"
]
},
"suggest": {
"title-suggestion": {
"completion": {
"fuzzy": {
"fuzziness": "1"
},
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
}
}
}
It returns the first 3 results in alphabetical order of the last matching character, instead of the longest prefix (which would be HOLLAND):
{
...
"suggest" : {
"title-suggestion" : [
{
"text" : "HOLL",
"offset" : 0,
"length" : 4,
"options" : [
{
"text" : "HOLARAT",
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.0,
"_source" : {
"title" : "HOLARAT"
}
},
{
"text" : "HOLBROOK",
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 3.0,
"_source" : {
"title" : "HOLBROOK"
}
},
{
"text" : "HOLCONNEN",
"_index" : "test",
"_type" : "_doc",
"_id" : "3",
"_score" : 3.0,
"_source" : {
"title" : "HOLCONNEN"
}
}
]
}
]
}
}
If the size param is removed then we can see that the score is the same for all entries, instead of the longest prefix being higher as stated.
With this being the case, how can results from Completion Suggesters with Fuzziness defined be ordered with the longest prefix at the top?
This has been reported in the past and this behavior is actually by design.
What I usually do in this case is to send two suggest queries (similar to what has been suggested here), one for exact match and another for fuzzy match. If the exact match contains a suggestion, I use it, otherwise I resort to using the fuzzy ones.
With the suggest query below, you'll get HOLLAND as exact-suggestion and then the fuzzy matches in fuzzy-suggestion:
POST test/_search
{
"_source": {
"includes": [
"title"
]
},
"suggest": {
"fuzzy-suggestion": {
"completion": {
"fuzzy": {
"fuzziness": "1"
},
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
},
"exact-suggestion": {
"completion": {
"field": "title.suggest",
"size": 3
},
"prefix": "HOLL"
}
}
}

Synonyms relevance issue in Elasticsearch

I am trying to configured synonyms in elasticsearch and done the sample configuration as well. But not getting expected relevancy when i am searching data.
Below is index Mapping configuration:
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"mind, brain",
"brainstorm,brain storm"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Below is sample data which i have indexed:
POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"my_field": "This is a brainstorm" }
{ "index" : { "_id" : "2" } }
{"my_field": "A different brain storm" }
{ "index" : { "_id" : "3" } }
{"my_field": "About brainstorming" }
{ "index" : { "_id" : "4" } }
{"my_field": "I had a storm in my brain" }
{ "index" : { "_id" : "5" } }
{"my_field": "I envisaged something like that" }
Below is query which i am trying:
GET test_index/_search
{
"query": {
"match": {
"my_field": {
"query": "brainstorm",
"analyzer": "my_search_analyzer"
}
}
}
}
Current Result:
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8185701,
"_source" : {
"my_field" : "A different brain storm"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.4100728,
"_source" : {
"my_field" : "I had a storm in my brain"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.90928507,
"_source" : {
"my_field" : "This is a brainstorm"
}
}
]
I am expecting document which is matching exect with query on top and document which is matching with synonyms should come with low score.
so here my expectation is document with value "This is a brainstorm" should come at position one.
Could you please suggest me how i can achive.
I have tried to applied boosting and weightage as well but no luck.
Thanks in advance !!!
Elasticsearch "replaces" every instance of a synonym all other synonyms, and does so on both indexing and searching (unless you provide a separate search_analyzer) so you're losing the exact token. To keep this information, use a subfield with standard analyzer and then use multi_match query to match either synonyms or exact value + boost the exact field.
I have got answer from Elastic Forum here. I have copied below for quick referance.
Hello there,
Since you are indexing synonyms into your inverted index, brain storm and brainstorm are all different tokens after analyzer does its thing. So Elasticsearch on query time uses your analyzer to create tokens for brain, storm and brainstorm from your query and match multiple tokens with indexes 2 and 4, your index 2 has lesser words so tf/idf scores it higher between the two and index number 1 only matches brainstorm.
You can also see what your analyzer does to your input with this;
POST test_index/_analyze
{
"analyzer": "my_search_analyzer",
"text": "I had a storm in my brain"
}
I did some trying out so, you should change your index analyzer to my_analyzer;
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym",
"synonyms": [
"mind, brain",
"brainstorm,brain storm"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Then you want to boost your exact matches, but you also want to get hits from my_search_analyzer tokens as well so i have changed your query a bit;
GET test_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"my_field": {
"query": "brainstorm",
"analyzer": "my_search_analyzer"
}
}
},
{
"match_phrase": {
"my_field": {
"query": "brainstorm"
}
}
}
]
}
}
}
result:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 2.3491273,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.3491273,
"_source" : {
"my_field" : "This is a brainstorm"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8185701,
"_source" : {
"my_field" : "A different brain storm"
}
}
]
}
}

Elasticsearch | Mapping exclude field with bulk API

I am using bulk api to create index and store data fields. Also I want to set mapping to exclude a field "field1" from the source. I know this can be done using "create Index API" reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html but I am using bulk API. below is sample API call:
POST _bulk
{ "index" : { "_index" : "test", _type = 'testType', "_id" : "1" } }
{ "field1" : "value1" }
Is there a way to add mapping settings while bulk indexing similar to below code:
{ "index" : { "_index" : "test", _type = 'testType', "_id" : "1" },
"mappings": {
"_source": {
"excludes": [
"field1"
]
}
}
}
{ "field1" : "value1" }
how to do mapping with bulk API?
It is not possible to define the mapping for a new Index while using the bulk API. You have to create your index beforehand and define the mapping then, or you have to define an index template and use a name for your index in your bulk request that triggers that template.
The following example code can be executed via the Dev Tools windows in Kibana:
PUT /_index_template/mytemplate
{
"index_patterns": [
"te*"
],
"priority": 1,
"template": {
"mappings": {
"_source": {
"excludes": [
"testexclude"
]
},
"properties": {
"testfield": {
"type": "keyword"
}
}
}
}
}
POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "testfield" : "value1", "defaultField" : "asdf", "testexclude": "this shouldn't be in source" }
GET /test/_mapping
You can see by the response that in this example the mapping template was used for the new test index because the testfield has only the keyword type and the source excludes is used from the template.
{
"test" : {
"mappings" : {
"_source" : {
"excludes" : [
"testexclude"
]
},
"properties" : {
"defaultField" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"testexclude" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"testfield" : {
"type" : "keyword"
}
}
}
}
}
Also the document is not returned with the excluded field:
GET /test/_doc/1
Response:
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"defaultField" : "asdf",
"testfield" : "value1"
}
}
Hope this answers your question and solves your use-case.

How to get results weighted by references from ElasticSearch

I have a dataset consisting of Notes referencing other Notes.
{id:"aaa", note: "lorem ipsum", references: ["ccc"]},
{id:"bbb", note: "lorem ipsum", references: ["aaa","ccc"]},
{id:"ccc", note: "lorem ipsum", references: ["bbb"]},
I want elastic search to use the references to weight the results, so in this case if I search for lorem I should get id "ccc" back since it has the most references. According to their docs, their graph solution does exactly this, but I also see examples where they are doing similar things.
But no explanation of how this is mapped to the Index. So my question is: how does one set up an ES index that uses references (indices)?
Other answers gave some clues, but then #7379490 provided the answer in another channel:
There is no way of doing this directly in ES. There are two possible solutions:
pre calculate references and pass them into ES by mapping a new value to the document.
Or aggregate and use aggregation to sort the response:
{
"query": {
"function_score": {
"query": {
"match": {
"note": "lorem"
}
},
"aggs": {
"references":{
"terms" : {
"field" : "references.keyword",
"order": { "_count": "desc" }
}
}
}
}
}
Function_score can be used to give higher score to documents with more values for reference field
Mapping:
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"note": {
"type": "text"
},
"references": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Query:
{
"query": {
"function_score": {
"query": {
"match": {
"note": "lorem"
}
},
"functions": [
{
"script_score": {
"script": "_score * doc['references.keyword'].length" --> references length
}
}
]
}
}
}
Result:
"hits" : [
{
"_index" : "index71",
"_type" : "_doc",
"_id" : "iJObRHIBXTOhZcHPaYks",
"_score" : 0.035661265,
"_source" : {
"id" : 2,
"note" : "lorem ipsum",
"references" : [
1,
3
]
}
},
{
"_index" : "index71",
"_type" : "_doc",
"_id" : "h5ObRHIBXTOhZcHPYok2",
"_score" : 0.017830633,
"_source" : {
"id" : 1,
"note" : "lorem ipsum",
"references" : [
3
]
}
},
{
"_index" : "index71",
"_type" : "_doc",
"_id" : "iZObRHIBXTOhZcHPb4k4",
"_score" : 0.017830633,
"_source" : {
"id" : 3,
"note" : "lorem ipsum",
"references" : [
2
]
}
}
]
If your data contain referencesGiven and you need to search by referencesReceived, I would recommend a 2-way pass:
insert all documents with an empty field for referencesReceived: []
(or referencesReceivedCount: 0 if that is enough)
for each document, for each item in referencesGiven, update the document receiving the reference

Is there any way to add the field in document but hide it from _source, also document should be analysed and searchable

I want to add one field to the document which should be searchable but when we do get/search it should not appear under _source.
I have tried index and store options but its not achievable through it.
Its more like _all or copy_to, but in my case value is provided by me (not collecting from other fields of the document.)
I am looking for mapping through which I can achieve below cases.
When I put document :
PUT my_index/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
and do search
GET my_index/_search
output should be
{
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"title" : "Some short title",
"date" : "2015-01-01"
}
}
]
}
}
also when I do the below search
GET my_index/_search
{
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
it should result me
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"title" : "Some short title",
"date" : "2015-01-01"
}
}
]
}
Simply use source filtering to exclude the content field:
GET my_index/_search
{
"_source": {
"excludes": [ "content" ]
},
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
We can achieve this using below mapping :
PUT my_index
{
"mappings": {
"_doc": {
"_source": {
"excludes": [
"content"
]
},
"properties": {
"title": {
"type": "text",
"store": true
},
"date": {
"type": "date",
"store": true
},
"content": {
"type": "text"
}
}
}
}
}
Add document :
PUT my_index/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
When you run the query to search content on the field 'content' :
GET my_index/_search
{
"query": {
"query_string": {
"default_field": "content",
"query": "long content"
}
}
}
You will get the result with hits as below:
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"date" : "2015-01-01",
"title" : "Some short title"
}
}
]
}
It hides the field 'content'. :)
Hence achieved it with the help of mapping. You don't need to exclude it from query each time you make get/search call.
More read on source :
https://www.elastic.co/guide/en/elasticsearch/reference/6.6/mapping-source-field.html

Resources