Building an effective Elasticsearch query for cross_fields with fuzziness - elasticsearch

I know that Elasticsearch does not support fuzziness with the cross_fields type in a multi_match query. I have a very difficult time with the Elasticsearch API and so I'm finding it challenging to build an analogous query that searches across multiple document fields with fuzzy string matching.
I have an index called papers with various fields such as Title, Author.FirstName, Author.LastName, PublicationDate, Journal etc... I want to be able to query with a string like "John Doe paper title 2015 journal name". cross_fields is the perfect multi_match type but it doesn't support fuzziness which is critical for my application.
Can anyone suggest a reasonable way to approach this? I've spent hours going through solutions on SO and the Elasticsearch forums with little success.

You can make use of copy_to field for this scenario. Basically you are copying all the values from different fields into one new field (my_search_field in the below details) and on this field, you would be able to perform fuzzy query via fuzziness parameter using simple match query.
Below is how a sample mapping, document and query would be:
Mapping:
PUT my_fuzzy_index
{
"mappings": {
"properties": {
"my_search_field":{ <---- Note this field
"type": "text"
},
"Title":{
"type": "text",
"copy_to": "my_search_field" <---- Note this
},
"Author":{
"type": "nested",
"properties": {
"FirstName":{
"type":"text",
"copy_to": "my_search_field" <---- Note this
},
"LastName":{
"type":"text",
"copy_to": "my_search_field" <---- Note this
}
}
},
"PublicationDate":{
"type": "date",
"copy_to": "my_search_field" <---- Note this
},
"Journal":{
"type":"text",
"copy_to": "my_search_field" <---- Note this
}
}
}
}
Sample Document:
POST my_fuzzy_index/_doc/1
{
"Title": "Fountainhead",
"Author":[
{
"FirstName": "Ayn",
"LastName": "Rand"
}
],
"PublicationDate": "2015",
"Journal": "journal"
}
Query Request:
POST my_fuzzy_index/_search
{
"query": {
"match": {
"my_search_field": { <---- Note this field
"query": "Aynnn Ranaad Fountainhead 2015 journal",
"fuzziness": 3 <---- Fuzzy parameter
}
}
}
}
Response:
{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.1027813,
"hits" : [
{
"_index" : "my_fuzzy_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.1027813,
"_source" : {
"Title" : "Fountainhead",
"Author" : [
{
"FirstName" : "Ayn",
"LastName" : "Rand"
}
],
"PublicationDate" : "2015",
"Journal" : "journal"
}
}
]
}
}
So instead of thinking of applying fuzzy query on multiple fields, you can instead go for this approach. That way your query would be simplified.
Let me know if this helps!

Related

get shingle result from elasticsearch

I'm already familiar with shingle analyzer and I am able to create a shingle analyzer as follows:
"index": {
"number_of_shards": 10,
"number_of_replicas": 1
},
"analysis": {
"analyzer": {
"shingle_analyzer": {
"filter": [
"standard",
"lowercase"
"filter_shingle"
]
}
},
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 2,
"min_shingle_size": 2,
"output_unigrams": false
}
}
}
}
and then I use the defined analyzer in mapping for a field in my document named content.The problem is the content field is a very long text and I want to use it as data for a autocomplete suggester, so I just need one or two words that follow the matched phrase. I wonder if there is a way to get the search (or suggest or analyze) API result as shingles too. By using shingle analyzer the elastic itself indexes the text as shingles, is there a way to access those shingles?
For instance,
the query I pass is :
GET the_index/_search
{
"_source": ["content"],
"explain": true,
"query" : {
"match" : { "content.shngled_field": "news" }
}
}
the result is :
{
"took" : 395,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : 7.8647532,
"hits" : [
{
"_shard" : "[v3_kavan_telegram_201911][0]",
"_node" : "L6vHYla-TN6CHo2I6g4M_A",
"_index" : "v3_kavan_telegram_201911",
"_type" : "_doc",
"_id" : "g1music/70733",
"_score" : 7.8647532,
"_source" : {
"content" : "Find the latest breaking news and information on the top stories, weather, business, entertainment, politics, and more."
....
}
as you can see the result contains the whole content filed which is a very long text. The result I expect is
"content" : "news and information on"
which is the matched shingle itself.
After you've created an index & ingested a doc
PUT sh
{
"mappings": {
"properties": {
"content": {
"type": "text",
"fields": {
"shingled": {
"type": "text",
"analyzer": "shingle_analyzer"
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"shingle_analyzer": {
"type": "standard",
"filter": [
"standard",
"lowercase",
"filter_shingle"
]
}
},
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 2,
"min_shingle_size": 2,
"output_unigrams": false
}
}
}
}
}
POST sh/_doc/1
{
"content": "and then I use the defined analyzer in mapping for a field in my document named content.The problem is the content field is a very long text and I want to use it as data for a autocomplete suggester, so I just need one or two words that follow the matched phrase. I wonder if there is a way to get the search (or suggest or analyze) API result as shingles too. By using shingle analyzer the elastic itself indexes the text as shingles, is there a way to access those shingles?"
}
You can call either _analyze w/ the corresponding analyzer to see how a given text would be tokenized:
GET sh/_analyze
{
"text": "and then I use the defined analyzer in mapping for a field in my document named content.The problem is the content field is a very long text and I want to use it as data for a autocomplete suggester, so I just need one or two words that follow the matched phrase. I wonder if there is a way to get the search (or suggest or analyze) API result as shingles too. By using shingle analyzer the elastic itself indexes the text as shingles, is there a way to access those shingles?",
"analyzer": "shingle_analyzer"
}
Or check out the term vectors information:
GET sh/_termvectors/1
{
"fields" : ["content.shingled"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
Will you be highlighting too?

How to combine completion, suggestion and match phrase across multiple text fields?

I've been reading about Elasticsearch suggesters, match phrase prefix and highlighting and i'm a bit confused as to which to use to suit my problem.
Requirement: i have a bunch of different text fields, and need to be able to autocomplete and autosuggest across all of them, as well as misspelling. Basically the way Google works.
See in the following Google snapshot, when we start typing "Can", it lists word like Canadian, Canada, etc. This is auto complete. However it lists additional words also like tire, post, post tracking, coronavirus etc. This is auto suggest. It searches for most relevant word in all fields. If we type "canxad" it should also misspel suggest the same results.
Could someone please give me some hints on how i can implement the above functionality across a bunch of text fields?
At first i tried this:
GET /myindex/_search
{
"query": {
"match_phrase_prefix": {
"myFieldThatIsCombinedViaCopyTo": "revis"
}
},
"highlight": {
"fields": {
"*": {}
},
"require_field_match" : false
}
}
but it returns highlights like this:
"In the aforesaid revision filed by the members of the Committee, the present revisionist was also party",
So that's not a "prefix" anymore...
Also tried this:
GET /myindex/_search
{
"query": {
"multi_match": {
"query": "revis",
"fields": ["myFieldThatIsCombinedViaCopyTo"],
"type": "phrase_prefix",
"operator": "and"
}
},
"highlight": {
"fields": {
"*": {}
}
}
}
But it still returns
"In the aforesaid revision filed by the members of the Committee, the present revisionist was also party",
Note: I have about 5 "text" fields that I need to search upon. One of those fields is quite long (1000s of words). If I break things up into keywords, I lose the phrase. So it's like I need match phrase prefix across a combined text field, with fuzziness?
EDIT
Here's an example of a document (some fields taken out, content snipped):
{
"id" : 1,
"respondent" : "Union of India",
"caseContent" : "<snip>..against the Union of India, through the ...<snip>"
}
As #Vlad suggested, i tried this:
POST /cases/_search
POST /cases/_search
{
"suggest": {
"respondent-suggest": {
"prefix": "uni",
"completion": {
"field": "respondent.suggest",
"skip_duplicates": true
}
},
"caseContent-suggest": {
"prefix": "uni",
"completion": {
"field": "caseContent.suggest",
"skip_duplicates": true
}
}
}
}
Which returns this:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"caseContent-suggest" : [
{
"text" : "uni",
"offset" : 0,
"length" : 3,
"options" : [ ]
}
],
"respondent-suggest" : [
{
"text" : "uni",
"offset" : 0,
"length" : 3,
"options" : [
{
"text" : "Union of India",
"_index" : "cases",
"_type" : "_doc",
"_id" : "dI5hh3IBEqNFLVH6-aB9",
"_score" : 1.0,
"_ignored" : [
"headNote.suggest"
],
"_source" : {
<snip>
}
}
]
}
]
}
}
So looks like it matches on the respondent field, which is great! But, it didn't match on the caseContent field, even though the text (see above) includes the phrase "against the Union of India".. shouldn't it match there? or is it because how the text is broken up?
Since you need autocomplete/suggest on each field, then you need to run a suggest query on each field and not on the copy_to field. That way you're guaranteed to have the proper prefixes.
copy_to fields are great for searching in multiple fields, but not so good for auto-suggest/-complete type of queries.
The idea is that for each of your fields, you should have a completion sub-field so that you can get auto-complete results for each of them.
PUT index
{
"mappings": {
"properties": {
"text1": {
"type": "text",
"fields": {
"suggest": {
"type": "completion"
}
}
},
"text2": {
"type": "text",
"fields": {
"suggest": {
"type": "completion"
}
}
},
"text3": {
"type": "text",
"fields": {
"suggest": {
"type": "completion"
}
}
}
}
}
}
Your suggest queries would then run on all the sub-fields directly:
POST index/_search?pretty
{
"suggest": {
"text1-suggest" : {
"prefix" : "revis",
"completion" : {
"field" : "text1.suggest"
}
},
"text2-suggest" : {
"prefix" : "revis",
"completion" : {
"field" : "text2.suggest"
}
},
"text3-suggest" : {
"prefix" : "revis",
"completion" : {
"field" : "text3.suggest"
}
}
}
}
That takes care of the auto-complete/-suggest part. For misspellings, the suggest queries allow you to specify a fuzzy parameter as well
UPDATE
If you need to do prefix search on all sentences within a body of text, the approach needs to change a bit.
The new mapping below creates a new completion field next to the text one. The idea is to apply a small transformation (i.e. split sentences) to what you're going to store in the completion field. So first create the index mapping like this:
PUT index
{
"mappings": {
"properties": {
"text1": {
"type": "text",
},
"text1Suggest": {
"type": "completion"
}
}
}
}
Then create an ingest pipeline that will populate the text1Suggest field with sentences from the text1 field:
PUT _ingest/pipeline/sentence
{
"processors": [
{
"split": {
"field": "text1",
"target_field": "text1Suggest.input",
"separator": "\\.\\s+"
}
}
]
}
Then we can index a document such as this one (with only the text1 field as the completion field will be built dynamically)
PUT test/_doc/1?pipeline=sentence
{
"text1": "The crazy fox. The quick snail. John goes to the beach"
}
What gets indexed looks like this (your text1 field + another completion field optimized for sentence prefix completion):
{
"text1": "The crazy fox. The cat drinks milk. John goes to the beach",
"text1Suggest": {
"input": [
"The crazy fox",
"The cat drinks milk",
"John goes to the beach"
]
}
}
And finally you can search for prefixes of any sentence, below we search for John and you should get a suggestion:
POST test/_search?pretty
{
"suggest": {
"text1-suggest": {
"prefix": "John",
"completion": {
"field": "text1Suggest"
}
}
}
}

How to find similar tags from text using elastic search

I try to use Elastic Search to find most similar tags from text.
For example, I create test_index and insert two documents:
POST test_index/_doc/17
{
"id": 17,
"tags": ["it", "devops", "server"]
}
POST test_index/_doc/20
{
"id": 20,
"tags": ["software", "hardware"]
}
So, i expect to find "software" tag (text or id) from "I'm using some softwares and applications" text.
I was hoping someone can provide an example on how to do this or at least point me in the right direction.
Thanks.
What you are looking for is nothing but a concept called as Stemming. You would need to create a Custom Analyzer and make use of Stemmer Token Filter.
Please find the below mapping, sample documents, query and response:
Mapping:
PUT my_stem_index
{
"settings": {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "my_stemmer"]
}
},
"filter" : {
"my_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
},
"mappings": {
"properties": {
"id":{
"type": "keyword"
},
"tags":{
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword":{
"type": "keyword"
}
}
}
}
}
}
From comments, it appears that you are using version < 7. For that you may have to add type in it.
PUT my_stem_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"tokenizer":"standard",
"filter":[
"lowercase",
"my_stemmer"
]
}
},
"filter":{
"my_stemmer":{
"type":"stemmer",
"name":"english"
}
}
}
},
"mappings":{
"_doc":{
"properties":{
"id":{
"type":"keyword"
},
"tags":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
}
}
}
}
Sample Documents:
POST my_stem_index/_doc/17
{
"id": 17,
"tags": ["it", "devops", "server"]
}
POST my_stem_index/_doc/20
{
"id": 20,
"tags": ["software", "hardware"]
}
POST my_stem_index/_doc/21
{
"id": 21,
"tags": ["softwares and applications", "hardwares and storage devices"]
}
Request Query:
POST my_stem_index/_search
{
"query": {
"match": {
"tags": "software"
}
}
}
Response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.5908618,
"hits" : [
{
"_index" : "my_stem_index",
"_type" : "_doc",
"_id" : "20",
"_score" : 0.5908618,
"_source" : {
"id" : 20,
"tags" : [
"software",
"hardware"
]
}
},
{
"_index" : "my_stem_index",
"_type" : "_doc",
"_id" : "21",
"_score" : 0.35965496,
"_source" : {
"id" : 21,
"tags" : [
"softwares and applications", <--- Note this has how `softwares` also was searchable.
"hardwares and storage devices"
]
}
}
]
}
}
Notice in response as how both the documents i.e. having _id 20 and 21 appear.
Additional Note:
If you are new to Elasticsearch, I'd suggest spending sometime to understand the concept of Analysis and how Elasticsearch implements the same using Analyzers.
This would help you understand how the document with softwares and applications is also returning when you only query for software and or vice versa.
Hope this helps!
If you search text that has base or root word, Stemming is good way.
If you need to find most similar word(s) from text, Ngram is more suitable way.
If you search exact words of text in word of tags, Shingles is better way.

Elasticsearch mapping for the UK postcodes, able to deal with spacing and capatalization

I am looking for a mapping/analyzer setup for Elasticsearch 7 with the UK postcodes. We do not require any fuzzy operator, but should be able to deal with variance in capital letters and spacing.
Some examples:
Query string: "SN13 9ED" should return:
sn139ed
SN13 9ED
Sn13 9ed
but should not return:
SN13 1EP
SN131EP
The keyword analyzer is used by default and this seems to be sensitive to spacing issues, but not to capital letters. It also will return a match for SN13 1EP unless we specify a query as SN13 AND 9ED, which we do not want.
Additionally, with the keyword analyzer, a query of SN13 9ED returns a result of SN13 1EP with a higher relevance than SN13 9ED even though this should be the exact match. Why are 2 matches in the same string a lower relevance than just 1 match?
Mapping for postal code
"post_code": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
Query
"query" => array:1 [▼
"query_string" => array:1 [▼
"query" => "KT2 7AJ"
]
]
I believe based on my comments, you may have been able to filter out SN13 1EP when your search string would be SN13 9ED.
Hope you are aware of what Analysis is, how Analyzers work on text field and how by default Standard Analyzer is applied on tokens before they eventually are stored in inverted index. Note that this is only applied on text fields.
Looking at your mapping, if you would have used searching on post_code and not post_code.keyword, I believe capitalization would have been resolved because ES for text field by default uses Standard Analyzer which means your tokens would eventually gets saved in index in lowercase format and even while querying, ES during querying time, the analyzer would be applied before it searches in the inverted index.
Note that by default, the same analyzer as configured in the mapping are applied during index time as well as search time on that field
For the scenarios where you have sn131ep what I've done is made use of Pattern Capture Token Filter where I've specified a regex which would break the token into two of lengths 4 and 3 each and thereby save them in inverted index which in this case would be sn13 and 1ep. I'm also lowercasing them before I store them in inverted index.
Note that the scenario I'm adding for your postcode is that its size is fixed i.e. having 7 characters. You can add more patterns if that is not the case
Please see below for more details:
Mapping:
PUT my_postcode_index
{
"settings" : {
"analysis" : {
"filter" : {
"mypattern" : {
"type" : "pattern_capture",
"preserve_original" : true,
"patterns" : [
"(\\w{4}+)|(\\w{3}+)", <--- Note this and feel free to add more patterns
"\\s" <--- Filter based on whitespace
]
}
},
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "pattern",
"filter" : [ "mypattern", "lowercase" ] <--- Note the lowercase here
}
}
}
},
"mappings": {
"properties": {
"postcode":{
"type": "text",
"analyzer": "my_analyzer", <--- Note this
"fields":{
"keyword":{
"type": "keyword"
}
}
}
}
}
}
Sample Documents:
POST my_postcode_index/_doc/1
{
"postcode": "SN131EP"
}
POST my_postcode_index/_doc/2
{
"postcode": "sn13 1EP"
}
POST my_postcode_index/_doc/3
{
"postcode": "sn131ep"
}
Note that these documents are semantically the same.
Request Query:
POST my_postcode_index/_search
{
"query": {
"query_string": {
"default_field": "postcode",
"query": "SN13 1EP",
"default_operator": "AND"
}
}
}
Response:
{
"took" : 24,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.6246513,
"hits" : [
{
"_index" : "my_postcode_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6246513,
"_source" : {
"postcode" : "SN131EP"
}
},
{
"_index" : "my_postcode_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.6246513,
"_source" : {
"postcode" : "sn131ep"
}
},
{
"_index" : "my_postcode_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.5200585,
"_source" : {
"postcode" : "sn13 1EP"
}
}
]
}
}
Notice that all three documents are returned even with queries snp131p and snp13 1ep.
Additional Note:
You can make use of Analyze API to figure out what tokens are created for a particular text
POST my_postcode_index/_analyze
{
"analyzer": "my_analyzer",
"text": "sn139ed"
}
And you can see below what tokens are stored in inverted index.
{
"tokens" : [
{
"token" : "sn139ed",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "sn13",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "9ed",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
}
]
}
Also:
You may also want to read about Ngram Tokenizer. I'd advise you to play around both the solutions and see what best suits your inputs.
Please test it and let me know if you have any queries.
In addition to Opsters answer, the following can also be used to tackle the issue from the opposite angle. For Opster's answer, they suggest splitting value by a known postcode pattern, which is great.
If we do not know the pattern, the following can be used:
{
"analysis": {
"filter": {
"whitespace_remove": {
"pattern": " ",
"type": "pattern_replace",
"replacement": ""
}
},
"analyzer": {
"no_space_analyzer": {
"filter": [
"lowercase",
"whitespace_remove"
],
"tokenizer": "keyword"
}
}
}
}
{
"post_code": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"analyzer": "no_space_analyzer"
}
}
This allows us to search with any kind of spacing, and with any case due to the lowercase filter.
sn13 1ep, s n 1 3 1 e p, sn131ep will all match against SN13 1EP
I think the main drawback to this option, however, is we will no longer get any results for sn13 as we are not producing at tokens. sn13* would bring us back results, however.
Is it possible to mix both of these methods together so we can have the best of both worlds?

How to achieve searching on unstructured data in Spring boot Elastic Search integration with MongoDB

I am a newbie in elasticsearch and wanna know if the following case works for me
I wanna achieve search functionality on unstructured data, What I mean by that is I dont know what kind of fields does a model have, as you can see the image below I have a data property inside a model in which any kind of data can be
I know how to connect mongodb and elasticsearch using mongo-connect but I dont know that requirement can be achieved or not?
This answer is based on you last comment.
Let's say for example that your data field mappings look like:
PUT my_index
{
"mappings": {
"properties": {
"data": {
"type": "nested"
}
}
}
}
As you can see we didn't insert field to our schema, elastic will do that for us when we will index the first document.
Insert a new document:
POST my_index/_doc/1
{
"data" : {
"adType" : "SELL",
"price" : "2000",
"numberOfRooms" : 20,
"isNegotiable" : "true",
"area" : 200
}
}
If we want to search for the word SELL but we don't know which field is assigned to it then we could use the following query:
GET my_index/_search
{
"query": {
"nested": {
"path": "data",
"query": {
"multi_match": {
"query": "2000",
"fields": [],
"type": "best_fields"
}
}
}
}
}
We set fields=[] meaning:
If no fields are provided, the multi_match query defaults to the index.query.default_field index settings, which in turn defaults to *. * extracts all fields in the mapping that are eligible to term queries and filters the metadata fields. All extracted fields are then combined to build a query.
We used multi_match query
The results we get:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"data" : {
"adType" : "SELL",
"price" : "2000",
"numberOfRooms" : 20,
"isNegotiable" : "true",
"area" : 200
}
}
}
]
}
}
UPDATE
Insert a document
POST my_index/_doc/1
{
"data" : "SELL 2000 20 true 200"
}
Then your query:
GET my_index/_search
{
"query": {
"match":
{
"data":"SELL 2000"
}
}
}
In spring using QueryBuilder
QueryBuilder qb = QueryBuilders.matchQuery("data", "SELL 2000");
I hope this is what you were looking for.

Resources