How to find similar tags from text using elastic search - elasticsearch

I try to use Elastic Search to find most similar tags from text.
For example, I create test_index and insert two documents:
POST test_index/_doc/17
{
"id": 17,
"tags": ["it", "devops", "server"]
}
POST test_index/_doc/20
{
"id": 20,
"tags": ["software", "hardware"]
}
So, i expect to find "software" tag (text or id) from "I'm using some softwares and applications" text.
I was hoping someone can provide an example on how to do this or at least point me in the right direction.
Thanks.

What you are looking for is nothing but a concept called as Stemming. You would need to create a Custom Analyzer and make use of Stemmer Token Filter.
Please find the below mapping, sample documents, query and response:
Mapping:
PUT my_stem_index
{
"settings": {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "my_stemmer"]
}
},
"filter" : {
"my_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
},
"mappings": {
"properties": {
"id":{
"type": "keyword"
},
"tags":{
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword":{
"type": "keyword"
}
}
}
}
}
}
From comments, it appears that you are using version < 7. For that you may have to add type in it.
PUT my_stem_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"tokenizer":"standard",
"filter":[
"lowercase",
"my_stemmer"
]
}
},
"filter":{
"my_stemmer":{
"type":"stemmer",
"name":"english"
}
}
}
},
"mappings":{
"_doc":{
"properties":{
"id":{
"type":"keyword"
},
"tags":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
}
}
}
}
Sample Documents:
POST my_stem_index/_doc/17
{
"id": 17,
"tags": ["it", "devops", "server"]
}
POST my_stem_index/_doc/20
{
"id": 20,
"tags": ["software", "hardware"]
}
POST my_stem_index/_doc/21
{
"id": 21,
"tags": ["softwares and applications", "hardwares and storage devices"]
}
Request Query:
POST my_stem_index/_search
{
"query": {
"match": {
"tags": "software"
}
}
}
Response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.5908618,
"hits" : [
{
"_index" : "my_stem_index",
"_type" : "_doc",
"_id" : "20",
"_score" : 0.5908618,
"_source" : {
"id" : 20,
"tags" : [
"software",
"hardware"
]
}
},
{
"_index" : "my_stem_index",
"_type" : "_doc",
"_id" : "21",
"_score" : 0.35965496,
"_source" : {
"id" : 21,
"tags" : [
"softwares and applications", <--- Note this has how `softwares` also was searchable.
"hardwares and storage devices"
]
}
}
]
}
}
Notice in response as how both the documents i.e. having _id 20 and 21 appear.
Additional Note:
If you are new to Elasticsearch, I'd suggest spending sometime to understand the concept of Analysis and how Elasticsearch implements the same using Analyzers.
This would help you understand how the document with softwares and applications is also returning when you only query for software and or vice versa.
Hope this helps!

If you search text that has base or root word, Stemming is good way.
If you need to find most similar word(s) from text, Ngram is more suitable way.
If you search exact words of text in word of tags, Shingles is better way.

Related

Elasticsearch English stemming not working correctly

I've added an english stemmer analyzer and filter to our query but it doesn't seem to be working correctly with plurals stemming from 'y' => 'ies'.
For example, when I search 'raspberry' the results never include 'raspberries' and so on.
I've tried both english and minimal_english but I still get the same result.
Here's the analyzer and settings:
analysis: {
analyzer: {
custom_analyzer: {
type: "custom",
tokenizer: "standard",
filter: ["lowercase", "english_stemmer"],
},
},
filter: {
english_stemmer: {
type: "stemmer",
language: "english",
},
},
},
}
What am I doing wrong?
Though english should work for the e.g. you mentioned, you can even go for porter_stem instead. This is equivalent to stemmer with language english.
porter_stem in action:
POST /_analyze
{
"tokenizer": "standard",
"filter": ["porter_stem"],
"text": ["raspberry", "raspberries"]
}
Response of above request:
{
"tokens" : [
{
"token" : "raspberri",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "raspberri",
"start_offset" : 10,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 101
}
]
}
You can see both raspberry and raspberries get tokenise to raspberri. Therefore searching for raspberry will also match raspberries and vice-versa.
Make sure that the field against which you are indexing and searching has defined the analyzer as custom_analyzer (according to settings you stated in your question).
Working e.g.
Mapping:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"english_stemmer"
]
}
},
"filter": {
"english_stemmer": {
"type": "stemmer",
"language": "english"
}
}
}
},
"mappings": {
"properties": {
"field1": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
Indexing:
PUT test/_doc/1
{
"field1": "raspberries"
}
PUT test/_doc/2
{
"field1": "raspberry"
}
Search:
GET test/_search
{
"query": {
"match": {
"field1": {
"query": "raspberry"
}
}
}
}
Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.18232156,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.18232156,
"_source" : {
"field1" : "raspberries"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.18232156,
"_source" : {
"field1" : "raspberry"
}
}
]
}
}
You can also have a look at other stemmer kstem.
Unfortunately, porter_stem doesn't always work, e.g. virus and viruses. Someone suggested snowball - but I haven't tried it yet...

How to do an exact match query in ElasticSearch?

I want to do an exact match query to an ElasticSearch index,
I have the following data -
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.21110919,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.21110919,
"_source" : {
"id" : 1,
"name" : "test"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.160443,
"_source" : {
"id" : 2,
"name" : "test two"
}
}
]
}
}
I want to query the field name,
I am trying to search the name test,
But it returns me both documents.
The expected result is the only document 1.
Mapping is as follows -
{
"test" : {
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
I tried the following -
GET /test/_search
{
"query": {
"bool": {
"must": {
"term" : {
"name": "test"
}
}
}
}
}
GET /test/_search
{
"query": {
"match": {
"name": "test"
}
}
}
In addition to the link to the answer I provided in comment, I would suggest you to define name field as:
{
"name":{
"type": "text",
"fields":{
"keyword":{
"type": "keyword"
}
}
}
}
and then query on field name.keyword whenever you require exact match (case sensitive) and name if you want partial match such as search on first name only.
Looks like you are using text datatype on your name field, which is spitting test two in 2 tokens as test and two, hence it matches your search query test as match query is analyzed and applies the same analyzer to resultant tokens are matched against the documents tokens present in the inverted index.
Solution your using example
Index def
{
"mappings": {
"properties": {
"name": {
"type": "keyword" --> note use of `keyword` type
}
}
}
}
Index you sample docs
{
"name" : "test two"
}
{
"name" : "test"
}
Search query same as yours
{
"query": {
"match": {
"name": "test"
}
}
}
Search results as you want
"hits": [
{
"_index": "so_key",
"_type": "_doc",
"_id": "1",
"_score": 0.6931471,
"_source": {
"name": "test"
}
}
]
Important Note: you can use the analyze API to see how your data is indexed, for example
Using standard(default analyzer) on the text field
POST _analyze
{
"text": "test two",
"analyzer" : "standard" --> Change analyzer to keyword and see diff
}
Tokens
{
"tokens": [
{
"token": "test",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "two",
"start_offset": 5,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
}
]
}

Building an effective Elasticsearch query for cross_fields with fuzziness

I know that Elasticsearch does not support fuzziness with the cross_fields type in a multi_match query. I have a very difficult time with the Elasticsearch API and so I'm finding it challenging to build an analogous query that searches across multiple document fields with fuzzy string matching.
I have an index called papers with various fields such as Title, Author.FirstName, Author.LastName, PublicationDate, Journal etc... I want to be able to query with a string like "John Doe paper title 2015 journal name". cross_fields is the perfect multi_match type but it doesn't support fuzziness which is critical for my application.
Can anyone suggest a reasonable way to approach this? I've spent hours going through solutions on SO and the Elasticsearch forums with little success.
You can make use of copy_to field for this scenario. Basically you are copying all the values from different fields into one new field (my_search_field in the below details) and on this field, you would be able to perform fuzzy query via fuzziness parameter using simple match query.
Below is how a sample mapping, document and query would be:
Mapping:
PUT my_fuzzy_index
{
"mappings": {
"properties": {
"my_search_field":{ <---- Note this field
"type": "text"
},
"Title":{
"type": "text",
"copy_to": "my_search_field" <---- Note this
},
"Author":{
"type": "nested",
"properties": {
"FirstName":{
"type":"text",
"copy_to": "my_search_field" <---- Note this
},
"LastName":{
"type":"text",
"copy_to": "my_search_field" <---- Note this
}
}
},
"PublicationDate":{
"type": "date",
"copy_to": "my_search_field" <---- Note this
},
"Journal":{
"type":"text",
"copy_to": "my_search_field" <---- Note this
}
}
}
}
Sample Document:
POST my_fuzzy_index/_doc/1
{
"Title": "Fountainhead",
"Author":[
{
"FirstName": "Ayn",
"LastName": "Rand"
}
],
"PublicationDate": "2015",
"Journal": "journal"
}
Query Request:
POST my_fuzzy_index/_search
{
"query": {
"match": {
"my_search_field": { <---- Note this field
"query": "Aynnn Ranaad Fountainhead 2015 journal",
"fuzziness": 3 <---- Fuzzy parameter
}
}
}
}
Response:
{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.1027813,
"hits" : [
{
"_index" : "my_fuzzy_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.1027813,
"_source" : {
"Title" : "Fountainhead",
"Author" : [
{
"FirstName" : "Ayn",
"LastName" : "Rand"
}
],
"PublicationDate" : "2015",
"Journal" : "journal"
}
}
]
}
}
So instead of thinking of applying fuzzy query on multiple fields, you can instead go for this approach. That way your query would be simplified.
Let me know if this helps!

Why isn't my elastic search query returning the text analyzed by english analyzer?

I have an index named test_blocks
{
"test_blocks" : {
"aliases" : { },
"mappings" : {
"block" : {
"dynamic" : "false",
"properties" : {
"content" : {
"type" : "string",
"fields" : {
"content_en" : {
"type" : "string",
"analyzer" : "english"
}
}
},
"id" : {
"type" : "long"
},
"title" : {
"type" : "string",
"fields" : {
"title_en" : {
"type" : "string",
"analyzer" : "english"
}
}
},
"user_id" : {
"type" : "long"
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1438642440687",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"version" : {
"created" : "1070099"
},
"uuid" : "45vkIigXSCyvHN6g-w5kkg"
}
},
"warmers" : { }
}
}
When I do a search for killing, a word in the content, the search results return as expected.
http://localhost:9200/test_blocks/_search?q=killing&pretty=1
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.07431685,
"hits" : [ {
"_index" : "test_blocks",
"_type" : "block",
"_id" : "218",
"_score" : 0.07431685,
"_source":{"block":{"id":218,"title":"The \u003ci\u003eparticle\u003c/i\u003e streak","content":"Barry Allen is a Central City police forensic scientist\n with a reasonably happy life, despite the childhood\n trauma of a mysterious red and yellow being killing his\n mother and framing his father. All that changes when a\n massive \u003cb\u003eparticle\u003c/b\u003e accelerator accident leads to Barry\n being struck by lightning in his lab.","user_id":82}}
}, {
"_index" : "test_blocks",
"_type" : "block",
"_id" : "219",
"_score" : 0.07431685,
"_source":{"block":{"id":219,"title":"The \u003ci\u003eparticle\u003c/i\u003e streak","content":"Barry Allen is a Central City police forensic scientist\n with a reasonably happy life, despite the childhood\n trauma of a mysterious red and yellow being killing his\n mother and framing his father. All that changes when a\n massive \u003cb\u003eparticle\u003c/b\u003e accelerator accident leads to Barry\n being struck by lightning in his lab.","user_id":83}}
} ]
}
}
However given that I have an english analyzer for the content field (content_en), I would have expected it to return me the same document for the query kill. But it doesn't. I get 0 hits.
http://localhost:9200/test_blocks/_search?q=kill&pretty=1
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
My understanding through this analyze query is that "killing" would have got broken down in to "kill"
http://localhost:9200/_analyze?analyzer=english&text=killing
{
"tokens" : [ {
"token" : "kill",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
So why isn't the query "kill" match that document ? Are my mappings incorrect or is it my search that is incorrect?
I am using elasticsearch v1.7.0
You need to use fuzzysearch (some introduction available here):
curl -XPOST 'http://localhost:9200/test_blocks/_search' -d '
{
"query": {
"match": {
"title": {
"query": "kill",
"fuzziness": 2,
"prefix_length": 1
}
}
}
}'
UPD. Having content_en field with content which was given by stemmer, it makes sense to actually query that field:
curl -XPOST 'http://localhost:9200/test_blocks/_search' -d '
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "kill",
"fields": ["block.title", "block.title.title_en"]
}
}
}'
The following queries http://localhost:9200/_search?q=kill. ,http://localhost:9200/_search?q=kill. end up searching across
_all field .
_all field uses the default analyzer which unless overridden happens to be standard analyzer and not english analyzer .
For making the above query work you would need to add english analyzer to _all field and re-index
Example:
{
"mappings": {
"block": {
"_all" : {"analyzer" : "english"}
}
}
Also would point out the mapping in OP doesn't seem consistent with the document structure. As #EugZol pointed our the content is within block object so the mapping should be something on these lines :
{
"mappings": {
"block": {
"properties": {
"block": {
"properties": {
"content": {
"type": "string",
"analyzer": "standard",
"fields": {
"content_en": {
"type": "string",
"analyzer": "english"
}
}
},
"id": {
"type": "long"
},
"title": {
"type": "string",
"analyzer": "standard",
"fields": {
"title_en": {
"type": "string",
"analyzer": "english"
}
}
},
"user_id": {
"type": "long"
}
}
}
}
}
}
}

Elasticsearch data model

I'm currently parsing text from internal résumés in my company. The goal is to index everything in elasticsearch to perform search on them.
for the moment I have the following JSON document with no mapping defined :
Each coworker has a list of project with the client name
{
name: "Jean Wisser"
position: "Junior Developer"
"projects": [
{
"client": "SutrixMedia",
"missions": [
"Responsible for the quality on time and within budget",
"Writing specs, testing,..."
],
"technologies": "JIRA/Mantis/Adobe CQ5 (AEM)"
},
{
"client": "Société Générale",
"missions": [
" Writing test cases and scenarios",
" UAT"
],
"technologies": "HP QTP/QC"
}
]
}
The 2 main questions we would like to answer are :
Which coworker has already worked in this company ?
Which client use this technology ?
The first question is really easy to answer, for example:
Projects.client="SutrixMedia" returns me the right resume.
But how can I answer to the second one ?
I would like to make a query like this : Projects.technologies="HP QTP/QC" and the answer would be only the client name ("Société Générale" in this case) and NOT the entire document.
Is it possible to get this answer by defining a mapping with nested type ?
Or should I go for a parent/child mapping ?
Yes, indeed, that's possible with ES 1.5.* if you map projects as nested type and then retrieve nested inner_hits.
So here goes the mapping for your sample document above:
curl -XPUT localhost:9200/resumes -d '
{
"mappings": {
"resume": {
"properties": {
"name": {
"type": "string"
},
"position": {
"type": "string"
},
"projects": {
"type": "nested", <--- declare "projects" as nested type
"properties": {
"client": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"missions": {
"type": "string"
},
"technologies": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
}'
Then, you can index your sample document from above:
curl -XPUT localhost:9200/resumes/resume/1 -d '{...}'
Finally, with the following query which only retrieves the nested inner_hits you can retrieve only the nested object that matches Projects.technologies="HP QTP/QC"
curl -XPOST localhost:9200/resumes/resume/_search -d '
{
"_source": false,
"query": {
"nested": {
"path": "projects",
"query": {
"term": {
"projects.technologies.raw": "HP QTP/QC"
}
},
"inner_hits": { <----- only retrieve the matching nested document
"_source": "client" <----- and only the "client" field
}
}
}
}'
which yields only the client name instead of the whole matching document:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.4054651,
"hits" : [ {
"_index" : "resumes",
"_type" : "resume",
"_id" : "1",
"_score" : 1.4054651,
"inner_hits" : {
"projects" : {
"hits" : {
"total" : 1,
"max_score" : 1.4054651,
"hits" : [ {
"_index" : "resumes",
"_type" : "resume",
"_id" : "1",
"_nested" : {
"field" : "projects",
"offset" : 1
},
"_score" : 1.4054651,
"_source":{"client":"Société Générale"} <--- here is the client name
} ]
}
}
}
} ]
}
}

Resources