Elasticsearch on multiple fields with partial and full matches - elasticsearch

Our Account model has a first_name, last_name and a ssn (social security number).
I want to do partial matches on the first_name,last_name' but an exact match on ssn. I have this so far:
settings analysis: {
filter: {
substring: {
type: "nGram",
min_gram: 3,
max_gram: 50
},
ssn_string: {
type: "nGram",
min_gram: 9,
max_gram: 9
},
},
analyzer: {
index_ngram_analyzer: {
type: "custom",
tokenizer: "standard",
filter: ["lowercase", "substring"]
},
search_ngram_analyzer: {
type: "custom",
tokenizer: "standard",
filter: ["lowercase", "substring"]
},
ssn_ngram_analyzer: {
type: "custom",
tokenizer: "standard",
filter: ["ssn_string"]
},
}
}
mapping do
[:first_name, :last_name].each do |attribute|
indexes attribute, type: 'string',
index_analyzer: 'index_ngram_analyzer',
search_analyzer: 'search_ngram_analyzer'
end
indexes :ssn, type: 'string', index: 'not_analyzed'
end
My search is as follows:
query: {
multi_match: {
fields: ["first_name", "last_name", "ssn"],
query: query,
type: "cross_fields",
operator: "and"
}
}
So this works:
Account.search("erik").records.to_a
and even (for Erik Smith):
Account.search("erik smi").records.to_a
and the ssn:
Account.search("111112222").records.to_a
but not:
Account.search("erik 111112222").records.to_a
Any idea if I am indexing or querying wrong?
Thank you for any help!

Does it have to be done with a single query string? If not, I would do something like this:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"_all": {
"enabled": true,
"index_analyzer": "ngram_analyzer",
"search_analyzer": "standard"
},
"properties": {
"first_name": {
"type": "string",
"include_in_all": true
},
"last_name": {
"type": "string",
"include_in_all": true
},
"ssn": {
"type": "string",
"index": "not_analyzed",
"include_in_all": false
}
}
}
}
}
Notice the use of the_all field. I included first_name and last_name in _all, but not ssn, and ssn is not analyzed at all since I want to do exact matches against it.
I indexed a couple of documents for illustration:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"first_name":"Erik","last_name":"Smith","ssn":"111112222"}
{"index":{"_id":2}}
{"first_name":"Bob","last_name":"Jones","ssn":"123456789"}
Then I can query for the partial names, and filter by the exact ssn:
POST /test_index/doc/_search
{
"query": {
"filtered": {
"query": {
"match": {
"_all": {
"query": "eri smi",
"operator": "and"
}
}
},
"filter": {
"term": {
"ssn": "111112222"
}
}
}
}
}
And I get back what I'm expecting:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.8838835,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.8838835,
"_source": {
"first_name": "Erik",
"last_name": "Smith",
"ssn": "111112222"
}
}
]
}
}
If you need to be able to do the search with a single query string (no filter), you could include ssn in the all field as well, but with this setup it will also match on partial strings (like 111112) so that may not be what you want.
If you only want to match prefixes (i.e., search terms that start at the beginning of the words), you should use edge ngrams.
I wrote a blog post about using ngrams which might help you out a little: http://blog.qbox.io/an-introduction-to-ngrams-in-elasticsearch
Here is the code I used for this answer. I tried a few different things, including the setup I posted here, and another inluding ssn in _all, but with edge ngrams. Hope this helps:
http://sense.qbox.io/gist/b6a31c929945ef96779c72c468303ea3bc87320f

Related

How to find word 'food2u' by search 'food' in Elasticsearch?

I am a rookie who just started learning elasticsearch,And I want to find word like 'food2u' by search keyword 'food'.But I can only get the results like 'Food Repo','Give Food' etc. The field's Mapping is 'text' and this is my query
GET api/_search
{"query": {
"match": {
"Name": {
"query": "food"
}
}
},
"_source":{
"includes":["Name"]
}
}
You are getting the results like 'Food Repo','Give Food', as the text field uses a standard analyzer if no analyzer is specified. Food Repo gets tokenized into food and repo. Similarly Give Food gets tokenized into give and food.
But food2u gets tokenized into food2u. Since there is no matching token ("food"), you will not get the food2u document.
You need to use edge_ngram tokenizer to do a partial text match.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 4,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"name":"food2u"
}
Search Query:
{
"query": {
"match": {
"name": "food"
}
}
}
Search Result:
"hits": [
{
"_index": "67552800",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "food2u"
}
}
]
If you don't want to change the mapping, you can even use a wildcard query to return the matching documents
{
"query": {
"wildcard": {
"Name": {
"value": "food*"
}
}
}
}
OR you can even use query_string with wildcard
{
"query": {
"query_string": {
"query": "food*",
"fields": [
"Name"
]
}
}
}

Return only exact matches (substrings) in full text search (elasticsearch)

I have an index in elasticsearch with a 'title' field (analyzed string field). If I have the following documents indexed:
{title: "Joe Dirt"}
{title: "Meet Joe Black"}
{title: "Tomorrow Never Dies"}
and the search query is "I want to watch the movie Joe Dirt tomorrow"
I want to find results where the full title matches as a substring of the search query. If I use a straight match query, all of these documents will be returned because they all match one of the words. I really just want to return "Joe Dirt" because the title is an exact match substring of the search query.
Is that possible in elasticsearch?
Thanks!
One way to achieve this is as follows :
1) while indexing index title using keyword tokenizer
2) While searching use shingle token-filter to extract substring from the query string and match against the title
Example:
Index Settings
put test
{
"settings": {
"analysis": {
"analyzer": {
"substring": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"substring"
]
},
"exact": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"filter": {
"substring": {
"type":"shingle",
"output_unigrams" : true
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "exact"
}
}
}
}
}
}
}
Index Documents
put test/movie/1
{"title": "Joe Dirt"}
put test/movie/2
{"title": "Meet Joe Black"}
put test/movie/3
{"title": "Tomorrow Never Dies"}
Query
post test/_search
{
"query": {
"match": {
"title.raw" : {
"analyzer": "substring",
"query": "Joe Dirt tomorrow"
}
}
}
}
Result :
"hits": {
"total": 1,
"max_score": 0.015511602,
"hits": [
{
"_index": "test",
"_type": "movie",
"_id": "1",
"_score": 0.015511602,
"_source": {
"title": "Joe Dirt"
}
}
]
}

elasticsearch: How to rank first appearing words or phrases higher

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.
You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.
This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

elasticsearch context suggester stopwords

Is there a way to analyze a field that is passed to the context suggester?
If, say, I have this in my mapping:
mappings: {
myitem: {
title: {type: 'string'},
content: {type: 'string'},
user: {type: 'string', index: 'not_analyzed'},
suggest_field: {
type: 'completion',
payloads: false,
context: {
user: {
type: 'category',
path: 'user'
},
}
}
}
}
and I index this doc:
POST /myindex/myitem/1
{
title: "The Post Title",
content: ...,
user: 123,
suggest_field: {
input: "The Post Title",
context: {
user: 123
}
}
}
I would like to analyze the input first, split it into separate words, run it through lowercase and stop words filters so that the context suggester actually gets
suggest_field: {
input: ["post", "title"],
context: {
user: 123
}
}
I know I can pass an array into the suggest field but I would like to avoid lowercasing the text, splitting it, running the stop words filter in my application, before passing to ES. If possible, I would rather ES do this for me. I did try adding an index_analyzer to the field mapping but that didn't seem to achieve anything.
OR, is there another way to get autocomplete suggestions for words?
Okay, so this is pretty involved, but I think it does what you want, more or less. I'm not going to explain the whole thing, because that would take quite a bit of time. However, I will say that I started with this blog post and added a stop token filter. The "title" field has sub-fields (what used to be called a multi_field) that use different analyzers, or none. The query contains a couple of terms aggregations. Also notice that the aggregations results are filtered by the match query to only return results relevant to the text query.
Here is the index setup (spend some time looking through this; if you have specific questions I will try to answer them but I encourage you to go through the blog post first):
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
},
"stop_filter": {
"type": "stop"
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop_filter",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop_filter"
]
},
"stopword_only_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"asciifolding",
"stop_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"stopword_only": {
"type": "string",
"analyzer": "stopword_only_analyzer"
}
}
}
}
}
}
}
Then I added a few docs:
PUT /test_index/_bulk
{"index": {"_index":"test_index", "_type":"doc", "_id":1}}
{"title": "The Lion King"}
{"index": {"_index":"test_index", "_type":"doc", "_id":2}}
{"title": "Beauty and the Beast"}
{"index": {"_index":"test_index", "_type":"doc", "_id":3}}
{"title": "Alladin"}
{"index": {"_index":"test_index", "_type":"doc", "_id":4}}
{"title": "The Little Mermaid"}
{"index": {"_index":"test_index", "_type":"doc", "_id":5}}
{"title": "Lady and the Tramp"}
Now I can search the documents with word prefixes if I want (or the full words, capitalized or not), and use aggregations to return both the intact titles of the matching documents, as well as intact (non-lowercased) words, minus the stopwords:
POST /test_index/_search?search_type=count
{
"query": {
"match": {
"title": {
"query": "mer king",
"operator": "or"
}
}
},
"aggs": {
"word_tokens": {
"terms": { "field": "title.stopword_only" }
},
"intact_titles": {
"terms": { "field": "title.raw" }
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"intact_titles": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "The Lion King",
"doc_count": 1
},
{
"key": "The Little Mermaid",
"doc_count": 1
}
]
},
"word_tokens": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "The",
"doc_count": 2
},
{
"key": "King",
"doc_count": 1
},
{
"key": "Lion",
"doc_count": 1
},
{
"key": "Little",
"doc_count": 1
},
{
"key": "Mermaid",
"doc_count": 1
}
]
}
}
}
Notice that "The" gets returned. This seems to be because the default _english_ stopwords only contain "the". I didn't immediately find a way around this.
Here is the code I used:
http://sense.qbox.io/gist/2fbb8a16b2cd35370f5d5944aa9ea7381544be79
Let me know if that helps you solve your problem.
You can set up a analyzer which does this for you.
If you follow the tutorial called you complete me, there is a section about stopwords.
There is a change in how elasticsearch works after this article was written. The standard analyzer no logner does stopword removal, so you need to use the stop analyzer in stead.
The mapping
curl -X DELETE localhost:9200/hotels
curl -X PUT localhost:9200/hotels -d '
{
"mappings": {
"hotel" : {
"properties" : {
"name" : { "type" : "string" },
"city" : { "type" : "string" },
"name_suggest" : {
"type" : "completion",
"index_analyzer" : "stop",//NOTE HERE THE DIFFERENCE
"search_analyzer" : "stop",//FROM THE ARTICELE!!
"preserve_position_increments": false,
"preserve_separators": false
}
}
}
}
}'
Getting suggestion
curl -X POST localhost:9200/hotels/_suggest -d '
{
"hotels" : {
"text" : "m",
"completion" : {
"field" : "name_suggest"
}
}
}'
Hope this helps. I have spent a long time looking for this answer myself.

Elasticsearch multi_field type search and sort issue

I'm having an issue with multi_field mapping type in one of my indexes and I am not sure what the issue is. I use a very similar mapping in another index and I don't have these issues. ES version is 90.12
I have set this up I have a mapping that looks like this:
{
"settings": {
"index": {
"number_of_shards": 10,
"number_of_replicas": 1
}
},
"mappings": {
"production": {
"properties": {
"production_title": {
"type": "multi_field",
"fields": {
"production_title_edgengram": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"production_title": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
The .yml looks like this:
index:
mapper:
dynamic: true
analysis:
analyzer:
autocomplete_index:
tokenizer: keyword
filter: ["lowercase", "autocomplete_ngram"]
autocomplete_search:
tokenizer: keyword
filter: lowercase
ngram_index:
tokenizer: keyword
filter: ["ngram_filter"]
ngram_search:
tokenizer: keyword
filter: lowercase
filter:
autocomplete_ngram:
type: edgeNGram
min_gram: 1
max_gram: 15
side: front
ngram_filter:
type: nGram
min_gram: 2
max_gram: 8
So doing this:
curl -XGET 'http://localhost:9200/productionindex/production/_search' -d '{
"sort": [
{
"production_title": "asc"
}
],
"size": 1
}'
and
curl -XGET 'http://localhost:9200/productionindex/production/_search' -d '{
"sort": [
{
"production_title": "desc"
}
],
"size": 1
}'
I end up with the exact same result somewhere in the middle of the alphabet:
"production_title": "IL, 'Hoodoo Love'"
However, if I do this:
{
"query": {
"term": {
"production_title": "IL, 'Hoodoo Love'"
}
}
}
I get zero results.
Furthermore, if I do this:
{
"query": {
"match": {
"production_title_edgengram": "Il"
}
}
}
I also get zero results.
If I don't use multi_field and I separate them out, I can then search on them fine, (term and autocomplete) but I still can't sort.
While indexing I am only sending production_title when indexing multi_field.
Does anyone have any idea what is going on here?
Below please find the explain (last result only for brevity)
{
"_shard": 6,
"_node": "j-D2SYPCT0qZt1lD1RcKOg",
"_index": "productionindex",
"_type": "production",
"_id": "casting_call.productiondatetime.689",
"_score": null,
"_source": {
"venue_state": "WA",
"updated_date": "2014-03-10T12:08:13.927273",
"django_id": 689,
"production_types": [
69,
87,
89
],
"production_title": "WA, 'Footloose'"
},
"sort": [
null
],
"_explanation": {
"value": 1.0,
"description": "ConstantScore(cache(_type:audition)), product of:",
"details": [
{
"value": 1.0,
"description": "boost"
},
{
"value": 1.0,
"description": "queryNorm"
}
]
}
}
from this curl:
curl -XPOST 'http://localhost:9200/productionindex/production/_search?pretty=true&explain=true' -d '{
"query": {
"match_all": {}
},
"sort": [
{
"production_title": {
"order": "desc"
}
}
]
}'

Resources