elasticsearch completion suggester produce duplicate results - elasticsearch

I am using elasticsearch completion suggester thesedays, and got some problem that it always produce similar results.
Say I search with the following statement:
"my_suggestion": {
> "text": "ni",
> "completion": {
> "field": "my_name_for_sug"
> }
> }
And get the following results:
"my_suggestion" : [ {
"text" : "ni",
"offset" : 0,
"length" : 2,
"options" : [ {
"text" : "Nine West",
"score" : 329.0
}, {
"text" : "Nine West ",
"score" : 329.0
}, {
"text" : "Nike",
"score" : 295.0
}, {
"text" : "NINE WEST",
"score" : 168.0
}, {
"text" : "NINE WEST ",
"score" : 168.0
} ]
} ],
So the question is how can I merge or aggregate the same results like "NINE WEST" and "NINE WEST ".
the mapping is:
"my_name_for_sug": {
"type": "completion"
,"analyzer": "ik_max_word"
,"search_analyzer": "ik_max_word"
,"payloads": true
,"preserve_separators": false
}
where ik_max_word is an chinese-specific analyzer, and it can do the standard analyzer's job.
Thanks

Elastic Suggesters automatically de-duplicate same output (at least till 2.x). I haven't tried out 5.x yet, and there's some changes in suggesters there.
The problem seems to be your index analyzer, which is indexing your documents so that:
"text" : "Nine West",
"text" : "Nine West ",
"text" : "NINE WEST",
"text" : "NINE WEST ",
aren't exactly the same. You need to index them using an analyzer which lowercases the tokens, and strips extra spaces etc.
Once you do that, you should get de-duplicated output for suggestions, like you want.

Related

prefix autocomplete suggestion elasticsearch

I am trying to implement a prefix auto complete feature using ElasticSearch, here is my mapping for the suggest field:
PUT vdpinfo
{
"mappings": {
"details" : {
"properties" : {
"suggest" : {
"type" : "completion"
},
"title": {
"type": "keyword"
}
}
}
}
}
And I indexed some data with both single word and double words(bigram), such as:
{"suggest": "leather"}
And also:
{"suggest": "leather seats"}
{"suggest": "2 leather"}
And my search query is like this:
GET /vdpinfo/details/_search
{
"suggest": {
"feature-suggest": {
"prefix": "leather",
"completion": {
"field": "suggest"
}
}
}
}
But the result returns both {"suggest": "leather"} and {"suggest": "2 leather"}, and more importantly, {"suggest": "2 leather"} is ranked higher than leather.
My question is why the 2 leather gets returned, why doesn't it just do prefix autocomplete as in the query. prefix: leather?
This is because the default analyzer that is used for analyzing your data is the simple analyzer, which simply breaks text into terms whenever it encounters a character which is not a letter, so 2 leather is actually indexed as leather, hence why that result is showing (and also why it is showing first).
The reason they are using the simple analyzer by default instead of the standard one is to not provide suggestion based on stop words (explanation here).
So if you use the standard analyzer instead, you won't get any suggestion for 2 leather
PUT vdpinfo
{
"mappings": {
"details" : {
"properties" : {
"suggest" : {
"type" : "completion",
"analyzer" : "standard"
},
"title": {
"type": "keyword"
}
}
}
}
}

Exact phrase match in ElasticSearch

I'm trying to achieve exact search by phrase in Elastic, using my existing index (full-text). When user is searching, say, "Sanity Testing", the result should bring all the docs with "Sanity Testing" (case-insensitive), but not "Sanity tested".
My mapping:
{
"doc": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "string",
"term_vector":"with_positions_offsets",
"analyzer":"o3analyzer",
"store": true
},
"title" : {"store" : "yes"},
"date" : {"store" : "yes"},
"keywords" : {"store" : "yes"},
"content_type" : {"store" : "yes"},
"content_length" : {"store" : "yes"},
"language" : {"store" : "yes"}
}
}
}
}
}
As I understand, there's a way to add another index with "raw" analyzer, but I'm not sure this will work due to the need to search as case-insensitive. And also I don't want to rebuild indexes, as there are hundreds machines with tons of documents already indexed, so it may take ages.
Is there a way to run such a query? I'm now trying to search using the following query:
{
query: {
match_phrase: {
file: "Sanity Testing"
}
}
and it brings me both "Sanity Testing" and "Sanity Tested".
Any help appreciated!

Get top 100 most used three word phrases in all documents

I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:
Something like this:
Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]
I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.
My current mapping and settings:
{
"mappings": {
"items": {
"properties": {
"body": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}
What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")
Take a look here: https://www.elastic.co/blog/searching-with-shingles
Basically, you need a field with a shingle analyzer producing solely 3-term shingles:
Elastic blog-post configuration but with:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":3,
"output_unigrams":"false"
}
The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"three-word-phrases" : {
"terms" : {
"field" : "body",
"size" : 100
}
}
}
}

Elasticsearch Completion in middle of the sentence

is it possible to perform Completion on Elasticsearch and get result even if text is from the middle of input?
For instance:
"TitleSuggest" : {
"type" : "completion",
"index_analyzer" : "simple",
"search_analyzer" : "simple",
"payloads" : true,
"preserve_position_increments" : false,
"preserve_separators" : false
}
That's my current mapping and my query is
{
"passport": {
"text": "Industry Overview",
"completion": {
"field": "TitleSuggest",
"fuzzy": {
"edit_distance": 2
}
}
}
}
But nothing is returned, I have documents that contain Industry Overview in their input. For instance if I'm looking only for Industry:
{
"text" : "Industry",
"offset" : 0,
"length" : 8,
"options" : [{
"text" : "Airline Industry Sees Recovery in 2014",
"score" : 16
}, {
"text" : "Alcoholic Drinks Industry Overview",
"score" : 16
}, {
"text" : "Challenges in the Pet Care Industry For 2014",
"score" : 16
}
]
}
I can achieve that by using nGrams, but I'd like to get this done using completion suggesters
So my initial goal would getting this if I type in Industry Overview
{
"text" : "Industry Overview",
"offset" : 0,
"length" : 8,
"options" : [{
"text" : "Alcoholic Drinks Industry Overview",
"score" : 16
}
]
}
I've tried using shingle analyzer - that didn't solve the problem and I didn't come up on Google with anything useful.
ES Version : 1.5.1

How to find most used phrases in elasticsearch?

I know that you can find most used terms in an index with using facets.
For example on following inputs:
"A B C"
"AA BB CC"
"A AA B BB"
"AA B"
term facet returns this:
B:3
AA:3
A:2
BB:2
CC:1
C:1
But I'm wondering that is it possible to list followings:
AA B:2
A B:1
BB CC:1
....etc...
Is there such a feature in ElasticSearch?
As mentioned in ramseykhalaf's comment, a shingle filter would produce tokens of length "n" words.
"settings" : {
"analysis" : {
"filter" : {
"shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"filter_stop":{
"type":"stop",
"enable_position_increments":"false"
}
},
"analyzer" : {
"shingle_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["standard," "lowercase", "shingle", "filter_stop"]
}
}
}
},
"mappings" : {
"type" : {
"properties" : {
"letters" : {
"type" : "string",
"analyzer" : "shingle_analyzer"
}
}
}
}
See this blog post for full details.
I'm not sure if elasticsearch will let you do this the way you want natively. But you might be interested in checking out Carrot2 - http://search.carrot2.org to accomplished what you want (and probably more.)

Resources