Stemming and highlighting for phrase search - elasticsearch

My Elasticsearch index is full of large English-text documents. When I search for "it is rare", I get 20 hits with that exact phrase and when I search for "it is rarely" I get a different 10. How can I get all 30 hits at once?
I've tried creating creating a multi-field with the english analyzer (below), but if I search in that field then I only get results from parts of phrase (e.g., documents matchin it or is or rare) instead of the whole phrase.
"mappings" : {
...
"text" : {
"type" : "string",
"fields" : {
"english" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets_payloads",
"analyzer" : "english"
}
}
},
...

Figured it out!
Store two fields, one for the text content (text) and a sub-field with the English-ified stem words (text.english).
Create a custom analyzer based off of the default English analyzer which doesn't strip stop words.
Highlight both fields and check for each when displaying results to the user.
Here's my index configuration:
{
mappings: {
documents: {
properties: {
title: { type: 'string' },
text: {
type: 'string',
term_vector: 'with_positions_offsets_payloads',
fields: {
english: {
type: 'string',
analyzer: 'english_nostop',
term_vector: 'with_positions_offsets_payloads',
store: true
}
}
}
}
}
},
settings: {
analysis: {
filter: {
english_stemmer: {
type: 'stemmer',
language: 'english'
},
english_possessive_stemmer: {
type: 'stemmer',
language: 'possessive_english'
}
},
analyzer: {
english_nostop: {
tokenizer: 'standard',
filter: [
'english_possessive_stemmer',
'lowercase',
'english_stemmer'
]
}
}
}
}
}
And here's what a query looks like:
{
query: {
query_string: {
query: <query>,
fields: ['text.english'],
analyzer: 'english_nostop'
}
},
highlight: {
fields: {
'text.english': {}
'text': {}
}
},
}

Related

Partial search query in kuromoji

I have an issue when trying to do partial search using the kuromoji plugin.
When I index full sentence, like ホワイトソックス with analyzer like:
{
"tokenizer": {
"type": "kuromoji_tokenizer",
"mode": "search"
},
"filter": ["lowercase"],
"text" : "ホワイトソックス"
}
then the word is properly split into ホワイト and ソックス as it should, I can search for both words separately, and that's correct.
But, when user didn't provide full sentence yet and is missing last letter (ホワイトソック), any kuromoji analyzer treats it as one word.
Because of that, result is empty.
My question is, is there something I can do about it? Either by indexing or searching this query in different fashion? I'm sure there is japan partial search but I can't find the right settings.
Example index settings:
{
analyzer: {
ngram_analyzer: {
tokenizer: 'search_tokenizer',
filter: ['lowercase', 'cjk_width', 'ngram_filter'],
},
search_analyzer: {
tokenizer: 'search_tokenizer',
filter: ['asciifolding'],
}
},
filter: {
ngram_filter: {
type: 'edge_ngram',
min_gram: '1',
max_gram: '20',
preserve_original: true,
token_chars: ['letter', 'digit']
}
},
tokenizer: {
search_tokenizer: {
type: 'kuromoji_tokenizer',
mode: 'search'
}
}
}
Search query:
query_string: {
fields: [
"..."
],
query: "ホワイトソック",
fuzziness: "0",
default_operator: "AND",
analyzer: "search_analyzer"
}
Any help appreciated!

Elasticsearch edgeNGram analyzer/tokenizer fuzzy query matching

We have an Accounts table that we are searching for similar records using fuzzy query with edgeNGram analyzer for multiple fields. Our setup:
Settings
{
settings: {
analysis: {
analyzer: {
edge_n_gram_analyzer: {
tokenizer: "whitespace",
filter: ["lowercase", "ednge_gram_filter"]
}
},
filter: {
ednge_gram_filter: {
type: "edgeNGram",
min_gram: 2,
max_gram: 10
}
}
}
}
}
Mapping
{
mappings: {
document_type: {
properties: {
uid: {
type: "text",
analyzer: "edge_n_gram_analyzer"
},
shop_name: {
type: "text",
analyzer: "edge_n_gram_analyzer"
},
seller_name: {
type: "text",
analyzer: "edge_n_gram_analyzer"
},
...
...
...
locale_id: {
type: "integer"
}
}
}
}
}
Query
{
body: {
query: {
bool: {
must: [
{
bool: {
should: [
{
fuzzy: {
uid: {
value: "antonline",
boost: 1.0,
fuzziness: 2,
prefix_length: 0,
max_expansions: 100
}
}
},
{
fuzzy: {
seller_name: {
value: "antonline",
boost: 1.0,
fuzziness: 2,
prefix_length: 0,
max_expansions: 100
}
}
},
{
fuzzy: {
shop_name: {
value: "antonline",
boost: 1.0,
fuzziness: 2,
prefix_length: 0,
max_expansions: 100
}
}
}
]
}
}
],
must_not: [
{
term: {
locale_id: {
value: 7
}
}
}
]
}
}
}
}
The above example finds different variations of 'antonline' string such as "antonline", "sanjonline", "tanonline", "kotonline", "htonline", "awmonline". However, it doesn't match strings with punctuation like antonline.com or even antonlinecom without the dot. We tried different types of tokenizers but nothing helps.
How could we achieve the search result as we expect?
I resolved that issue by removing everything that matches this regex:
[.,'\"\-+:~\^!?*\\]
Do the removal while building the index as well as while searching.

JS - client.bulk() results in not mapped documents

I have a template with a mapping like this:
"template" : "database*",
"mappings": {
"_default_" : {
"dynamic_templates" : [
{
// default string mapping
"string" : {
"match" : "*",
"match_mapping_type": "string",
"mapping": {
"type": "text",
"fields": {
"raw" : {
"type": "keyword"
}
}
}
}
},
]
}
}
The idea behind this template is to generate for every new field a "type : keyword" field for exact searches.
When adding documents (in an empty index using this template) with JS API client.index() everything works fine, this way i am able to query like:
{
"query": {
"match": {
"fooBar": "bla"
}
}
}
Or for exact searches like:
{
"query": {
"match": {
"fooBar.raw": "bla"
}
}
}
But adding documents (again empty index) in bulk with client.bulk() like this:
client.bulk({
body: [
// FIRST DOCUMENT
{ index: { _index: 'myindex', _type: 'mytype', _id: 1 } },
// the first document to index
{ fooBar: 'foo' },
// SECOND DOCUMENT
{ index: { _index: 'myindex', _type: 'mytype', _id: 2 } },
// the second document to index
{ fooBar: 'bar' },
(...)
]
}, function (err, resp) {
// ...
});
The same "raw" request results in an error:
"No mapping found for [fooBar.raw] (...)"
while:
{
"query": {
"match": {
"fooBar": "bla"
}
}
}
delivers results. This brings me to the conclusion, that the document is indeed indexed but the field "raw" has not been created.
Is that right? Is in fact bulk Indexing not using mapping?
How can I use bulk import and map the documents?
Thanks! :)

ElasticSearch match multiple fields with different values

I can actually perform a simple match search with this:
query: {match: {searchable: {query:search}}}
This works well, my searchable field is analyzed in my mapping.
Now I want to perform a search on multiple fields: 1 string, and all other are numeric.
My mapping:
mappings dynamic: 'false' do
indexes :searchable, analyzer: "custom_index_analyzer", search_analyzer: "custom_search_analyzer"
indexes :year, type: "integer"
indexes :country_id, type: "integer"
indexes :region_id, type: "integer"
indexes :appellation_id, type: "integer"
indexes :category_id, type: "integer"
end
def as_indexed_json(options={})
as_json(
only: [:searchable, :year, :country_id, :region_id, :appellation_id, :category_id]
)
end
I have tried this:
query: {
filtered: {
query: {
match: {
searchable: search
}
},
filter: {
term: {
country_id: "1"
},
term: {
region_id: "2"
},
term: {
appellation_id: "34"
}
}
}
},
sort: {
_score: {
order: :desc
},
year: {
order: :desc,
ignore_unmapped: true
}
},
size:100
It works well but it will give me 100 results in all cases, from the appellation_id sent (34), even if the searchable field is very far from the text search.
I have also tried a BOOL query:
self.search(query: {
bool: {
must: [
{
match: {
country_id: "1"
},
match: {
region_id: "2"
},
match: {
appellation_id: "34"
},
match: {
searchable: search
}
}
]
}
},
sort: {
_score: {
order: :desc
},
year: {
order: :desc,
ignore_unmapped: true
}
},
size:100
)
But It will give me all results matching the searchable field and don't take care of the appellation_id wanted.
My goal is to get the best results and performance and ask ES to give me all data from country_id=X, region_id=Y, appellation_id=Z and finally perform a match on this set of results with the searchable field... and don't obtain results to far from reality with the searchable text.
Thanks.
As you may know elasticsearch match query returns result based on a relevance score. You can try to use term query instead of match for an exact term match. Also I guess your bool query structure must be like :
bool: {
must: [
{ match: {
country_id: "1"
}},
{match: {
region_id: "2"
}},
{match: {
appellation_id: "34"
}},
{match: {
searchable: search
}}
]
}

Ignoring Apostrophes (Possessive) In ElasticSearch

I'm trying to get user submitted queries for "Joe Frankles", "Joe Frankle", "Joe Frankle's" to match the original text "Joe Frankle's". Right now we're indexing the field this text is in with (Tire / Ruby Format):
{ :type => 'string', :analyzer => 'snowball' }
and searching with:
query { string downcased_query, :default_operator => 'AND' }
I tried this unsuccessfully:
create :settings => {
:analysis => {
:char_filter => {
:remove_accents => {
:type => "mapping",
:mappings => ["`=>", "'=>"]
}
},
:analyzer => {
:myanalyzer => {
:type => 'custom',
:tokenizer => 'standard',
:char_filter => ['remove_accents'],
:filter => ['standard', 'lowercase', 'stop', 'snowball', 'ngram']
}
},
:default => {
:type => 'myanalyzer'
}
}
},
There's two official ways of handling possessive apostrophes:
1) Use the "possessive_english" stemmer as described in the ES docs:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html
Example:
{
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_stemmer"]
}
},
"filter" : {
"my_stemmer" : {
"type" : "stemmer",
"name" : "possessive_english"
}
}
}
}
}
Use other stemmers or snowball in addition to the "possessive_english" filter if you like. Should/Must work, but it's untested code.
2) Use the "word_delimiter" filter:
{
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_word_delimiter"]
}
},
"filter" : {
"my_word_delimiter" : {
"type" : "word_delimiter",
"preserve_original": "true"
}
}
}
}
}
Works for me :-) ES docs:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
Both will cut off "'s".
I ran into a similar problem, the snowball analyzer alone didn't work for me. Don't know if it's supposed to or not. Here's what I use:
properties: {
name: {
boost: 10,
type: 'multi_field',
fields: {
name: { type: 'string', index: 'analyzed', analyzer: 'title_analyzer' },
untouched: { type: 'string', index: 'not_analyzed' }
}
}
}
analysis: {
char_filter: {
remove_accents: {
type: "mapping",
mappings: ["`=>", "'=>"]
}
},
filter: {},
analyzer: {
title_analyzer: {
type: 'custom',
tokenizer: 'standard',
char_filter: ['remove_accents'],
}
}
}
The Admin indices analyze tool is also great when working with analyzers.
It looks like in your query you are searching _all field, but your analyzer is applied only to the individual field. To enable this functionality for the _all field, simply make snowball your default analyzer.

Resources