How to recognize a brand in a text with Elasticsearch - elasticsearch

I have been stuck for 2 days on this and I am sure it can be done with Elasticsearch. Any help would be really appreciated!
I receive products from various sources and I want to integrate them to my current inventory.
Products reach me in the form of text. They generally have a brand and a name:
1000 Stories Zinfandel Bourbon Barrel Aged
1000 Stories Gold Rush Red Blend Bourbon Barrel
1000 Stories Cabernet Bourbon Barrel Aged
^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Brand Product Name
But one can be missing or they can be mixed. That's why I want to recognize what is there and what is missing.
I generally already know the brands and products from my inventory. How can I get Elasticsearch to tell me which is what?
Ideally, I would get something like:
<brand>1000 Stories</brand> <name>Zinfandel Bourbon Barrel Aged</name>
<brand>1000 Stories</brand> <name>Gold Rush Red Blend Bourbon Barrel</name>
<brand>1000 Stories</brand> <name>Cabernet Bourbon Barrel Aged</name>
Though just recognizing the brand would already be a big step.
I was hoping to make it work with phrase suggestion, because it already includes the matches in the results. I have tried to map my brands and names with all those analyzers:
settings: {
analysis: {
analyzer: {
trigram: {
type: "custom",
tokenizer: "standard",
filter: ["lowercase","shingle"]
},
reverse: {
type: "custom",
tokenizer: "standard",
filter: ["lowercase","reverse"]
},
raw_analyzer: {
tokenizer: "keyword",
filter: [
"lowercase",
"asciifolding"
]
}
},
filter: {
shingle: {
type: "shingle",
min_shingle_size: 2,
max_shingle_size: 3
}
}
}
},
mappings: {
product: {
properties: {
brand: {
type: "text",
fields: {
trigram: {
type: "text",
analyzer: "trigram"
},
reverse: {
type: "text",
analyzer: "reverse"
},
raw: {
type: "text",
analyzer: "raw_analyzer"
}
}
}
And searched them with each analyzer:
phrase_name_suggestion: {
phrase: {
field: 'name.trigram',
max_errors: 0.99,
size: 5,
gram_size: 3,
direct_generator: [
{
field: "name.trigram",
suggest_mode: "always"
},
{
field: "name.reverse",
suggest_mode: "always",
pre_filter: "reverse",
post_filter: "reverse"
} ],
highlight: {
pre_tag: "<name>",
post_tag: "</name>"
}
}
},
phrase_name_raw_suggestion: {
phrase: {
field: 'name.raw',
max_errors: 0.99,
size: 5,
gram_size: 3,
highlight: {
pre_tag: "<name>",
post_tag: "</name>"
}
}
}
I am only getting random suggestions on the wrong terms, or no results at all. Like a term of a brand suggested for a term of a name, instead of just recognizing the brand.
Note in case it narrows the options: names are manually input, so I can get all variety of text: missing name, typos ("Sinfandel" for "Zinfandel"), abbreviations ("cab sauv" for "cabernet sauvignon"), ... That's a separate issue but if it can be included into this solution I'll happily take it.
I am running Elasticsearch 6.4.2. I can work with a more recent version if needed.

Related

Elasticsearch structure in the correct and effective way for search engine

I'm building a search engine for my audio store.
I only use 1 index for the audio documents and here is the structure:
{
id: { type: 'integer' },
title: { type: 'search_as_you_type' },
description: { type: 'text' },
createdAt: { type: 'date' },
updatedAt: { type: 'date' },
datePublished: { type: 'date' },
duration: { type: 'float' },
categories: {
type: 'nested',
properties: {
id: { type: 'integer' },
name: { type: 'text' }
},
}
}
It's simple to search by text the audio documents with the order by date published.
But I want to make it more powerful to make a text search and order by trending based on the audio listen times and purchase histories in a specific range, eg: text search trending audios for the last 3 months or the last 30 days, so I tweaked the structure as below:
{
...previousProperties,
listenTimes: {
type: 'nested',
properties: {
timestamp: { type: 'date' },
progress: { type: 'float' }, // value 0-1.
},
},
purchaseHistories: {
type: 'nested',
properties: {
timestamp: { type: 'date' }
},
},
}
And here is my query to get trending audios for the last 3 months and it worked:
{
bool: {
should: [
{
nested: {
path: 'listenTimes',
query: {
function_score: {
query: {
range: {
'listenTimes.timestamp': {
gte: $range,
},
},
},
functions: [
{
field_value_factor: {
field: 'listenTimes.progress',
missing: 0,
},
},
],
boost_mode: 'replace',
},
},
score_mode: 'sum',
},
},
{
nested: {
path: 'purchaseHistories',
query: {
function_score: {
query: {
range: {
'purchaseHistories.timestamp': {
gte: 'now+1d-3M/d',
},
},
},
boost: 1.5,
},
},
score_mode: 'sum',
},
},
],
},
}
I have some uncertainty with my approach such as:
The number of listen times and purchase histories record of each audio are quite big, is it effective if I structured the data like this? I just only test with the sample data and it seems to work fine.
Does Elasticsearch will re-index the whole document every time I push new records of listen times and purchase histories into the audio docs?
I'm very new to Elasticsearch, so could someone please give me some advice on this case, thank you so much!
First question is a good one, it depends how you will implement it, you will have to look out for atomic action since, I'm guessing, you're planning to fetch number of listen times and then save incremented value. If you're doing this from one application in one thread and it's managing to process it in time, then you're fine, but you're not able to scale. I would say that elasticsearch is not really made for this kind of transactions. First idea that popped into my brain is saving numbers into SQL database and updating elasticsearch on some schedule. I suppose those results don't have to be updated in real time?
And about second question I'll just post quote from elasticsearch documentation The document must still be reindexed, but using update removes some network roundtrips and reduces chances of version conflicts between the GET and the index operation., you can find more on this link.

Partial search query in kuromoji

I have an issue when trying to do partial search using the kuromoji plugin.
When I index full sentence, like ホワイトソックス with analyzer like:
{
"tokenizer": {
"type": "kuromoji_tokenizer",
"mode": "search"
},
"filter": ["lowercase"],
"text" : "ホワイトソックス"
}
then the word is properly split into ホワイト and ソックス as it should, I can search for both words separately, and that's correct.
But, when user didn't provide full sentence yet and is missing last letter (ホワイトソック), any kuromoji analyzer treats it as one word.
Because of that, result is empty.
My question is, is there something I can do about it? Either by indexing or searching this query in different fashion? I'm sure there is japan partial search but I can't find the right settings.
Example index settings:
{
analyzer: {
ngram_analyzer: {
tokenizer: 'search_tokenizer',
filter: ['lowercase', 'cjk_width', 'ngram_filter'],
},
search_analyzer: {
tokenizer: 'search_tokenizer',
filter: ['asciifolding'],
}
},
filter: {
ngram_filter: {
type: 'edge_ngram',
min_gram: '1',
max_gram: '20',
preserve_original: true,
token_chars: ['letter', 'digit']
}
},
tokenizer: {
search_tokenizer: {
type: 'kuromoji_tokenizer',
mode: 'search'
}
}
}
Search query:
query_string: {
fields: [
"..."
],
query: "ホワイトソック",
fuzziness: "0",
default_operator: "AND",
analyzer: "search_analyzer"
}
Any help appreciated!

Elastic search query using match_phrase_prefix and fuzziness at the same time?

I am new to elastic search, so I am struggling a bit to find the optimal query for our data.
Imagine I want to match the following word "Handelsstandens Boldklub".
Currently, I'm using the following query:
{
query: {
bool: {
should: [
{
match: {
name: {
query: query, slop: 5, type: "phrase_prefix"
}
}
},
{
match: {
name: {
query: query,
fuzziness: "AUTO",
operator: "and"
}
}
}
]
}
}
}
It currently list the word if I am searching for "Hand", but if I search for "Handle" the word will no longer be listed as I did a typo. However if I reach to the end with "Handlesstandens" it will be listed again, as the fuzziness will catch the typo, but only when I have typed the whole word.
Is it somehow possible to do phrase_prefix and fuzziness at the same time? So in the above case, if I make a typo on the way, it will still list the word?
So in this case, if I search for "Handle", it will still match the word "Handelsstandens Boldklub".
Or what other workarounds are there to achieve the above experience? I like the phrase_prefix matching as its also supports sloppy matching (hence I can search for "Boldklub han" and it will list the result)
Or can the above be achieved by using the completion suggester?
Okay, so after investigating elasticsearch even further, I came to the conclusion that I should use ngrams.
Here is a really good explaniation of what it does and how it works.
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
Here is the settings and mapping I used: (This is elasticsearch-rails syntax)
settings analysis: {
filter: {
ngram_filter: {
type: "ngram",
min_gram: "2",
max_gram: "20"
}
},
analyzer: {
ngram_analyzer: {
type: "custom",
tokenizer: "standard",
filter: ["lowercase", "ngram_filter"]
}
}
} do
mappings do
indexes :name, type: "string", analyzer: "ngram_analyzer"
indexes :country_id, type: "integer"
end
end
And the query: (This query actually search in two different indexes at the same time)
{
query: {
bool: {
should: [
{
bool: {
must: [
{ match: { "club.country_id": country.id } },
{ match: { name: query } }
]
}
},
{
bool: {
must: [
{ match: { country_id: country.id } },
{ match: { name: query } }
]
}
}
],
minimum_should_match: 1
}
}
}
But basically you should just do a match or multi match query, depending on how many fields you want to search in.
I hope someone find it helpful, as I was personally thinking to much in terms of fuzziness instead of ngrams (Didn't know about before). This led me in the wrong direction.

Implementing Tags with ElasticSearch (v1.5.2)

Currently i am working with ElasticSearch for indexing and querying of several million documents. Now i want to incorporate tags into these documents as well. And i am not quite sure, what the best way will be, to match my requirements, which are:
I want to query ES for most used tags. Paginating and filtering should be possible as well.
Is this even possible? I've tried using aggregations before and it kind of worked, but i was not able to paginate or filter the results.
{
size: 0,
aggs: {
group_by_tags: {
terms: {
field: "tags"
}
}
}
}
So, i thought using Nested Objects will be the way to go and i've changed the mapping, which now looks like this:
mappings: {
shop_outfits: {
_all: {
enabled: false
},
properties: {
id: {
type: "string",
index: "not_analyzed"
},
userId: {
type: "string",
index: "not_analyzed"
},
title: {
type: "string"
},
description: {
type: "string"
},
tags: {
type: "nested",
properties: {
tag: {
type: "string",
index: "not_analyzed"
}
}
},
articles: {
type: "string",
index: "not_analyzed"
},
uniqueId: {
type: "string",
index: "not_analyzed"
},
createdAt: {
type: "date",
format: "yyyy-MM-dd HH:mm:ss"
}
}
}
}
Is it possible to return all tags ordered by usage in a list that can be paginated? Or is this simply not possible with Nested Objects?
I am glad for any hint in the right direction!
EDIT:
Or is using a parent/child relationship the right way?

ElasticSearch "H & R Block" with partial word search

The requirements are to be able to search the following terms :
"H & R" to find "H & R Block".
I have managed to implement this requirement alone using word_delimiter, as mentionned in this answer elasticsearch tokenize "H&R Blocks" as "H", "R", "H&R", "Blocks"
Using ruby code :
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] },
},
filter: {
my_splitter: {
type: "word_delimiter",
preserve_original: true
}
},
analyzer: {
my_analyzer {
char_filter: %w[strip_punctuation],
type: "custom",
tokenizer: "whitespace",
filter: %w[lowercase asciifolding my_splitter]
}
}
}
But also, in the same query, we want autocomplete functionality or partial word matching, so
"Ser", "Serv", "Servi", "Servic" and "Service" all find "Service" and "Services".
I have managed to implement this requirement alone, using ngram.
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] }
},
analyzer: {
my_analyzer: {
char_filter: %w[strip_punctuation],
tokenizer: "my_ngram",
filter: %w[lowercase asciifolding]
}
},
tokenizer: {
my_ngram: {
type: "nGram",
min_gram: "3",
max_gram: "10",
token_chars: %w[letter digit]
}
}
}
I just can't manage to implement them together. When I use ngram, short words are ignored so "H & R" is left out. When I use word_delimiter, partial word searches stop working. Below, my latest attempt at merging both requirements, it results in supporting partial word searches but not "H & R".
{
char_filter: {
strip_punctuation: { type: "mapping", mappings: [".=>", ",=>", "!=>", "?=>"] }
},
filter: {
my_splitter: {
type: "word_delimiter",
preserve_original: true
}
},
analyzer: {
my_analyzer: {
char_filter: %w[strip_punctuation],
type: "custom",
tokenizer: "my_tokenizer",
filter: %w[lowercase asciifolding my_splitter]
}
},
tokenizer: {
my_tokenizer: {
type: "nGram",
min_gram: "3",
max_gram: "10",
token_chars: %w[letter digit]
}
}
}
You can use multi_field from your mapping to index the same field in multiple ways. You can use your full text search with custom tokenizer on the default field, and create a special indexing for your autocompletion needs.
"title": {
"type": "string",
"fields": {
"raw": { "type": "string", "index": "not_analyzed" }
}
}
Your query will need to be slightly different when performing the autocomplete as the field will be title.raw instead of just title.
Once the field is indexed in all the ways that make sense for your query, you can query the index using a boolean "should" query, matching the tokenized version and the word start query. It is likely that a larger boost should be provided to the first query matching complete words to get the direct hits on top.

Resources