real-word spell-checker with elasticsearch

real-word spell-checker with elasticsearch - elasticsearch

I'm already familiar with Elasticsearch's spell-checker and I can build a simple spell-checker using suggest API. The thing is, there is a kind of misspelled words, called "real-word" misspells. A real-word misspell happens when a mistake in writing a word's spell, creates another word that is present in the indexed data, so the lexical spell-checker misses to correct it because lexically the word IS correct.
For instance, consider the query "How to bell my laptop?".The user by "bell" meant "sell", but "bell" is present in indexed vocabulary. So the spell-checker leaves it to be.
The idea of finding and correcting the real-word spell mistakes is by using the frequency of indexed data n-grams. If the frequency of current n-gram is very low and on the other hand there is a very similar n-gram with high frequency in indexed data, the chances are we have a real-word misspell.
I wonder if there is a way to implement such spell-checker using elasticsearch API?

After I searched for a while I find out the implementation of such a thing is possible using phrase_suggester.
POST v2_201911/_search
{
"suggest": {
"text": "how to bell my laptop",
"simple_phrase": {
"phrase": {
"field": "content",
"gram_size": 2,
"real_word_error_likelihood": 0.95,
"direct_generator": [
{
"field": "content",
"suggest_mode": "always",
"prefix_length": 0,
"min_word_length": 1
}
],
"highlight": {
"pre_tag": "<em>",
"post_tag": "</em>"
}
}
}
}
}
According to documentation :
real_word_error_likelihood :
The likelihood of a term being a misspelled even if the term exists in
the dictionary. The default is 0.95, meaning 5% of the real words are
misspelled.

Related

Ingesting / enriching / transforming data in one elasticsearch index with dynamic information from a second one

I would like to dynamically enrich an existing index based on the (weighted) term frequencies given in a second index.
Imagine I have one index with one field I want to analyze (field_of_interest):
POST test/_doc/1
{
"field_of_interest": "The quick brown fox jumps over the lazy dog."
}
POST test/_doc/2
{
"field_of_interest": "The quick and the dead."
}
POST test/_doc/3
{
"field_of_interest": "The lazy quack was quick to quip."
}
POST test/_doc/4
{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! "
}
and a second one (scores) with pairs of keywords and weights:
POST scores/_doc/1
{
"term": "quick",
"weight": 1
}
POST scores/_doc/2
{
"term": "brown",
"weight": 2
}
POST scores/_doc/3
{
"term": "lazy",
"weight": 3
}
POST scores/_doc/4
{
"term": "green",
"weight": 4
}
I would like to define and perform some kind of analysis, ingestion, transform, enrichment, re-indexing or whatever to dynamically add a new field points to the first index that is the aggregation (sum) of the weighted number of occurrences of each of the search terms from the second index in the field_of_interest in the first index. So after performing this operation, I would want a new index to look something like this (some fields omitted):
{
"_id":"1",
"_source":{
"field_of_interest": "The quick brown fox jumps over the lazy dog.",
"points": 6
}
},
{
"_id":"2",
"_source":{
"field_of_interest": "The quick and the dead.",
"points": 1
}
},
{
"_id":"3",
"_source":{
"field_of_interest": "The lazy quack was quick to quip.",
"points": 4
}
},
{
"_id":"4",
"_source":{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! ",
"points": 9
}
}
If possible, it may even be interesting to get individual fields for each of the terms, listing the weighted sum of the occurrences, e.g.
{
"_id":"4",
"_source":{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! ",
"quick": 3,
"brown": 0,
"lazy": 6,
"green": 0,
"points": 9
}
}
The question I now have is how to go about this in Elasticsearch. I am fairly new to Elastic, and there are many concepts that seem promising, but so far I have not been able to pinpoint even a partial solution.
I am on Elasticsearch 7.x (but would be open to move to 8.x) and want to do this via the API, i.e. without using Kibana.
I first thought of an _ingest pipeline with an _enrich policy, since I am kind of trying to add information from one index to another. But my understanding is that the matching does not allow for a query, so I don't see how this could work.
I also looked at _transform, _update_by_query, custom scoring, _term_vector but to be honest, I am a bit lost.
I would appreciate any pointers whether what I want to do can be done with Elasticsearch (I assumed it would kind of be the perfect tool) and if so, which of the many different Elasticsearch concept would be most suitable for my use case.

Follow this sequence of steps:
/_scroll every document in the second index.
Search for it in the first index (simple match query)
Increment the points by a script update operation on every matching document.
Having individual words as fields in the first index is not a good idea. We do not know which words are going to be found inside the sentences, and so your index mapping will explode witha lot of dynamic fields, which is not desirable. A better way is to add a nested mapping to the first index. With the following mapping:
{
"words" : {
"type" : "nested",
"properties" : {
"name" : {"type" : "keyword"},
"weight" : {"type" : "float"}
}
}
}
THen you simply append to this array, for every word that is found. "points" can be a seperate field.
What you want to do has to be done client side. There is no inbuilt way to handle such an operation.
HTH.

Elasticsearch - best query and index for partial and fuzzy search

I thought this scenario must be quite common, but I was unable to find the best way to do it.
I have a big dataset of products. All the products have this kind of schema:
{
"productID": 1,
"productName": "Whatever",
"productBoost": 1234
}
I have this problem to combine partial (query string) and fuzzy query.
What i have is about 1.5M records in an index which have listed the names od the product and the boost value- like the popularity value of the product(most common have higher popularity and less popular ones have less popularity).
For this i would like to use function score.
What i was trying to achieve is search as you type, with the function score and fuzziness.
I’m not sure if this is the best approach.
Current query i'm using is this:
"query": {
"function_score": {
"query": {
"match": {
"productName": {
"query": "word",
"fuzziness": "AUTO",
"operator": "AND"
}
}
},
"field_value_factor": {
"field": "productBoost",
"factor": 1,
"modifier": "square"
}
}
}
This is working kinda ok, but the problem is that i want products like:
"Cabbage raw", to come up before "Cabernet red wine", when i try to search for the string "cab" because the boost is way higher on "Cabbage raw".
Another problem is when i search for the word "cabage" (typo of "cabagge"), there is only one product, and there are a lot of "cabagge" containing products.
If the query_string had the fuzziness with the wildcards, that would be ideal for this solution i think.
Also this is a match query so partial part is not working as well.
I tried using query_string, with the wildcards, but the downside of that is i can not use fuzziness for that kind of query.
Also i've tried nGrams and edge but i'm not sure how to implement it in this case scenario and how to combine the search score with the existing boost i have.
The only thing, that might even fix this issue, that i didn't try are suggesters.
I couldn't make them work with the function_score.
If anyone have any ideas on implementing this, it would be really helpful.

Which is better simple_query_string or query_string?

What is the difference between simple_query_string and query_string in elastic search?
Which is better for searching?
In the elastic search simple_query_string documentation, they are written
Unlike the regular query_string query, the simple_query_string query
will never throw an exception and discards invalid parts of the
query.
but it not clear. Which one is better?

There is no simple answer. It depends :)
In general the query_string is dedicated for more advanced uses. It has more options but as you quoted it throws exception when sent query cannot be parsed as a whole. In contrary simple_query_string has less options but does not throw exception on invalid parts.
As an example take a look at two below queries:
GET _search
{
"query": {
"query_string": {
"query": "hyperspace AND crops",
"fields": [
"description"
]
}
}
}
GET _search
{
"query": {
"simple_query_string": {
"query": "hyperspace + crops",
"fields": [
"description"
]
}
}
}
Both are equivalent and return the same results from your index. But when you will break the query and sent:
GET _search
{
"query": {
"query_string": {
"query": "hyperspace AND crops AND",
"fields": [
"description"
]
}
}
}
GET _search
{
"query": {
"simple_query_string": {
"query": "hyperspace + crops +",
"fields": [
"description"
]
}
}
}
Then you will get results only from the second one (simple_query_string). The first one (query_string) will throw something like this:
{
"error": {
"root_cause": [
{
"type": "query_shard_exception",
"reason": "Failed to parse query [hyperspace AND crops AND]",
"index_uuid": "FWz0DXnmQhyW5SPU3yj2Tg",
"index": "your_index_name"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
...
]
},
"status": 400
}
Hope you are now understand the difference with throwing/not throwing exception.
Which is better? If you want expose the search to some plain end users I would rather recommend to use simple_query_string. Thanks to that end user will get some result in each query case even if he made a mistake in a query. query_string is recommended for some more advanced users who will be trained in how is the correct query syntax so they will know why they do not have any results in every particular situation.

Adding to what #Piotr has mentioned,
What I understand is when you want the external users or consumers want to make use of the search solution, simple query string offers better solution in terms of error handling and limiting what kind of queries users can probably construct.
In other words, if the search solution is available publicly for any consumers to consume the solution, then I guess simple_query_string would make sense, however if I do know who my end-users are in a way I can drive them as what they are looking for, no reason why I cannot expose them via query_string
Also QueryStringQueryBuilder.java makes use of QueryStringQueryParser.java while SimpleQueryStringBuilder.java makes use of SimpleQueryStringQueryParser.java which makes me think that there would be certain limitations in parsing and definitely the creators wouldn't want many features to be managed by end-users. for .e.g dis-max and which is available in query_string.
Perhaps the main purpose of simple query string is to limit end-users to make use of simple querying for their purpose and devoid them of all forms of complex querying and advance features so that we have more control on our search engine (which I'm not really sure about but just a thought).
Plus the possibilities to misuse query_string might be more as only advanced users are capable of constructing certain complex queries in correct way which may be a bit too much for simple users who are looking for basic search solution.

Query performance when applying the "Great mapping refactoring"

Our applications' entities are dynamic, we don't know how many properties they'll have or what their type will be.
Up until now, we've indexed our data in the following way:
{
"message": "some string",
"count": 1,
"date": "2015-06-01"
}
After reading the following blog:
We've understood that it's better to index the data like this:
{
"data": [
{
"key": "message",
"str_val": "some_string"
},
{
"key": "count",
"int_val": 1
},
{
"key": "date",
"date_val": "2015-06-01"
}
]
}
We were wondering how the index would work in terms of nested aggregations.
will the mapping refactoring above damage the indexing time (and/or the query/aggregation time) due to the fact that now, every entity will be nested one level deeper?
We have thousands of different object types, hence our mapping file is huge. That slows down the indexing time, so a mapping refactoring is highly necessary.
Are you aware of any disadvantages when it comes to refactoring our mapping as explained in the blog above?

How to get ElasticSearch to return scores independent of case?

I would like ElasticSearch to return result scores that are independent of case. As an example, suppose I query for the string "HOUSE" or ("house") I obtain the following results:
"House" => score: 0.6868894,
"House on the hill" => score: 0.52345484
"HOUSE" => score: 0.52200186
In an ideal world, both "House" and "HOUSE" would have a score of 1.0 and "House on the hill" a score of 0.5.
So far I've tried adding a custom analyser and am now looking at the omit_norms option. I'm also considering patterns since they have a CASE_INSENSITIVE flag. Unfortunately I'm finding the official documentation lacks examples and code snippets...
Can anyone provide code snippets/examples of a query including the parameters required to achieve scores independent of case? Extra recognition to anyone who can provide a solution using Tire for Rails.
MAPPING
mapping _source: {} do
indexes :id, type: 'integer'
indexes :value, :analyzer => 'string_lowercase'
end
** analyser is custom analyser mentioned above
QUERY
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "house"
}
}
}
},
"fields": ["value"],
"from": 0,
"size": 50,
"sort": {
"_score": {
"order": "desc"
}
},
"explain": true
}
ElasticSearch 0.90.5;
Rails 4.0.0;
Tire (gem) 0.6.0

Better yet, the problem is caused by ES using many (5 by default) shards to score the documents and each shard uses only the documents that were allocated to it to compute its scores. Since I'm using test data and my DB is practically empty the score were completely off. Answer is to use dfs_query_then_fetch search type (at least while developping..). Still searching for how to implement in Rails/Tire or set as default in ES.
Cheers,
nic

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio