Ingesting / enriching / transforming data in one elasticsearch index with dynamic information from a second one - elasticsearch

I would like to dynamically enrich an existing index based on the (weighted) term frequencies given in a second index.
Imagine I have one index with one field I want to analyze (field_of_interest):
POST test/_doc/1
{
"field_of_interest": "The quick brown fox jumps over the lazy dog."
}
POST test/_doc/2
{
"field_of_interest": "The quick and the dead."
}
POST test/_doc/3
{
"field_of_interest": "The lazy quack was quick to quip."
}
POST test/_doc/4
{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! "
}
and a second one (scores) with pairs of keywords and weights:
POST scores/_doc/1
{
"term": "quick",
"weight": 1
}
POST scores/_doc/2
{
"term": "brown",
"weight": 2
}
POST scores/_doc/3
{
"term": "lazy",
"weight": 3
}
POST scores/_doc/4
{
"term": "green",
"weight": 4
}
I would like to define and perform some kind of analysis, ingestion, transform, enrichment, re-indexing or whatever to dynamically add a new field points to the first index that is the aggregation (sum) of the weighted number of occurrences of each of the search terms from the second index in the field_of_interest in the first index. So after performing this operation, I would want a new index to look something like this (some fields omitted):
{
"_id":"1",
"_source":{
"field_of_interest": "The quick brown fox jumps over the lazy dog.",
"points": 6
}
},
{
"_id":"2",
"_source":{
"field_of_interest": "The quick and the dead.",
"points": 1
}
},
{
"_id":"3",
"_source":{
"field_of_interest": "The lazy quack was quick to quip.",
"points": 4
}
},
{
"_id":"4",
"_source":{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! ",
"points": 9
}
}
If possible, it may even be interesting to get individual fields for each of the terms, listing the weighted sum of the occurrences, e.g.
{
"_id":"4",
"_source":{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! ",
"quick": 3,
"brown": 0,
"lazy": 6,
"green": 0,
"points": 9
}
}
The question I now have is how to go about this in Elasticsearch. I am fairly new to Elastic, and there are many concepts that seem promising, but so far I have not been able to pinpoint even a partial solution.
I am on Elasticsearch 7.x (but would be open to move to 8.x) and want to do this via the API, i.e. without using Kibana.
I first thought of an _ingest pipeline with an _enrich policy, since I am kind of trying to add information from one index to another. But my understanding is that the matching does not allow for a query, so I don't see how this could work.
I also looked at _transform, _update_by_query, custom scoring, _term_vector but to be honest, I am a bit lost.
I would appreciate any pointers whether what I want to do can be done with Elasticsearch (I assumed it would kind of be the perfect tool) and if so, which of the many different Elasticsearch concept would be most suitable for my use case.

Follow this sequence of steps:
/_scroll every document in the second index.
Search for it in the first index (simple match query)
Increment the points by a script update operation on every matching document.
Having individual words as fields in the first index is not a good idea. We do not know which words are going to be found inside the sentences, and so your index mapping will explode witha lot of dynamic fields, which is not desirable. A better way is to add a nested mapping to the first index. With the following mapping:
{
"words" : {
"type" : "nested",
"properties" : {
"name" : {"type" : "keyword"},
"weight" : {"type" : "float"}
}
}
}
THen you simply append to this array, for every word that is found. "points" can be a seperate field.
What you want to do has to be done client side. There is no inbuilt way to handle such an operation.
HTH.

Related

Get most searched terms from indexes that are referenced by an alias

The search is pointing to an alias.
The indexing process create a new index every 5 minutes.
Then the alias is updated, pointing to the new index.
The index is recreated to avoid sync problems that can occur if we update item by item when a change is made.
However, I need to keep track of the searched terms to produce a dashboard to list the most searched terms in a period. Or even using Kibana to show/extract it.
*The searched terms can be multi words, such as "white", "white summer night", etc. We are looking to rank the term, not the individual words.
I don't have experience with Elasticsearch and the searches that I have tried did not bring relevant solutions.
Thanks for the help!
{
"actions" : [
{ "remove" : { "index" : "catalog*", "alias" : "catalog-index" } },
{ "add" : { "index" : "catalog1234566", "alias" : "catalog-index" } }
]
}
Mappings:
{
"mappings":{
"properties":{
"created_at":{
"type":"integer"
},
"search_terms_key":{
"type":"keyword"
}
}
}
}
Query:
{
"query":{
"match_all":{
}
},
"aggs":{
"search_terms_key":{
"terms":{
"field":"search_terms_key",
"value_type":"string"
}
}
}
}
Log the search terms (or entire queries, if necessary), ingest those into Elasticsearch, then analyze them with Kibana. The index alias configuration is not relevant.
You should get the logs either directly from whatever connects to Elasticsearch, or from a proxy between it and Elasticsearch.
You could get Elasticsearch itself to log queries, but that's usually a bad idea in terms of performance.
Since it's the entire term you're after, be sure to use keyword mapping on the search term.
Once you have search terms ingested, use a terms aggregation to show the most popular searches.
[edit: make explicit that search terms need to be logged, not full DSL queries]

real-word spell-checker with elasticsearch

I'm already familiar with Elasticsearch's spell-checker and I can build a simple spell-checker using suggest API. The thing is, there is a kind of misspelled words, called "real-word" misspells. A real-word misspell happens when a mistake in writing a word's spell, creates another word that is present in the indexed data, so the lexical spell-checker misses to correct it because lexically the word IS correct.
For instance, consider the query "How to bell my laptop?".The user by "bell" meant "sell", but "bell" is present in indexed vocabulary. So the spell-checker leaves it to be.
The idea of finding and correcting the real-word spell mistakes is by using the frequency of indexed data n-grams. If the frequency of current n-gram is very low and on the other hand there is a very similar n-gram with high frequency in indexed data, the chances are we have a real-word misspell.
I wonder if there is a way to implement such spell-checker using elasticsearch API?
After I searched for a while I find out the implementation of such a thing is possible using phrase_suggester.
POST v2_201911/_search
{
"suggest": {
"text": "how to bell my laptop",
"simple_phrase": {
"phrase": {
"field": "content",
"gram_size": 2,
"real_word_error_likelihood": 0.95,
"direct_generator": [
{
"field": "content",
"suggest_mode": "always",
"prefix_length": 0,
"min_word_length": 1
}
],
"highlight": {
"pre_tag": "<em>",
"post_tag": "</em>"
}
}
}
}
}
According to documentation :
real_word_error_likelihood :
The likelihood of a term being a misspelled even if the term exists in
the dictionary. The default is 0.95, meaning 5% of the real words are
misspelled.

ElasticSearch: return some text surrounding a match on a full-text query?

I have some full text search query on an article index:
"query": {
"multi_match": {
"query": article,
"fields": [ "text" ],
"minimum_should_match": "75%"
}
}
I want to know if I can change it to return only part of the text rather than the entire matched text. For example, let's say I search for "brown fox". Instead of returning the entire article, I just want to return a few words surrounding any match of "brown fox", so that a result might be ".. is said that any brown fox could jump over fences..", disregarding newlines.
Is this possible in ES?
As #Adam-t mentioned, highlighting in EC is the key to this answer. For future references I have added my search query where I was able to get the requested answer. I'm posting this answer because, I've also faced the same issue and it took me a while to find a proper solution.
{
"query":{
"match_phrase":{
"text":"investors"
}
},
"highlight":{
"fragment_size":100,
"fields":{
"text":{}
}
}
}
Above search query will search for term "investors" through a large text and returns a response like below,
"highlight" : {
"content" : [
"*stocks closed at a near three-week high on Wednesday, led by blue-chips, but foreign <em>investors</em>",
"The dollar currency ended weaker. ** Local <em>investors</em> picked up select shares, with one of the two presidential"
]
}
fragment_size highlights the surrounding text with a default value of 100

Phrase suggester returns unexpected result when first letter is misspelled

I'm using Elasticsearch Phrase Suggester for correcting user's misspellings. everything is working as I expected unless user enters a query which it's first letter is misspelled. At this situation phrase suggester returns nothing or returns unexpected results.
My query for suggestion:
{
"suggest": {
"text": "user_query",
"simple_phrase": {
"phrase": {
"field": "title.phrase",,
"collate": {
"query": {
"inlile" : {
"bool": {
"should": [
{ "match": {"title": "{{suggestion}}"}},
{ "match": {"participants": "{{suggestion}}"}}
]
}
}
}
}
}
}
}
}
Example when first letter is misspelled:
"simple_phrase" : [
{
"text" : "گاشانچی",
"offset" : 0,
"length" : 11,
"options" : [ {
"text" : "گارانتی",
"score" : 0.00253151
}]
}
]
Example when fifth letter is misspelled:
"simple_phrase" : [
{
"text" : "کاشاوچی",
"offset" : 0,
"length" : 11,
"options" : [ {
"text" : "کاشانچی",
"score" : 0.1121
},
{
"text" : "کاشانجی",
"score" : 0.0021
},
{
"text" : "کاشنچی",
"score" : 0.0020
}]
}
]
I expect that these two misspelled queries have same suggestions(my expected suggestions are second one). what is wrong?
P.S: I'm using this feature for Persian language.
I have solution for your problem, only need to add some fields in your schema.
P.S: I don't have that much expertise in elasticsearch but I have solved same problem using solr, you can implement same way in elasticSearch too
Create new ngram field and copy all you title name in ngram field.
When you fire any query for missspell word and you get blank result then split
the word and again fire the same query you will get results as expected.
Example : Suppose user searching for word Akshay but type it as Skshay, then
create query in below way you will get results as expected hopefully.
I am here giving you solr example same way you can achieve it using
elasticsearch.
**(ngram:"skshay" OR ngram:"sk" OR ngram:"ks" OR ngram:"sh" OR ngram:"ha" ngram:"ay")**
We have split the word sequence wise and fire query on field ngram.
Hope it will help you.
From Elasticsearch doc:
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-suggesters-phrase.html
prefix_length
The number of minimal prefix characters that must match in order be a
candidate suggestions. Defaults to 1. Increasing this number improves
spellcheck performance. Usually misspellings don’t occur in the
beginning of terms. (Old name "prefix_len" is deprecated)
So by default phrase-suggester assumes that the first character is correct because the default value for prefix_length is 1.
Note: setting this value to 0 is not a good way because this will have performance implications.
You need to use the reverse analyzer
I explained it in this post so please go and check my answer
Elasticsearch spell check suggestions even if first letter missed
And regarding the duplicates, you can use
skip_duplicates
Whether duplicate suggestions should be filtered out (defaults to
false).

Understanding boosting in ElasticSearch

I've been using ElasticSearch for a little bit with the goal of building a search engine and I'm interested in manually changing the IDFs (Inverse Document Frequencies) of each term to match the ones one can measure from the Google Books unigrams.
In order to do that I plan on doing the following:
1) Use only 1 shard (so IDFs are not computed for every shard and they are "global")
2) Get the ttf (total term frequency, which is used to compute the IDFs) for every term by running this query for every document in my index
curl -XGET 'http://localhost:9200/index/document/id_doc/_termvectors?pretty=true' -d '{
"fields" : ["content"],
"offsets" : true,
"term_statistics" : true
}'
3) Use the Google Books unigram model to "rescale" the ttf for every term.
The problem is that, once I've found the "boost" factors I have to use for every term, how can I use this in a query?
For instance, let's consider this example
"query":
{
"bool":{
"should":[
{
"match":{
"title":{
"query":"cat",
"boost":2
}
}
},
{
"match":{
"content":{
"query":"cat",
"boost":2
}
}
}
]
}
}
Does that mean that the IDFs of the term "cat" is going to be boosted / multiplied by a factor of 2?
Also, what happens if instead of search for one word I have a sentence? Would that mean that the IDFs of each word is going to be boosted by 2?
I tried to understand the role of the boost parameter (https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html) and t.getBoost(), but that seems a little confusing.
The boost is used when query with multi query clauses, example:
{
"bool":{
"should":[
{
"match":{
"clause1":{
"query":"query1",
"boost":3
}
}
},
{
"match":{
"clause2":{
"query":"query2",
"boost":2
}
}
},
{
"match":{
"clause3":{
"query":"query1",
"boost":1
}
}
}
]
}
}
In the above query, it means clause1 is three times important than clause3, clause2 is the twice important than clause2, It's not simply multiply 3, 2, because when calculate score, because there is normalized for scores.
also if you just query with one query clause with boost, it's not useful.
An usage scenario for using boost:
A set of page document set with title and content field.
You want to search title and content with some terms, and you think title is more important than content when search these documents. so you can set title query clause boost more than content. Such as if your query hit one document by title field, and one hit document by content field, and you want to hit title field's document prior to the content field document. so boost can help you do it.

Resources