Elasticsearch nested phrase search within a distance - elasticsearch

Sample ES Document
{
// other properties
"transcript" : [
{
"id" : 0,
"user_type" : "A",
"phrase" : "hi good afternoon"
},
{
"id" : 1,
"user_type" : "B",
"phrase" : "hey"
}
{
"id" : 2,
"user_type" : "A",
"phrase" : "hi "
}
{
"id" : 3,
"user_type" : "B",
"phrase" : "my name is john"
}
]
}
transcript is a nested field whose mapping looks like
{
"type":"nested",
"properties": {
"id":{
"type":"integer"
}
"phrase": {
"type":"text",
"analyzer":"standard"
},
"user_type": {
"type":"keyword"
}
}
}
I need to search for two phrases inside transcript that are apart by at max a given distance d.
For example:
If the phrases are hi and name and d is 1, the above document match because hi is present in third nested object, and name is present in fourth nested object. (Note: hi in first nested object and name in fourth nested object is NOT valid, as they are apart by more than d=1 distance)
If the phrases are good and name and d is 1, the above document does not match because good and name are 3 distance apart.
If both phrases are present in same sentence, the distance is considered as 0.
Possible Solution:
I can fetch all documents where both phrases are present and on the application side, I can discard documents where phrases were more than the given threshold(d) apart. The problem in this case could be that I cannot get the count of such documents beforehand in order to show in the UI as found in 100 documents out of 1900 (as without processing from application side, we can't be sure if the document is indeed a match or not, and it's not feasible to do processing for all documents in index)
Second possible solution is:
{
"query": {
"bool": {
// suppose d = 2
// if first phrase occurs at 0th offset, second phrase can occur at
// ... 0th, 1st or 2nd offset
// if first phrase occurs at 1st offset, second phrase can occur at
// ... 1st, 2nd or 3rd offset
// any one of above permutation should exist
"should": [
{
// search for 1st permutation
},
{
// search for 2nd permutation
},
...
]
}
}
}
This is clearly not scalable as if d is large, and if the transcript is large, the query is going to be very very big.
Kindly suggest any approach.

Related

Ingesting / enriching / transforming data in one elasticsearch index with dynamic information from a second one

I would like to dynamically enrich an existing index based on the (weighted) term frequencies given in a second index.
Imagine I have one index with one field I want to analyze (field_of_interest):
POST test/_doc/1
{
"field_of_interest": "The quick brown fox jumps over the lazy dog."
}
POST test/_doc/2
{
"field_of_interest": "The quick and the dead."
}
POST test/_doc/3
{
"field_of_interest": "The lazy quack was quick to quip."
}
POST test/_doc/4
{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! "
}
and a second one (scores) with pairs of keywords and weights:
POST scores/_doc/1
{
"term": "quick",
"weight": 1
}
POST scores/_doc/2
{
"term": "brown",
"weight": 2
}
POST scores/_doc/3
{
"term": "lazy",
"weight": 3
}
POST scores/_doc/4
{
"term": "green",
"weight": 4
}
I would like to define and perform some kind of analysis, ingestion, transform, enrichment, re-indexing or whatever to dynamically add a new field points to the first index that is the aggregation (sum) of the weighted number of occurrences of each of the search terms from the second index in the field_of_interest in the first index. So after performing this operation, I would want a new index to look something like this (some fields omitted):
{
"_id":"1",
"_source":{
"field_of_interest": "The quick brown fox jumps over the lazy dog.",
"points": 6
}
},
{
"_id":"2",
"_source":{
"field_of_interest": "The quick and the dead.",
"points": 1
}
},
{
"_id":"3",
"_source":{
"field_of_interest": "The lazy quack was quick to quip.",
"points": 4
}
},
{
"_id":"4",
"_source":{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! ",
"points": 9
}
}
If possible, it may even be interesting to get individual fields for each of the terms, listing the weighted sum of the occurrences, e.g.
{
"_id":"4",
"_source":{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! ",
"quick": 3,
"brown": 0,
"lazy": 6,
"green": 0,
"points": 9
}
}
The question I now have is how to go about this in Elasticsearch. I am fairly new to Elastic, and there are many concepts that seem promising, but so far I have not been able to pinpoint even a partial solution.
I am on Elasticsearch 7.x (but would be open to move to 8.x) and want to do this via the API, i.e. without using Kibana.
I first thought of an _ingest pipeline with an _enrich policy, since I am kind of trying to add information from one index to another. But my understanding is that the matching does not allow for a query, so I don't see how this could work.
I also looked at _transform, _update_by_query, custom scoring, _term_vector but to be honest, I am a bit lost.
I would appreciate any pointers whether what I want to do can be done with Elasticsearch (I assumed it would kind of be the perfect tool) and if so, which of the many different Elasticsearch concept would be most suitable for my use case.
Follow this sequence of steps:
/_scroll every document in the second index.
Search for it in the first index (simple match query)
Increment the points by a script update operation on every matching document.
Having individual words as fields in the first index is not a good idea. We do not know which words are going to be found inside the sentences, and so your index mapping will explode witha lot of dynamic fields, which is not desirable. A better way is to add a nested mapping to the first index. With the following mapping:
{
"words" : {
"type" : "nested",
"properties" : {
"name" : {"type" : "keyword"},
"weight" : {"type" : "float"}
}
}
}
THen you simply append to this array, for every word that is found. "points" can be a seperate field.
What you want to do has to be done client side. There is no inbuilt way to handle such an operation.
HTH.

How can I script a field in Kibana that matches first 4 digits of a field?

I'm having a steep learning curve with the syntax and my data has PII so I don't know how to describe more.
I need a new field in kibana in the already indexed documents. This field "C" would be a combination of the first 4 digits of a field "A" that contains numbers up to millions and is of type:keyword, and a field "B" that is type:keyword and is some large number.
Later I will use this field "C" that is a unique combination, to compare with a list/array of items ( I will insert the list in a query DSL in Kibana, as I need to build some visualizations and reports with the returned documents).
I saw that I could use painless to create this new field, but I don't know exactly if I need to use regex and how to.
EDIT:
As requested more info about the mapping with a concrete example.
"fieldA" : {
"type: "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"fieldB" : {
"type: "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
Example of values:
FieldA = "9876443320134",
FieldB = "000000001".
I would like to sum the first 4 digits of FieldA and the full content of FieldB. FieldC would result in a value of "9877".
The raw query could look like this:
GET combination_index/_search
{
"script_fields": {
"a+b": {
"script": {
"source": """
def a = doc['fieldA.keyword'].value;
def b = doc['fieldB.keyword'].value;
if (a == null || b == null) {
return null
}
def parsed_a = new BigInteger(a);
def parsed_b = new BigInteger(b);
return new BigInteger(parsed_a.toString().substring(0, 4)) + parsed_b;
"""
}
}
}
}
Note 1: we're parsing the strings into BigInteger b/c of seemingly insufficient Integer.MAX_VALUE.
Note 2: we're first parsing fieldA and only then calling .toString on it again in order to handle the edge case of fieldA starting w/ 0s like 009876443320134. It's assumed that you're looking for 9876, not 98, which be the result of first calling .substring and then parsing.
If you intend to use it in Kibana visualizations, you'll need an index pattern first. Once you've got one, you can proceed as follows:
then put the script in:
click save and the new scripted becomes available in numeric aggregations and queries:

ElasticSearch append non matched docs at the end of the search result

Is there any way to append non matched docs at the end of the search result?
I have been working on a project where we need to search docs by geolocation data but some docs don't have the geolocation data available. As a result of that these docs not returning in the search result.
Is there any way to append non matched docs at the end of the search result?
Example mapping:
PUT /my_locations
{
"mappings": {
"_doc": {
"properties": {
"address": {
"properties": {
"city": {
"type": "text"
},
"location": {
"type": "geo_point"
}
}
}
}
}
}
}
Data with geo location:
PUT /my_locations/_doc/1
{
"address" : {
"city: "XYZ",
"location" : {
"lat" : 40.12,
"lon" : -71.34
}
}
}
Data without geo location:
PUT /my_locations/_doc/2
{
"address" : {
"city: "ABC"
}
}
Is there any way to perform geo distance query which will select the docs with geolocation data plus append the non geo docs at the end of the result?
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-geo-distance-query.html#query-dsl-geo-distance-query
You have two separate queries
Get documents within the area
Get other documents
To get both of these in one search, would mean all of the documents appear in one result, and share ranking. It would be difficult to create a relevancy model which gets first 9 documents with address, and one without.
But you can just run two queries at once, one for say, the first 9 documents with location, and one for without any.
Example:
GET my_locations/_msearch
{}
{"size":9,"query":{"geo_distance":{"distance":"200km","pin.location":{"lat":40,"lon":-70}}}}
{}
{"size":1,"query":{"bool":{"must_not":[{"exists":{"field":"pin.location"}}]}}}

Phrase suggester returns unexpected result when first letter is misspelled

I'm using Elasticsearch Phrase Suggester for correcting user's misspellings. everything is working as I expected unless user enters a query which it's first letter is misspelled. At this situation phrase suggester returns nothing or returns unexpected results.
My query for suggestion:
{
"suggest": {
"text": "user_query",
"simple_phrase": {
"phrase": {
"field": "title.phrase",,
"collate": {
"query": {
"inlile" : {
"bool": {
"should": [
{ "match": {"title": "{{suggestion}}"}},
{ "match": {"participants": "{{suggestion}}"}}
]
}
}
}
}
}
}
}
}
Example when first letter is misspelled:
"simple_phrase" : [
{
"text" : "گاشانچی",
"offset" : 0,
"length" : 11,
"options" : [ {
"text" : "گارانتی",
"score" : 0.00253151
}]
}
]
Example when fifth letter is misspelled:
"simple_phrase" : [
{
"text" : "کاشاوچی",
"offset" : 0,
"length" : 11,
"options" : [ {
"text" : "کاشانچی",
"score" : 0.1121
},
{
"text" : "کاشانجی",
"score" : 0.0021
},
{
"text" : "کاشنچی",
"score" : 0.0020
}]
}
]
I expect that these two misspelled queries have same suggestions(my expected suggestions are second one). what is wrong?
P.S: I'm using this feature for Persian language.
I have solution for your problem, only need to add some fields in your schema.
P.S: I don't have that much expertise in elasticsearch but I have solved same problem using solr, you can implement same way in elasticSearch too
Create new ngram field and copy all you title name in ngram field.
When you fire any query for missspell word and you get blank result then split
the word and again fire the same query you will get results as expected.
Example : Suppose user searching for word Akshay but type it as Skshay, then
create query in below way you will get results as expected hopefully.
I am here giving you solr example same way you can achieve it using
elasticsearch.
**(ngram:"skshay" OR ngram:"sk" OR ngram:"ks" OR ngram:"sh" OR ngram:"ha" ngram:"ay")**
We have split the word sequence wise and fire query on field ngram.
Hope it will help you.
From Elasticsearch doc:
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-suggesters-phrase.html
prefix_length
The number of minimal prefix characters that must match in order be a
candidate suggestions. Defaults to 1. Increasing this number improves
spellcheck performance. Usually misspellings don’t occur in the
beginning of terms. (Old name "prefix_len" is deprecated)
So by default phrase-suggester assumes that the first character is correct because the default value for prefix_length is 1.
Note: setting this value to 0 is not a good way because this will have performance implications.
You need to use the reverse analyzer
I explained it in this post so please go and check my answer
Elasticsearch spell check suggestions even if first letter missed
And regarding the duplicates, you can use
skip_duplicates
Whether duplicate suggestions should be filtered out (defaults to
false).

Terms query not returning results for list of strings

I have this Elastic query which fails to return the desired results for terms.letter_score. I'm certain there is available matches in the index. This query (excluding letter_score) returns the expected filtered results but nothing with letter_score. The only difference is (as far as I can tell), is that the cat_id values is a list of integers vs strings. Any ideas of what could be the issue here? I'm basically trying to get it to match ANY value from the letter_score list.
Thanks
{
"size": 10,
"query": {
"bool": {
"filter": [
{
"terms": {
"cat_id": [
1,
2,
4
]
}
},
{
"terms": {
"letter_score": [
"A",
"B",
"E"
]
}
}
]
}
}
}
It sounds like your letter_score field is of type text, and hence, has been analyzed, so the tokens A, B and E have been stored as a, b and e so the terms query won't match them.
Also if that's the case, the probability is high that the token a has been ignored at indexing time because it is a stop word and the standard analyzer (default) ignores them (if you're using ES 5+).
A first approach is to use a match query instead of terms, like this:
{
"match": {
"letter_score": "A B E"
}
}
If that still doesn't work, I suggest that you change the mapping of your letter_score field to keyword (requires reindexing your data) and then your query will work as it is now

Resources