How to highlight regexp in elasticsearch to with patterns that include spaces - elasticsearch

Am trying to use regexp in elasticsearch to find some patterns and highlight it, the pattern am trying to find contain spaces,
I also used keyword as to not analyze the text
{
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
the exact pattern is ".*( [a-zA-Z]( |,|.)){3,5}.*" the query looks like this
{
"_source": false,
"query": {
"bool": {
"should": [
{
"regexp": {
"transcript_data.transcript.keyword": {
"value": ".*( [a-zA-Z]){3,5}.*"
}
}
}
]
}
},
"highlight": {
"fields": {
"transcript_data.transcript.keyword": {}
}
}
}
the highlight seems to highlight the whole document (start to end) eventhough the pattern lies in the middle of text.
to clarify, consider the below difference between the 2 images
V.S.
for example
It's it's it's a steal. A hot eight mining and b k k t. I think these have went too the output should be <em>b k k t</em> I get ... <em>It's it's it's a steal. A hot eight mining and b k k t. I think these have went too</em> and I believe this because the .* but this also seem to be the way regex work in ES, what am doing wrong ?

As far as I know you cannot search for text type of field when regex have white space because text field are analzyed and it is split into multipul tokens. So when you do search on text field with whitespace in regex then it will not return any result.
Currently you are trying to search keyword, type of field which is not analyzed field and that's why you are able to search it. Also, it is highlighting entire field because keyword field is not analyzed and it will store entire single value.
You can use "require_field_match": "false" in highlighting if you want to search on other field and highlight on different field but this will also not work in your case.
You can try out shingle another and then try to search on keyword field and highlight on shingle field, but I am not sure if this will fit your usecase completelty.

Related

Can't configure elasticsearch

I enter the word "Booking" into the search, and already starting from the 5th result it returns "Cooking", but there are still results with the word "Booking", they contain it in the line.
$this->aQuery['query']['bool']['must'][]['multi_match'] = [
'type' => 'cross_fields',
'query' => 'Booking',
'fields' => ['prod_name', 'prod_prefix'],
'operator' => 'or'
];
I need results containing only the word "Booking". But if you enter "Travel Booking", then the results may contain both "Travel" and "Booking".
To get an exact match, use the term query
Match uses the, so called, analyzers to transform your search and find all matching words. Terms check for exact match on the search strings, so searching for Booking should not return documents that contain "Travel Booking".
As explained on the page:
By default, Elasticsearch changes the values of text fields during analysis. For example, the default standard analyzer changes text field values as follows:
Removes most punctuation
Divides the remaining content into individual words, called tokens
Lowercases the tokens
The term query does not analyze the search term. The term query only searches for the exact term you provide. This means the term query may return poor or no results when searching text fields.
Also note that you need to include a Keyword type in your index mappings
"prod_name": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
Then when you search for your word, you need to use the Keyword field name:
"term": {
"prod_name.raw": {
"value": "Booking",
"boost": 1.0
}
}

Fuzzy Matching Fails But Exact Match Passes

I've been constructing an ElasticSearch query using Fuzzy Matching to match a user in the system. When running it against a specific group of users (ones with my name), the query appears to work perfectly, but when running it against a random selection of users, it appears to fail.
For the purposes of my testing, I'm passing in the exact values of a specific user, so I would expect at least 1 match.
In narrowing this down, I found that an exact match against a name returns the data as expected, but putting the same value into a fuzzy block causes it to return 0 results.
For Instance, this query returns a user record as expected:
{
"from": 0,
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"firstName": {
"query": "sVxGBCkPYZ",
"boost": 30
}
}
}
],
"should": [
]
}
},
"fields": [
"id",
"firstName"
]
}
However replacing the match element with the below fails to return any records:
{
"fuzzy": {
"firstName": {
"value": "sVxGBCkPYZ",
"fuzziness": 2,
"boost": 30,
"min_similarity": 0.3
}
}
}
Why would this be happening, and is there anything I can do to remedy the situation?
For reference. This is the ES version i'm currently using:
"version": {
"number": "1.7.1",
"build_hash": "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
"build_timestamp": "2015-07-29T09:54:16Z",
"build_snapshot": false,
"lucene_version": "4.10.4"
}
The match fails because fuzzy searches are term level queries meaning the query string would not be analysed while the data that got indexed, I assume, if of type text with standard analyzer, would be converted to svxgbckpyz in the inverted index.
You can instead, implement fuzziness with match query as below:
POST testindex/_search
{
"query":{
"match":{
"firstname":{
"query":"sVxGBCkPYZ",
"fuzziness":"AUTO"
}
}
}
}
You can change the value from AUTO to 2 or 3 depending on your use case.
The exact match you mentioned also works because query string would get analysed and converts the input string into lower case, which is available in inverted index.
As for how fuzzy query (that you've mentioned) works behind the scene, as per this LINK, is as follows:
The fuzzy query works by taking the original term and building a
Levenshtein automaton—like a big graph representing all the strings
that are within the specified edit distance of the original string.
The fuzzy query then uses the automaton to step efficiently through
all of the terms in the term dictionary to see if they match. Once it
has collected all of the matching terms that exist in the term
dictionary, it can compute the list of matching documents.
Of course, depending on the type of data stored in the index, a fuzzy
query with an edit distance of 2 can match a very large number of
terms and perform very badly.
Note this statement in particular, representing all the strings that are within the specified edit distance of the original string
For e.g. some of the words with distance of 1 for life would be aife, bife, cife, dife....lifz.
So in your case, fuzzy search's automaton would not be able to create term svxgbckpyz from input string sVxGBCkPYZ firstly because the distance between them is 7 (Remember distance is 1 between A and a) which I don't think AUTO option can create and even if you configure it to 7, it may not create the string as there would be huge list of words with distance 7
Adding one more LINK for more info. Hope it helps!

Elasticsearch simple query string: removing documents containing words

I created a foo example to express what I mean. Suppose we have an index which documents contain the words Text and Texture.
Then I'd like to select all documents containing the word Text (I'm using the simple query string).
When I use the query "query": "Text", I get areas 1, 2 and 3 from the picture bellow.
When I use the query "query": "Text -Texture", I get only the area 3 from the picture bellow.
How could I get both areas 2 and 3?
Thanks.
To understand your problem you need to post your query.
Try to use term:
{
"query": {
"term": {
"myField": "Text"
}
}
}

elasticsearch: or operator, number of matches

Is it possible to score my searches according to the number of matches when using operator "or"?
Currently query looks like this:
"query": {
"function_score": {
"query": {
"match": {
"tags.eng": {
"query": "apples banana juice",
"operator": "or",
"fuzziness": "AUTO"
}
}
},
"script_score": {
"script": # TODO
},
"boost_mode": "replace"
}
}
I don't want to use "and" operator, since I want documents containing "apple juice" to be found, as well as documents containing only "juice", etc. However a document containing the three words should score more than documents containing two words or a single word, and so on.
I found a possible solution here https://github.com/elastic/elasticsearch/issues/13806
which uses bool queries. However I don't know how to access the tokens (in this example: apples, banana, juice) generated by the analyzer.
Any help?
Based on the discussions above I came up with the following solution, which is a bit different that I imagined when I asked the question, but works for my case.
First of all I defined a new similarity:
"settings": {
"similarity": {
"boost_similarity": {
"type": "scripted",
"script": {
"source": "return 1;"
}
}
}
...
}
Then I had the following problem:
a query for "apple banana juice" had the same score for a doc with tags ["apple juice", "apple"] and another doc with tag ["banana", "apple juice"]. Although I would like to score the second one higher.
From the this other discussion I found out that this issue was caused because I had a nested field. And I created a usual text field to address it.
But I also was wanted to distinguish between a doc with tags ["apple", "banana", "juice"] and another doc with tag ["apple banana juice"] (all three words in the same tag). The final solution was therefore to keep both fields (a nested and a text field) for my tags.
Finally the query consists of bool query with two should clauses: the first should clause is performed on the text field and uses an "or" operator. The second should clause is performed on the nested field and uses and "and operator"
Despite I found a solution for this specific issue, I still face a few other problems when using ES to search for tagged documents. The examples in the documentation seem to work very well when searching for full texts. But does someone know where I can find something more specific to tagged documents?

Ngram Tokenizer on field, not on query

I'm having trouble finding the solution for a use case here.
Basically, it's pretty simple : I need to perform a "contains" query, like a SQL like '%...%'.
I've seen there is a regexp query, which I actually managed to get working perfectly, but as it seems to scale badly, i'm trying out nGrams. Now, I've played around with them before and know "how they work", but the behaviour isn't the one I expect it to be.
Basically, i've configured my analyzer to be mingram =2, maxgram = 20. Say I index a user called "Christophe". I want the query "Chris" to actually match, which it does, since Chris is a 5-gram of Christophe. The problem is, "Risotto" matches aswell, because it gets broken down into Ngrams and ultimately "is" is a 2-gram of "Christophe" and so it matches aswell.
What I need is the analyzer to actually break down the indexed field in nGrams at indexing time, and compare those to the FULL text query. Risotto should match Risotto, XXXRisottoXXX and so on, but not Risolo or something where the nGrams do match.
Is there any solution ?
You need to use search_analyzer setting to have distinct index time and search time analyzers.
Sample from docs:
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}

Resources