How can you match long strings? - elasticsearch

One of the main challenges that I am facing at the moment is how to match long string applying fuzziness to them .
For example , let's say that we have the following document :
PUT my_index/type/2
{
"name":"longnameyesverylong"
}
if I apply a fuzzy search on that name , like the following :
"match": {
"name": {
"query": "longnameyesverylong",
"fuzziness": 2
}
I can find it but my goal would be to be able to open the net and allow more than two mistakes for this type of strings.
Let's say for example that I index something like :
PUT my_index/type/2
{
"name":"l1ngnam2yesver3long"
}
The previous match query won't be able to find this document, as the fuzziness is greater than 2 and that is not supported in ES.
I tried to use ngrams , but the tokens did not meet the requirement either and the index would grow too much.
The only option that I have on top of my head is to split the string manually at index time creating my "own tokenizer" and create a document that looks like
PUT my_index/type/2
{
"name":"longnamey esverylong"
}
And then , at search time , split the string again and apply a Boolean query with fuzziness on each token. This can probably do what I need , but I feel that there is probably a better solution for this problem.
Is there any other approach that you think it might be appropriate?
Thank you.

Problem solved. They key for this problem is the pattern_capture filter.

Related

Type of field for prefix search in Elastic Search

I'm confused on what index type I should apply for my field for prefix search, many show search_as_you_type but I think auto complete is not what I'm going for.
I have a UUID field:
id: 34y72ca1-3739-41ff-bbec-f6d17479384c
The following terms should return the doc above:
3
34
34y72ca1
34y72ca1-3739
34y72ca1-3739-41ff-bbec-f6d17479384c
Using 3739 should not return it as it doesn't start with 3739. Initially this is what I was going for but then the wildcard field is not supported by Amazon AWS, so I compromise for prefix search instead of partial search.
I tried search_as_you_type field but it doesn't return the result when I use the whole ID. Actually, my use case is when user click enter, the results will be shown, instead of real-live when they type, so if speed is compromised its OK, just that I hope for something that will be good for many rows of data.
Thanks
If you have not explicitly defined any index mapping, then you need to use id.keyword field instead of the id field for the prefix query to show the appropriate results. This uses the keyword analyzer instead of the standard analyzer
{
"query": {
"prefix": {
"id.keyword": {
"value": "34y72ca1"
}
}
}
}
Otherwise, you can modify your index mapping, by adding multi fields for id field

elastic search: get exact match term results

I have elastic search index with documents having a field "backend_name" like:- google, goolge_staging, google_stg1 etc.
I want only those documents that have "backend_name" = google
I am trying with the term query like this:
{ "query": { "term": { "backend_name": "google" } } }
But it returns me document having "backend_name" as goolge_staging, google_stg1 too. I want just document with "backend_name" = google.
One way to resolve it is to have goolge_staging, google_stg1 etc. in must not list but I want some better way. Suggestions?
It is provably because of the mapping you are using.
Take a look at the Elasticsearch documentation of term query
Try changing the mapping type to keyword so it matches only if it is an exact match.

Nested count queries

i'm looking to add a feature to an existing query. Basically, I run a query that returns say 1000 documents. Those documents all have the same structure, only the values of certain fields vary. What i'd like, is to not only get the full list as a result, but also count how many results have a field X with the value Y, how many results have the same field X with the value Z etc...
Basically get all the results + 4 or 5 "counts" that would act like the SQL "group by", in a way.
The point of this is to allow full text search over all the clients in our database (without filtering), while showing how many of those are active clients, past clients, active prospects etc...
Any way to do this without running additional / separate queries ?
EDIT WITH ANSWER :
Aggregations is the way to go. Here's how I did it, it's so straightforward that I expected much harder work !
{
"query": {
"term": {
"_type":"client"
}
},
"aggregations" : {
"agg1" : {
"terms" : {
"field" : "listType.typeRef.keyword"
}
}
}
}
Note that it's even in a list of terms and not a single field, that's just how easy it was !
I believe what you are looking for is the aggregation query.
The documentation should be clear enough, but if you struggle please give us your ES query and we will help you from there.

Is it possible to chain fquery filters in elastic search with exact matches?

I have been having trouble writing a method that will take in various search parameters in elasticsearch. I was working with queries that looked like this:
body:
{query:
{filtered:
{filter:
{and:
[
{term: {some_term: "foo"}},
{term: {is_visible: true}},
{term: {"term_two": "something"}}]
}
}
}
}
Using this syntax I thought I could chain these terms together and programatically generate these queries. I was using simple strings and if there was a term like "person_name" I could split the query into two and say "where person_name match 'JOHN'" and where person_name match 'SMITH'" getting accurate results.
However, I just came across the "fquery" upon asking this question:
Escaping slash in elasticsearch
I was not able to use this "and"/"term" filter searching a value with slashes in it, so I learned that I can use fquery to search for the full value, like this
"fquery": {
"query": {
"match": {
"by_line": "John Smith"
But how can I search like this for multiple items? IT seems that when i combine fquery and my filtered/filter/and/term queries, my "and" term queries are ignored. What is the best practice for making nested / chained queries using elastic search ?
As in the comment below, yes I can just add fquery to the "and" block like so
{:filtered=>
{:filter=>
{:and=>[
{:term=>{:is_visible=>true}},
{:term=>{:is_private=>false}},
{:fquery=>
{:query=>{:match=>{:sub_location=>"New JErsey"}}}}]}}}
Why would elasticsearch also return results with "sub_location" = "new York"? I would like to only return "new jersey" here.
A match query analyzes the input and by default it is a boolean OR query if there are multiple terms after the analysis. In your case, "New JErsey" gets analyzed into the terms "new" and "jersey". The match query that you are using will search for documents in which the indexed value of field "sub_location" is either "new" or "jersey". That is why your query also matches documents where the value of field "sub_location" is "new York" because of the common term "new".
To only match for "new jersey", you can use the following version of the match query:
{
"query": {
"match": {
"sub_location": {
"query": "New JErsey",
"operator": "and"
}
}
}
}
This will not match documents where the value of field "sub_location" is "New York". But, it will match documents where the value of field "sub_location" is say "York New" because the query finally translates into a boolean query like "York" AND "New". If you are fine with this behaviour, well and good, else read further.
All these issues arise because you are using the default analyzer for the field "sub_location" which breaks tokens at word boundaries and indexes them. If you really do not care about partial matches and want to always match the entire string, you can make use of custom analyzers to use Keyword Tokenizer and Lowercase Token Filter. Mind you, going ahead with this approach will need you to re-index all your documents again.

Unexpected case sensitivty

I am a noob running elastic search 1.5.9. I want to pull out all of the documents that have the field "PERSON" set to "Johnson." (Note the mixed casing). If I manually look at elastic search head, I can see a document with exactly those attributes.
The docs explain that I should construct a filter query to pull out this document. But when I do so, I get some unexpected behavior.
This works. It returns exactly one document w/ Person = "Johnson", as expected
query = {"filter": {"term" : { "PERSON" : "johnson" }}}
But this does not work
query = {"filter": {"term" : { "PERSON" : "Johnson" }}}
If you look closely, you'll see that the good query is lowercase but the bad query is mixed case -- even though the PERSON field is set to "Johnson".
Adding to the weirdness, I am lower casing everything that goes into the full_text field: "_source": { "full_text": "all lower case" So the full text includes johnson -- which I would think would be totally independent from the PERSON field.
What's going on? How do I do a mixed case search on the PERSON field?
Term query wont analyze your search text.
This means you need to analyzed and provide the query in token format for term query to actually work.
Use match query instead , things will work like magic.
So when a string like below goes to Elasticsearch , its tokenized ( or rather analyzed) and stored
"Green Apple" -> ( "green" , "apple")
This is the default behavior of analysis.
Now when you search using term query , the analysis wont happen.
Which means for the word Apple , it searches for the token Apple with case preserved. And hence fails.
For match query , it does do the analysis. Which means if you search with Apple , it converts it to apple and then does the search. Which give good matches.
You can learn more on analysis here.

Resources