Unexpected case sensitivty - elasticsearch

I am a noob running elastic search 1.5.9. I want to pull out all of the documents that have the field "PERSON" set to "Johnson." (Note the mixed casing). If I manually look at elastic search head, I can see a document with exactly those attributes.
The docs explain that I should construct a filter query to pull out this document. But when I do so, I get some unexpected behavior.
This works. It returns exactly one document w/ Person = "Johnson", as expected
query = {"filter": {"term" : { "PERSON" : "johnson" }}}
But this does not work
query = {"filter": {"term" : { "PERSON" : "Johnson" }}}
If you look closely, you'll see that the good query is lowercase but the bad query is mixed case -- even though the PERSON field is set to "Johnson".
Adding to the weirdness, I am lower casing everything that goes into the full_text field: "_source": { "full_text": "all lower case" So the full text includes johnson -- which I would think would be totally independent from the PERSON field.
What's going on? How do I do a mixed case search on the PERSON field?

Term query wont analyze your search text.
This means you need to analyzed and provide the query in token format for term query to actually work.
Use match query instead , things will work like magic.
So when a string like below goes to Elasticsearch , its tokenized ( or rather analyzed) and stored
"Green Apple" -> ( "green" , "apple")
This is the default behavior of analysis.
Now when you search using term query , the analysis wont happen.
Which means for the word Apple , it searches for the token Apple with case preserved. And hence fails.
For match query , it does do the analysis. Which means if you search with Apple , it converts it to apple and then does the search. Which give good matches.
You can learn more on analysis here.

Related

Multiple elasticsearch match queries

Say I have a document with 3 text fields: field_a , field_b and field_c.
Is it possible to do a single query so that we have results in this order:
'match' in field_a
'match' in field_b
'match' in field_c
'mutli_match' results can have results from different fields mixed together in the order of the results, what I want is any and all results from field_a, then any and all results from field_b and so on.
Even though, I find this approach strange in general (I think the problem you have should be solved in a different way, e.g. multiple stages of search), I think you could solve it for now in a following manner.
Multi match query have a perfect ability to provide boost to your fields. E.g.
"query": {
"multi_match" : {
"query" : "this is a test",
"fields" : [ "field_a^1000", "field_b^10", "field_c" ]
}
}
The sign ^ is a boost sign which will multiple score of the match in this field by the value - 1000 in case of field_a
However, I would recommend to avoid this sort of behavior in general, since:
It's hard to control those boosting values
It could be in some cases behaving not as expected (imagine you get the score of 1000 in field_b)
If you would have many hits, this makes whole idea of having match of field_c kinda obsolete, since no user will scroll that far away in search results

How can you match long strings?

One of the main challenges that I am facing at the moment is how to match long string applying fuzziness to them .
For example , let's say that we have the following document :
PUT my_index/type/2
{
"name":"longnameyesverylong"
}
if I apply a fuzzy search on that name , like the following :
"match": {
"name": {
"query": "longnameyesverylong",
"fuzziness": 2
}
I can find it but my goal would be to be able to open the net and allow more than two mistakes for this type of strings.
Let's say for example that I index something like :
PUT my_index/type/2
{
"name":"l1ngnam2yesver3long"
}
The previous match query won't be able to find this document, as the fuzziness is greater than 2 and that is not supported in ES.
I tried to use ngrams , but the tokens did not meet the requirement either and the index would grow too much.
The only option that I have on top of my head is to split the string manually at index time creating my "own tokenizer" and create a document that looks like
PUT my_index/type/2
{
"name":"longnamey esverylong"
}
And then , at search time , split the string again and apply a Boolean query with fuzziness on each token. This can probably do what I need , but I feel that there is probably a better solution for this problem.
Is there any other approach that you think it might be appropriate?
Thank you.
Problem solved. They key for this problem is the pattern_capture filter.

How to find all documents with specific string in field?(Elasticsearch)

I have a document with fields:
"provider": "AppStore",
"device_model": "iPad3,6[graphicsDeviceName: PowerVR SGX 554]",
"days_in_game": 34,
And I need to get all documents with iPad string in device_model!
Is it possible?
There are two types of search queries in Elasticsearch ie. term queries and match queries. The match first analyzes the query string, then looks for documents containing the words in the query and returns result depending upon how closely it matches.
What the term query does is basically a yes or no query and will return only the documents that have an exact match.
I think for your case a term query is better fit. And since field does not contain the exact word iPad but something like iPad3 you should use a prefix, wildcard or possibly a regexp query depending upon what your document actually contain(take a look at this)
You could use the following query:
{
"query": {
"prefix": {
"device_model": "iPad"
}
}

Is it possible to chain fquery filters in elastic search with exact matches?

I have been having trouble writing a method that will take in various search parameters in elasticsearch. I was working with queries that looked like this:
body:
{query:
{filtered:
{filter:
{and:
[
{term: {some_term: "foo"}},
{term: {is_visible: true}},
{term: {"term_two": "something"}}]
}
}
}
}
Using this syntax I thought I could chain these terms together and programatically generate these queries. I was using simple strings and if there was a term like "person_name" I could split the query into two and say "where person_name match 'JOHN'" and where person_name match 'SMITH'" getting accurate results.
However, I just came across the "fquery" upon asking this question:
Escaping slash in elasticsearch
I was not able to use this "and"/"term" filter searching a value with slashes in it, so I learned that I can use fquery to search for the full value, like this
"fquery": {
"query": {
"match": {
"by_line": "John Smith"
But how can I search like this for multiple items? IT seems that when i combine fquery and my filtered/filter/and/term queries, my "and" term queries are ignored. What is the best practice for making nested / chained queries using elastic search ?
As in the comment below, yes I can just add fquery to the "and" block like so
{:filtered=>
{:filter=>
{:and=>[
{:term=>{:is_visible=>true}},
{:term=>{:is_private=>false}},
{:fquery=>
{:query=>{:match=>{:sub_location=>"New JErsey"}}}}]}}}
Why would elasticsearch also return results with "sub_location" = "new York"? I would like to only return "new jersey" here.
A match query analyzes the input and by default it is a boolean OR query if there are multiple terms after the analysis. In your case, "New JErsey" gets analyzed into the terms "new" and "jersey". The match query that you are using will search for documents in which the indexed value of field "sub_location" is either "new" or "jersey". That is why your query also matches documents where the value of field "sub_location" is "new York" because of the common term "new".
To only match for "new jersey", you can use the following version of the match query:
{
"query": {
"match": {
"sub_location": {
"query": "New JErsey",
"operator": "and"
}
}
}
}
This will not match documents where the value of field "sub_location" is "New York". But, it will match documents where the value of field "sub_location" is say "York New" because the query finally translates into a boolean query like "York" AND "New". If you are fine with this behaviour, well and good, else read further.
All these issues arise because you are using the default analyzer for the field "sub_location" which breaks tokens at word boundaries and indexes them. If you really do not care about partial matches and want to always match the entire string, you can make use of custom analyzers to use Keyword Tokenizer and Lowercase Token Filter. Mind you, going ahead with this approach will need you to re-index all your documents again.

Finding fields Elasticsearch has matched on

I am using Elasticsearch to search for a group a user should join. I have the user data nested into the search query. On return I get back the closest matched group that user should be in.
The field I am searching on is a nested field as follows:
`{"interests": [
{"topics":["python", "stackoverflow", "elasticsearch"]},
{"topics":["arts", "textiles"]}
]}`
However if you want an understanding of a match - how do you do this?
Elasticsearch does have an explain function which says what the scoring is made up of using tfidf, but not specifically what terms were used.
For example, if I search for 'Textile', the doc should match on 'textiles'. Thus I want the term 'textiles' to be returned in explain or some other way.
The only way I see that provides this need, is to store the search and the document retrieved and then process both to discover words ES has most likely matched on.
EDIT - for some more clarity of the question
An example in my index of a group which has "interests": ['arts', 'fine arts', 'art painting', 'arts and crafts', 'sports']
Now my search, I am looking for Arts and many other things. Now the term I am searching for comes up in this list many times, thus should always be a contributor.
What I want in the response is to say these words were matched ['arts', 'fine arts', 'art painting', 'arts and crafts']along with the degree to which they match i..e 'arts' should be higher than the others, but all others are also relevant
Elasticsearch allows you to specify the _name field for all queries and
filters. This means that you can separate your query into different parts with
separate names, which will allow you to determine which parts matched.
For example:
{
"query" : {
"bool" : {
"should" : [
{"match" : { "interests.topics" : {"query" : "python", "_name" : "py-topic"} }},
{"match" : { "interests.topics" : {"query" : "arts", "_name" : "arts-topic"} }}
]
}
}
}
Then, in your response, you will get back any array of which queries (or
filters) matched and you can determine if the py-topic query and/or the
arts-topic query matched above.

Resources