Determining which words were matched in a fuzzy search - elasticsearch

I'm running a fuzzy search, and need to see which words were matched. For example, if I am searching for the query testing, and it matches a field with the sentence The boy was resting, I need to be able to know that the match was due to the word resting.
I tried setting the parameter explain = true, but it doesn't seem to contain the information I need. Any thoughts?

Alright, this is what I was looking for:
After a bit of research, I found the Highlighting feature of elasticsearch.
By default it returns a snippet of context surrounding the match, but you can set the fragment size to the query length to return only the exact match. For example:
{
query : query,
highlight : {
"fields" : {
'text' : {
"fragment_size" : query.length
}
}
}
}

Using explain should give you some clues, although not very easily available.
If you run the following, also available at https://www.found.no/play/gist/daa46f0e14273198691a , you should see e.g. description: "weight(text:nesting^0.85714287 in 1) […], description: "weight(text:testing in 1) [PerFieldSimilarity] […] and so on in the hit's _explanation.
#!/bin/bash
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
# Create indexes
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{}'
# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"type"}}
{"text":"The boy was resting"}
{"index":{"_index":"play","_type":"type"}}
{"text":"The bird was testing while nesting"}
'
# Do searches
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"query": {
"match": {
"text": {
"query": "testing",
"fuzziness": 1
}
}
},
"explain": true
}
'

Related

Using a string to build Query DSL for Elasticsearch

I'm using Meteor (so Javascript, Node, NPM, etc) and would like to provide a simple text input for users to search via Elasticsearch. I would like to be able to use modifiers on the text like + and "" and search for a specific field. I'm looking for something that can convert a plain text input into Elasticsearch Query DSL.
These would be some example queries:
This query would mean that the keyword "tatooine" must exist:
stormtrooper +tatooine
This would mean that "death star" should be one keyword:
stormtrooper "death star"
This would search for the keyword "bloopers" only in the category field:
stormtrooper category=bloopers
Is there a library that can do this? Can a generic solution exist or is this why I can't find any existing answers to this?
simple_query_string would support your query syntax out of the box, except for category=bloopers which should be category:bloopers instead, but otherwise it should work:
curl -XPOST localhost:9200/your_index/_search -d '{
"query": {
"simple_query_string": {
"query": "stormtrooper category:bloopers"
}
}
}'
curl -XPOST localhost:9200/your_index/_search -d '{
"query": {
"simple_query_string": {
"query": "stormtrooper +tatooine"
}
}
}'
You can also send the query in the query string directly like this:
curl -XPOST localhost:9200/your_index/_search?q=stormtrooper%20%22death%20star%22"

Substring and similarity matching in elasticsearch

I am learning to use elastisearch as alternative for database queries and I am not able to perform substring matches on the built index.
The mapping I have used to create index is
"mappings" : {
"user" : {
"properties" : {
"name" : {"type": "string"},
"specialty" : {"type": "string" ,"analyzer":"snowball"},
"address : {"type": "string" ,"analyzer":"snowball"}
}
}
}
The document I am indexing is
{
"name" : "John Doe",
"speciality": ["pediatrician","Child Doctor"],
"address": ["#123 park road Abbeyville","#423 park road AbbeyTown" ]
}
When I perform a search like
curl -XGET localhost:9200/test/user/_search?q=speciality:pediatrician
I get the correct document
However when I search strings like
curl -XGET localhost:9200/test/user/_search?q=speciality:pedia
curl -XGET localhost:9200/test/user/_search?q=speciality:pediatricians
No results are retrieved
P.S I know that wild cards can be used for matching but I need to be able to search for both the whole word and parts of the words based on user input so as to return the most relevant documents.
Did you try reindexing after changing the mapping? Also try setting the search analyzer to snowball in the settings.
UPDATE:
You can go for wild card search. Better go for trailing wild card search alone instead of both leading and trailing wild card search.
curl -XGET localhost:9200/test/user/_search?q=speciality:pedia*
curl -XGET localhost:9200/test/user/_search?q=speciality:pediatricians*

Keyword search in ElasticSearch with no regards to the schema

Is it possible to use ElasticSearch to do keyword searches, exactly like in a search engine?
Let me rephrase:
As far as I understand, an ElasticSearch term query requires to specify in which field(s?) to search for keywords.
Given the fact that ElasticSearch can be "schemaless", I wish I could declare a query than can search for keywords in any field.
Is there a syntax for that?
You're looking for the behavior provided by the _all-field, which happens to be on by default:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html
Here's a runnable example: https://www.found.no/play/gist/14688f48c75b9931272b
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"type"}}
{"foo":"bar"}
{"index":{"_index":"play","_type":"type"}}
{"something_else":"foo bar"}
'
# Do searches
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"query": {
"match": {
"_all": {
"query": "bar"
}
}
}
}
'

Is there a way to "escape" ElasticSearch stop words?

I am fairly new to ElasticSearch and have a question on stop words. I have an index that contains state names for the USA....ex: New York/NY, California/CA,Oregon/OR. I believe Oregon's abbreviation, 'OR' is a stop word, so when I insert the state data into the index, I cannot search on 'OR'. Is there a way I can set up custom stopwords for this or am I doing something wrong?
Here is how I am building the index:
curl -XPUT http://localhost:9200/test/state/1 -d '{"stateName": ["California","CA"]}'
curl -XPUT http://localhost:9200/test/state/2 -d '{"stateName": ["New York","NY"]}'
curl -XPUT http://localhost:9200/test/state/3 -d '{"stateName": ["Oregon","OR"]}'
A search for 'NY', works fine. Ex:
curl -XGET 'http://localhost:9200/test/state/_search?pretty=1' -d '
{
"query" : {
"match" : {
"stateName" : "NY"
}
}
}'
But a search for 'OR', returns zero hits:
curl -XGET 'http://localhost:9200/test/state/_search?pretty=1' -d '
{
"query" : {
"match" : {
"stateName" : "OR"
}
}
}'
I believe this search returns no results because OR is stop word, but I don't know how to work around this. Thanks for you help.
You can (and definitely should) control the way you index data by modifying your mapping according to your data and the way you want to search against it.
In your case I would disable stopwords for that specific field rather than modifying the stopword list, but you could do the latter too if you wish to. The point is that you're using the default mapping which is great to start with, but as you can see you need to tweak it depending on your needs.
For each field, you can specify what analyzer to use. An analyzer defines the way you split your text into tokens (tokenizer) that will be indexed and also additional changes you can make to each token (even remove or add new ones) using token filters.
You can specify your mapping either while creating your index or update it afterwards using the put mapping api (as long as the changes you make are backwards compatible).

Queries vs. Filters

I can't see any description of when I should use a query or a filter or some combination of the two. What is the difference between them? Can anyone please explain?
The difference is simple: filters are cached and don't influence the score, therefore faster than queries. Have a look here too. Let's say a query is usually something that the users type and pretty much unpredictable, while filters help users narrowing down the search results , for example using facets.
This is what official documentation says:
As a general rule, filters should be used instead of queries:
for binary yes/no searches
for queries on exact values
As a general rule, queries should be used instead of filters:
for full text search
where the result depends on a relevance score
An example (try it yourself)
Say index myindex contains three documents:
curl -XPOST localhost:9200/myindex/mytype -d '{ "msg": "Hello world!" }'
curl -XPOST localhost:9200/myindex/mytype -d '{ "msg": "Hello world! I am Sam." }'
curl -XPOST localhost:9200/myindex/mytype -d '{ "msg": "Hi Stack Overflow!" }'
Query: How well a document matches the query
Query hello sam (using keyword must)
curl localhost:9200/myindex/_search?pretty -d '
{
"query": { "bool": { "must": { "match": { "msg": "hello sam" }}}}
}'
Document "Hello world! I am Sam." is assigned a higher score than "Hello world!", because the former matches both words in the query. Documents are scored.
"hits" : [
...
"_score" : 0.74487394,
"_source" : {
"name" : "Hello world! I am Sam."
}
...
"_score" : 0.22108285,
"_source" : {
"name" : "Hello world!"
}
...
Filter: Whether a document matches the query
Filter hello sam (using keyword filter)
curl localhost:9200/myindex/_search?pretty -d '
{
"query": { "bool": { "filter": { "match": { "msg": "hello sam" }}}}
}'
Documents that contain either hello or sam are returned. Documents are NOT scored.
"hits" : [
...
"_score" : 0.0,
"_source" : {
"name" : "Hello world!"
}
...
"_score" : 0.0,
"_source" : {
"name" : "Hello world! I am Sam."
}
...
Unless you need full text search or scoring, filters are preferred because frequently used filters will be cached automatically by Elasticsearch, to speed up performance. See Elasticsearch: Query and filter context.
Filters -> Does this document match? a binary yes or no answer
Queries -> Does this document match? How well does it match? uses scoring
Few more addition to the same.
A filter is applied first and then the query is processed over its results. To store the binary true/false match per document , something called a bitSet Array is used.
This BitSet array is in memory and this would be used from second time the filter is queried. This way , using bitset array data-structure , we are able to utilize the cached result.
One more point to note here , the filter cache is created only when the request is executed hence only from the second hit , we actually get the advantage of caching.
But then you can use warmer API , to outgrow this. When you register a query with filter against a warmer API , it will make sure that this is executed against a new segment whenever it comes live. Hence we will get consistent speed from the first execution itself.
Basically, a query is used when you want to perform a search on your documents with scoring.
And filters are used to narrow down the set of results obtained by using query. Filters are boolean.
For example say you have an index of restaurants something like zomato.
Now you want to search for restaurants that serve 'pizza', which is basically your search keyword.
So you will use query to find all the documents containing "pizza" and some results will obtained.
Say now you want list of restaurant that serves pizza and has rating of atleast 4.0.
So what you will have to do is use the keyword "pizza" in your query and apply the filter for rating as 4.0.
What happens is that filters are usually applied on the results obtained by querying your index.
Since version 2 of Elasticsearch, filters and queries have been merged and any query clause can be used as either a filter or a query (depending on the context). As with version 1, filters are cached and should be used if scoring does not matter.
Source: https://logz.io/blog/elasticsearch-queries/
Queries : calculate score; thus they’re able to return results sorted by relevance.
Filters : don’t calculate score, making them faster and easier to cache.

Resources