Elastic search 'bulk' fuzzy match - elasticsearch

I want to leverage ES fuzzy match for matching city names. For example, for these several inputs (Chicago, Chicago, IL, Chicago -IL, Chicago, USA, Chicago land, greater chicago area etc), I want to return one standard city name (Chicago,IL).
Simple fuzzy match works fine.
PUT /fuzzy_items/city/_mapping
{
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "simple"
}
}
}
}
However, I've several documents (ex. 10,000) which needs to be normalized. Other operations have bulk API.. ie. If I need to index 10,000 docs, bulk API is handy to achieve this purpose in fewer calls.
Is there a similar feature for fuzzy match? meaning, can I send array of inputs ["chicago,USA","Minneapolis, MN"] and expect a fuzzy matched response array ["Chicago,IL","Minneapolis,MN"]?
The problem I'm trying to overcome is how 'not' to make 1000s of calls to fuzzy match and achieve the same with fewer calls.
Or
Is there a different way of achieving this, may be using elastic scripting ?
Appreciate any suggestions.

Related

prevent elasticsearch from matching target phrase multiple times in document

I am an Elastic Search newbie.
How can one make elastic search rank documents that more precisely match the input string?
For example, suppose we have the query
{
"query": {
"match": {
"name": "jones"
}
}
}
Suppose we have two documents:
Doc1: "name" : "jones"
Doc2: "name" : "jones jones jones jones jones"
I want Doc1 to be ranked more highly? It is a more precise match. How can I do this?
(Hopefully, in the most general possible way -- e.g. what if everywhere above 'jones' were replaced with 'fred jones')
Perhaps there are two approaches:
Maybe you can tell ES, "hey for this query a high term frequency should not be rewarded" (which seems to go against the core of ES, TF-IDF ...Because it very strongly wants to rewards a high TF (term frequency).
Maybe you can tell ES "prefer shorter matches over longer ones" (maybe using script_score???)
Surprised that I can't find answers to this question elsewhere. I must be missing something very fundamental.

Elasticsearch: search word forms only

I have collection of docs and they have field tags which is array of strings. Each string is a word.
Example:
[{
"id": 1,
"tags": [ "man", "boy", "people" ]
}, {
"id": 2,
"tags":[ "health", "boys", "people" ]
}, {
"id": 3,
"tags":[ "people", "box", "boxer" ]
}]
Now I need to query only docs which contains word "boy" and its forms("boys" in my example). I do not need elasticsearch to return doc number 3 because it is not form of boy.
If I use fuzzy query I will get all three docs and also doc number 3 which I do not need. As far as I understand, elasticsearch use levenshtein distance to determine whether doc relevant or not.
If I use match query I will get number 1 only but not both(1,2).
I wonder is there any ability to query docs by word form matching. Is there a way to make elastic match "duke", "duchess", "dukes" but not "dikes", "buke", "bike" and so on? This is more complicated case with "duke" but I need to support such case also.
Probably it could be solved using some specific settings of analyzer?
With "word-form matching" I guess you are referring to matching morphological variations of the same word. This could be about addressing plural, singular, case, tense, conjugation etc. Bear in mind that the rules for word variations are language specific
Elasticsearch's implementation of fuzziness is based on the Damerau–Levenshtein distance. It handles mutations (changes, transformations, transpositions) independent of a specific language, solely based on the number if edits.
You would need to change the processing of your strings at indexing and at search time to get the language-specific variations addressed via stemming. This can be achieved by configuring a suitable an analyzer for your field that does the language-specific stemming.
Assuming that your tags are all in English, your mapping for tags could look like:
"tags": {
"type": "text",
"analyzer": "english"
}
As you cannot change the type or analyzer of an existing index you would need to fix your mapping and then re-index everything.
I'm not sure whether Duke and Duchesse are considered to be the same word (and therefore addresses by the stemmer). If not, you would need to use a customised analyzer that allows you to configure synonyms.
See also Elasticsearch Reference: Language Analyzers

Application-side Joins Elasticsearch

I have two indexes in Elasticsearch, a system index, and a telemetry index. I'd like to perform queries and aggregations on the telemetry index using filters from the systems index. The systems index is relatively small and only receives new documents occasionally, but the telemetry index is much larger and is constantly receiving new documents. This seems like an ideal situation for using an application-side join.
I tried emulating the example query at the pervious link, but it turns out the filtered query is deprecated as of ES 5.0. (Why is this example in the current documentation?!)
Here are my queries:
GET /system/_search
{
"query": {
"match": {
"name": "George's system"
}
}
}
GET /telemetry/_search
{
"query": {
"bool":{
"must": {
"multi_match": {
"operator": "and",
"fields": ["systemId"]
, [1] }
}
}
}
}
}
The second one fails with a json_parse_exception because for some reason it doesn't like the [ ] characters after "fields".
Can anyone provide a simple example of using application-side joins?
Once such a query is defined (perhaps in Kibana's Dev Tools console) is there a way to visualize it in Kibana?
With elastic there is no way to execute two nested queries like in a relational database where the first query uses the response of the second. The example in the application-side join, means that you are actually making two queries (two different requests to elastic) on the application side.
First query you get the list of ids you need to filter on.
Second query you pass the list of ids that you got to the terms filter.
This works when you have no more than 1024 values for systemId. Because terms query has a limit on the number of terms.
Because this query is not feasible, then you can't visualize it in kibana.
In such case you have to sacrifice a little of space and add the systemId to your mapping.
Good Luck!

Add custom comparatorClass class in Solr

I am newbie in Solr. I want to add a custom comparatorClass in Solr. I also need to use fields - term and count in my custom class which I have defined in my schema.xml.
Structure of indexing document :
"docs": [
{
"count": 98,
"term": "age",
},
{
"count": 6,
"term": "age assan",
},
{
"count": 5,
"term": "age but",
},
{
"count": 10,
"term": "age salman",
}]
I have stored ngrams with term and their count but solr gives frequency by own that I don't need. I want my count frequency which I have defined for each term. And that term and count, I need to use and want to sort with frequency(count) and then edit distance which I need to implement by creating own class in comparator class or there is something else which helps me. Please share..
How can I do this. Any help please.
Thanks.
You should be able to do this without implementing a custom similarity class. The first requirement is (from your description) a straight forward sort on the count value, while the latter can be implemented by sorting on the value from the strdist() function. You can also multiply or weight these values against each other in a single sort statement by using several functions.
If you really, really need to build your own scorer (which I don't think you need to do from your description) - these are usually written to explore other ranking algorithms than tf/idf, bm25 etc. for larger corpuses, a search on Google gives you many resources with pre-made, easy to adopt solutions. I particularly want to point out "This is the Nuclear Option" in Build Your Own Custom Lucene Query and Scorer:
Unless you just want the educational experience, building a custom Lucene Query should be the “nuclear option” for search relevancy. It’s very fiddly and there are many ins-and-outs. If you’re actually considering this to solve a real problem, you’ve already gone down the following paths [...]

Django-Haystack elasticsearch queries

Haystack generates elasticsearch queries to get results from elasticsearch. The queries get prepended with a filter containing the following query:
"query": {
"query_string": {
"query": "django_ct:(customers.customer)"
}
}
What is the meaning of the django_ct(..) query? Is this a function that haystack installs in elasticsearch? Is it some caching magic? Can I get rid of this part altogether?
The reason why I'm asking is that I have to build a custom query to use an elasticsearch multi_field. In order to change the queries I want to understand first how haystack generates its own queries.
Haystack uses Django's content types to determine which model attributes to search against in Elasticsearch. This is not really best practice, but it's how it's done in HS.
Basically, the code in HS looks something like this:
app_name, model_name = django_ct.split('.')
ct = ContentType.objects.get_by_natural_key(app_name, model_name)
model = ct.model_class()
# do stuff with model
So, you really don't want to ignore it when using haystack, if you are indexing more than one model in your index.
I have a couple other answers based on elasticsearch here: index analyzer vs query analyzer in haystack - elasticsearch? and here: Django Haystack Distinct Value for Field
EDIT regarding multi-fields:
I've used Haystack and multifields in the past, so I'm not sure you need to write you own backend. The key is understanding how haystack creates searches. As I said in one of the other posts, everything goes into query_string and from there it creates a lucene based search string. Again, not really best practice.
So let's say you have a multi-field that looks like this:
"some_field": {
"type": "multi_field",
"fields": {
"some_field_edgengram": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"some_field": {
"type": "string",
"index": "not_analyzed"
}
}
},
In haystack, you can just search against some_field and some_field_edgengram directly.
For example SearchQuerySet().filter(some_field="cat") and SearchQuerySet().filter(some_field_edgengram="cat") will both work, but the first will only match tokens that have cat exactly and the second will match cat, cats, catlin, catch, etc, at least using my edgengram analyzers.
However, just because you use haystack for indexing and search doesn't mean you have to use it for 100% of your search solutions. In the past, I've used PYES in some areas of the app and haystack in others, because haystack lacked the support for more advanced features and the query_string parsing was losing some of the finer grained accuracy we were looking for.
In your case, you could get results from the search engine via elasticutils or python-elasticseach directly for some more advanced searches and use haystack for the other more routine searches.

Resources