Search-as-you-type on IP datatype in Elasticsearch

Search-as-you-type on IP datatype in Elasticsearch - elasticsearch

We are currently adding search-as-you-type in the UI for some fields in our index.
For String-fields the functionality of Elasticsearch allows a number of ways of doing this, e.g. via match_phrase_prefix query on the analyzed tokens or via ngrams during indexing.
However as IPv4-Addresses are stored as long internally, doing wildcard or prefix searching on them is not easily possible as far as I see.
One can use range-queries for searching for IP-Ranges, but I rather would like to let them user enter "118" and display matches for "168.1.118.32" as well as "118.43.119.4" and "1.1.1.118".
Is there a built in way to perform such queries? Or do we need to store the field as analyzed string separately?

After some more investigation we used a multi field to store the IP address twice, once as normal IP type and a second time as analyzed value where we split the IP into it's 4 octets so we can search on these parts separatedely.
In the template we use the following pattern to split up the value when writing to the index:
"analyzer": {
"ipv4analyzer": {
"tokenizer": "ipv4tokenizer"
}
},
"tokenizer": {
"ipv4tokenizer": {
"pattern": "([0-9]{1,3})",
"type": "pattern",
"group": "1"
}
}

Related

Unexpected result using Elasticsearch when dash character is involved

I'm querying Elasticsearch 2.3 using django-haystack, and the query that is executed seems to be the following:
'imaging_telescopes:(*\\"FS\\-60\\"*)'
An object in my Elasticsearch data has the following value for its property imaging_telescopes: "Takahashi FSQ-106N".
This object matches the query, and to me this result is unepected, I wouldn't want it to match.
My assumption is that it matches becasue it contains the letters FS, but in my frontend I'm just searching for "FS-60".
How can I modify the query so that it's stricter in looking for objects whose property imaging_telescopes exactly contains some text?
Thanks!
EDIT: this is the mapping of the field:
"imaging_telescopes": {
"type": "string",
"analyzer": "snowball"
}

Elasticsearch: search word forms only

I have collection of docs and they have field tags which is array of strings. Each string is a word.
Example:
[{
"id": 1,
"tags": [ "man", "boy", "people" ]
}, {
"id": 2,
"tags":[ "health", "boys", "people" ]
}, {
"id": 3,
"tags":[ "people", "box", "boxer" ]
}]
Now I need to query only docs which contains word "boy" and its forms("boys" in my example). I do not need elasticsearch to return doc number 3 because it is not form of boy.
If I use fuzzy query I will get all three docs and also doc number 3 which I do not need. As far as I understand, elasticsearch use levenshtein distance to determine whether doc relevant or not.
If I use match query I will get number 1 only but not both(1,2).
I wonder is there any ability to query docs by word form matching. Is there a way to make elastic match "duke", "duchess", "dukes" but not "dikes", "buke", "bike" and so on? This is more complicated case with "duke" but I need to support such case also.
Probably it could be solved using some specific settings of analyzer?

With "word-form matching" I guess you are referring to matching morphological variations of the same word. This could be about addressing plural, singular, case, tense, conjugation etc. Bear in mind that the rules for word variations are language specific
Elasticsearch's implementation of fuzziness is based on the Damerau–Levenshtein distance. It handles mutations (changes, transformations, transpositions) independent of a specific language, solely based on the number if edits.
You would need to change the processing of your strings at indexing and at search time to get the language-specific variations addressed via stemming. This can be achieved by configuring a suitable an analyzer for your field that does the language-specific stemming.
Assuming that your tags are all in English, your mapping for tags could look like:
"tags": {
"type": "text",
"analyzer": "english"
}
As you cannot change the type or analyzer of an existing index you would need to fix your mapping and then re-index everything.
I'm not sure whether Duke and Duchesse are considered to be the same word (and therefore addresses by the stemmer). If not, you would need to use a customised analyzer that allows you to configure synonyms.
See also Elasticsearch Reference: Language Analyzers

ElasticSearch Search query is not case sensitive

I am trying to search query and it working fine for exact search but if user enter lowercase or uppercase it does not work as ElasticSearch is case insensitive.
example
{
"query" : {
"bool" : {
"should" : {
"match_all" : {}
},
"filter" : {
"term" : {
"city" : "pune"
}
}
}
}
}
it works fine when city is exactly "pune", if we change text to "PUNE" it does not work.

ElasticSearch is case insensitive.
"Elasticsearch" is not case-sensitive. A JSON string property will be mapped as a text datatype by default (with a keyword datatype sub or multi field, which I'll explain shortly).
A text datatype has the notion of analysis associated with it; At index time, the string input is fed through an analysis chain, and the resulting terms are stored in an inverted index data structure for fast full-text search. With a text datatype where you haven't specified an analyzer, the default analyzer will be used, which is the Standard Analyzer. One of the components of the Standard Analyzer is the Lowercase token filter, which lowercases tokens (terms).
When it comes to querying Elasticsearch through the search API, there are a lot of different types of query to use, to fit pretty much any use case. One family of queries such as match, multi_match queries, are full-text queries. These types of queries perform analysis on the query input at search time, with the resulting terms compared to the terms stored in the inverted index. The analyzer used by default will be the Standard Analyzer as well.
Another family of queries such as term, terms, prefix queries, are term-level queries. These types of queries do not analyze the query input, so the query input as-is will be compared to the terms stored in the inverted index.
In your example, your term query on the "city" field does not find any matches when capitalized because it's searching against a text field whose input underwent analysis at index time. With the default mapping, this is where the keyword sub field could help. A keyword datatype does not undergo analysis (well, it has a type of analysis with normalizers), so can be used for exact matching, as well as sorting and aggregations. To use it, you would just need to target the "city.keyword" field. An alternative approach could also be to change the analyzer used by the "city" field to one that does not use the Lowercase token filter; taking this approach would require you to reindex all documents in the index.

Elasticsearch will analyze the text field lowercase unless you define a custom mapping.
Exact values (like numbers, dates, and keywords) have the exact value
specified in the field added to the inverted index in order to make
them searchable.
However, text fields are analyzed. This means that their values are
first passed through an analyzer to produce a list of terms, which are
then added to the inverted index. There are many ways to analyze text:
the default standard analyzer drops most punctuation, breaks up text
into individual words, and lower cases them.
See: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
So if you want to use a term query — analyze the term on your own before querying. Or just lowercase the term in this case.

To Solve this issue i create custom normalization and update mapping to add,
before we have to delete index and add it again
First Delete the index
DELETE PUT http://localhost:9200/users
now create again index
PUT http://localhost:9200/users
{
"settings": {
"analysis": {
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"user": {
"properties": {
"city": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
}
}
}
}

Ngram Tokenizer on field, not on query

I'm having trouble finding the solution for a use case here.
Basically, it's pretty simple : I need to perform a "contains" query, like a SQL like '%...%'.
I've seen there is a regexp query, which I actually managed to get working perfectly, but as it seems to scale badly, i'm trying out nGrams. Now, I've played around with them before and know "how they work", but the behaviour isn't the one I expect it to be.
Basically, i've configured my analyzer to be mingram =2, maxgram = 20. Say I index a user called "Christophe". I want the query "Chris" to actually match, which it does, since Chris is a 5-gram of Christophe. The problem is, "Risotto" matches aswell, because it gets broken down into Ngrams and ultimately "is" is a 2-gram of "Christophe" and so it matches aswell.
What I need is the analyzer to actually break down the indexed field in nGrams at indexing time, and compare those to the FULL text query. Risotto should match Risotto, XXXRisottoXXX and so on, but not Risolo or something where the nGrams do match.
Is there any solution ?

You need to use search_analyzer setting to have distinct index time and search time analyzers.
Sample from docs:
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}

Django-Haystack elasticsearch queries

Haystack generates elasticsearch queries to get results from elasticsearch. The queries get prepended with a filter containing the following query:
"query": {
"query_string": {
"query": "django_ct:(customers.customer)"
}
}
What is the meaning of the django_ct(..) query? Is this a function that haystack installs in elasticsearch? Is it some caching magic? Can I get rid of this part altogether?
The reason why I'm asking is that I have to build a custom query to use an elasticsearch multi_field. In order to change the queries I want to understand first how haystack generates its own queries.

Haystack uses Django's content types to determine which model attributes to search against in Elasticsearch. This is not really best practice, but it's how it's done in HS.
Basically, the code in HS looks something like this:
app_name, model_name = django_ct.split('.')
ct = ContentType.objects.get_by_natural_key(app_name, model_name)
model = ct.model_class()
# do stuff with model
So, you really don't want to ignore it when using haystack, if you are indexing more than one model in your index.
I have a couple other answers based on elasticsearch here: index analyzer vs query analyzer in haystack - elasticsearch? and here: Django Haystack Distinct Value for Field
EDIT regarding multi-fields:
I've used Haystack and multifields in the past, so I'm not sure you need to write you own backend. The key is understanding how haystack creates searches. As I said in one of the other posts, everything goes into query_string and from there it creates a lucene based search string. Again, not really best practice.
So let's say you have a multi-field that looks like this:
"some_field": {
"type": "multi_field",
"fields": {
"some_field_edgengram": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"some_field": {
"type": "string",
"index": "not_analyzed"
}
}
},
In haystack, you can just search against some_field and some_field_edgengram directly.
For example SearchQuerySet().filter(some_field="cat") and SearchQuerySet().filter(some_field_edgengram="cat") will both work, but the first will only match tokens that have cat exactly and the second will match cat, cats, catlin, catch, etc, at least using my edgengram analyzers.
However, just because you use haystack for indexing and search doesn't mean you have to use it for 100% of your search solutions. In the past, I've used PYES in some areas of the app and haystack in others, because haystack lacked the support for more advanced features and the query_string parsing was losing some of the finer grained accuracy we were looking for.
In your case, you could get results from the search engine via elasticutils or python-elasticseach directly for some more advanced searches and use haystack for the other more routine searches.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio