Elasticsearch count terms ignoring spaces - elasticsearch

Using ES 1.2.1
My aggregation
{
"size": 0,
"aggs": {
"cities": {
"terms": {
"field": "city","size": 300000
}
}
}
}
The issue is that some city names have spaces in them and aggregate separately.
For instance Los Angeles
{
"key": "Los",
"doc_count": 2230
},
{
"key": "Angeles",
"doc_count": 2230
},
I assume it has to do with the analyzer? Which one would I use to not split on spaces?

For fields that you want to perform aggregations on I would recommend either the keyword analyzer or do not analyze the field at all. From the keyword analyzer documentation:
An analyzer of type keyword that "tokenizes" an entire stream as a single token. This is useful for data like zip codes, ids and so on. Note, when using mapping definitions, it might make more sense to simply mark the field as not_analyzed.
However if you want to still perform analysis on the field to include for other searches, then consider using the field setting of ES 1.x As described in the field/multi_field documentation. This will allow you to have a value of the field for searching and one for aggregations.

There are 2 approaches to solve this.
The not_analyzed way - But this wont consider different capital and small cases
The keyword tokenizer way - Here we can map different terms with
different case as one.
These two concepts with working code examples are illustrated in this blog.

Related

Elasticsearch: search word forms only

I have collection of docs and they have field tags which is array of strings. Each string is a word.
Example:
[{
"id": 1,
"tags": [ "man", "boy", "people" ]
}, {
"id": 2,
"tags":[ "health", "boys", "people" ]
}, {
"id": 3,
"tags":[ "people", "box", "boxer" ]
}]
Now I need to query only docs which contains word "boy" and its forms("boys" in my example). I do not need elasticsearch to return doc number 3 because it is not form of boy.
If I use fuzzy query I will get all three docs and also doc number 3 which I do not need. As far as I understand, elasticsearch use levenshtein distance to determine whether doc relevant or not.
If I use match query I will get number 1 only but not both(1,2).
I wonder is there any ability to query docs by word form matching. Is there a way to make elastic match "duke", "duchess", "dukes" but not "dikes", "buke", "bike" and so on? This is more complicated case with "duke" but I need to support such case also.
Probably it could be solved using some specific settings of analyzer?
With "word-form matching" I guess you are referring to matching morphological variations of the same word. This could be about addressing plural, singular, case, tense, conjugation etc. Bear in mind that the rules for word variations are language specific
Elasticsearch's implementation of fuzziness is based on the Damerau–Levenshtein distance. It handles mutations (changes, transformations, transpositions) independent of a specific language, solely based on the number if edits.
You would need to change the processing of your strings at indexing and at search time to get the language-specific variations addressed via stemming. This can be achieved by configuring a suitable an analyzer for your field that does the language-specific stemming.
Assuming that your tags are all in English, your mapping for tags could look like:
"tags": {
"type": "text",
"analyzer": "english"
}
As you cannot change the type or analyzer of an existing index you would need to fix your mapping and then re-index everything.
I'm not sure whether Duke and Duchesse are considered to be the same word (and therefore addresses by the stemmer). If not, you would need to use a customised analyzer that allows you to configure synonyms.
See also Elasticsearch Reference: Language Analyzers

ElasticSearch Search query is not case sensitive

I am trying to search query and it working fine for exact search but if user enter lowercase or uppercase it does not work as ElasticSearch is case insensitive.
example
{
"query" : {
"bool" : {
"should" : {
"match_all" : {}
},
"filter" : {
"term" : {
"city" : "pune"
}
}
}
}
}
it works fine when city is exactly "pune", if we change text to "PUNE" it does not work.
ElasticSearch is case insensitive.
"Elasticsearch" is not case-sensitive. A JSON string property will be mapped as a text datatype by default (with a keyword datatype sub or multi field, which I'll explain shortly).
A text datatype has the notion of analysis associated with it; At index time, the string input is fed through an analysis chain, and the resulting terms are stored in an inverted index data structure for fast full-text search. With a text datatype where you haven't specified an analyzer, the default analyzer will be used, which is the Standard Analyzer. One of the components of the Standard Analyzer is the Lowercase token filter, which lowercases tokens (terms).
When it comes to querying Elasticsearch through the search API, there are a lot of different types of query to use, to fit pretty much any use case. One family of queries such as match, multi_match queries, are full-text queries. These types of queries perform analysis on the query input at search time, with the resulting terms compared to the terms stored in the inverted index. The analyzer used by default will be the Standard Analyzer as well.
Another family of queries such as term, terms, prefix queries, are term-level queries. These types of queries do not analyze the query input, so the query input as-is will be compared to the terms stored in the inverted index.
In your example, your term query on the "city" field does not find any matches when capitalized because it's searching against a text field whose input underwent analysis at index time. With the default mapping, this is where the keyword sub field could help. A keyword datatype does not undergo analysis (well, it has a type of analysis with normalizers), so can be used for exact matching, as well as sorting and aggregations. To use it, you would just need to target the "city.keyword" field. An alternative approach could also be to change the analyzer used by the "city" field to one that does not use the Lowercase token filter; taking this approach would require you to reindex all documents in the index.
Elasticsearch will analyze the text field lowercase unless you define a custom mapping.
Exact values (like numbers, dates, and keywords) have the exact value
specified in the field added to the inverted index in order to make
them searchable.
However, text fields are analyzed. This means that their values are
first passed through an analyzer to produce a list of terms, which are
then added to the inverted index. There are many ways to analyze text:
the default standard analyzer drops most punctuation, breaks up text
into individual words, and lower cases them.
See: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
So if you want to use a term query — analyze the term on your own before querying. Or just lowercase the term in this case.
To Solve this issue i create custom normalization and update mapping to add,
before we have to delete index and add it again
First Delete the index
DELETE PUT http://localhost:9200/users
now create again index
PUT http://localhost:9200/users
{
"settings": {
"analysis": {
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"user": {
"properties": {
"city": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
}
}
}
}

elasticsearch: or operator, number of matches

Is it possible to score my searches according to the number of matches when using operator "or"?
Currently query looks like this:
"query": {
"function_score": {
"query": {
"match": {
"tags.eng": {
"query": "apples banana juice",
"operator": "or",
"fuzziness": "AUTO"
}
}
},
"script_score": {
"script": # TODO
},
"boost_mode": "replace"
}
}
I don't want to use "and" operator, since I want documents containing "apple juice" to be found, as well as documents containing only "juice", etc. However a document containing the three words should score more than documents containing two words or a single word, and so on.
I found a possible solution here https://github.com/elastic/elasticsearch/issues/13806
which uses bool queries. However I don't know how to access the tokens (in this example: apples, banana, juice) generated by the analyzer.
Any help?
Based on the discussions above I came up with the following solution, which is a bit different that I imagined when I asked the question, but works for my case.
First of all I defined a new similarity:
"settings": {
"similarity": {
"boost_similarity": {
"type": "scripted",
"script": {
"source": "return 1;"
}
}
}
...
}
Then I had the following problem:
a query for "apple banana juice" had the same score for a doc with tags ["apple juice", "apple"] and another doc with tag ["banana", "apple juice"]. Although I would like to score the second one higher.
From the this other discussion I found out that this issue was caused because I had a nested field. And I created a usual text field to address it.
But I also was wanted to distinguish between a doc with tags ["apple", "banana", "juice"] and another doc with tag ["apple banana juice"] (all three words in the same tag). The final solution was therefore to keep both fields (a nested and a text field) for my tags.
Finally the query consists of bool query with two should clauses: the first should clause is performed on the text field and uses an "or" operator. The second should clause is performed on the nested field and uses and "and operator"
Despite I found a solution for this specific issue, I still face a few other problems when using ES to search for tagged documents. The examples in the documentation seem to work very well when searching for full texts. But does someone know where I can find something more specific to tagged documents?

Terms aggregation on first three octets of IP

I'm doing a faceted search UI, and one of the facets I want to add is for the first three octets of an IP field.
So for example, given documents with IPs "192.168.1.1", "192.168.1.2", "192.168.2.1", I would want to display the facets "192.168.1 (2)" and "192.168.2 (1)".
Is there an aggregation I can use for this? As far as I can tell, range aggregations require me to predefine the ranges, and term aggregations only take a field.
Obviously the alternative is for me to index the first three octets as a separate field, but of course I would prefer to avoid that.
Thanks!
You can add a path hierarchy tokenizer with delimeter of '.' and a custom analyzer with the tokenizer set to the tokenizer you just made.
See this question for the syntax:
Elasticsearch - using the path hierarchy tokenizer to access different level of categories
Then you can aggregate terms and you will get results grouped by each number group
{
"key": "192",
"doc_count": 10
},
{
"key": "192.168",
"doc_count": 10
},
...
In the linked answer there is a way to exclude certain aggregations levels. The following should exclude all results except ones that have 3 levels of numbers.
"aggs": {
"ipaddr": {
"terms": {
"field": "your_ip_addr",
"exclude": ".*",
"include": ".*\\..*\\..*"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html

Django-Haystack elasticsearch queries

Haystack generates elasticsearch queries to get results from elasticsearch. The queries get prepended with a filter containing the following query:
"query": {
"query_string": {
"query": "django_ct:(customers.customer)"
}
}
What is the meaning of the django_ct(..) query? Is this a function that haystack installs in elasticsearch? Is it some caching magic? Can I get rid of this part altogether?
The reason why I'm asking is that I have to build a custom query to use an elasticsearch multi_field. In order to change the queries I want to understand first how haystack generates its own queries.
Haystack uses Django's content types to determine which model attributes to search against in Elasticsearch. This is not really best practice, but it's how it's done in HS.
Basically, the code in HS looks something like this:
app_name, model_name = django_ct.split('.')
ct = ContentType.objects.get_by_natural_key(app_name, model_name)
model = ct.model_class()
# do stuff with model
So, you really don't want to ignore it when using haystack, if you are indexing more than one model in your index.
I have a couple other answers based on elasticsearch here: index analyzer vs query analyzer in haystack - elasticsearch? and here: Django Haystack Distinct Value for Field
EDIT regarding multi-fields:
I've used Haystack and multifields in the past, so I'm not sure you need to write you own backend. The key is understanding how haystack creates searches. As I said in one of the other posts, everything goes into query_string and from there it creates a lucene based search string. Again, not really best practice.
So let's say you have a multi-field that looks like this:
"some_field": {
"type": "multi_field",
"fields": {
"some_field_edgengram": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"some_field": {
"type": "string",
"index": "not_analyzed"
}
}
},
In haystack, you can just search against some_field and some_field_edgengram directly.
For example SearchQuerySet().filter(some_field="cat") and SearchQuerySet().filter(some_field_edgengram="cat") will both work, but the first will only match tokens that have cat exactly and the second will match cat, cats, catlin, catch, etc, at least using my edgengram analyzers.
However, just because you use haystack for indexing and search doesn't mean you have to use it for 100% of your search solutions. In the past, I've used PYES in some areas of the app and haystack in others, because haystack lacked the support for more advanced features and the query_string parsing was losing some of the finer grained accuracy we were looking for.
In your case, you could get results from the search engine via elasticutils or python-elasticseach directly for some more advanced searches and use haystack for the other more routine searches.

Resources