Django-Haystack elasticsearch queries - elasticsearch

Haystack generates elasticsearch queries to get results from elasticsearch. The queries get prepended with a filter containing the following query:
"query": {
"query_string": {
"query": "django_ct:(customers.customer)"
}
}
What is the meaning of the django_ct(..) query? Is this a function that haystack installs in elasticsearch? Is it some caching magic? Can I get rid of this part altogether?
The reason why I'm asking is that I have to build a custom query to use an elasticsearch multi_field. In order to change the queries I want to understand first how haystack generates its own queries.

Haystack uses Django's content types to determine which model attributes to search against in Elasticsearch. This is not really best practice, but it's how it's done in HS.
Basically, the code in HS looks something like this:
app_name, model_name = django_ct.split('.')
ct = ContentType.objects.get_by_natural_key(app_name, model_name)
model = ct.model_class()
# do stuff with model
So, you really don't want to ignore it when using haystack, if you are indexing more than one model in your index.
I have a couple other answers based on elasticsearch here: index analyzer vs query analyzer in haystack - elasticsearch? and here: Django Haystack Distinct Value for Field
EDIT regarding multi-fields:
I've used Haystack and multifields in the past, so I'm not sure you need to write you own backend. The key is understanding how haystack creates searches. As I said in one of the other posts, everything goes into query_string and from there it creates a lucene based search string. Again, not really best practice.
So let's say you have a multi-field that looks like this:
"some_field": {
"type": "multi_field",
"fields": {
"some_field_edgengram": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"some_field": {
"type": "string",
"index": "not_analyzed"
}
}
},
In haystack, you can just search against some_field and some_field_edgengram directly.
For example SearchQuerySet().filter(some_field="cat") and SearchQuerySet().filter(some_field_edgengram="cat") will both work, but the first will only match tokens that have cat exactly and the second will match cat, cats, catlin, catch, etc, at least using my edgengram analyzers.
However, just because you use haystack for indexing and search doesn't mean you have to use it for 100% of your search solutions. In the past, I've used PYES in some areas of the app and haystack in others, because haystack lacked the support for more advanced features and the query_string parsing was losing some of the finer grained accuracy we were looking for.
In your case, you could get results from the search engine via elasticutils or python-elasticseach directly for some more advanced searches and use haystack for the other more routine searches.

Related

Elasticsearch query multi_match

I'm trying to create an elasticsearch query that looks for multiple fields. This works fine so far. However, I would like to refine this.
Let's say the word was indexed: "test". However, when I search for "tes" he does not find that word for me, but I would like to show it already - but the combination with my query brings me to a challenge.
{
"multi_match" : {
"query": "*" + query + "*",
"type": "cross_fields",
"operator": "and",
"fields": ["article.number^1","article.name_de^1", "article.name_en^5", "article.name_fr^5", "article.description^1"],
"tie_breaker": 0,
}
Depending on your constraints, here are your options.
If you wish to use wildcard before/after your search term, you can use wildcard query. This has high processing cost at query time.
If you are fine with additional storage cost, you can opt to tokenize your input during analysis process. See ngram tokenizer. Beware that if you have long strings, it can quickly explode the storage requirement.

ElasticSearch Search query is not case sensitive

I am trying to search query and it working fine for exact search but if user enter lowercase or uppercase it does not work as ElasticSearch is case insensitive.
example
{
"query" : {
"bool" : {
"should" : {
"match_all" : {}
},
"filter" : {
"term" : {
"city" : "pune"
}
}
}
}
}
it works fine when city is exactly "pune", if we change text to "PUNE" it does not work.
ElasticSearch is case insensitive.
"Elasticsearch" is not case-sensitive. A JSON string property will be mapped as a text datatype by default (with a keyword datatype sub or multi field, which I'll explain shortly).
A text datatype has the notion of analysis associated with it; At index time, the string input is fed through an analysis chain, and the resulting terms are stored in an inverted index data structure for fast full-text search. With a text datatype where you haven't specified an analyzer, the default analyzer will be used, which is the Standard Analyzer. One of the components of the Standard Analyzer is the Lowercase token filter, which lowercases tokens (terms).
When it comes to querying Elasticsearch through the search API, there are a lot of different types of query to use, to fit pretty much any use case. One family of queries such as match, multi_match queries, are full-text queries. These types of queries perform analysis on the query input at search time, with the resulting terms compared to the terms stored in the inverted index. The analyzer used by default will be the Standard Analyzer as well.
Another family of queries such as term, terms, prefix queries, are term-level queries. These types of queries do not analyze the query input, so the query input as-is will be compared to the terms stored in the inverted index.
In your example, your term query on the "city" field does not find any matches when capitalized because it's searching against a text field whose input underwent analysis at index time. With the default mapping, this is where the keyword sub field could help. A keyword datatype does not undergo analysis (well, it has a type of analysis with normalizers), so can be used for exact matching, as well as sorting and aggregations. To use it, you would just need to target the "city.keyword" field. An alternative approach could also be to change the analyzer used by the "city" field to one that does not use the Lowercase token filter; taking this approach would require you to reindex all documents in the index.
Elasticsearch will analyze the text field lowercase unless you define a custom mapping.
Exact values (like numbers, dates, and keywords) have the exact value
specified in the field added to the inverted index in order to make
them searchable.
However, text fields are analyzed. This means that their values are
first passed through an analyzer to produce a list of terms, which are
then added to the inverted index. There are many ways to analyze text:
the default standard analyzer drops most punctuation, breaks up text
into individual words, and lower cases them.
See: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
So if you want to use a term query — analyze the term on your own before querying. Or just lowercase the term in this case.
To Solve this issue i create custom normalization and update mapping to add,
before we have to delete index and add it again
First Delete the index
DELETE PUT http://localhost:9200/users
now create again index
PUT http://localhost:9200/users
{
"settings": {
"analysis": {
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"user": {
"properties": {
"city": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
}
}
}
}

Application-side Joins Elasticsearch

I have two indexes in Elasticsearch, a system index, and a telemetry index. I'd like to perform queries and aggregations on the telemetry index using filters from the systems index. The systems index is relatively small and only receives new documents occasionally, but the telemetry index is much larger and is constantly receiving new documents. This seems like an ideal situation for using an application-side join.
I tried emulating the example query at the pervious link, but it turns out the filtered query is deprecated as of ES 5.0. (Why is this example in the current documentation?!)
Here are my queries:
GET /system/_search
{
"query": {
"match": {
"name": "George's system"
}
}
}
GET /telemetry/_search
{
"query": {
"bool":{
"must": {
"multi_match": {
"operator": "and",
"fields": ["systemId"]
, [1] }
}
}
}
}
}
The second one fails with a json_parse_exception because for some reason it doesn't like the [ ] characters after "fields".
Can anyone provide a simple example of using application-side joins?
Once such a query is defined (perhaps in Kibana's Dev Tools console) is there a way to visualize it in Kibana?
With elastic there is no way to execute two nested queries like in a relational database where the first query uses the response of the second. The example in the application-side join, means that you are actually making two queries (two different requests to elastic) on the application side.
First query you get the list of ids you need to filter on.
Second query you pass the list of ids that you got to the terms filter.
This works when you have no more than 1024 values for systemId. Because terms query has a limit on the number of terms.
Because this query is not feasible, then you can't visualize it in kibana.
In such case you have to sacrifice a little of space and add the systemId to your mapping.
Good Luck!

Ngram Tokenizer on field, not on query

I'm having trouble finding the solution for a use case here.
Basically, it's pretty simple : I need to perform a "contains" query, like a SQL like '%...%'.
I've seen there is a regexp query, which I actually managed to get working perfectly, but as it seems to scale badly, i'm trying out nGrams. Now, I've played around with them before and know "how they work", but the behaviour isn't the one I expect it to be.
Basically, i've configured my analyzer to be mingram =2, maxgram = 20. Say I index a user called "Christophe". I want the query "Chris" to actually match, which it does, since Chris is a 5-gram of Christophe. The problem is, "Risotto" matches aswell, because it gets broken down into Ngrams and ultimately "is" is a 2-gram of "Christophe" and so it matches aswell.
What I need is the analyzer to actually break down the indexed field in nGrams at indexing time, and compare those to the FULL text query. Risotto should match Risotto, XXXRisottoXXX and so on, but not Risolo or something where the nGrams do match.
Is there any solution ?
You need to use search_analyzer setting to have distinct index time and search time analyzers.
Sample from docs:
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}

Elasticsearch count terms ignoring spaces

Using ES 1.2.1
My aggregation
{
"size": 0,
"aggs": {
"cities": {
"terms": {
"field": "city","size": 300000
}
}
}
}
The issue is that some city names have spaces in them and aggregate separately.
For instance Los Angeles
{
"key": "Los",
"doc_count": 2230
},
{
"key": "Angeles",
"doc_count": 2230
},
I assume it has to do with the analyzer? Which one would I use to not split on spaces?
For fields that you want to perform aggregations on I would recommend either the keyword analyzer or do not analyze the field at all. From the keyword analyzer documentation:
An analyzer of type keyword that "tokenizes" an entire stream as a single token. This is useful for data like zip codes, ids and so on. Note, when using mapping definitions, it might make more sense to simply mark the field as not_analyzed.
However if you want to still perform analysis on the field to include for other searches, then consider using the field setting of ES 1.x As described in the field/multi_field documentation. This will allow you to have a value of the field for searching and one for aggregations.
There are 2 approaches to solve this.
The not_analyzed way - But this wont consider different capital and small cases
The keyword tokenizer way - Here we can map different terms with
different case as one.
These two concepts with working code examples are illustrated in this blog.

Resources