I'm querying Elasticsearch 2.3 using django-haystack, and the query that is executed seems to be the following:
'imaging_telescopes:(*\\"FS\\-60\\"*)'
An object in my Elasticsearch data has the following value for its property imaging_telescopes: "Takahashi FSQ-106N".
This object matches the query, and to me this result is unepected, I wouldn't want it to match.
My assumption is that it matches becasue it contains the letters FS, but in my frontend I'm just searching for "FS-60".
How can I modify the query so that it's stricter in looking for objects whose property imaging_telescopes exactly contains some text?
Thanks!
EDIT: this is the mapping of the field:
"imaging_telescopes": {
"type": "string",
"analyzer": "snowball"
}
Related
I am trying to search query and it working fine for exact search but if user enter lowercase or uppercase it does not work as ElasticSearch is case insensitive.
example
{
"query" : {
"bool" : {
"should" : {
"match_all" : {}
},
"filter" : {
"term" : {
"city" : "pune"
}
}
}
}
}
it works fine when city is exactly "pune", if we change text to "PUNE" it does not work.
ElasticSearch is case insensitive.
"Elasticsearch" is not case-sensitive. A JSON string property will be mapped as a text datatype by default (with a keyword datatype sub or multi field, which I'll explain shortly).
A text datatype has the notion of analysis associated with it; At index time, the string input is fed through an analysis chain, and the resulting terms are stored in an inverted index data structure for fast full-text search. With a text datatype where you haven't specified an analyzer, the default analyzer will be used, which is the Standard Analyzer. One of the components of the Standard Analyzer is the Lowercase token filter, which lowercases tokens (terms).
When it comes to querying Elasticsearch through the search API, there are a lot of different types of query to use, to fit pretty much any use case. One family of queries such as match, multi_match queries, are full-text queries. These types of queries perform analysis on the query input at search time, with the resulting terms compared to the terms stored in the inverted index. The analyzer used by default will be the Standard Analyzer as well.
Another family of queries such as term, terms, prefix queries, are term-level queries. These types of queries do not analyze the query input, so the query input as-is will be compared to the terms stored in the inverted index.
In your example, your term query on the "city" field does not find any matches when capitalized because it's searching against a text field whose input underwent analysis at index time. With the default mapping, this is where the keyword sub field could help. A keyword datatype does not undergo analysis (well, it has a type of analysis with normalizers), so can be used for exact matching, as well as sorting and aggregations. To use it, you would just need to target the "city.keyword" field. An alternative approach could also be to change the analyzer used by the "city" field to one that does not use the Lowercase token filter; taking this approach would require you to reindex all documents in the index.
Elasticsearch will analyze the text field lowercase unless you define a custom mapping.
Exact values (like numbers, dates, and keywords) have the exact value
specified in the field added to the inverted index in order to make
them searchable.
However, text fields are analyzed. This means that their values are
first passed through an analyzer to produce a list of terms, which are
then added to the inverted index. There are many ways to analyze text:
the default standard analyzer drops most punctuation, breaks up text
into individual words, and lower cases them.
See: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
So if you want to use a term query — analyze the term on your own before querying. Or just lowercase the term in this case.
To Solve this issue i create custom normalization and update mapping to add,
before we have to delete index and add it again
First Delete the index
DELETE PUT http://localhost:9200/users
now create again index
PUT http://localhost:9200/users
{
"settings": {
"analysis": {
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"user": {
"properties": {
"city": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
}
}
}
}
I am indexing a very simple item with a field define as such:
"comid": {
"type": "keyword",
"store": "false",
"index": "no",
"include_in_all": false,
"doc_values": false
}
I then ingested a single item where comid = "this is an id"
When I query the item with exact match "this is an id"
I got this error:
cannot search on field [comid] since it is not indexed
Maybe I misunderstood the documentation but I thought we are able to search for keyword (using exact match) ?
I think I could get around this problem by changing the type from keyword to text and then using a keyword analyzer (which is noop if I understood it correctly) but it seems weird to do this for every keyword type field.
I must be missing something obvious here?
If it's not indexed, the field data is only stored in the document, not the index so you cannot search using that field.
See: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-index.html
We are currently adding search-as-you-type in the UI for some fields in our index.
For String-fields the functionality of Elasticsearch allows a number of ways of doing this, e.g. via match_phrase_prefix query on the analyzed tokens or via ngrams during indexing.
However as IPv4-Addresses are stored as long internally, doing wildcard or prefix searching on them is not easily possible as far as I see.
One can use range-queries for searching for IP-Ranges, but I rather would like to let them user enter "118" and display matches for "168.1.118.32" as well as "118.43.119.4" and "1.1.1.118".
Is there a built in way to perform such queries? Or do we need to store the field as analyzed string separately?
After some more investigation we used a multi field to store the IP address twice, once as normal IP type and a second time as analyzed value where we split the IP into it's 4 octets so we can search on these parts separatedely.
In the template we use the following pattern to split up the value when writing to the index:
"analyzer": {
"ipv4analyzer": {
"tokenizer": "ipv4tokenizer"
}
},
"tokenizer": {
"ipv4tokenizer": {
"pattern": "([0-9]{1,3})",
"type": "pattern",
"group": "1"
}
}
I have an index based on Products and one of the fields declared in the mapping is Attributes. This field is a nested type as it will contain two values - key and value. The problem I have is that the depending on the context of the attribute the datatype of value can vary between an integer and string.
For example:
{"attributes":[{"key":"StrEx","value":"Red"},{"key":"IntEx","value":2}]}
It seems the datatype for every instance of 'value' within all future nested documents within Attributes is decided based on the first data entered. I need to be able to store it as a integer/long datatype so I can perform range queries.
Any help or alternative ideas would be greatly appreciated.
You need a mapping like this one, for the value field:
"value": {
"type": "string",
"fields": {
"as_number": {
"type": "integer",
"ignore_malformed": true
}
}
}
Basically, your field is string but using fields you can attempt to format it as a numeric field.
When you want to use range queries then use value.as_number, for anything else use value.
Haystack generates elasticsearch queries to get results from elasticsearch. The queries get prepended with a filter containing the following query:
"query": {
"query_string": {
"query": "django_ct:(customers.customer)"
}
}
What is the meaning of the django_ct(..) query? Is this a function that haystack installs in elasticsearch? Is it some caching magic? Can I get rid of this part altogether?
The reason why I'm asking is that I have to build a custom query to use an elasticsearch multi_field. In order to change the queries I want to understand first how haystack generates its own queries.
Haystack uses Django's content types to determine which model attributes to search against in Elasticsearch. This is not really best practice, but it's how it's done in HS.
Basically, the code in HS looks something like this:
app_name, model_name = django_ct.split('.')
ct = ContentType.objects.get_by_natural_key(app_name, model_name)
model = ct.model_class()
# do stuff with model
So, you really don't want to ignore it when using haystack, if you are indexing more than one model in your index.
I have a couple other answers based on elasticsearch here: index analyzer vs query analyzer in haystack - elasticsearch? and here: Django Haystack Distinct Value for Field
EDIT regarding multi-fields:
I've used Haystack and multifields in the past, so I'm not sure you need to write you own backend. The key is understanding how haystack creates searches. As I said in one of the other posts, everything goes into query_string and from there it creates a lucene based search string. Again, not really best practice.
So let's say you have a multi-field that looks like this:
"some_field": {
"type": "multi_field",
"fields": {
"some_field_edgengram": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"some_field": {
"type": "string",
"index": "not_analyzed"
}
}
},
In haystack, you can just search against some_field and some_field_edgengram directly.
For example SearchQuerySet().filter(some_field="cat") and SearchQuerySet().filter(some_field_edgengram="cat") will both work, but the first will only match tokens that have cat exactly and the second will match cat, cats, catlin, catch, etc, at least using my edgengram analyzers.
However, just because you use haystack for indexing and search doesn't mean you have to use it for 100% of your search solutions. In the past, I've used PYES in some areas of the app and haystack in others, because haystack lacked the support for more advanced features and the query_string parsing was losing some of the finer grained accuracy we were looking for.
In your case, you could get results from the search engine via elasticutils or python-elasticseach directly for some more advanced searches and use haystack for the other more routine searches.