ElasticSearch 5 won't find documents with keyword including space - elasticsearch

I/m indexing documents with the following format:
{
"title": "this is the title",
"brand": "brand here",
"filters": ["filter1", "filter2", "Sin filters", "Camera IP"]
"active": true
}
Then a query looks like:
'query': {
'function_score': {
'query': {
'bool': {
'filter': [
{
'term': {
'active': True
}
}
],
'must': [
{
'terms': {
'filters': ['camera ip']
}
}
]
}
}
}
}
I can't return any document with "Camera IP" filters (or any variation of this string, lowercase and so on), but Es returns the ones with filters: "Sin filters".
The index is created with the following settings. Note that "filter" fields will fall under default template and is of type keyword
"settings":{
"index":{
"analysis":{
"analyzer":{
"keylower":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
},
"mappings": {
"_default_": {
"dynamic_templates": [
{
"string_as_keywords": {
"mapping": {
"index": "not_analyzed",
"type" : "keyword",
**"analyzer": "keylower"** # I also tried with and without changing this analyzer
},
"match": "*",
"match_mapping_type": "string"
}
},
{
"integers": {
"mapping": {
"type": "integer"
},
"match": "*",
"match_mapping_type": "long"
}
},
{
"floats": {
"mapping": {
"type": "float"
},
"match": "*",
"match_mapping_type": "double"
}
}
]
}
}
What I'm missing? It's strange it returns those with "Sin filters" filter but not with "Camera IP".
Thanks.

It seems like you want the filters to be lowercase and not be tokenized. I think the problem with your query is that you set the type of the strings a "keyword" and ES will not analyze these fields, not even changing their case:
Keyword fields are only searchable by their exact value.
That is why with your setting you can still retrieve the document with a query like this: {"query": {"term": {"filters": "Camera IP"}}}'.
Since you want the analyzer to change the casing of your text before indexing you should set the type to text by changing your mapping to something like this:
{"settings":{
"index": {
"analysis":{
"analyzer":{
"test_analyzer":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
},
"mappings": {
"_default_": {
"dynamic_templates": [
{
"string_as_keywords": {
"mapping": {
"type": "text",
"index": "not_analyzed",
"analyzer": "test_analyzer"
},
"match": "*",
"match_mapping_type": "string"
}
}
]
}
}}

Your filter 'filters': ['camera ip'] looks for camera ip whereas in the mapping you have the field filters as type keyword which elasticsearch looks for an exact match. So, in order to find that field you will need to have an exact string that you index for a match. If your use case doesn't require an exact match change the type to text, for which elasticsearch analyzes before indexing. More on text datatype here and keyword datatype here

Related

Dynamic templates support default types?

Dynamic templates allow you to define custom mappings that can be applied to dynamically added fields based on:
the datatype detected by Elasticsearch, with match_mapping_type.
the name of the field, with match and unmatch or match_pattern.
the full dotted path to the field, with path_match and path_unmatch.
I was trying to have a default type keyword for all fields while some special fields with specific *Suffix or prefix* could have specified types as follows, but it turned out all fields will be keyword in the end unexpectedly.
{
"order": 99,
"index_patterns": [
"xxxx_stats_*"
],
"settings": {
"index": {
"number_of_shards": "6",
"number_of_replicas": "1"
}
},
"mappings": {
"_doc": {
"dynamic": true,
"_source": {
"enabled": true
},
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "*",
"unmatch": [
"*Time",
"*At",
"is*"
],
"mapping": {
"ignore_above": 256,
"null_value": "NULL",
"type": "keyword"
}
}
},
{
"timeSuffix": {
"match_mapping_type": "*",
"match": [
"*Time",
"*At"
],
"mapping": {
"type": "long"
}
}
},
{
"isPrefix": {
"match_mapping_type": "*",
"match": "is*",
"mapping": {
"type": "boolean"
}
}
}
],
"date_detection": false,
"numeric_detection": true
}
},
"aliases": {
"{index}-alias": {
}
}
}
AFAIK match and unmatch cannot be arrays, only strings or regexes. So try this:
{
"dynamic_templates":[
{
"timeSuffix":{
"match_mapping_type":"*",
"match_pattern":"regex",
"match":"^(.*Time)|(.*At)$",
"mapping":{
"type":"long"
}
}
},
{
"isPrefix":{
"match_mapping_type":"*",
"match":"is*",
"mapping":{
"type":"boolean"
}
}
},
{
"strings":{
"match_mapping_type":"*",
"mapping":{
"ignore_above":256,
"null_value":"NULL",
"type":"keyword"
}
}
}
]
}
I also find that when you move strings to the bottom, the 2 mappings above will be resolved first. Otherwise, since every segment includes match_mapping_type":"*", the first matching segment will apply. This issue may be related.

Elastic Search,lowercase search doesnt work

I am trying to search again content using prefix and if I search for diode I get results that differ from Diode. How do I get ES to return result where both diode and Diode return the same results? This is the mappings and settings I am using in ES.
"settings":{
"analysis": {
"analyzer": {
"lowercasespaceanalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"articles": {
"properties": {
"title": {
"type": "text"
},
"url": {
"type": "keyword",
"index": "true"
},
"imageurl": {
"type": "keyword",
"index": "true"
},
"content": {
"type": "text",
"analyzer" : "lowercasespaceanalyzer",
"search_analyzer":"whitespace"
},
"description": {
"type": "text"
},
"relatedcontentwords": {
"type": "text"
},
"cmskeywords": {
"type": "text"
},
"partnumbers": {
"type": "keyword",
"index": "true"
},
"pubdate": {
"type": "date"
}
}
}
}
here is an example of the query I use
POST _search
{
"query": {
"bool" : {
"must" : {
"prefix" : { "content" : "capacitance" }
}
}
}
}
it happens because you use two different analyzers at search time and at indexing time.
So when you input query "Diod" at search time because you use "whitespace" analyzer your query is interpreted as "Diod".
However, because you use "lowercasespaceanalyzer" at index time "Diod" will be indexed as "diod". Just use the same analyzer both at search and index time, or analyzer that lowercases your strings because default "whitespace" analyzer doesn't https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html
There will be no term of Diode in your index. So if you want to get same results, you should let your query context analyzed by same analyzer.
You can use Query string query like
"query_string" : {
"default_field" : "content",
"query" : "Diode",
"analyzer" : "lowercasespaceanalyzer"
}
UPDATE
You can analyze your context before query.
AnalyzeResponse resp = client.admin().indices()
.prepareAnalyze(index, text)
.setAnalyzer("lowercasespaceanalyzer")
.get();
String analyzedContext = resp.getTokens().get(0);
...
Then use analyzedContext as new query context.

Update and search in multi field properties in ElasticSearch

I'm trying to use multi field properties for multi language support. I created following mapping for this:
{
"mappings": {
"product": {
"properties": {
"prod-id": {
"type": "string"
},
"prod-name": {
"type": "string",
"fields": {
"en": {
"type": "string",
"analyzer": "english"
},
"fr": {
"type": "string",
"analyzer": "french"
}
}
}
}
}
}
}
I created test record:
{
"prod-id": "1234567",
"prod-name": [
"Test product",
"Produit d'essai"
]
}
and tried to query using some language:
{
"query": {
"bool": {
"must": [
{"match": {
"prod-name.en": "Produit"
}}
]
}
}
}
As a result I got my document. But I expected that I will have empty result when I use French but choose English. It seems ElasticSearch ignores which field I specified in query. There is no difference in search result when I use "prod-name.en" or "prod-name.fr" or just "prod-name". Is this behaviour expected? Should I do some special things to have searching just in one language?
Another problem with updating multi field property. I can't update just one field.
{
"doc" : {
"prod-name.en": "Test"
}
}
I got following error:
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "Field name [prod-name.en] cannot contain '.'"
}
],
"type": "mapper_parsing_exception",
"reason": "Field name [prod-name.en] cannot contain '.'"
},
"status": 400
}
Is there any way to update just one field in multi field property?
In your mapping, the prod-name.en field will simply be analyzed using the english analyzer and the same for the french field. However, ES will not choose for you which value to put in which field.
Instead, you need to modify your mapping like this
{
"mappings": {
"product": {
"properties": {
"prod-id": {
"type": "string"
},
"prod-name": {
"type": "object",
"properties": {
"en": {
"type": "string",
"analyzer": "english"
},
"fr": {
"type": "string",
"analyzer": "french"
}
}
}
}
}
}
}
and input document to be like this and you'll get the results you expect.
{
"prod-id": "1234567",
"prod-name": {
"en": "Test product",
"fr": "Produit d'essai"
}
}
As for the updating part, your partial document should be like this instead.
{
"doc" : {
"prod-name": {
"en": "Test"
}
}
}

How to force a terms filter to ignore stopwords?

I have an Elasticsearch index with a bunch of fields, some of which I want to use along with the default stopword list. On the other hand, I have a username field which should return results for users called the, be etc.
Of course, when I run the following query:
{
"query": {
"constant_score": {
"filter": {
"terms": {
"username": [
"be"
]
}
}
}
}
}
nothing is returned. I have seen various solutions for changing the standard analyzer to remove stopwords, but am struggling to find how I would do so for this one field only. Thanks for any pointers.
You can do it like the following: add a custom analyzer that shouldn't use stopwords and then explicitly specify this analyzer just for those fields you want stopwords to be recognized (like your username field).
PUT /stopwords
{
"settings": {
"analysis": {
"analyzer": {
"my_english": {
"type": "english",
"stopwords": "_none_"
}
}
}
},
"mappings": {
"text": {
"properties": {
"title": {
"type": "string"
},
"content": {
"type": "string"
},
"username": {
"type": "string",
"analyzer": "my_english"
}
}
}
}
}

Dynamic Mapping for an object field that unwraps the parent path

I am evaluating whether ElasticSearch can meet the needs of a new system I'm building. It looks amazing, so I'm really hopeful I can figure out a mapping strategy that works.
In this system, administrators can define fields to be associated with documents dynamically. So a given type (in the elasticsearch sense of the word) can have any number of fields, which I do not know the name of ahead of time. And each field can be of any type: int, date, string, etc.
An example document may look like:
{
"name": "bob",
"age": 22,
"title": "Vice Intern",
"tagline": "Ask not what your company can do for you, but..."
}
Notice that there are 2 string fields. Awesome. My problem though is that I want the "tagline" to be analyzed, but I do not want "title" to be analyzed.
Remember I don't know the names of these fields ahead of time. And there could be multiple fields of each type. So there could be 10 string fields of various names, 3 of which should be analyzed and 7 of which should not.
Another requirement I have is that the name the administrator gives the field should also be what they can search by. So, for example, if they want to find all the Vice Interns who have something to say, the lucene query may be:
+title:"Vice Intern" +tagline:"company"
So my thought was that I could define a dynamic mapping. Since I don't know the names of the fields ahead of time, it seems like a great approach. The key though is coming up with a way of differentiating string fields that should be analyzed and ones that shouldn't be!
I thought, hey, I'll just put all the fields that need analyzing into a nested object, like this:
{
"name": "bob",
"age": 22,
"title": "Vice Intern",
"textfields": {
"tagline": "Ask not what your company can do for you, but...",
"somethingelse": "lorem ipsum",
}
}
Then, in my dynamic mapping, I have a way of mapping those fields differently:
{
"mytype": {
"dynamic_templates": {
"nested_textfields": {
"match": "textfields",
"match_mapping_type": "string",
"mapping": {
"index": "analyzed",
"analyzer": "default"
}
}
}
}
}
I know that isn't right, I actually need some kind of nested mapping, but no matter, because if I understand it correctly, even if I got that working, it would mean those fields are searched for (via lucene syntax) like this:
+title:"Vice Intern" +textfields.tagline:"company"
And I don't want the "textfields" prefix. Since I'm the one providing the textfields object that wraps the text fields, I know that the fields within it are still uniquely named across the entire document.
I thought of using a pattern match instead. So instead of wrapping them in a "textfields" object, I could prefix them, like "textfield_tagline". But when doing that, the {name} token in the dynamic mapping includes the prefix, I don't see a way to just pull out the "*" portion.
Any solution which gets me the necessary behavior is a correct answer. Even if that involves nested mapping information into the documents themselves (can you do that? I've seen something like that, I think...).
EDIT:
I've attempted the following dynamic template. I'm trying to use index_name to remove the 'textfields.' in the index. This dynamic template just doesn't seem to match though, because after putting a document and looking at the mapping I see no analyzer specified.
{
"mytype" : {
"dynamic_templates":
[
{
"textfields": {
"path_match": "textfields.*",
"match_mapping_type" : "string",
"mapping": {
"type": "string",
"index": "analyzed",
"analyzer": "default",
"index_name": "{name}",
"fields": {
"sort": {
"type": "string",
"index": "not_analyzed",
"index_name": "{name}_sort"
}
}
}
}
}
]
}
}
I was able to duplicate the results that you asked for specifically with the following index creation (with mappings), document, and search query. The type does vary a bit, but it serves the purpose of the example.
Index Settings
PUT http://localhost:9200/sandbox
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0
}
},
"mappings": {
"mytype": {
"dynamic_templates": [
{
"indexedfields": {
"path_match": "indexedfields.*",
"match_mapping_type" : "string",
"mapping": {
"type": "string",
"index": "analyzed",
"analyzer": "default",
"index_name": "{name}",
"fields": {
"sort": {
"type": "string",
"index": "not_analyzed",
"index_name": "{name}_sort"
}
}
}
}
},
{
"textfields": {
"path_match": "textfields.*",
"match_mapping_type" : "string",
"mapping": {
"type": "string",
"index": "not_analyzed",
"index_name": "{name}"
}
}
},
{
"strings": {
"path_match": "*",
"match_mapping_type" : "string",
"mapping": {
"type": "string",
"index": "not_analyzed"
}
}
}
]
}
}
}
Document
PUT http://localhost:9200/sandbox/mytype/1
{
"indexedfields":{
"hello":"Hello world",
"message":"The great balls of the world are on fire"
},
"textfields":{
"username":"User Name",
"projectname":"Project Name"
}
}
Search
POST http://localhost:9200/sandbox/mytype/_search
{
"query": {
"query_string": {
"query": "message:\"great balls\""
}
},
"filter":{
"query":{
"query_string":{
"query":"username:\"User Name\""
}
}
},
"from":0,
"size":10,
"sort":[
]
}
The search returns the following response:
{
"took":2,
"timed_out":false,
"_shards":{
"total":1,
"successful":1,
"failed":0
},
"hits":{
"total":1,
"max_score":0.19178301,
"hits":[
{
"_index":"sandbox",
"_type":"mytype",
"_id":"1",
"_score":0.19178301,
"_source":{
"indexedfields":{
"hello":"Hello world",
"message":"The great balls of the world are on fire"
},
"textfields":{
"username":"User Name",
"projectname":"Project Name"
}
}
}
]
}
}

Resources