Elasticsearch not analyzed and lowercase - elasticsearch

I'm trying to make a field lowercase and not analyzed in Elasticsearch 5+ in order to search for strings with spaces in lowercase (them being indexed in mixed case)
Before Elasticsearch v5 we could use an analyzer like this one to accomplish it:
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_keyword":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
}
This however doesn't work for me right now. And I believe the problem to be that "string" is deprecated and automatically converted to either keyword or text.
Anyone here know how to accomplish this? I thought about adding a "fields" tag to my mapping along the lines of:
"fields": {
"lowercase": {
"type": "string"
**somehow convert to lowercase**
}
}
This would make working with it slightly more challenging and I have no idea how to convert it to lowercase either.
Below you'll find a test setup which reproduces my exact problem.
create index:
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_keyword":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
},
"mappings":{
"test":{
"properties":{
"name":{
"analyzer":"analyzer_keyword",
"type":"string"
}
}
}
}
}
Add a test record:
{
"name": "city test"
}
Query that should match:
{
"size": 20,
"from": 0,
"query": {
"bool": {
"must": [{
"bool": {
"should": [{
"wildcard": {
"name": "*city t*"
}
}]
}
}]
}
}
}

When creating your index, you need to make sure that the analysis section is right under the settings section and not inside the settings > index section otherwise it won't work.
Then you also need to use the text data type for your field instead of the string one. Wipe your index, do that and it will work.
{
"settings":{
"analysis":{
"analyzer":{
"analyzer_keyword":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
},
"mappings":{
"test":{
"properties":{
"name":{
"analyzer": "analyzer_keyword",
"type": "text"
}
}
}
}
}

Related

ElasticSearch 5 won't find documents with keyword including space

I/m indexing documents with the following format:
{
"title": "this is the title",
"brand": "brand here",
"filters": ["filter1", "filter2", "Sin filters", "Camera IP"]
"active": true
}
Then a query looks like:
'query': {
'function_score': {
'query': {
'bool': {
'filter': [
{
'term': {
'active': True
}
}
],
'must': [
{
'terms': {
'filters': ['camera ip']
}
}
]
}
}
}
}
I can't return any document with "Camera IP" filters (or any variation of this string, lowercase and so on), but Es returns the ones with filters: "Sin filters".
The index is created with the following settings. Note that "filter" fields will fall under default template and is of type keyword
"settings":{
"index":{
"analysis":{
"analyzer":{
"keylower":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
},
"mappings": {
"_default_": {
"dynamic_templates": [
{
"string_as_keywords": {
"mapping": {
"index": "not_analyzed",
"type" : "keyword",
**"analyzer": "keylower"** # I also tried with and without changing this analyzer
},
"match": "*",
"match_mapping_type": "string"
}
},
{
"integers": {
"mapping": {
"type": "integer"
},
"match": "*",
"match_mapping_type": "long"
}
},
{
"floats": {
"mapping": {
"type": "float"
},
"match": "*",
"match_mapping_type": "double"
}
}
]
}
}
What I'm missing? It's strange it returns those with "Sin filters" filter but not with "Camera IP".
Thanks.
It seems like you want the filters to be lowercase and not be tokenized. I think the problem with your query is that you set the type of the strings a "keyword" and ES will not analyze these fields, not even changing their case:
Keyword fields are only searchable by their exact value.
That is why with your setting you can still retrieve the document with a query like this: {"query": {"term": {"filters": "Camera IP"}}}'.
Since you want the analyzer to change the casing of your text before indexing you should set the type to text by changing your mapping to something like this:
{"settings":{
"index": {
"analysis":{
"analyzer":{
"test_analyzer":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
},
"mappings": {
"_default_": {
"dynamic_templates": [
{
"string_as_keywords": {
"mapping": {
"type": "text",
"index": "not_analyzed",
"analyzer": "test_analyzer"
},
"match": "*",
"match_mapping_type": "string"
}
}
]
}
}}
Your filter 'filters': ['camera ip'] looks for camera ip whereas in the mapping you have the field filters as type keyword which elasticsearch looks for an exact match. So, in order to find that field you will need to have an exact string that you index for a match. If your use case doesn't require an exact match change the type to text, for which elasticsearch analyzes before indexing. More on text datatype here and keyword datatype here

Searching in all fields, case insensitive, and not analyzed

In elasticSearch,
How can I define a dynamic default mapping for any field (the fields are not predefined) that is searchable with spaces and case insensitive values.
For example, if i have two documents:
PUT myindex/mytype/1
{
"transaction": "test"
}
and
PUT myindex/mytype/2
{
"transaction": "test SPACE"
}
I'd like to perform the following queries:
Querying: "test", Expected result: "test"
Querying: "test space", Expected result "test SPACE"
I've tried to use:
PUT myindex
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_keyword":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
},
"mappings":{
"test":{
"properties":{
"title":{
"analyzer":"analyzer_keyword",
"type":"string"
}
}
}
}
}
But it gives me both document as result when looking for "test".
Apparently there was a mistake running my query:
Here's a solution I found to this problem, when using multi field query:
#any field mapping - not analyzed and case insensitive
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_keyword": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
},
"mappings": {
"doc": {
"dynamic_templates": [
{ "notanalyzed": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"analyzer":"analyzer_keyword"
}
}
}
]
}
}
}
#index test data
POST /test_index/doc/_bulk
{"index":{"_id":3}}
{"name":"Company Solutions", "a" : "a1"}
{"index":{"_id":4}}
{"name":"Company", "a" : "a2"}
#search for document with name “company” and a “a1”
POST /test_index/doc/_search
{
"query" : {
"filtered" : {
"filter": {
"and": {
"filters": [
{
"query": {
"match": {
"name": "company"
}
}
},
{
"query": {
"match": {
"a": "a2"
}
}
}
]
}
}
}
}
}

Multi-level nesting in elastic search

I have the below structure (small part of a very large elastic-search document)
sample: {
{
"md5sum":"4002cbda13066720513d1c9d55dba809",
"id":1,
"sha256sum":"1c6e77ec49413bf7043af2058f147fb147c4ee741fb478872f072d063f2338c5",
"sha1sum":"ba1e6e9a849fb4e13e92b33d023d40a0f105f908",
"created_at":"2016-02-02T14:25:19+00:00",
"updated_at":"2016-02-11T20:43:22+00:00",
"file_size":188416,
"type":{
"name":"EXE"
},
"tags":[
],
"sampleSources":[
{
"filename":"4002cbda13066720513d1c9d55dba809",
"source":{
"name":"default"
}
},
{
"filename":"4002cbda13066720332513d1c9d55dba809",
"source":{
"name":"default"
}
}
]
}
}
The filter I would like to use is to find by the 'name' contained within sample.sampleSources.source using elastic search.
I tried the below queries
curl -XGET "http://localhost:9200/app/sample/_search?pretty" -d {query}
where, {query} is
{
"query":{
"nested":{
"path":"sample.sampleSources",
"query":{
"nested":{
"path":"sample.sampleSources.source",
"query":{
"match":{
"sample.sampleSources.source.name":"default"
}
}
}
}
}
}
}
However, it is not returning me any results. I have certain cases in my document where the nesting is more deeper than this. Can someone please guide me as to how should I formulate this query so that it works for all cases?
EDIT 1
Mappings:
{
"app":{
"mappings":{
"sample":{
"sampleSources":{
"type":"nested",
"properties":{
"filename":{
"type":"string"
},
"source":{
"type":"nested",
"properties":{
"name":{
"type":"string"
}
}
}
}
}
}
EDIT 2
The solution posted by Waldemar Neto below works well for match query but not for a wild-card or neither for a regexp
Can you please guide? I need the wild-card and the regexp queries to be working for this.
i tried here using your examples and works fine.
Take a look in my data.
mapping:
PUT /app
{
"mappings": {
"sample": {
"properties": {
"sampleSources": {
"type": "nested",
"properties": {
"source": {
"type": "nested"
}
}
}
}
}
}
}
indexed data
POST /app/sample
{
"md5sum": "4002cbda13066720513d1c9d55dba809",
"id": 1,
"sha256sum": "1c6e77ec49413bf7043af2058f147fb147c4ee741fb478872f072d063f2338c5",
"sha1sum": "ba1e6e9a849fb4e13e92b33d023d40a0f105f908",
"created_at": "2016-02-02T14:25:19+00:00",
"updated_at": "2016-02-11T20:43:22+00:00",
"file_size": 188416,
"type": {
"name": "EXE"
},
"tags": [],
"sampleSources": [
{
"filename": "4002cbda13066720513d1c9d55dba809",
"source": {
"name": "default"
}
},
{
"filename": "4002cbda13066720332513d1c9d55dba809",
"source": {
"name": "default"
}
}
]
}
Search query
GET /app/sample/_search
{
"query": {
"nested": {
"path": "sampleSources.source",
"query": {
"match": {
"sampleSources.source.name": "default"
}
}
}
}
}
Example using wildcard
GET /app/sample/_search
{
"query": {
"nested": {
"path": "sampleSources.source",
"query": {
"wildcard": {
"sampleSources.source.name": {
"value": "*aul*"
}
}
}
}
}
}
The only thing that I saw some difference was in the path, you don't need to set the sample (type) in the nested path, only the inner objets.
Test and give me a feedback.

Multi field analyzer not working as expected

I'm confused. I have the following document indexed:
POST test/topic
{
"title": "antiemetics"
}
With the following query:
{
"query": {
"query_string" : {
"fields" : ["title*"],
"default_operator": "AND",
"query" :"anti emetics",
"use_dis_max" : true
}
},
"highlight" : {
"fields" : {
"*" : {
"fragment_size" : 200,
"pre_tags" : ["<mark>"],
"post_tags" : ["</mark>"]
}
}
}
}
and the following settings and mappings:
POST test{
"settings":{
"index":{
"number_of_shards":1,
"analysis":{
"analyzer":{
"merge":{
"type":"custom",
"tokenizer":"keyword",
"filter":[
"lowercase"
],
"char_filter":[
"hyphen",
"space",
"html_strip"
]
}
},
"char_filter":{
"hyphen":{
"type":"pattern_replace",
"pattern":"[-]",
"replacement":""
},
"space":{
"type":"pattern_replace",
"pattern":" ",
"replacement":""
}
}
}
}
},
"mappings":{
"topic":{
"properties":{
"title":{
"analyzer":"standard",
"search_analyzer":"standard",
"type":"string",
"fields":{
"specialised":{
"type":"string",
"index":"analyzed",
"analyzer":"standard",
"search_analyzer":"merge"
}
}
}
}
}
}
}
I know my use of a multi-field doesn't make sense as I'm using the same index analyzer as the title so please just ignore that however I'm more interested in my understanding with regard to analyzers. I was expecting the merge analyzer to change the following query "anti emetics" to "antiemetics" and I was hoping the multifield setting that has the analyzer applied would match against the token "antiemetics" but I don't get any results back even though I have tested that the analyzer is removing white spaces from the query by running the analyze API. Any idea why?
This seems to work with your setup:
POST /test_index/_search
{
"query": {
"match": {
"title.specialised": "anti emetics"
}
}
}
Here's some code I set up to play with it:
http://sense.qbox.io/gist/3ef6926644213cf7db568557a801fec6cb15eaf9

Partial word search - ElasticSearch 1.7.2

I've been trying to build a search module for an application, using ElasticSearch. Below is the Index Structure I've constructed from sample code I read from other StackOverflow posts.
{
"megacorp4":{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"my_ngram_tokenizer",
"filter":[
"my_ngram_filter"
]
}
},
"filter":{
"my_ngram_filter":{
"type":"edgeNGram",
"min_gram":3,
"max_gram":15
}
},
"tokenizer":{
"my_ngram_tokenizer":{
"type":"edgeNGram",
"min_gram":3,
"max_gram":15
}
}
},
"mappings":{
"employee":{
"properties":{
"about":{
"type":"string",
"analyzer":"my_analyzer"
},
"age":{
"type":"long"
},
"first_name":{
"type":"string"
},
"interests":{
"type":"string",
"analyzer":"my_analyzer"
},
"last_name":{
"type":"string"
}
}
}
}
}
}
}
Below are the records I inserted to test the search functionality
[
{
"first_name":"John",
"last_name":"Smith",
"age":25,
"about":"I love to go rock climbing",
"interests":[
"sports",
"music"
]
},
{
"first_name":"Douglas",
"last_name":"Fir",
"age":35,
"about":"I like to build album climb cabinets",
"interests":[
"forestry",
"music"
]
},
{
"first_name":"Jane",
"last_name":"Smith",
"age":32,
"about":"I like to collect rock albums",
"interests":[
"music"
]
}
]
I ran a search on the 'about' column, both using API (through POSTMAN) and in the Python client as follows :
API Query:
localhost:9200/megacorp4/_search?q=climb
Python Query :
from elasticsearch import Elasticsearch
from pprint import pprint
es = Elasticsearch()
res = es.search(index="megacorp4", body={"query": {"match": {'about':"climb"}}})
pprint(res)
I'm able to obtain only exact match, and I don't get the result with 'climbing' in the output. However when I replace 'climb' with 'climb*' in the query, I get 2 records with 'climb' and 'climbing'. I don't want to use '*' wildcard approach.
I've also tried using 'english', 'standard' & 'ngram' inbuilt analyzers, but nothing seemed to work.
In need of help to implement Search a key as Partial words in Full Text.
Thanks in advance.
Use this mapping instead:
DELETE test
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_ngram_filter"
]
}
},
"filter": {
"my_ngram_filter": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 15
}
}
}
},
"mappings": {
"employee": {
"properties": {
"about": {
"type": "string",
"analyzer": "my_analyzer"
},
"age": {
"type": "long"
},
"first_name": {
"type": "string"
},
"interests": {
"type": "string",
"analyzer": "my_analyzer"
},
"last_name": {
"type": "string"
}
}
}
}
}
POST /test/employee/_bulk
{"index":{}}
{"first_name":"John","last_name":"Smith","age":25,"about":"I love to go rock climbing","interests":["sports","music"]}
{"index":{}}
{"first_name":"Douglas","last_name":"Fir","age":35,"about":"I like to build album climb cabinets","interests":["forestry","music"]}
{"index":{}}
{"first_name":"Jane","last_name":"Smith","age":32,"about":"I like to collect rock albums","interests":["music"]}
GET /test/_search?q=about:climb
GET /test/_search
{
"query": {
"query_string": {
"query": "about:climb"
}
}
}
GET /test/_search
{
"query": {
"match": {
"about": "climb"
}
}
}
Two changes:
you need another closing curly bracket for the settings part
replace your custom tokenizer (which will not help you since you already have the edgeNGram filter) with another one, my suggestion is standard tokenizer
And for the ?q=climb part, by default this searches the _all field which is analyzed with standard analyzer and not with your custom one.
So, the correct query is localhost:9200/megacorp4/_search?q=about:climb.

Resources