Elasticsearch match string with spaces, columns, dashes exactly - elasticsearch

I'm using Elasticsearch 6.8, and trying to write a query in python notebook. Here is a mapping used for the index i'm working with:
{ "mapping": { "news": { "properties": { "dateCreated": { "type": "date", "format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis" }, "itemId": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "market": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "timeWindow": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } } }
I'm trying to search for exact string like "[2020-08-16 10:00:00.0,2020-08-16 11:00:00.0]" in "timeWindow" field (which is a "text" type, not a "date" field), and also select by market="en-us" (market is a "text" field too). This string has spaces,colons,commas, a lot of whitecharacters, and I don't know how to make a right query.
At the moment I have this query:
res = es.search(index='my_index',
doc_type='news',
body={
'size': size,
'query':{
"bool":{
"must":[{
"simple_query_string": {
"query": "[2020-08-17 00:00:00.0,2020-08-17 01:00:00.0]",
"default_operator": "and",
"minimum_should_match":"100%"
}
},
{"match":{"market":"en-us"}}
]
}
}
})
The problem is that is doesn't match my "simple_query_string" for timeWindow string exactly (I understand that this string gets tokenized, splitted into parts like "2020","08","17","00","01", etc, and each token is analyzed separately), and I'm getting different values for timeWindow that I want to exclude, like
['[2020-08-17 00:00:00.0,2020-08-17 01:00:00.0]'
'[2020-08-17 00:05:00.0,2020-08-17 01:05:00.0]'
...
'[2020-08-17 00:50:00.0,2020-08-17 01:50:00.0]'
'[2020-08-17 00:55:00.0,2020-08-17 01:55:00.0]'
'[2020-08-17 01:00:00.0,2020-08-17 02:00:00.0]']
Is there a way to do what I want?
UPD (and answer):
My current query uses "term" and "timeWindow.keyword", this combination allows me to do exact search for string with spaces and other whitecharacters:
res = es.search(index='msn_click_events', doc_type='news', body={
'size': size,
'query':{
"bool":{
"must":[{
"term": {
"timeWindow.keyword": tw
}
},
{"match":{"market":"en-us"}}
]
}
}
})
And this query selects only right timewindows values (string):
['[2020-08-17 00:00:00.0,2020-08-17 01:00:00.0]'
'[2020-08-17 01:00:00.0,2020-08-17 02:00:00.0]'
'[2020-08-17 02:00:00.0,2020-08-17 03:00:00.0]'
...
'[2020-08-17 22:00:00.0,2020-08-17 23:00:00.0]'
'[2020-08-17 23:00:00.0,2020-08-18 00:00:00.0]']

On your timeWindow field you need a keyword aka exact search but you are using the full-text query and as you defined this field as text field and you already guessed it correct, it gets analyzed during the index time, hence you are not getting the correct results.
If you are using the dynamic mapping, then .keyword field would be generated for each text field in the mapping, so you can simply use timeWindow.keyword in your query and it will work.
If you have defined your mapping than you need to add the keyword field to store the timewindow, reindex the data and use that keyword field in query to get the expected results.

Related

Elasticsearch - Using copy_to on the fields of a nested type

Elastic version 7.17
Below I've pasted a simplified version of my mappings which represent a nested object structure. One top-level-object will have one or more second-level-object. A second-level-object will have one or more third-level-object. Fields field_a, field_b, and field_c on third-level-object are all related to each other so I'd like to copy them into a single field that can be partial matched against. I've done this on a lot of attributes at the top-level-object level, so I know it works.
{
"mappings": {
"_doc": { //one top level object
"dynamic": "false",
"properties": {
"second-level-objects": { //one or more second level objects
"type": "nested",
"dynamic": "false",
"properties": {
"third-level-objects": { //one or more third level objects
"type": "nested",
"dynamic": "false",
"properties": {
"my_copy_to_field": { //should have the values from field_a, field_b, and field_c
"type": "text",
"index": true
},
"field_a": {
"type": "keyword",
"index": false,
"copy_to": "my_copy_to_field"
},
"field_b": {
"type": "long",
"index": false,
"copy_to": "my_copy_to_field"
},
"field_c": {
"type": "keyword",
"index": false,
"copy_to": "my_copy_to_field"
},
"field_d": {
"type": "keyword",
"index": true
}
}
}
}
}
}
}
}
}
However, when I run a nested query against that my_copy_to_field I get no results because the field is never populated, even though I know my documents have data in the 3 fields with copy_to. If I perform a nested query against field_d which is not part of the copied info I get the expected results, so it seems like there's something going on with nested (or double-nested in my case) usage of copy_to that I'm overlooking. Here is my query which returns nothing:
GET /my_index/_search
{
"query": {
"nested": {
"inner_hits": {},
"path": "second-level-objects",
"query": {
"nested": {
"inner_hits": {},
"path": "second-level-objects.third-level-objects",
"query": {
"bool": {
"should": [
{"match": {"second-level-objects.third-level-objects.my_copy_to_field": "my search value"}}
]
}
}
}
}
}
}
}
I've tried adding include_in_root:true to the third-level-objects, but that didn't make any difference. If I could just get the field to populate with the copied data then I'm sure I can work through getting the query working. Is there something I'm missing about using copy_to with nested fields?
Additionally, when I view my data in Kibana -> Discover, I see second-level-objects as an available "Nested" field, but I don't see anything for third-level-objects, even though KQL recognizes it as a field. Is that symptomatic of an issue?
You must add complete path nested, like this:
"field_a": {
"type": "keyword",
"copy_to": "second-level-objects.third-level-objects.my_copy_to_field"
},
"field_b": {
"type": "long",
"copy_to": "second-level-objects.third-level-objects.my_copy_to_field"
},
"field_c": {
"type": "keyword",
"copy_to": "second-level-objects.third-level-objects.my_copy_to_field"
}

How to declare mapping for nested fields in Elasticsearch to allow for storing different types?

In essence, I want my mapping to be as schemaless as possible, but allow for nested types and being able to store data that may have different types:
When I try to add a document where some fields have different types of values, I get an error like this:
"type": "illegal_argument_exception",
"reason": "mapper [data.customData.value] of different type, current_type [long], merged_type [text]"
This can easily be solved by mapping the field value to text (or create it dynamically by first inserting a document with only text). However, I would like to avoid having a schema. Perhaps having all of the fields nested in customData to be set to text? How do I do that?
I had the problem earlier, but then it started working after accidentally managing to get a dynamical mapping that worked (since everything was regarded as text. I was later made aware of this problem since I needed to change the mapping to allow for nested types.
Documents with this kind of data are troublesome to store successfully:
"customData": [
{
"value": "some_text",
"key": "some_text"
},
{
"value": 0,
"key": "some_text"
}
]
A part of the mapping that works:
{
"my_index": {
"aliases": {},
"mappings": {
"_doc": {
"properties": {
"data": {
"properties": {
"customData": {
"properties": {
"key": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"value": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
},
"some_list": {
"type": "nested",
"properties": {
"some_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
}
In essence, I want the mapping to be as schemaless as possible, but allow for nested types and being able to store data that may have different types:
{
"mappings": {
"_doc": {
"properties": {
"data": {
"type": "object"
},
"somee_list": {
"type": "nested"
}
}
}
}
}
So what would be the best approach to go about this problem?

How to set elasticsearch index mapping as not_analysed for all the fields

I want my elasticsearch index to match the exact value for all the fields. How do I map my index to "not_analysed" for all the fields.
I'd suggest making use of multi-fields in your mapping (which would be default behavior if you aren't creating mapping (dynamic mapping)).
That way you can switch to traditional search and exact match searches when required.
Note that for exact matches, you would need to have keyword datatype + Term Query. Sample examples are provided in the links I've specified.
Hope it helps!
You can use dynamic_templates mapping for this. As a default, Elasticsearch is making the fields type as text and index: true like below:
{
"products2": {
"mappings": {
"product": {
"properties": {
"color": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
As you see, also it creates a keyword field as multi-field. This keyword fields indexed but not analyzed like text. if you want to drop this default behaviour. You can use below configuration for the index while creating it :
PUT products
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"product": {
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword",
"index": false
}
}
}
]
}
}
}
After doing this the index will be like below :
{
"products": {
"mappings": {
"product": {
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword",
"index": false
}
}
}
],
"properties": {
"color": {
"type": "keyword",
"index": false
},
"type": {
"type": "keyword",
"index": false
}
}
}
}
}
}
Note: I don't know the case but you can use the multi-field feature as mentioned by #Kamal. Otherwise, you can not search on the not analyzed fields. Also, you can use the dynamic_templates mapping set some fields are analyzed.
Please check the documentation for more information :
https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html
Also, I was explained the behaviour in this article. Sorry about that but it is Turkish. You can check the example code samples with google translate if you want.

Cannot search Elastic Search with string containing "/0"

I'm using Elastic Search to search for my ID. My ID somehow looks like 21.11101/0000-0000-9B71-2. However, when I use it in my ES, it will throw an error saying that:
Failed to parse query [21.11101/0000-0000-9B71-2]"
Encountered: EOF after : \"/0000-0000-9B71-2\"
I think this happens because ES thinks that /0 is an EOF character. How to tell ES to treat my search query as a normal text? Here is my query:
GET _search
{
"query": {
"query_string": {
"query": "21.11101/0000-0000-9B71-2"
}
}
}
And here is my _mapping:
"PID": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
I want to use Prefix Query, so basically, my query will look like this:
GET _search
{
"query": {
"prefix": {
"ID" : "21.11101/00"
}
}
}
But that query returns 0 hit. I guess because of the error is above. So, how can I solve this problem?
Thanks for your help.
Let's break it down a little bit: what your mapping says is:
"PID": { "type": "text", ... which means the content of PID field will be text with default analyzer, which standard. It will tokenize your input text by white-spaces and some word-delimiters, like -(hyphen), /(forward slash) and some other.
... "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } will prompt ES to keep additional version of your PID field as a keyword, which will be accessible via alias PID.keyword. The important limitation of multi-fields is that you can't perform search against it. You can only make sorting and aggregation.
Taking that into account my suggestion is as simple as specifying non-standard analyzer which won't tokenize input text (it is keyword analyzer):
curl -XPUT http://localhost:9200/my_pids -d '
{
"mappings": {
"pid": {
"properties": {
"PID": {
"type": "text",
"analyzer": "keyword",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}'
If you won't use the sorting / aggregation by your PID.keyword field - feel free to drop the fields section as well.
Hope it helps :)

Aggregating over _field_names in elasticsearch 5

I'm trying to aggregate over field names in ES 5 as described in Elasticsearch aggregation on distinct keys But the solution described there is not working anymore.
My goal is to get the keys across all the documents. Mapping is the default one.
Data:
PUT products/product/1
{
"param": {
"field1": "data",
"field2": "data2"
}
}
Query:
GET _search
{
"aggs": {
"params": {
"terms": {
"field": "_field_names",
"include" : "param.*",
"size": 0
}
}
}
}
I get following error: Fielddata is not supported on field [_field_names] of type [_field_names]
After looking around it seems the only way in ES > 5.X to get the unique field names is through the mappings endpoint, and since cannot aggregate on the _field_names you may need to slightly change your data format since the mapping endpoint will return every field regardless of nesting.
My personal problem was getting unique keys for various child/parent documents.
I found if you are prefixing your field names in the format prefix.field when hitting the mapping endpoint it will automatically nest the information for you.
PUT products/product/1
{
"param.field1": "data",
"param.field2": "data2",
"other.field3": "data3"
}
GET products/product/_mapping
{
"products": {
"mappings": {
"product": {
"properties": {
"other": {
"properties": {
"field3": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"param": {
"properties": {
"field1": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"field2": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
}
Then you can grab the unique fields based on the prefix.
This is probably because setting size: 0 is not allowed anymore in ES 5. You have to set a specific size now.
POST _search
{
"aggs": {
"params": {
"terms": {
"field": "_field_names",
"include" : "param.*",
"size": 100 <--- change this
}
}
}
}

Resources