How can I force a float casting on elasticsearch? - elasticsearch

I have an index elasticsearch with a mapping:
{
"book": {
"mappings": {
"educational": {
"properties": {
"price": {
"type": "float"
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
Now I can index a document with instead of a float, a string:
{
"title": "Test",
"price": "120.99"
}
The value of price will be presented as a string when I will retrieve this document later, despite the fact that the mapping say it should be a float.
I know that the price will still be indexed as a float despite the fact that it is presented as a string but is there a way to force a casting of the field into a float to have a better coherence in the data?

internally the field will be stored as a float, when coercing is used. However the original document will not be changed, which means the original JSON will still contain the field as a string.
You could use a convert processor in a pipeline to change the string to a float before the document is being indexed.

Related

Keyword field created automatically without any mapping in Entity class

My ElasticSearch version is 7.6.2 and my spring-boot-starter-data-elasticsearch is version 2.2.0.
Due to some dependency i am not upgrading ES to lastest version.
Problem i am facing is ES index is sometimes created with .keyword fields and sometimes it is just normal text field.
Below is my entity class. i am not able to find why this is happening. I read that all text field will have keyword field also. but why it is not created always.
My Entity class
#Setter
#Getter
#Document(indexName="myindex", createIndex=true, shards = 4)
public class MyIndex {
#Field(type = FieldType.Keyword)
private String place;
#Field(type = FieldType.Text)
private String name;
#Id
private String dynamicId = UUID.randomUUID().toString();
public MyIndex()
{}
Mapping in ES:
{
"mappings": {
"myindex": {
"properties": {
"place": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"dynamicId": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}
}
}
Sometimes it is created as below for the same entity class
{
"mappings": {
"myindex": {
"properties": {
"place": {
"type": "keyword"
},
"name": {
"type": "text"
},
"dynamicId": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}
}
}
With the entity definition shown, when Spring Data Elasticsearch creates the index and writes the mapping, you will get the mapping shown in your second example with these value for the properties:
{
"properties": {
"place": {
"type": "keyword"
},
"name": {
"type": "text"
}
}
}
If you want to have a nested keyword property in Spring Data Elasticsearch you have to define it on the entity with the corresponding annotation.
Please notice: the #Id property is not mapped explicitly but will be dynamically mapped on first indexing of a document.
The mapping in the first case and the part in the second where a String is mapped as
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
is the default value that Elasticsearch uses when a document is indexed with a text field that was not mapped before - see the docs about dynamic mapping.
So your second example shows the mapping of an index that was created by Spring Data Elasticsearch and where some documents have been indexed.
The first one would be created by Elasticsearch if some other application creates the index and writes data into the index. It could also be that the index was created outside your application, and on application startup no mapping would then be written, because the index already exists. So you should review the way your indices are created.

Elasticsearch match string with spaces, columns, dashes exactly

I'm using Elasticsearch 6.8, and trying to write a query in python notebook. Here is a mapping used for the index i'm working with:
{ "mapping": { "news": { "properties": { "dateCreated": { "type": "date", "format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis" }, "itemId": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "market": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "timeWindow": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } } }
I'm trying to search for exact string like "[2020-08-16 10:00:00.0,2020-08-16 11:00:00.0]" in "timeWindow" field (which is a "text" type, not a "date" field), and also select by market="en-us" (market is a "text" field too). This string has spaces,colons,commas, a lot of whitecharacters, and I don't know how to make a right query.
At the moment I have this query:
res = es.search(index='my_index',
doc_type='news',
body={
'size': size,
'query':{
"bool":{
"must":[{
"simple_query_string": {
"query": "[2020-08-17 00:00:00.0,2020-08-17 01:00:00.0]",
"default_operator": "and",
"minimum_should_match":"100%"
}
},
{"match":{"market":"en-us"}}
]
}
}
})
The problem is that is doesn't match my "simple_query_string" for timeWindow string exactly (I understand that this string gets tokenized, splitted into parts like "2020","08","17","00","01", etc, and each token is analyzed separately), and I'm getting different values for timeWindow that I want to exclude, like
['[2020-08-17 00:00:00.0,2020-08-17 01:00:00.0]'
'[2020-08-17 00:05:00.0,2020-08-17 01:05:00.0]'
...
'[2020-08-17 00:50:00.0,2020-08-17 01:50:00.0]'
'[2020-08-17 00:55:00.0,2020-08-17 01:55:00.0]'
'[2020-08-17 01:00:00.0,2020-08-17 02:00:00.0]']
Is there a way to do what I want?
UPD (and answer):
My current query uses "term" and "timeWindow.keyword", this combination allows me to do exact search for string with spaces and other whitecharacters:
res = es.search(index='msn_click_events', doc_type='news', body={
'size': size,
'query':{
"bool":{
"must":[{
"term": {
"timeWindow.keyword": tw
}
},
{"match":{"market":"en-us"}}
]
}
}
})
And this query selects only right timewindows values (string):
['[2020-08-17 00:00:00.0,2020-08-17 01:00:00.0]'
'[2020-08-17 01:00:00.0,2020-08-17 02:00:00.0]'
'[2020-08-17 02:00:00.0,2020-08-17 03:00:00.0]'
...
'[2020-08-17 22:00:00.0,2020-08-17 23:00:00.0]'
'[2020-08-17 23:00:00.0,2020-08-18 00:00:00.0]']
On your timeWindow field you need a keyword aka exact search but you are using the full-text query and as you defined this field as text field and you already guessed it correct, it gets analyzed during the index time, hence you are not getting the correct results.
If you are using the dynamic mapping, then .keyword field would be generated for each text field in the mapping, so you can simply use timeWindow.keyword in your query and it will work.
If you have defined your mapping than you need to add the keyword field to store the timewindow, reindex the data and use that keyword field in query to get the expected results.

How to declare mapping for nested fields in Elasticsearch to allow for storing different types?

In essence, I want my mapping to be as schemaless as possible, but allow for nested types and being able to store data that may have different types:
When I try to add a document where some fields have different types of values, I get an error like this:
"type": "illegal_argument_exception",
"reason": "mapper [data.customData.value] of different type, current_type [long], merged_type [text]"
This can easily be solved by mapping the field value to text (or create it dynamically by first inserting a document with only text). However, I would like to avoid having a schema. Perhaps having all of the fields nested in customData to be set to text? How do I do that?
I had the problem earlier, but then it started working after accidentally managing to get a dynamical mapping that worked (since everything was regarded as text. I was later made aware of this problem since I needed to change the mapping to allow for nested types.
Documents with this kind of data are troublesome to store successfully:
"customData": [
{
"value": "some_text",
"key": "some_text"
},
{
"value": 0,
"key": "some_text"
}
]
A part of the mapping that works:
{
"my_index": {
"aliases": {},
"mappings": {
"_doc": {
"properties": {
"data": {
"properties": {
"customData": {
"properties": {
"key": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"value": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
},
"some_list": {
"type": "nested",
"properties": {
"some_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
}
In essence, I want the mapping to be as schemaless as possible, but allow for nested types and being able to store data that may have different types:
{
"mappings": {
"_doc": {
"properties": {
"data": {
"type": "object"
},
"somee_list": {
"type": "nested"
}
}
}
}
}
So what would be the best approach to go about this problem?

Avoid creating dual mappings from logstash

I notice that logstash creates an extra "keyword" field in the index mapping for every string field that it extracts from the log files and sends to elastic search.
There are many fields that I've removed completely with the prune plugin, but there are other fields that I don't want to remove completely, but I also don't need to have a *.keyword for them.
Is there a way to have logstash only create *.keyword fields for some fields and not others? Specifically, is there a way for logstash to have a whitelist of fields that it is OK to create *.keywords for, and not do it for anything else?
(using elasticsearch 6.x)
I think you need to change the mapping of the desired fields. The mapping page shows the default text type mapping:
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/breaking_50_mapping_changes.html
I tried to set a field without a keyword field and it worked except you couldn't agregate on that field (I tried terms aggregation) even if you set index: true in the mapping. I might have missed something but I think this is where you should start.
The solution I'm working with for now is a dynamic templates.
I can map some fields to just text and others to text and a keyword. For example:
{
"mappings": {
"doc": {
"dynamic_templates": [
{
"match_my_custom_fields": {
"match_mapping_type": "string",
"match": "custom_prefix_*",
"mapping": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
{
"strings_as_keywords": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword",
"ignore_above": 256
}
}
}
],
"properties": {
"geoip": {
"dynamic": true,
"properties": {
"ip": {
"type": "ip"
},
"location": {
"type": "geo_point"
},
"latitude": {
"type": "half_float"
},
"longitude": {
"type": "half_float"
}
}
}
}
}
}
This way, everything beginning with custom_prefix_ will have a text and keyword field, and everything else will just have a keyword.
Of course, I somehow broke the geoip.geo_point that was being emitted from the geoip logstash plugin, and now my map visualizations won't work, so I need to figure out how to restore that.
EDIT: Got geo_point working again, see the "geoip" prop

Elastic search common mapping type and run aggregation based on type of data

we have an elastic search index with following mapping (showing only partial mapping relevant to this question)
"instFields": {
"properties": {
"_index": {
"type": "object"
},
"fieldValue": {
"fields": {
"raw": {
"index": "not_analyzed",
"type": "string"
}
},
"type": "string"
},
"sourceFieldId": {
"type": "integer"
}
},
"type": "nested"
}
as you can see fieldValue type is string: in original data in the database for that fieldValue column is stored in a JSON type column (in Postgresql). use case is such that when this data is stored fieldValue can be valid JsNumber, JsString,JsBoolean (any valid [JsValue][1] now question is that when storing this fieldValue in ES - it'll have to be a definite type - so we convert fieldValue to String while pushing data into ElasticSearch.
Following is a sample data from Elastic search
"instFields": [
{
"sourceFieldId": 1233,
"fieldValue": "Demo Logistics LLC"
},
{
"sourceFieldId": 1236,
"fieldValue": "169451"
}
]
this is where it gets interesting where now we want to run various metrics aggregations on fieldValue - for e.g. if sourceFieldId = 1236 then run [avg][3] on fieldValue - problem is fieldValue had to be stored as string in ES - due to originally fieldValue being JsValue type field in the application. what's the best way to create mapping in elastic search such that fieldValue can be stored with an appropriate type vs string type so various metrics aggregation can be run of fieldValue which are of type long (though encoded as string in ES)
One of the ways to achieve this is create different fields in elastic search with all possible type of JsValue (e.g. JsNumber, JsBoolean,JsString etc). now while indexing - application can derive proper type of JsValue field to find out whether it's JsString, JsNumber, JsBoolean etc.
on application side I can decode proper type of fieldValue being indexed
value match{
case JsString(s) =>
case JsNumber(n) =>
case JsBoolean(b)
}
now modify mapping in elastic search and add more fields - each with proper type - as shown below
"instFields": {
"properties": {
"_index": {
"type": "object"
},
"fieldBoolean": {
"type": "boolean"
},
"fieldDate": {
"fields": {
"raw": {
"format": "dateOptionalTime",
"type": "date"
}
},
"format": "dateOptionalTime",
"type": "date"
},
"fieldDouble": {
"fields": {
"raw": {
"type": "double"
}
},
"type": "double"
},
"fieldLong": {
"fields": {
"raw": {
"type": "long"
}
},
"type": "long"
},
"fieldString": {
"fields": {
"raw": {
"index": "not_analyzed",
"type": "string"
}
},
"type": "string"
},
"fieldValue": {
"fields": {
"raw": {
"index": "not_analyzed",
"type": "string"
}
},
"type": "string"
}
now at the time of indexing
value match{
case JsString(s) => //populate fieldString
case JsNumber(n) => //populate fieldDouble (there is also fieldLong)
case JsBoolean(b) //populate fieldBoolean
}
this way now boolean value is stored in fieldBoolean, number is stored in long etc. now running metrics aggregation becomes a normal business by going against fieldLong or fieldDouble field (depending on the query use case). notice fieldValue field is still there in ES mapping and index as before. Application will continue to convert value to string and store it in fieldValue as before - this way queries which don't care about types can only query fieldValue field in the index.
It sounds like you should have two separate fields, one for the case when the value is a string and one for when it is an instance of a number.
Depending on how you're indexing this data, it can be easy or hard. However, its a bit strange that you have a fields that could be a string or a number.
Regardless, elasticsearch is not going to be able to do both in a single field

Resources