Excluding field from _source causes aggregation to not work - elasticsearch

We're using Elasticsearch 1.7.2 and trying to use the "include/exclude from _source" feature as it's described here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
We have a field types that's 'pretty' and that we would like to return to the client but it's not well suited to aggregations, and a field types_int (and also a types_string but that's not relevant now) that's 'ugly' but optimized for search/aggregations which we don't want to return to the client but that we want to aggregate/filter on.
The field types_int doesn't need to be stored anywhere, it just needs to be indexed. We don't want to waste bandwidth in returning it to the client either, so we don't want to include it in _source.
The mapping for it looks like this:
"types_int": {
"type": "nested",
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"value_int": {
"type": "integer"
}
}
}
However, after we add the exclude, our filters/aggregations on it stop working.
The excludes looks like this:
"_source": {
"excludes": [
"types_int"
]
}
Without that in the mapping, everything works fine.
An example of a filter:
POST my_index/my_type/_search
{
"filter": {
"nested": {
"path": "types_int",
"filter": {
"term": {
"types_int.name": "<something>"
}
}
}
}
}
Again, removing the excludes and everything works fine.
Thinking it might have something to do with nested types, since they're separate documents and all and perhaps handled differently from normal fields, I added an exclude mapping for a 'normal' value type field and then my filter also stopped working.
"publication": {
"type": "string",
"index": "not_analyzed"
}
"_source": {
"excludes": [
"publication"
]
}
So my conclusion is that after you exclude something from _source, you can no longer filter on it? Which doesn't make sense to me, so I'm thinking there's something we're doing wrong here. The _source include/exclude is just a post-process action that manipulates the string data inside that field, right?
I understand that we can also use source filtering to request specific fields to not be included at query time, but it's simply unnecessary to store it. If anything, I would just like to understand why this doesn't work :)

Related

Elasticsearch custom mapping definition

I have to upload data to elk in the following format:
{
"location":{
"timestamp":1522751098000,
"resources":[
{
"resource":{
"name":"Node1"
},
"probability":0.1
},
{
"resource":{
"name":"Node2"
},
"probability":0.01
}]
}
}
I'm trying to define a mapping this kind of data and I produced he following mapping:
{
"mappings": {
"doc": {
"properties": {
"location": {
"properties" : {
"timestamp": {"type": "date"},
"resources": []
}
}
}
}
}
I have 2 questions:
how can I define the "resources" array in my mapping?
is it possible to define a custom type (e.g. resource) and use this type in my mapping (e.g "resources": [{type:resource}]) ?
There is a lot of things to know about the Elasticsearch mapping. I really highly suggest to read through at least some of their documentation.
Short answers first, in case you don't care:
Elasticsearch automatically allows storing one or multiple values of defined objects, there is no need to specify an array. See Marker 1 or refer to their documentation on array types.
I don't think there is. Since Elasticsearch 6 only 1 type per index is allowed. Nested objects is probably the closest, but you define them in the same file. Nested objects are stored in a separate index (internally).
Long answer and some thoughts
Take a look at the following mapping:
"mappings": {
"doc": {
"properties": {
"location": {
"properties": {
"timestamp": {
"type": "date"
},
"resources": { [1]
"type": "nested", [2]
"properties": {
"resource": {
"properties": {
"name": { [3]
"type": "text"
}
}
},
"probability": {
"type": "float"
}
}
}
}
}
}
}
}
This is how your mapping could look like. It can be done differently, but I think it makes sense this way - maybe except marker 3. I'll come to these right now:
Marker 1: If you define a field, you usually give it a type. I defined resources as a nested type, but your timestamp is of type date. Elasticsearch automatically allows storing one or multiple values of these objects. timestamp could actually also contain an array of dates, there is no need to specify an array.
Marker 2: I defined resources as a nested type, but it could also be an object like resource a little below (where no type is given). Read about nested objects here. In the end I don't know what your queries would look like, so not sure if you really need the nested type.
Marker 3: I want to address two things here. First, I want to mention again that resource is defined as a normal object with property name. You could do that for resources as well.
Second thing is more a thought-provoking impulse: Don't take it too seriously if something absolutely doesn't fit your case. Just take it as an opinion.
This mapping structure looks very inspired by a relational database approach. I think you usually want to define document structures for elasticsearch more for the expected searches. Redundancy is not a problem, but nested objects can make your queries complicated. I think I would omit the whole resources part and do it something like this:
"mappings": {
"doc": {
"properties": {
"location": {
"properties": {
"timestamp": {
"type": "date"
},
"resource": {
"properties": {
"resourceName": {
"type": "text"
}
"resourceProbability": {
"type": "float"
}
}
}
}
}
}
}
}
Because as I said, in this case resource can contain an array of objects, each having a resourceName and a resourceProbability.

Elasticsearch - template matcing based on field value

Imagine this document:
{
"_index": "project.datasync.20180101",
"_type": "com.redhat.viaq.common",
"service": "data-sync-server",
"data": {
"foo":"bar"
}
...
}
I would like to have mapping for "data.foo" field (imagine I need some changes in how it is indexed etc.)
I know I can match indices like this:
{
"template" : "project.datasync.*",
"order" : 100,
"mappings": {
"data": {
"enabled": true,
"properties": {
"foo": {"type": "string", "index": "not_analyzed", ...}
}
}
}
}
However, the datasync part in the index name comes from somewhere else and there's no guarantee that it will be datasync or something similar that matches a pattern.
So, my index template wouldn't match if the index is project.thedatasync.20180101.
I know I can use project.* in my index template for matching, but in that case it is too generic where it matches irrelevant things.
So, I would like to have this mapping active only when service is data-sync-server which is always true for the documents that I am interested in.
Any ideas? This seemed like something fundamentally against how ElasticSearch works and in that case I would like to clarify that.
Please note that documents are sent to ElasticSearch with Fluentd I don't have access to Fluentd config to change the index name there.

Elasticsearch Terms aggregation with unknown datatype

I'm indexing data of unknown schema in Elasticsearch using dynamic mapping, i.e. we don't know the shape, datatypes, etc. of much of the data ahead of time. In queries, I want to be able to aggregate on any field. Strings are (by default) mapped as both text and keyword types, and only the latter can be aggregated on. So for strings my terms aggregations must look like this:
"aggs": {
"something": {
"terms": {
"field": "something.keyword"
}
}
}
But other types like numbers and bools do not have this .keyword sub-field, so aggregations for those must look like this (which would fail for text fields):
"aggs": {
"something": {
"terms": {
"field": "something"
}
}
}
Is there any way to specify a terms aggregation that basically says "if something.keyword exists, use that, otherwise just use something", and without taking a significant performance hit?
Requiring datatype information to be provided at query time might be an option for me, but ideally I want to avoid it if possible.
If the primary use case is aggregations, it may be worth changing the dynamic mapping for string properties to index as a keyword datatype, with a multi-field sub-field indexed as a text datatype i.e. in dynamic_templates
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword",
"ignore_above": 256,
"fields": {
"text": {
"type": "text"
}
}
}
}
},

Boost field on index in Elastic

I'm using Elastic 1.7.3 and I would like to have a boost on some fields in a index with documents like this fictional example :
{
title: "Mickey Mouse",
content: "Mickey Mouse is a fictional ...",
related_articles: [
{"title": "Donald Duck"},
{"title": "Goofy"}
]
}
Here eg: title is really important, content too, related_articles is a bit more important. My real document have lot of fields and nested object.
I would like to give more weight to the title field than content, and more to content than related_articles.
I have seen the title^5 way, but I must use it at each query and I must (I guess) list all my fields instead of a "_all" query.
I do a lot of search but I found lot of deprecated solutions (_boost by eg).
As I used to work with Sphinx : I search something that works like the field weight option where you can give some weight to field that are really important in your index than others.
You're right that the _boost meta-field that you could use at the type level has been deprecated.
But you can still use the boost property when defining each field in your mapping, which will boost your field at indexing time.
Your mapping would look like this:
{
"my_type": {
"properties": {
"title": {
"type": "string", "boost": 5
},
"content": {
"type": "string", "boost": 4
},
"related_articles": {
"type": "nested",
"properties": {
"title": {
"type": "string", "boost": 3
}
}
}
}
}
}
You have to be aware, though, that it's not necessarily a good idea to boost your field at index time, because once set, you cannot change it unless you are willing to re-index all of your documents, whereas using query-time boosting achieves the same effect and can be changed more easily.

ElasticSearch what analyzer should be used for searching for both url fragment and exact url path

I want to store uri in a mapping and I want to make it searchable the following way:
Exact match (i.e. if I stored: http://stackoverflow.com/questions then looking for the term http://stackoverflow.com/questions retrieves the item.
Bit like letter tokenizer all "words" should be searchable. So searching for either questions, stackoverflow or maybe com will bring back http://stackoverflow.com/questions as a hit.
Looking for '.' or '/' separated url fragments should be still searchable. So searching for stackoverflow.com will bring back http://stackoverflow.com/questions as a hit.
should be case insensitive. (like lowercase)
The html://, htmls://, www. etc. is optional for searching. So searching for either http://stackoverflow.com or stackoverflow.com will bring back http://stackoverflow.com/questions as a hit.
Maybe a solution should be something like chaining tokenizers or something like that. I'm quite new to ES so this is maybe a trivial question.
So what kind of analyzer should I use/build to achieve this functionality?
Any help would be greatly apprechiated.
You are absolutely, correct. You will want to set your field type as multi_field and then create analyzers for each scenario. At the core, you can then do a multi_match query:
=============type properties===============
{
"fun_documents": {
"properties": {
"url": {
"type": "multi_field",
"fields": {
"keyword": {
"type": "string",
"analyzer": "keyword"
},
"alphanum_only": {
"type": "string",
"analyzer": "my_custom_alpha_num_analyzer"
},
{
"etc": "etc"
}
}
}
}
}
}
==================query=====================
{
"query": {
"multi_match": {
"query": "stackoverflow",
"fields": [
"url.keyword",
"url.alphanum_only",
"url.optional_fun"
]
}
}
}
Note that you can get fancy with multi_field aliases and reusing the same name, but this is the simple demonstration.

Resources