Elasticsearch Terms aggregation with unknown datatype - elasticsearch

I'm indexing data of unknown schema in Elasticsearch using dynamic mapping, i.e. we don't know the shape, datatypes, etc. of much of the data ahead of time. In queries, I want to be able to aggregate on any field. Strings are (by default) mapped as both text and keyword types, and only the latter can be aggregated on. So for strings my terms aggregations must look like this:
"aggs": {
"something": {
"terms": {
"field": "something.keyword"
}
}
}
But other types like numbers and bools do not have this .keyword sub-field, so aggregations for those must look like this (which would fail for text fields):
"aggs": {
"something": {
"terms": {
"field": "something"
}
}
}
Is there any way to specify a terms aggregation that basically says "if something.keyword exists, use that, otherwise just use something", and without taking a significant performance hit?
Requiring datatype information to be provided at query time might be an option for me, but ideally I want to avoid it if possible.

If the primary use case is aggregations, it may be worth changing the dynamic mapping for string properties to index as a keyword datatype, with a multi-field sub-field indexed as a text datatype i.e. in dynamic_templates
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword",
"ignore_above": 256,
"fields": {
"text": {
"type": "text"
}
}
}
}
},

Related

Elasticsearch - Given a query_string with a wildcard, can I aggregate on the matched term?

I'm about to describe the use case for a terms aggregation and the reason why mappings should be properly configured but given the state of our cluster, neither of these are options.
I'm doing full-text searching on a terabyte of raw log data and trying to do some counts on the specific terms being matched.
Given a query string like 192.168.0.* I'm finding documents that reference terms like 192.168.0.12 somewhere in the body as expected. The specific field is not consistent.
What I'd like to do is an aggregation on the term that was found. If ES returns 100 documents in which 192.168.0.12 was found, there should be a counter that reflects this (192.168.0.12: 100). Similarly, if 50 documents were found for 192.168.0.254 I'd expect to see 192.168.0.254: 50.
Given the scale and timing this has to be done in Elasticsearch, not sideloaded and iterated application-side. Is this doable?
For this, you will need to define your mapping something like this
"IP_ADDRESS": {
"type": "keyword",
"fields": {
"raw":{
"type": "text"
}
}
}
So the searching will be on IP_ADDRESS.raw and term aggregation will be on IP_ADDRESS
{
"query": {
"query_string": {
"default_field": "IP_ADDRESS.raw",
"query": "192.168.0.*"
}
},
"aggs": {
"count_term": {
"terms": {
"field": "IP_ADDRESS",
"size": 1000
}
}
}
}

Split a field for a aggregates

I have an field with a variety of (multi word) category tags that I'm trying to figure out how to get aggregates for. For any given document there may be one or more tags separated by | characters.
I have the following mapping for my field:
"category": {
"type": "keyword",
"fields": {
"raw" : {
"type": "keyword",
"index": "not_analyzed"
}
}
}
This works for storing the data, but when I try to get the aggregates, for example:
{"aggregates": {
"categories": {
"terms": {
"field": "category"
}
}
}}
It returns whatever the field contains. For example, if I have two documents with categories of
Facilities|Information Technology
Human Resources|Information Technology
I'd like to get back something like:
Information Technology: 2
Facilities: 1
Human Resources: 1
Any suggestions on what I need to do to either split the data as part of my mapping or aggregates query?

Full Text Search as well as Terms Search on same filed of Elasticsearch

I'm from MySql background. So I don't know much about elasticsearch and it's working.
Here is my requirements
There will be table of resulted records with sorting option on all the column. There will be filter option from where user will select multiple values for multiple columns (e.g, City should be from City1, City2, City3 and Category should be from Cat2, Cat22, Cat6). There will be also search bar where user will enter some text and full text search will be applied on some fields (i.e, City, Area etc).
This image will give better understanding.
Where I'm facing problem is Full Text Search. I have tried some mapping but every time I have to compromise either on Full Text Search or Terms Search. So I think there is no any way to apply both search on same field. But as I told, I don;t know much about elasticsearch. So if any one have solution, it will be appreciated.
Here is what I have applied currently which makes sorting and Terms Searching enable but Full Text Search is not working.
{
"mappings":{
"my_type":{
"properties":{
"city":{
"type":"string",
"index":"not_analyzed"
},
"category":{
"type":"string",
"index":"not_analyzed"
},
"area":{
"type":"string",
"index":"not_analyzed"
},
"zip":{
"type":"string",
"index":"not_analyzed"
},
"state":{
"type":"string",
"index":"not_analyzed"
}
}
}
}
}
You can update the mapping with multifields with two mappings one for full text and another for terms search. Here's a sample mapping for city.
{
"city": {
"type": "string",
"index": "not_analyzed",
"fields": {
"fulltext": {
"type": "string"
}
}
}
}
Default mapping is for terms search, so when terms search is required, you could simple query in "city" field. But, you need full-text search, query must be performed on "city.fulltext". Hope this helps.
Full-text search won't work on not_analyzed fields and sorting won't work on analyzed fields.
You need to use multi-fields.
It is often useful to index the same field in different ways for different purposes. This is the purpose of multi-fields. For instance, a string field could be mapped as a text field for full-text search, and as a keyword field for sorting or aggregations:
For example :
{
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
} ...
}
}
}
}
Use the dot notation to sort by city.raw :
{
"query": {
"match": {
"city": "york"
}
},
"sort": {
"city.raw": "asc"
}
}

Elasticsearch autocomplete integer field

I am trying to implement an autocomplete feature on a numeric field (it's actual type in ES is long).
I am using a jQuery UI Autocomplete widget on the client side, having it's source function send a query to Elasticsearch with the prefix term to get a number (say, 5) of autocomplete options.
The query I am using is something like the following:
{
"size": 0,
"query": {
"prefix": {
"myField": "<term>"
}
},
"aggs": {
"myAggregation": {
"terms": {
"field": "myField",
"size": 5
}
}
}
}
Such that if myField has the distinct values: [1, 15, 151, 21, 22], and term is 1, then I'd expect to get from ES the buckets with keys [1, 15, 151].
The problem is this does not seem to work with numeric fields. For the above example, I am getting a single bucket with the key 1, and if term is 15 I am getting a single bucket with key 15, i.e. it only returns exact matches. In contrast, it works perfectly for string typed fields.
I am guessing I need some special mapping for myField, but I'd prefer to have the mapping as general as possible, while having the autocomplete working with minimal changes to the mapping (just to note - the index I am querying might be a general one, external to my application, so I will be able to change the type/field mappings in it only if the new mapping is something general and standard).
What are my options here?
What I would do is to create a string sub-field into your integer field, like this:
{
"myField": {
"type": "integer",
"fields": {
"to_string": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
Then your query would need to be changed to the one below, i.e. query on the string field, but retrieve the terms aggregations from the integer field
{
"size": 0,
"query": {
"prefix": {
"myField.to_string": "1"
}
},
"aggs": {
"myAggregation": {
"terms": {
"field": "myField",
"size": 5
}
}
}
}
Note that you can also create a completely independent field, not necessary a sub-field, the key point is that one field needs the integer value to run the terms aggregation on and the other field needs the string value to run the prefix query on.

Excluding field from _source causes aggregation to not work

We're using Elasticsearch 1.7.2 and trying to use the "include/exclude from _source" feature as it's described here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
We have a field types that's 'pretty' and that we would like to return to the client but it's not well suited to aggregations, and a field types_int (and also a types_string but that's not relevant now) that's 'ugly' but optimized for search/aggregations which we don't want to return to the client but that we want to aggregate/filter on.
The field types_int doesn't need to be stored anywhere, it just needs to be indexed. We don't want to waste bandwidth in returning it to the client either, so we don't want to include it in _source.
The mapping for it looks like this:
"types_int": {
"type": "nested",
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"value_int": {
"type": "integer"
}
}
}
However, after we add the exclude, our filters/aggregations on it stop working.
The excludes looks like this:
"_source": {
"excludes": [
"types_int"
]
}
Without that in the mapping, everything works fine.
An example of a filter:
POST my_index/my_type/_search
{
"filter": {
"nested": {
"path": "types_int",
"filter": {
"term": {
"types_int.name": "<something>"
}
}
}
}
}
Again, removing the excludes and everything works fine.
Thinking it might have something to do with nested types, since they're separate documents and all and perhaps handled differently from normal fields, I added an exclude mapping for a 'normal' value type field and then my filter also stopped working.
"publication": {
"type": "string",
"index": "not_analyzed"
}
"_source": {
"excludes": [
"publication"
]
}
So my conclusion is that after you exclude something from _source, you can no longer filter on it? Which doesn't make sense to me, so I'm thinking there's something we're doing wrong here. The _source include/exclude is just a post-process action that manipulates the string data inside that field, right?
I understand that we can also use source filtering to request specific fields to not be included at query time, but it's simply unnecessary to store it. If anything, I would just like to understand why this doesn't work :)

Resources