Split a field for a aggregates - elasticsearch

I have an field with a variety of (multi word) category tags that I'm trying to figure out how to get aggregates for. For any given document there may be one or more tags separated by | characters.
I have the following mapping for my field:
"category": {
"type": "keyword",
"fields": {
"raw" : {
"type": "keyword",
"index": "not_analyzed"
}
}
}
This works for storing the data, but when I try to get the aggregates, for example:
{"aggregates": {
"categories": {
"terms": {
"field": "category"
}
}
}}
It returns whatever the field contains. For example, if I have two documents with categories of
Facilities|Information Technology
Human Resources|Information Technology
I'd like to get back something like:
Information Technology: 2
Facilities: 1
Human Resources: 1
Any suggestions on what I need to do to either split the data as part of my mapping or aggregates query?

Related

How to index a field for both numeric and text search

I want to enable both numberic and full-text search for a field.
I need the two ways to search the field for different scenarios.
How can I index the field?
You can always make use of fields for such use cases. Lets say the field name is field1. Below is how you can define it for indexing it in different ways:
"field1": {
"type": "integer",
"fields": {
"textval": {
"type": "text"
},
"keyword": {
"type": "keyword"
}
}
}
Refer this for understanding more on fields.

Multiple mappings for one index in Elasticsearch 6

I have an index that contains documents of different types (not talking about _type here) and each document has a field document_type that states their type. Is it possible to define mappings for each type of document within this index?
Is it possible to define mappings for each type of document within this index?
No, if you think of using the same field name with different types. For instance, field name id of type string and integer won't work.
Having different document_type basically indicates different domains. What you could do is to group information under each respective domain or type. For instance, an employee and project, both have an id and name, but different types in this example. Some call that nesting.
An example index mapping:
PUT example
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"doc": {
"properties": {
"employee": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 64
}
}
}
}
},
"project": {
"properties": {
"id": {
"type": "keyword"
},
"name": {
"type": "keyword",
"ignore_above": 32
}
}
}
}
}
}
}
If you write the information, with different types.
PUT example/doc/1
{
"employee": {
"id": 4711,
"name": "John Doe"
},
"project": {
"id": "Project X",
"name": "Firebrand"
}
}
Others would argue to store employee and project in separate indices. This approach depends on your scenario and is also desirable. You allow both domains to evolve separately from each other.
Having a separate employee and project index gives you an advantage regarding maintenance. For querying some would argue, that you can group than with an alias. In the above example, it doesn't make sense since the field types are different. A search for the name over an analysed text field is different than over a keyword. Querying makes sense if you have the same field type.
No, if you want to use a single index, you would need to define a single mapping that combines the fields of each document type.
A better way might be to define separate indices on the same cluster for each document type. You can then create a single index alias that aliases to both of those indices if you want to be able to query across document types. Be sure that all fields that exist in both documents have the same data type in both mappings.
Having a single field name with more than one mapping type in the same index is not possible. Two options I can think of:
1. Separate the different doc types to separate indices.
2. Use different fields names for different doc types, so that each name can have different mapping. You can also use nesting, like: type_a.my_field and type_b.my_field, both in the same index.

Elasticsearch Terms aggregation with unknown datatype

I'm indexing data of unknown schema in Elasticsearch using dynamic mapping, i.e. we don't know the shape, datatypes, etc. of much of the data ahead of time. In queries, I want to be able to aggregate on any field. Strings are (by default) mapped as both text and keyword types, and only the latter can be aggregated on. So for strings my terms aggregations must look like this:
"aggs": {
"something": {
"terms": {
"field": "something.keyword"
}
}
}
But other types like numbers and bools do not have this .keyword sub-field, so aggregations for those must look like this (which would fail for text fields):
"aggs": {
"something": {
"terms": {
"field": "something"
}
}
}
Is there any way to specify a terms aggregation that basically says "if something.keyword exists, use that, otherwise just use something", and without taking a significant performance hit?
Requiring datatype information to be provided at query time might be an option for me, but ideally I want to avoid it if possible.
If the primary use case is aggregations, it may be worth changing the dynamic mapping for string properties to index as a keyword datatype, with a multi-field sub-field indexed as a text datatype i.e. in dynamic_templates
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword",
"ignore_above": 256,
"fields": {
"text": {
"type": "text"
}
}
}
}
},

Request Body search in Elasticsearch

I am using Elasticsearch 5.4.1. Here is mapping:
{
"testi": {
"mappings": {
"testt": {
"properties": {
"last": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
When I use URI search I receive results. On the other hand during using Request Body search there is empty result in any case.
GET testi/testt/_search
{
"query" : {
"term" : { "name" : "John" }
}
}
Couple things going on here:
For both last and name, you are indexing the field itself as text and then a subfield as a keyword. Is that your intention? You want to be able to do analyzed/tokenized search on the raw field and then keyword search on the subfield?
If that is your intention, you now have two ways to query each of these fields. For example, name gives you the analyzed version of the field (you designed type text meaning Elasticsearch applied a standard analyzer on it and applied lowercase filter, some basic tokenizing and stemming, etc.) and name.keyword gives you the unaltered keyword version of this field
Therefore, your terms query expects your input string John to match on the field you're querying against. Since you used capitalization in your query input, you likely want to use the keyword subfield of name so try "term" : { "name.keyword" : "John" } instead.
As a light demonstration of what is happening to the original field, "term" : { "name.keyword" : "john" } should work as well
You are seeing results in _search because it is just executing a match_all. If you did pass a basic text parameter, it is executing against _all which is a concatenation of all the fields in each document, so both the keyword and text versions are available

Multiple Paths in Nested Queries

I'm cross-posting this from the elasticsearch forums (https://discuss.elastic.co/t/multiple-paths-in-nested-query/96851/1)
Below is an example, but first I’ll tell you about my use case, because I’m not sure if this is a good approach. I’m trying to automatically index a large collection of typed data. What this means is I’m trying to generate mappings and queries on those mappings all automatically based on information about my data. A lot of my data is relational, and I’m interested in being able to search accross the relations, thus I’m also interested in using Nested data types.
However, the issue is that many of these types have on the order of 10 relations, and I’ve got a feeling its not a good idea to pass 10 identical copies of a nested query to elasticsearch just to query 10 different nested paths the same way. Thus, I’m wondering if its possible to instead pass multiple paths into a single query? Better yet, if its possible to search over all fields in the current document and in all its nested documents and their fields in a single query. I’m aware of object fields, and they’re not a good fit because I want to retrive some data of matched nested documents.
In this example, I create an index with multiple nested types and some of its own types, upload a document, and attempt to query the document and all its nested documents, but fail. Is there some way to do this without duplicating the query for each nested document, or is that actually a performant way to do this? Thanks
PUT /my_index
{
"mappings": {
"type1" : {
"properties" : {
"obj1" : {
"type" : "nested",
"properties": {
"name": {
"type":"text"
},
"number": {
"type":"text"
}
}
},
"obj2" : {
"type" : "nested",
"properties": {
"color": {
"type":"text"
},
"food": {
"type":"text"
}
}
},
"lul":{
"type": "text"
},
"pucci":{
"type": "text"
}
}
}
}
}
PUT /my_index/type1/1
{
"obj1": [
{ "name":"liar", "number":"deer dog"},
{ "name":"one two three", "number":"you can call on me"},
{ "name":"ricky gervais", "number":"user 123"}
],
"obj2": [
{ "color":"red green blue", "food":"meatball and spaghetti"},
{ "color":"orange", "food":"pineapple, fish, goat"},
{ "color":"none", "food":"none"}
],
"lul": "lul its me user123",
"field": "one dog"
}
POST /my_index/_search
{
"query": {
"nested": {
"path": ["obj1", "obj2"],
"query": {
"query_string": {
"query": "ricky",
"all_fields": true
}
}
}
}
}

Resources