Create keyword string type with custom analyzer in 5.3.0 - elasticsearch

I have a string I'd like to index as keyword type but with a special comma analyzer:
For example:
"San Francisco, Boston, New York" -> "San Francisco", "Boston, "New York"
should be both indexed and aggregatable at the same time so that I can split it up by buckets. In pre 5.0.0 the following worked:
Index settings:
{
'settings': {
'analysis': {
'tokenizer': {
'comma': {
'type': 'pattern',
'pattern': ','
}
},
'analyzer': {
'comma': {
'type': 'custom',
'tokenizer': 'comma'
}
}
},
},
}
with the following mapping:
{
'city': {
'type': 'string',
'analyzer': 'comma'
},
}
Now in 5.3.0 and above the analyzer is no longer a valid property for the keyword type, and my understanding is that I want a keyword type here. How do I specify an aggregatable, indexed, searchable text type with custom analyzer?

You can use multifields to index the same fields in two different ways one for searching and other for aggregations.
Also i suugest you to add a filter for trim and lowercase the tokens produced to help you with better search.
Mappings
PUT commaindex2
{
"settings": {
"analysis": {
"tokenizer": {
"comma": {
"type": "pattern",
"pattern": ","
}
},
"analyzer": {
"comma": {
"type": "custom",
"tokenizer": "comma",
"filter": ["lowercase", "trim"]
}
}
}
},
"mappings": {
"city_document": {
"properties": {
"city": {
"type": "keyword",
"fields": {
"city_custom_analyzed": {
"type": "text",
"analyzer": "comma",
"fielddata": true
}
}
}
}
}
}
}
Index Document
POST commaindex2/city_document
{
"city" : "san fransisco, new york, london"
}
Search Query
POST commaindex2/city_document/_search
{
"query": {
"bool": {
"must": [{
"term": {
"city.city_custom_analyzed": {
"value": "new york"
}
}
}]
}
},
"aggs": {
"terms_agg": {
"terms": {
"field": "city",
"size": 10
}
}
}
}
Note
In case you want to run aggs on indexed fields, like you want to count for each city in buckets, you can run terms aggregation on city.city_custom_analyzed field.
POST commaindex2/city_document/_search
{
"query": {
"bool": {
"must": [{
"term": {
"city.city_custom_analyzed": {
"value": "new york"
}
}
}]
}
},
"aggs": {
"terms_agg": {
"terms": {
"field": "city.city_custom_analyzed",
"size": 10
}
}
}
}
Hope this helps

Since you're using ES 5.3, I suggest a different approach, using an ingest pipeline to split your field at indexing time.
PUT _ingest/pipeline/city-splitter
{
"description": "City splitter",
"processors": [
{
"split": {
"field": "city",
"separator": ","
}
},
{
"foreach": {
"field": "city",
"processor": {
"trim": {
"field": "_ingest._value"
}
}
}
}
]
}
Then you can index a new document:
PUT cities/city/1?pipeline=city-splitter
{ "city" : "San Francisco, Boston, New York" }
And finally you can search/sort on city and run an aggregation on the field city.keyword as if the cities had been split in your client application:
POST cities/_search
{
"query": {
"match": {
"city": "boston"
}
},
"aggs": {
"cities": {
"terms": {
"field": "city.keyword"
}
}
}
}

Related

Elasticsearch - Applying multi level filter on nested aggregation bucket?

I'm, trying to get distinct nested objects by applying multiple filters.
Basically in Elasticsearch I have cities as top level document and inside I have nested citizens documents, which have another nested pets documents.
I am trying to get all citizens that have certain conditions applied on all of these 3 levels (cities, citizens and pets):
Give me all distinct citizens
that have age:"40",
that have pets "name":"Casper",
from cities with office_type="secondary"
I know that to filter 1st level I can use query condition, and then if I need to filter the nested citizens I can add a filter in the aggregation level.
I am using this article as an example: https://iridakos.com/tutorials/2018/10/22/elasticsearch-bucket-aggregations.html
Query working so far:
GET city_offices/_search
{
"size" : 10,
"query": {
"term" : { "office_type" : "secondary" }
},
"aggs": {
"citizens": {
"nested": {
"path": "citizens"
},
"aggs": {
"inner_agg": {
"filter": {
"term": { "citizens.age": "40" }
} ,
"aggs": {
"occupations": {
"terms": {
"field": "citizens.occupation"
}
}
}
}
}
}
}
}
BUT: How can I add the "pets" nested filter condition?
Mapping:
PUT city_offices
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"city": {
"type": "keyword"
},
"office_type": {
"type": "keyword"
},
"citizens": {
"type": "nested",
"properties": {
"occupation": {
"type": "keyword"
},
"age": {
"type": "integer"
},
"pets": {
"type": "nested",
"properties": {
"kind": {
"type": "keyword"
},
"name": {
"type": "keyword"
},
"age": {
"type": "integer"
}
}
}
}
}
}
}
}
}
Index data:
PUT /city_offices/doc/1
{
"city":"Athens",
"office_type":"secondary",
"citizens":[
{
"occupation":"Statistician",
"age":30,
"pets":[
{
"kind":"Cat",
"name":"Phoebe",
"age":14
}
]
},
{
"occupation":"Librarian",
"age":30,
"pets":[
{
"kind":"Rabbit",
"name":"Nino",
"age":13
}
]
},
{
"occupation":"Librarian",
"age":40,
"pets":[
{
"kind":"Rabbit",
"name":"Nino",
"age":13
}
]
},
{
"occupation":"Statistician",
"age":40,
"pets":[
{
"kind":"Rabbit",
"name":"Casper",
"age":2
},
{
"kind":"Rabbit",
"name":"Nino",
"age":13
},
{
"kind":"Dog",
"name":"Nino",
"age":15
}
]
}
]
}
So I found a solution for this.
Basically I apply top level filters in the query section and then apply rest of conditions in the aggregations.
First I apply citizens level filter aggregation, then I go inside nested pets and apply the filter and then I need to get back up to citizens level (using reverse_nested: citizens) and then set the term that will generate the final bucket.
Query looks like this:
GET city_offices/_search
{
"size" : 10,
"query": {
"term" : { "office_type" : "secondary" }
},
"aggs": {
"citizens": {
"nested": {
"path": "citizens"
},
"aggs": {
"inner": {
"filter": {
"term": { "citizens.age": "40" }
} ,
"aggs": {
"occupations": {
"nested": {
"path": "citizens.pets"
},
"aggs": {
"inner_pets": {
"filter": {
"term": { "citizens.pets.name": "Casper" }
} ,
"aggs": {
"lll": {
"reverse_nested": {
"path": "citizens"
},
"aggs": {
"xxx": {
"terms": {
"field": "citizens.occupation",
"size": 10
}
}
}
}
}
}
}
}
}
}
}
}
}
}
The response bucket looks like this:
"xxx": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Librarian",
"doc_count": 1
},
{
"key": "Statistician",
"doc_count": 1
}
]
}
Any other suggestions?

Distinct values from array-field matching filter in Elasticsearch 2.4

In short: I want to lookup for distinct values in some field of the document BUT only matching some filter. The problem is in array-fields.
Imagine there are following documents in ES 2.4:
[
{
"states": [
"Washington (US-WA)",
"California (US-CA)"
]
},
{
"states": [
"Washington (US-WA)"
]
}
]
I'd like my users to be able to lookup all possible states via typeahead, so I have the following query for the "wa" user request:
{
"query": {
"wildcard": {
"states.raw": "*wa*"
}
},
"aggregations": {
"typed": {
"terms": {
"field": "states.raw"
},
"aggregations": {
"typed_hits": {
"top_hits": {
"_source": { "includes": ["states"] }
}
}
}
}
}
}
states.raw is a sub-field with not_analyzed option
This query works pretty well unless I have an array of values like in the example - it returns both Washington and California. I do understand why it happens (query and aggregations are working on top of the document and the document contains both, even though only one option matched the filter), but I really want to only see Washington and don't want to add another layer of filtering on the application side for the ES results.
Is there a way to do so via single ES 2.4 request?
You could use the "Filtering Values" feature (see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-bucket-terms-aggregation.html#_filtering_values_2).
So, your request could look like:
POST /index/collection/_search?size=0
{
"aggregations": {
"typed": {
"terms": {
"field": "states.raw",
"include": ".*wa.*" // You need to carefully quote the "wa" string because it'll be used as part of RegExp
},
"aggregations": {
"typed_hits": {
"top_hits": {
"_source": { "includes": ["states"] }
}
}
}
}
}
}
I can't hold myself back, though, and not tell you that using wildcard with leading wildcard is not the best solution. Do, please please, consider using ngrams for this:
PUT states
{
"settings": {
"analysis": {
"filter": {
"ngrams": {
"type": "nGram",
"min_gram": "2",
"max_gram": "20"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"ngrams"
],
"tokenizer": "standard"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"location": {
"properties": {
"states": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"ngrams": {
"type": "string",
"analyzer": "ngram_analyzer"
}
}
}
}
}
}
}
}
}
POST states/doc/1
{
"text":"bla1",
"location": [
{
"states": [
"Washington (US-WA)",
"California (US-CA)"
]
},
{
"states": [
"Washington (US-WA)"
]
}
]
}
POST states/doc/2
{
"text":"bla2",
"location": [
{
"states": [
"Washington (US-WA)",
"California (US-CA)"
]
}
]
}
POST states/doc/3
{
"text":"bla3",
"location": [
{
"states": [
"California (US-CA)"
]
},
{
"states": [
"Illinois (US-IL)"
]
}
]
}
And the final query:
GET states/_search
{
"query": {
"term": {
"location.states.ngrams": {
"value": "sh"
}
}
},
"aggregations": {
"filtering_states": {
"terms": {
"field": "location.states.raw",
"include": ".*sh.*"
},
"aggs": {
"typed_hits": {
"top_hits": {
"_source": {
"includes": [
"location.states"
]
}
}
}
}
}
}
}

Facet by objects(tags) in an array

I am running into a query problem with ElasticSearch.
We have objects that looks like this:
{
"id":"1234",
"tags":[
{ "tagName": "T1", "tagValue":"V1"},
{ "tagName": "T2", "tagValue":"V2"},
{ "tagName": "T3", "tagValue":"V3"}
]
}
{
"id":"5678",
"tags":[
{ "tagName": "T1", "tagValue":"X1"},
{ "tagName": "T2", "tagValue":"X2"}
]
}
And I would like to get a list of tagValues for tagName=T1, which is "V1" and "X1".
I tried
{
"filter": {
"bool": {
"must": [
{
"term":{
"tags.tagName": "T1"
}
}
]
}
},
"facets": {
"TagValues":{
"filter": {
"term": {
"tags.tagName": "T1"
}
},
"terms": {
"field": "tags.tagValue",
"size": 30
}
}
}
}
It seems like it's returning all tagValues from all tags "T1", "T2", and "T3".
Can someone please help me with this query? How can I get faceted list for objects that's in an array?
Any help would be appreciated.
Thank you,
The main idea is to use the nested type for your tags field. Here is the mapping you should use:
curl -XPUT localhost:9200/mytags -d '{
"mappings": {
"mytag": {
"properties": {
"id": {
"type": "string"
},
"tags": {
"type": "nested",
"properties": {
"tagName": {
"type": "string",
"index": "not_analyzed"
},
"tagValue": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}'
Then you can reindex your data and run a query like the one below, which will first filter only the document containing a tagName whose value is T1 and then using aggregations (don't use facets anymore as they are deprecated), you can again select only those tags whose tagName is T1 and then retrieve the associated tagValue fields. This will get you the expected V1 and X1 values.
curl -XPOST localhost:9200/mytags/mytag/_search -d '{
"size": 0,
"query": {
"filtered": {
"filter": {
"nested": {
"path": "tags",
"query": {
"term": {
"tags.tagName": "T1"
}
}
}
}
}
},
"aggs": {
"tags": {
"nested": {
"path": "tags"
},
"aggs": {
"values": {
"filter": {
"term": {
"tags.tagName": "T1"
}
},
"aggs": {
"values": {
"terms": {
"field": "tags.tagValue"
}
}
}
}
}
}
}
}'

Elastic Search Sum aggregation with group by and where condition

I am newbie in ElasticSearch.
We are currently moving our code from relational DB to ElasticSearch. So we are converting our queries in ElasticSearch query format.
I am looking for ElasticSearch equivalent of below query -
SELECT Color, SUM(ListPrice), SUM(StandardCost)
FROM Production.Product
WHERE Color IS NOT NULL
AND ListPrice != 0.00
AND Name LIKE 'Mountain%'
GROUP BY Color
Can someone provide me the example of ElasticSearch query for above?
You'd have a products index with a product type documents whose mapping could look like this based on your query above:
curl -XPUT localhost:9200/products -d '
{
"mappings": {
"product": {
"properties": {
"Color": {
"type": "string"
},
"Name": {
"type": "string"
},
"ListPrice": {
"type": "double"
},
"StandardCost": {
"type": "double"
}
}
}
}
}'
Then the ES query equivalent to the SQL one you gave above would look like this:
{
"query": {
"filtered": {
"query": {
"query_string": {
"default_field": "Name",
"query": "Mountain*"
}
},
"filter": {
"bool": {
"must_not": [
{
"missing": {
"field": "Color"
}
},
{
"term": {
"ListPrice": 0
}
}
]
}
}
}
},
"aggs": {
"by_color": {
"terms": {
"field": "Color"
},
"aggs": {
"total_price": {
"sum": {
"field": "ListPrice"
}
},
"total_cost": {
"sum": {
"field": "StandardCost"
}
}
}
}
}
}

ElasticSearch and NEST - How do I construct a simple OR query?

I'm developing a building repository query.
Here is the query that I am trying to write.
(Exact match on zipCode) AND ((Case-insensitive exact
match on address1) OR (Case-insensitive exact match on siteName))
In my repository, I have a document that looks like the following:
address1: 4 Myrtle Street
siteName: Myrtle Street
zipCode: 90210
And I keep getting matches on:
address1: 45 Myrtle Street
siteName: Myrtle
zipCode: 90210
Here are some attempts that have not worked:
{
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"term": {
"address1": {
"value": "45 myrtle street"
}
}
},
{
"term": {
"siteName": {
"value": "myrtle"
}
}
}
]
}
},
{
"term": {
"zipCode": {
"value": "90210"
}
}
}
]
}
}
}
{
"query": {
"filtered": {
"query": {
"term": {
"zipCode": {
"value": "90210"
}
}
},
"filter": {
"or": {
"filters": [
{
"term": {
"address1": "45 myrtle street"
}
},
{
"term": {
"siteName": "myrtle"
}
}
]
}
}
}
}
}
{
"filter": {
"bool": {
"must": [
{
"or": {
"filters": [
{
"term": {
"address1": "45 myrtle street"
}
},
{
"term": {
"siteName": "myrtle"
}
}
]
}
},
{
"term": {
"zipCode": "90210"
}
}
]
}
}
}
{
"query": {
"bool": {
"must": [
{
"span_or": {
"clauses": [
{
"span_term": {
"siteName": {
"value": "myrtle"
}
}
}
]
}
},
{
"term": {
"zipCode": {
"value": "90210"
}
}
}
]
}
}
}
{
"query": {
"filtered": {
"query": {
"term": {
"zipCode": {
"value": "90210"
}
}
},
"filter": {
"or": {
"filters": [
{
"term": {
"address1": "45 myrtle street"
}
},
{
"term": {
"siteName": "myrtle"
}
}
]
}
}
}
}
}
Does anyone know what I am doing wrong?
I'm writing this with NEST, so I would prefer NEST syntax, but ElasticSearch syntax would certainly suffice as well.
EDIT: Per #Greg Marzouka's comment, here are the mappings:
{
[indexname]: {
"mappings": {
"[indexname]elasticsearchresponse": {
"properties": {
"address": {
"type": "string"
},
"address1": {
"type": "string"
},
"address2": {
"type": "string"
},
"address3": {
"type": "string"
},
"city": {
"type": "string"
},
"country": {
"type": "string"
},
"id": {
"type": "string"
},
"originalSourceId": {
"type": "string"
},
"placeId": {
"type": "string"
},
"siteName": {
"type": "string"
},
"siteType": {
"type": "string"
},
"state": {
"type": "string"
},
"systemId": {
"type": "long"
},
"zipCode": {
"type": "string"
}
}
}
}
}
}
Based on your mapping, you won't be able to search for exact matches on siteName because it's being analyzed with the standard analyzer, which is more tuned for full text search. This is the default analyzer that is applied by Elasticsearch when one isn't explicitly defined on a field.
The standard analyzer is breaking up the value of siteName into multiple tokens. For example, Myrtle Street is tokenized and stored as two separate terms in your index, myrtle and street, which is why your query is matching that document. For a case-insensitive exact match, instead you want Myrtle Street stored as a single, lower-cased token in your index: myrtle street.
You could set siteName to not_analyzed, which won't subject the field to the analysis chain at all- meaning the values will not be modified. However, this will produce a single Myrtle Street token, which will work for exact matches, but will be case-sensitive.
What you need to do is create a custom analyzer using the keyword tokenizer and lowercase token filter, then apply it to your field.
Here's how you can accomplish this with NEST's fluent API:
// Create the custom analyzer using the keyword tokenizer and lowercase token filter
var myAnalyzer = new CustomAnalyzer
{
Tokenizer = "keyword",
Filter = new [] { "lowercase" }
};
var response = this.Client.CreateIndex("your-index-name", c => c
// Add the customer analyzer to your index settings
.Analysis(an => an
.Analyzers(az => az
.Add("my_analyzer", myAnalyzer)
)
)
// Create the mapping for your type and apply "my_analyzer" to the siteName field
.AddMapping<YourType>(m => m
.MapFromAttributes()
.Properties(ps => ps
.String(s => s.Name(t => t.SiteName).Analyzer("my_analyzer"))
)
)
);

Resources