OpenSearch / ElasticSearch index mappings

OpenSearch / ElasticSearch index mappings - elasticsearch

I have a system that ingests multiple scores for events and we use opensearch (previously elastic search) for getting the averages.
For example, an input would be similar to:
// event 1
{
id: "foo1",
timestamp: "some-iso8601-timestamp",
scores: [
{ name: "arbitrary-name-1", value: 80 },
{ name: "arbitrary-name-2", value: 55 },
{ name: "arbitrary-name-3", value: 30 },
]
}
// event 2
{
id: "foo2",
timestamp: "some-iso8601-timestamp",
scores: [
{ name: "arbitrary-name-1", value: 90 },
{ name: "arbitrary-name-2", value: 65 },
{ name: "arbitrary-name-3", value: 40 },
]
}
The score name are arbitrary and subject to change from time to time.
We ultimately would like to query the data to get the average scores values:
[
{ name: "arbitrary-name-1", value: 85 },
{ name: "arbitrary-name-2", value: 60 },
{ name: "arbitrary-name-3", value: 35 },
]
However, the only way we have been able to achieve this so far has been to insert multiple documents, one for each score name/value pair in each event. This seems wasteful. The search in place currently is to group the documents by score name and timestamp intervals, then to perform a weighted average of the scores in each bucket.
Is there a way the data can be inserted to allow this query pattern to take place by only adding one document into opensearch per event/record (rather than one document per score per event/record)? How might that look?
Thanks!

Is it what you were trying to do ?
I got a bit confused. ^^
DELETE /71397606
PUT /71397606
{
"mappings": {
"properties": {
"id": {
"type": "text"
},
"scores": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
},
"value": {
"type": "long"
}
}
},
"timestamp": {
"type": "text"
}
}
}
}
POST /_bulk
{"index":{"_index":"71397606"}}
{"id":"foo1","timestamp":"some-iso8601-timestamp","scores":[{"name":"arbitrary-name-1","value":80},{"name":"arbitrary-name-2","value":55},{"name":"arbitrary-name-3","value":30}]}
{"index":{"_index":"71397606"}}
{"id":"foo2","timestamp":"some-iso8601-timestamp","scores":[{"name":"arbitrary-name-1","value":90},{"name":"arbitrary-name-2","value":65},{"name":"arbitrary-name-3","value":40}]}
{"index":{"_index":"71397606"}}
{"id":"foo2","timestamp":"some-iso8601-timestamp","scores":[{"name":"arbitrary-name-1","value":85},{"name":"arbitrary-name-x","value":65},{"name":"arbitrary-name-y","value":40}]}
GET /71397606/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"nested": {
"nested": {
"path": "scores"
},
"aggs": {
"pername": {
"terms": {
"field": "scores.name",
"size": 10
},
"aggs": {
"avg": {
"avg": {
"field": "scores.value"
}
}
}
}
}
}
}
}
ps:
If not could you give an example ?

Related

Distinct records with geo_distance sort on aggregation ES

I'm working on nearby API using elasticsearch.
I'm trying to run 4 actions in ES query
match condition (here running a script to get records within radius)
get distinct records based on company's Key (want to get one record from a company)
sort records based on geo_distance
add the field as Distance to get the distance between user and location
Here is my code:
const query = {
query: {
bool: {
must: [
customQuery,
{
term: {
"schedule.isShopOpen": true,
},
},
{
term: {
isBranchAvailable: true,
},
},
{
term: {
branchStatus: "active",
},
},
{
match:{
shopStatus: "active"
}
},
{
script: {
script: {
params: {
lat: parseFloat(req.lat),
lon: parseFloat(req.lon),
},
source:
"doc['location'].arcDistance(params.lat, params.lon) / 1000 <= doc['searchRadius'].value",
lang: "painless",
},
},
},
],
},
},
aggs: {
duplicateCount: {
terms: {
field: "companyKey",
size: 10000,
},
aggs: {
duplicateDocuments: {
top_hits: {
sort: [
{
_geo_distance: {
location: {
lat: parseFloat(req.lat),
lon: parseFloat(req.lon),
},
order: "asc",
unit: "km",
mode: "min",
distance_type: "arc",
ignore_unmapped: true,
},
},
],
script_fields: {
distance: {
script: {
params: {
lat: parseFloat(req.lat),
lon: parseFloat(req.lon),
},
inline: `doc['location'].arcDistance(params.lat, params.lon)/1000`,
},
},
},
stored_fields: ["_source"],
size: 1,
},
},
},
},
},
};
Here's the out put:
data: [
{
companyKey: "1234",
companyName: "Floward",
branchKey: "3425234",
branch: "Mursilat",
distance: 1.810064121687324,
},
{
companyKey: "0978",
companyName: "Dkhoon",
branchKey: "352345",
branch: "Wahah blue branch ",
distance: 0.08931851500047634,
},
{
companyKey: "567675",
companyName: "Abdulaziz test",
branchKey: "53425",
branch: "Jj",
distance: 0.011447273197846672,
},
{
companyKey: "56756",
companyName: "Mouj",
branchKey: "345345",
branch: "King fahad",
distance: 5.822936713752124,
},
];
I have two issues
How to sort records based on geo_distance
will query actions(match, script) apply to aggregation data...?
Can you please help me out to solve these issues

This would be more appropriate query for your use case
{
"query": {
"bool": {
"filter": [
{
"geo_distance": {
"distance": "200km",
"distance_type": "arc",
"location": {
"lat": 40,
"lon": -70
}
}
},
{
"match": {
"shopStatus": "active"
}
}
]
}
},
"collapse": {
"field": "companyKey"
},
"sort": [
{
"_geo_distance": {
"location": {
"lat": 40,
"lon": 71
},
"order": "asc",
"unit": "km",
"mode": "min",
"distance_type": "arc",
"ignore_unmapped": true
}
}
],
"_source": ["*"],
"script_fields": {
"distance_in_m": {
"script": "doc['location'].arcDistance(40, -70)" // convert to unit required
}
}
}
Filter instead of must - since you are just filtering documents, filter will be faster as it does not score documents unlike must
collapse
You can use the collapse parameter to collapse search results based on field values. The collapsing is done by selecting only the top sorted document per collapse key.
Geo distance instead of script -- to find documents with in distance
script field to get distance

ElasticSearch aggregation query with List in documents

I have following records of car sales of different brands in different cities.
Document -1
{
"city": "Delhi",
"cars":[{
"name":"Toyota",
"purchase":100,
"sold":80
},{
"name":"Honda",
"purchase":200,
"sold":150
}]
}
Document -2
{
"city": "Delhi",
"cars":[{
"name":"Toyota",
"purchase":50,
"sold":40
},{
"name":"Honda",
"purchase":150,
"sold":120
}]
}
I am trying to come up with query to aggregate car statistics for a given city but not getting the right query.
Required result:
{
"city": "Delhi",
"cars":[{
"name":"Toyota",
"purchase":150,
"sold":120
},{
"name":"Honda",
"purchase":350,
"sold":270
}]
}

First you need to map your array as a nested field (script would be complicated and not performant). Nested field are indexed, aggregation will be pretty fast.
remove your index / or create a new one. Please note i use test as type.
{
"mappings": {
"test": {
"properties": {
"city": {
"type": "keyword"
},
"cars": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
},
"purchase": {
"type": "integer"
},
"sold": {
"type": "integer"
}
}
}
}
}
}
}
Index your document (same way you did)
For the aggregation:
{
"size": 0,
"aggs": {
"avg_grade": {
"terms": {
"field": "city"
},
"aggs": {
"resellers": {
"nested": {
"path": "cars"
},
"aggs": {
"agg_name": {
"terms": {
"field": "cars.name"
},
"aggs": {
"avg_pur": {
"sum": {
"field": "cars.purchase"
}
},
"avg_sold": {
"sum": {
"field": "cars.sold"
}
}
}
}
}
}
}
}
}
}
result:
buckets": [
{
"key": "Honda",
"doc_count": 2,
"avg_pur": {
"value": 350
},
"avg_sold": {
"value": 270
}
}
,
{
"key": "Toyota",
"doc_count": 2,
"avg_pur": {
"value": 150
},
"avg_sold": {
"value": 120
}
}
]
if you have index the name / city field as a text (you have to ask first if this is necessary), use .keyword in the term aggregation ("cars.name.keyword").

Terms Aggregation excluding last vowels using Spanish analyze - Elasticsearch 6.4

I am trying to get keywords from a bunch of tweets in the Spanish language. The thing is that when I get the results the last vowel in most words in the response is removed. Any idea of why is this happening?
The data are clean tweets extracted from Twitter in the Spanish language
Here is the query:
{
"query": {
"bool": {
"must": {
"terms": {
"full_text_sentiment": "positive"
}
},
"filter": {
"range": {
"created_at": {
"gte": greaterThanTime,
"lte": lessThanTime
}
}
}
}
},
"aggs": {
"keywords": {
"terms": { "field": "full_text_clean", "size": 10}
}
}
}
The mapping is the following for the field:
"full_text_clean": {
"type": "text",
"analyzer": "spanish",
"fielddata": true,
"fielddata_frequency_filter": {
"min": 0.1,
"max": 1.0,
"min_segment_size": 10
},
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 512
}
}
}
And this is the buckets in the response:
[ { key: 'aquí', doc_count: 3 },
{ key: 'deport', doc_count: 3 },
{ key: 'informacion', doc_count: 3 },
{ key: '23', doc_count: 2 },
{ key: 'corazon', doc_count: 2 },
{ key: 'dios', doc_count: 2 },
{ key: 'mexic', doc_count: 2 },
{ key: 'mujer', doc_count: 2 },
{ key: 'quier', doc_count: 2 },
{ key: 'siempr', doc_count: 2 }]
where "deport", should be "deporte", "mexic" should be "mexico", "quier" should be "quiero" etc.
Any idea of what is happening?
Thank you!

Hello the spanish analyzer (reference here) contains a stemming token filter. It is this stemmer that reduce words to their root, and thus remove generally some characters at the end of words.
More information about stemming here
To avoid this behavior you will need to create a new custom analyzer without stemming.
You can use the example from the documentation and just remove the spanish_stemmer filter.

Filter document on items in an array ElasticSearch

I am using ElasticSearch to search through documents. However, I need to make sure the current user is able to see those documents. Each document is tied to a community, in which the user may belong.
Here is the mapping for my Document:
export const mapping = {
properties: {
amazonId: { type: 'text' },
title: { type: 'text' },
subtitle: { type: 'text' },
description: { type: 'text' },
createdAt: { type: 'date' },
updatedAt: { type: 'date' },
published: { type: 'boolean' },
communities: { type: 'nested' }
}
}
I'm currently saving the ids of the communities the document belongs to in an array of strings. Ex: ["edd05cd0-0a49-4676-86f4-2db913235371", "672916cf-ee32-4bed-a60f-9a7c08dba04b"]
Currently, when I filter a query with {term: { communities: community.id } }, it returns all the documents, regardless of the communities it's tied to.
Here's the full query:
{
index: 'document',
filter_path: { filter: {term: { communities: community.id } } },
body: {
sort: [{ createdAt: { order: 'asc' } }]
}
}
This is the following result based on the community id of "b7d28e7f-7534-406a-981e-ddf147b5015a". NOTE: This is a return from my graphql, so the communities on the document are actual full objects after resolving the hits from the ES query.
"hits": [
{
"title": "The One True Document",
"communities": [
{
"id": "edd05cd0-0a49-4676-86f4-2db913235371"
},
{
"id": "672916cf-ee32-4bed-a60f-9a7c08dba04b"
}
]
},
{
"title": "Boring Document 1",
"communities": []
},
{
"title": "Boring Document 2",
"communities": []
},
{
"title": "Unpublished",
"communities": [
{
"id": "672916cf-ee32-4bed-a60f-9a7c08dba04b"
}
]
}
]
When I attempt to map the communities as {type: 'keyword', index: 'not_analyzed'} I receive an error that states, [illegal_argument_exception] Could not convert [communities.index] to boolean.
So do I need to change my mapping, my filter, or both? Searching around the docs for 6.6, I see that terms needs the non_analyzed mapping.
UPDATE --------------------------
I updated the communities mapping to be a keyword as suggested below. However, I still received the same result.
I updated my query to the following (using a community id that has documents):
query: { index: 'document',
body:
{ sort: [ { createdAt: { order: 'asc' } } ],
from: 0,
size: 5,
query:
{ bool:
{ filter:
{ term: { communities: '672916cf-ee32-4bed-a60f-9a7c08dba04b' } } } } } }
Which gives me the following results:
{
"data": {
"communities": [
{
"id": "672916cf-ee32-4bed-a60f-9a7c08dba04b",
"feed": {
"documents": {
"hits": []
}
}
}
]
}
}
Appears that my filter is working too well?

Since you are storing ids of communities you should make sure that the ids doesn't get analysed. For this communities should be of type keyword. Second you want to store array of community ids since a user can belong to multiple communities. To do this you don't need to make it of type nested. Nested has all together different use case.
To sore values as array you need to make sure that while indexing you are always passing the values against the field as array even if the value is single value.
You need to change mapping and the way you are indexing values against field communities.
1. Update mapping as below:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"amazonId": {
"type": "text"
},
"title": {
"type": "text"
},
"subtitle": {
"type": "text"
},
"description": {
"type": "text"
},
"createdAt": {
"type": "date"
},
"updatedAt": {
"type": "date"
},
"published": {
"type": "boolean"
},
"communities": {
"type": "keyword"
}
}
}
}
}
2. Adding a document to index:
PUT my_index/_doc/1
{
"title": "The One True Document",
"communities": [
"edd05cd0-0a49-4676-86f4-2db913235371",
"672916cf-ee32-4bed-a60f-9a7c08dba04b"
]
}
3. Filtering by community id:
GET my_index/_doc/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"communities": "672916cf-ee32-4bed-a60f-9a7c08dba04b"
}
}
]
}
}
}
Nested Field approach
1. Mapping:
PUT my_index_2
{
"mappings": {
"_doc": {
"properties": {
"amazonId": {
"type": "text"
},
"title": {
"type": "text"
},
"subtitle": {
"type": "text"
},
"description": {
"type": "text"
},
"createdAt": {
"type": "date"
},
"updatedAt": {
"type": "date"
},
"published": {
"type": "boolean"
},
"communities": {
"type": "nested"
}
}
}
}
}
2. Indexing document:
PUT my_index_2/_doc/1
{
"title": "The One True Document",
"communities": [
{
"id": "edd05cd0-0a49-4676-86f4-2db913235371"
},
{
"id": "672916cf-ee32-4bed-a60f-9a7c08dba04b"
}
]
}
3. Querying (used of nested query):
GET my_index_2/_doc/_search
{
"query": {
"bool": {
"filter": [
{
"nested": {
"path": "communities",
"query": {
"term": {
"communities.id.keyword": "672916cf-ee32-4bed-a60f-9a7c08dba04b"
}
}
}
}
]
}
}
}
You might be noticing I used communities.id.keyword and not communities.id. To understand the reason for this go through this.

Filtering facets results in nested element with ElasticSearch

I have this mapping:
products: {
product: {
properties: {
id: {
type: "long"
},
name: {
type: "string"
},
tags: {
dynamic: "true",
properties: {
tagId: {
type: "long"
},
tagType: {
type: "long"
}
}
}
}
}
}
I want to create a facet on tag ids, but with tag-type filtering.
I need the filter to only apply on the facet and not the query results.
So here's my request:
{
"from": 0,
"size": 10,
"facets": {
"tags": {
"terms": {
"field": "tags.tagId",
"size": 10
},
"facet_filter": {
"terms": {
"tags.tagType": [
"11",
"19"
]
}
}
}
},
"query": {
"match_all": {}
}
}
The facet filtering does not seem to affect the faceting.
Any ideas?

The filter is applied to the documents, the parent entity in your example. That means that you're filtering the documents on which you make the facet by tags.tagType. Therefore all documents which have a specific tags.tagType value are used to build the facet, which is not what I want.
This is the usecase for nested documents. You can have a look at this nice article too.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

OpenSearch / ElasticSearch index mappings - elasticsearch

Related

Distinct records with geo_distance sort on aggregation ES

ElasticSearch aggregation query with List in documents

Terms Aggregation excluding last vowels using Spanish analyze - Elasticsearch 6.4

Filter document on items in an array ElasticSearch

Filtering facets results in nested element with ElasticSearch

Categories

Resources