Elasticsearch aggregate by field prefix - elasticsearch

I have data entries of the form
{
"id": "ABCxxx",
// Other fields
}
Where ABC is a unique identifier that defines the "type" of this record. (For example a user would be USR1234..., an image would be IMG1234...)
I want to get a list of all the different types of records that I have in my ES. So in essence I want to do a sort by id but only looking at the first three characters of the id.
This doesn't work obviously, because it sorts by id (so USR123 is different than USR456):
{
"fields": ["id"],
"aggs": {
"group_by_id": {
"terms": {
"field": "id"
}
}
}
}
How do I write this query?

You can use the painless scripting language to get this accomplished.
{
"fields": ["id"],
"aggs": {
"group_by_id": {
"terms": {
"script" : {
"inline": "doc['id'].substring(0,3)",
"lang": "painless"
}
}
}
}
}
More info here. Please note that the syntax for the substring method may not be exactly right.

As suggested by paqash already that the same can be achieved via script but I would suggest an alternate of storing "type" as a different field altogether in your schema.
For eg.
USR1234 : {id:"USR1234", type:"USR"}
IMG1234 : {id:"USR1234", type:"IMG"}
This would avoid unnecessary complications in scripting and keep your query interface clean.

Related

How to search by non-tokenized field length in ElasticSearch

Say I create an index people which will take entries that will have two properties: name and friends
PUT /people
{
"mappings": {
"properties": {
"friends": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
and I put two entries, each one of them has two friends.
POST /people/_doc
{
"name": "Jack",
"friends": [
"Jill", "John"
]
}
POST /people/_doc
{
"name": "Max",
"friends": [
"John", "John" # Max will have two friends, but both named John
]
}
Now I want to search for people that have multiple friends
GET /people/_search
{
"query": {
"bool": {
"filter": [
{
"script": {
"script": {
"source": "doc['friends.keyword'].length > 1"
}
}
}
]
}
}
}
This will only return Jack and ignore Max. I assume this is because we are actually traversing the inversed index, and John and John create only one token - which is 'john' so the length of the tokens is actually 1 here.
Since my index is relatively small and performance is not the key, I would like to actually traverse the source and not the inversed index
GET /people/_search
{
"query": {
"bool": {
"filter": [
{
"script": {
"script": {
"source": "ctx._source.friends.length > 1"
}
}
}
]
}
}
}
But according to the https://github.com/elastic/elasticsearch/issues/20068 the source is supported only when updating, not when searching, so I cannot.
One obvious solution to this seems to take the length of the field and store it to the index. Something like friends_count: 2 and then filter based on that. But that requires reindexing and also this appears as something that should be solved in some obvious way I am missing.
Thanks a lot.
There is a new feature in ES 7.11 as runtime fields a runtime field is a field that is evaluated at query time. Runtime fields enable you to:
Add fields to existing documents without reindexing your data
Start working with your data without understanding how it’s structured
Override the value returned from an indexed field at query time
Define fields for a specific use without modifying the underlying schema
you can find more information here about runtime fields, but how you can use runtime fields you can do something like this:
Index Time:
PUT my-index/
{
"mappings": {
"runtime": {
"friends_count": {
"type": "keyword",
"script": {
"source": "doc['#friends'].size()"
}
}
},
"properties": {
"#timestamp": {"type": "date"}
}
}
}
You can also use runtime fields in search time for more information check here.
Search Time
GET my-index/_search
{
"runtime_mappings": {
"friends_count": {
"type": "keyword",
"script": {
"source": "ctx._source.friends.size()"
}
}
}
}
Update:
POST mytest/_update_by_query
{
"query": {
"match_all": {}
},
"script": {
"source": "ctx._source.arrayLength = ctx._source.friends.size()"
}
}
You can update all of your document with query above and adjust your query.
For everyone wondering about the same issue, I think #Kaveh answer is the most likely way to go, but I did not manage to make it work in my case. It seems to me that source is created after the query is performed and therefore you cannot access source for the purposes of filtering query.
This leaves you with two options:
filter the result on the application level (ugly and slow solution)
actually save the filed length in a separate field. Such as friends_count
possibly there is another option I don't know about(?).

Multiple Paths in Nested Queries

I'm cross-posting this from the elasticsearch forums (https://discuss.elastic.co/t/multiple-paths-in-nested-query/96851/1)
Below is an example, but first I’ll tell you about my use case, because I’m not sure if this is a good approach. I’m trying to automatically index a large collection of typed data. What this means is I’m trying to generate mappings and queries on those mappings all automatically based on information about my data. A lot of my data is relational, and I’m interested in being able to search accross the relations, thus I’m also interested in using Nested data types.
However, the issue is that many of these types have on the order of 10 relations, and I’ve got a feeling its not a good idea to pass 10 identical copies of a nested query to elasticsearch just to query 10 different nested paths the same way. Thus, I’m wondering if its possible to instead pass multiple paths into a single query? Better yet, if its possible to search over all fields in the current document and in all its nested documents and their fields in a single query. I’m aware of object fields, and they’re not a good fit because I want to retrive some data of matched nested documents.
In this example, I create an index with multiple nested types and some of its own types, upload a document, and attempt to query the document and all its nested documents, but fail. Is there some way to do this without duplicating the query for each nested document, or is that actually a performant way to do this? Thanks
PUT /my_index
{
"mappings": {
"type1" : {
"properties" : {
"obj1" : {
"type" : "nested",
"properties": {
"name": {
"type":"text"
},
"number": {
"type":"text"
}
}
},
"obj2" : {
"type" : "nested",
"properties": {
"color": {
"type":"text"
},
"food": {
"type":"text"
}
}
},
"lul":{
"type": "text"
},
"pucci":{
"type": "text"
}
}
}
}
}
PUT /my_index/type1/1
{
"obj1": [
{ "name":"liar", "number":"deer dog"},
{ "name":"one two three", "number":"you can call on me"},
{ "name":"ricky gervais", "number":"user 123"}
],
"obj2": [
{ "color":"red green blue", "food":"meatball and spaghetti"},
{ "color":"orange", "food":"pineapple, fish, goat"},
{ "color":"none", "food":"none"}
],
"lul": "lul its me user123",
"field": "one dog"
}
POST /my_index/_search
{
"query": {
"nested": {
"path": ["obj1", "obj2"],
"query": {
"query_string": {
"query": "ricky",
"all_fields": true
}
}
}
}
}

Compare IDs between two indices in elasticsearch

I have two indices in an elasticsearch cluster, containing what ought to be the same data in two slightly different formats. However, the number of records are different. The IDs of each document should be the same. Is there a way to extract a list of what IDs are present in one index but not the other?
If your two indices have the same type where these documents are stored, you can use something like this:
GET index1,index2/_search
{
"size": 0,
"aggs": {
"group_by_uid": {
"terms": {
"field": "_uid"
},
"aggs": {
"count_indices": {
"cardinality": {
"field": "_index"
}
},
"values_bucket_filter_by_index_count": {
"bucket_selector": {
"buckets_path": {
"count": "count_indices"
},
"script": "params.count < 2"
}
}
}
}
}
}
The query above works in 5.x. If your ID is a field inside a document, that's even better to test.
For anyone that comes across this, Scrutineer (https://github.com/Aconex/scrutineer/) provides this sort of ability if you follow convention of ID & Version concepts within Elasticsearch.

Hide a single record in Elastic Search on a per user basis

As a logged in user, I want to be able to hide a single record that I never want to see again if I perform the same search. Is this possible with ElasticSearch?
I've read about multitenancy and filters but I'm not quite sure how a top level implementation might look like.
One of my ideas is that I store some reference to the unwanted record in an RDB and then add those references into a filter query but I'm not sure what reference to use since Elastic Search generates it's own ID's that may not stay the same when a re-index happens.
It depends. If you have not many users and not too big documents you can go with field on the document, Add field dismissedBy and when use dismiss write update to document
POST test/type1/1/_update
{
"script" : {
"inline": "ctx._source.dismissedBy.add(params.userId)",
"lang": "painless",
"params" : {
"userId" : "1"
}
}
}
And query:
POST /index/documents/_search
{
"query": {
"bool": {
"must_not": {
"term": {
"dismissedBy": 1
}
}
}
}
}
Problem with this approach is that if you re-index document all settings will be overwritten so you must keep copy in some other places too.
Other option if documents are large or you have lots of users then I would go with parent/child approach
If user hit dismiss then you should index it
PUT /indexname/dissmisses/1?parent=dismissforid
{
"userId": 1
}
Then when you search you do
POST /index/documents/_search
{
"query": {
"bool": {
"must_not": {
"has_child": {
"type": "dissmiss",
"query": {
"term": {
"userId": 1
}
}
}
}
}
}
}

Sort documents by size of a field

I have documents like below indexed,
1.
{
"name": "Gilly",
"hobbyName" : "coin collection",
"countries": ["US","France","Georgia"]
}
2.
{
"name": "Billy",
"hobbyName":"coin collection",
"countries":["UK","Ghana","China","France"]
}
Now I need to sort these documents based on the array length of the field "countries", such that the result after the sorting would be of the order document2,document1. How can I achieve this using elasticsearch?
You can use script based sorting to achieve this.
{
"query": {
"match_all": {}
},
"sort": {
"_script": {
"type": "number",
"script": "doc['countries'].values.size()",
"order": "desc"
}
}
}
I would suggest using token count type in Elasticsearch.
By using scripts , it can be done (can check here for how to do it using scripts). But then results wont be perfect.
Scripts mostly uses filed data cache and duplicate are removed in this.
You can read more on how to use token count type here.

Resources