Convert two repeated values in array into a string - elasticsearch

I have some old documents where a field has an array of two vales repeated, something like this:
"task" : [
"first_task",
"first_task"
],
I'm trying to convert this array into a string because it's the same value. I've seen the following script: Convert array with 2 equal values to single value but in my case, this problem can't be fixed through logstash because it happens just with old documents stored.
I was thinking to do something like this:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"script": {
"description": "Change task field from array to first element of this one",
"lang": "painless",
"source": """
if (ctx['task'][0] == ctx['task'][1]) {
ctx['task'] = ctx['task'][0];
}
"""
}
}
]
},
"docs": [
{
"_index" : "tasks",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"#timestamp" : "2022-05-03T07:33:44.652Z",
"task" : ["first_task", "first_task"]
}
}
]
}
The result document is the following:
{
"docs" : [
{
"doc" : {
"_index" : "tasks",
"_type" : "_doc",
"_id" : "1",
"_source" : {
"#timestamp" : "2022-05-03T07:33:44.652Z",
"task" : "first_task"
},
"_ingest" : {
"timestamp" : "2022-05-11T09:08:48.150815183Z"
}
}
}
]
}
We can see the task field is reassigned and we have the first element of the array as a value.
Is there a way to manipulate actual data from Elasticsearch and convert all the documents with this characteristic using DSL queries?
Thanks.

You can achieve this with _update_by_query endpoint. Here is an example:
POST tasks/_update_by_query
{
"script": {
"source": """
if (ctx._source['task'][0] == ctx._source['task'][1]) {
ctx._source['task'] = ctx._source['task'][0];
}
""",
"lang": "painless"
},
"query": {
"match_all": {}
}
}
You can remove the match_all query if you want to update all documents or you can filter documents by chaning the conditions in the query.
Keep in mind that running a script to update all documents in the index may cause some performance issues while the update process is running.

Related

elasticearch aggregation by array size

I need a stats on elasticsearch. I can't make the request.
I would like to know the number of people per appointment.
appointment index mapping
{
"id" : "383577",
"persons" : [
{
"id" : "1",
},
{
"id" : "2",
}
]
}
what i would like
"buckets" : [
{
"key" : "1", <--- appointment of 1 person
"doc_count" : 1241891
},
{
"key" : "2", <--- appointment of 2 persons
"doc_count" : 10137
},
{
"key" : "3", <--- appointment of 3 persons
"doc_count" : 8064
}
]
Thank you
The easiest way to do this is to create another integer field containing the length of the persons array and aggregating on that field.
{
"id" : "383577",
"personsCount": 2, <---- add this field
"persons" : [
{
"id" : "1",
},
{
"id" : "2",
}
]
}
The non-optimal way of achieving what you expect is to use a script that will return the length of the persons array dynamically, but be aware that this is sub-optimal and can potentially harm your cluster depending on the volume of data you have:
GET /_search
{
"aggs": {
"persons": {
"terms": {
"script": "doc['persons.id'].size()"
}
}
}
}
If you want to update all your documents to create that field you can do it like this:
POST index/_update_by_query
{
"script": {
"source": "ctx._source.personsCount = ctx._source.persons.length"
}
}
However, you'll also need to modify the logic of your indexing application to create that new field.

ElasticSearch Set Processor

I am trying to use the Elasticsearch Set Processor functionality to add a Queue wise Constant field to a given index which contains data from multiple Queues. The ElasticSearch documentation is really sparse in this respect.
I am trying to use the below code to create a Set Processor for Index default-*, but somehow it's not working
PUT /_ingest/pipeline/set_aht
{
"description": "sets queue wise AHT constants",
"processors": [
{
"set": {
"field": "queueAHTVal",
"value": "10",
"if": "queueName == 'A'"
}
}
]
}
Looking for some howto guidance from anyone who might have previously worked on Set Processor for ElasticSearch
I tried to work on a possible suggestion. If i understood your issue well, you want to add a new field based on a field value (queueName) when it equals to A?
If yes, I modified your pipeline and did a test locally.
Here is the updated pipeline code:
PUT _ingest/pipeline/set_aht
{
"processors": [
{
"set": {
"field": "queueAHTVal",
"value": "10",
"if": "ctx.queueName.equals('A')"
}
}
]
}
I used the _reindex API so as to ingest the data in another field.
POST _reindex
{
"source": {
"index": "espro"
},
"dest": {
"index": "espro-v2",
"pipeline": "set_aht"
}
}
The response is:
"hits" : [
{
"_index" : "espro-v2",
"_type" : "_doc",
"_id" : "7BErVHQB3IIDvL59miT1",
"_score" : 1.0,
"_source" : {
"queueName" : "A",
"queueAHTVal" : "10"
}
},
{
"_index" : "espro-v2",
"_type" : "_doc",
"_id" : "IBEsVHQB3IIDvL59iien",
"_score" : 1.0,
"_source" : {
"queueName" : "B"
}
}
Let me know if you need help or If I wrongly understood your issue, I will try to help thank you.

Elasticsearch Deduplication

I have a collection of documents where each document looks like
{
"_id": ... ,
"Author": ...,
"Content": ....,
"DateTime": ...
}
I would like to issue one query to the collection so that I get in response the oldest document from each author. I am considering using a terms aggregation but when I do that I get a list of buckets, being the unique Author values, telling me nothing about which of their documents is the oldest. Furthermore, that approach requires a subsequent call to ES, which is undesirable.
Any advice you could offer would be greatly appreciated. Thanks.
You can use collapse in elastic search.
It will return top 1 record per author sorted on DateTime
{
"size": 10,
"collapse": {
"field": "Author.keyword"
},
"sort": [
{
"DateTime": {
"order": "desc"
}
}
]
}
Result
"hits" : [
{
"_index" : "index83",
"_type" : "_doc",
"_id" : "e1QwrnABAWOsYG7tvNrB",
"_score" : null,
"_source" : {
"Author" : "b",
"Content" : "ADSAD",
"DateTime" : "2019-03-11"
},
"fields" : {
"Author.keyword" : [
"b"
]
},
"sort" : [
1552262400000
]
},
{
"_index" : "index83",
"_type" : "_doc",
"_id" : "elQwrnABAWOsYG7to9oS",
"_score" : null,
"_source" : {
"Author" : "a",
"Content" : "ADSAD",
"DateTime" : "2019-03-10"
},
"fields" : {
"Author.keyword" : [
"a"
]
},
"sort" : [
1552176000000
]
}
]
}
EDIT 1:
{
"size": 10,
"collapse": {
"field": "Author.keyword"
},
"sort": [
{
"DateTime": {
"order": "desc"
}
}
],
"aggs":
{
"authors": {
"terms": {
"field": "Author.keyword", "size": 10 },
"aggs": {
"doc_count": { "value_count": { "field":
"Author.keyword"
}
}
}
}
}
}
There's no simple way of doing it directly with one call to Elasticsearch. Fortunately, there's a nice article on Elastic Blog showing some methods of doing it.
One these methods is using logstash to remove duplicates. Other method include using a Python script that can be found on this github repository:
#!/usr/local/bin/python3
import hashlib
from elasticsearch import Elasticsearch
es = Elasticsearch(["localhost:9200"])
dict_of_duplicate_docs = {}
# The following line defines the fields that will be
# used to determine if a document is a duplicate
keys_to_include_in_hash = ["CAC", "FTSE", "SMI"]
# Process documents returned by the current search/scroll
def populate_dict_of_duplicate_docs(hits):
for item in hits:
combined_key = ""
for mykey in keys_to_include_in_hash:
combined_key += str(item['_source'][mykey])
_id = item["_id"]
hashval = hashlib.md5(combined_key.encode('utf-8')).digest()
# If the hashval is new, then we will create a new key
# in the dict_of_duplicate_docs, which will be
# assigned a value of an empty array.
# We then immediately push the _id onto the array.
# If hashval already exists, then
# we will just push the new _id onto the existing array
dict_of_duplicate_docs.setdefault(hashval, []).append(_id)
# Loop over all documents in the index, and populate the
# dict_of_duplicate_docs data structure.
def scroll_over_all_docs():
data = es.search(index="stocks", scroll='1m', body={"query": {"match_all": {}}})
# Get the scroll ID
sid = data['_scroll_id']
scroll_size = len(data['hits']['hits'])
# Before scroll, process current batch of hits
populate_dict_of_duplicate_docs(data['hits']['hits'])
while scroll_size > 0:
data = es.scroll(scroll_id=sid, scroll='2m')
# Process current batch of hits
populate_dict_of_duplicate_docs(data['hits']['hits'])
# Update the scroll ID
sid = data['_scroll_id']
# Get the number of results that returned in the last scroll
scroll_size = len(data['hits']['hits'])
def loop_over_hashes_and_remove_duplicates():
# Search through the hash of doc values to see if any
# duplicate hashes have been found
for hashval, array_of_ids in dict_of_duplicate_docs.items():
if len(array_of_ids) > 1:
print("********** Duplicate docs hash=%s **********" % hashval)
# Get the documents that have mapped to the current hashval
matching_docs = es.mget(index="stocks", doc_type="doc", body={"ids": array_of_ids})
for doc in matching_docs['docs']:
# In this example, we just print the duplicate docs.
# This code could be easily modified to delete duplicates
# here instead of printing them
print("doc=%s\n" % doc)
def main():
scroll_over_all_docs()
loop_over_hashes_and_remove_duplicates()
main()

ElasticSearch NEST - Use UpdateByQuery to create a non existing field or update an existing one

I'm pretty new to elasticsearch (6.6.0), i'd like to achieve a single query that can both create/update a document based on a custom field.
Here is my document structure
{
"_index" : "document",
"_type" : "_doc",
"_id" : "nvs9gmkB0wioRAGjGGVA",
"_score" : 1.0,
"_source" : {
"customId" : "4a3e7b21-9be9-4378-98ec-aa3e9f40aee7",
"title" : "test",
}
}
Using NEST (Last known nuget package) i built the following method to add/update a field in the document
var response = await _client.UpdateByQueryAsync < dynamic > (s => s
.Index("document")
.Type("_doc")
.Query(q => q.Match(t => t
.Field("customId")
.Query("4a3e7b21-9be9-4378-98ec-aa3e9f40aee7")
))
.Script(sc => sc
.Source("ctx._source." + PARAMETER_NAME + " = params." + PARAMETER_NAME)
.Params(p => p
.Add(PARAMETER_NAME, PARAMETER_VALUE)
)
.Lang(ScriptLang.Painless)
)
.RequestsPerSecond(-1)
.WaitForCompletion()
.Refresh()
).ConfigureAwait(false);
if (!response.IsValid) {
//error logic goes here
}
return response;
If i try to use the method to UPDATE the title of the document with the given customId everything works out just fine.
However if I try to use the same method to ADD a field to the document nothing happens, the query returns a "valid nest response built from a successful low level call on post" and moves on.
I've tried to replicate the same query and execute it on kibana and it works just fine, updating the document, here's the query :
POST document/_doc/_update_by_query?requests_per_second=-1&wait_for_completion=true&refresh=true
{
"query": {
"match": {
"customId": {
"query": "4a3e7b21-9be9-4378-98ec-aa3e9f40aee7"
}
}
},
"script": {
"lang": "painless",
"params": {
"datetime": "2019-03-15T11:44:43.555Z"
},
"source": "ctx._source.datetime = params.datetime"
}
}
At this point i'm pretty lost, what am i missing?
Thank you.
P.S : The document is likely to change over time, to deal with that i've created a template for all my documents, once created the document index has the following dynamic template :
"dynamic_templates" : [
{
"string_fields" : {
"match" : "*",
"path_unmatch" : "customId",
"match_mapping_type" : "string",
"mapping" : {
"analyzer" : "autocomplete",
"search_analyzer" : "autocomplete_search",
"type" : "text"
}
}
}
]

Elastich search : more_like_this operator returns no hit

I am trying to find similar documents to one document in elastic search (the document with id '4' in this case) in my sandbox based on a field (the 'town' field in this case).
So i wrote this query, which returns no hit :
GET _search
{
"query": {
"more_like_this" : {
"fields" : ["town"],
"docs" : [
{
"_index" : "app",
"_type" : "house",
"_id" : "4"
}
],
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
In my dataset, the document #4 is located in a town nammed 'Paris'. Thus when I run the following query, the document #4 is in the hits results with a lot of others results :
GET _search
{
"query": {
"match": { "town": "Paris" }
}
}
I don't understand why the 'more_like_this' query does not return results whereas there are other documents that have a field with the same value.
I precise that I check the _index, _type and _id parameters using the '"match_all": {}' query.
It looks like the second example of this official elastic search ressource : http://www.elastic.co/guide/en/elasticsearch/reference/1.5/query-dsl-mlt-query.html
What's wrong with my 'more_like_this' query ?
I am assuming you have only a less number of documents.
In that case , can you give min_doc_freq as 0 and try again.
Also use POST for search -
POST _search
{
"query": {
"more_like_this" : {
"fields" : ["town"],
"docs" : [
{
"_index" : "app",
"_type" : "house",
"_id" : "4"
}
],
"min_term_freq" : 1,
"max_query_terms" : 12,
"min_doc_freq" : 1
}
}
}

Resources