Elasticsearch Deduplication

Elasticsearch Deduplication - elasticsearch

I have a collection of documents where each document looks like
{
"_id": ... ,
"Author": ...,
"Content": ....,
"DateTime": ...
}
I would like to issue one query to the collection so that I get in response the oldest document from each author. I am considering using a terms aggregation but when I do that I get a list of buckets, being the unique Author values, telling me nothing about which of their documents is the oldest. Furthermore, that approach requires a subsequent call to ES, which is undesirable.
Any advice you could offer would be greatly appreciated. Thanks.

You can use collapse in elastic search.
It will return top 1 record per author sorted on DateTime
{
"size": 10,
"collapse": {
"field": "Author.keyword"
},
"sort": [
{
"DateTime": {
"order": "desc"
}
}
]
}
Result
"hits" : [
{
"_index" : "index83",
"_type" : "_doc",
"_id" : "e1QwrnABAWOsYG7tvNrB",
"_score" : null,
"_source" : {
"Author" : "b",
"Content" : "ADSAD",
"DateTime" : "2019-03-11"
},
"fields" : {
"Author.keyword" : [
"b"
]
},
"sort" : [
1552262400000
]
},
{
"_index" : "index83",
"_type" : "_doc",
"_id" : "elQwrnABAWOsYG7to9oS",
"_score" : null,
"_source" : {
"Author" : "a",
"Content" : "ADSAD",
"DateTime" : "2019-03-10"
},
"fields" : {
"Author.keyword" : [
"a"
]
},
"sort" : [
1552176000000
]
}
]
}
EDIT 1:
{
"size": 10,
"collapse": {
"field": "Author.keyword"
},
"sort": [
{
"DateTime": {
"order": "desc"
}
}
],
"aggs":
{
"authors": {
"terms": {
"field": "Author.keyword", "size": 10 },
"aggs": {
"doc_count": { "value_count": { "field":
"Author.keyword"
}
}
}
}
}
}

There's no simple way of doing it directly with one call to Elasticsearch. Fortunately, there's a nice article on Elastic Blog showing some methods of doing it.
One these methods is using logstash to remove duplicates. Other method include using a Python script that can be found on this github repository:
#!/usr/local/bin/python3
import hashlib
from elasticsearch import Elasticsearch
es = Elasticsearch(["localhost:9200"])
dict_of_duplicate_docs = {}
# The following line defines the fields that will be
# used to determine if a document is a duplicate
keys_to_include_in_hash = ["CAC", "FTSE", "SMI"]
# Process documents returned by the current search/scroll
def populate_dict_of_duplicate_docs(hits):
for item in hits:
combined_key = ""
for mykey in keys_to_include_in_hash:
combined_key += str(item['_source'][mykey])
_id = item["_id"]
hashval = hashlib.md5(combined_key.encode('utf-8')).digest()
# If the hashval is new, then we will create a new key
# in the dict_of_duplicate_docs, which will be
# assigned a value of an empty array.
# We then immediately push the _id onto the array.
# If hashval already exists, then
# we will just push the new _id onto the existing array
dict_of_duplicate_docs.setdefault(hashval, []).append(_id)
# Loop over all documents in the index, and populate the
# dict_of_duplicate_docs data structure.
def scroll_over_all_docs():
data = es.search(index="stocks", scroll='1m', body={"query": {"match_all": {}}})
# Get the scroll ID
sid = data['_scroll_id']
scroll_size = len(data['hits']['hits'])
# Before scroll, process current batch of hits
populate_dict_of_duplicate_docs(data['hits']['hits'])
while scroll_size > 0:
data = es.scroll(scroll_id=sid, scroll='2m')
# Process current batch of hits
populate_dict_of_duplicate_docs(data['hits']['hits'])
# Update the scroll ID
sid = data['_scroll_id']
# Get the number of results that returned in the last scroll
scroll_size = len(data['hits']['hits'])
def loop_over_hashes_and_remove_duplicates():
# Search through the hash of doc values to see if any
# duplicate hashes have been found
for hashval, array_of_ids in dict_of_duplicate_docs.items():
if len(array_of_ids) > 1:
print("********** Duplicate docs hash=%s **********" % hashval)
# Get the documents that have mapped to the current hashval
matching_docs = es.mget(index="stocks", doc_type="doc", body={"ids": array_of_ids})
for doc in matching_docs['docs']:
# In this example, we just print the duplicate docs.
# This code could be easily modified to delete duplicates
# here instead of printing them
print("doc=%s\n" % doc)
def main():
scroll_over_all_docs()
loop_over_hashes_and_remove_duplicates()
main()

Related

Convert two repeated values in array into a string

I have some old documents where a field has an array of two vales repeated, something like this:
"task" : [
"first_task",
"first_task"
],
I'm trying to convert this array into a string because it's the same value. I've seen the following script: Convert array with 2 equal values to single value but in my case, this problem can't be fixed through logstash because it happens just with old documents stored.
I was thinking to do something like this:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"script": {
"description": "Change task field from array to first element of this one",
"lang": "painless",
"source": """
if (ctx['task'][0] == ctx['task'][1]) {
ctx['task'] = ctx['task'][0];
}
"""
}
}
]
},
"docs": [
{
"_index" : "tasks",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"#timestamp" : "2022-05-03T07:33:44.652Z",
"task" : ["first_task", "first_task"]
}
}
]
}
The result document is the following:
{
"docs" : [
{
"doc" : {
"_index" : "tasks",
"_type" : "_doc",
"_id" : "1",
"_source" : {
"#timestamp" : "2022-05-03T07:33:44.652Z",
"task" : "first_task"
},
"_ingest" : {
"timestamp" : "2022-05-11T09:08:48.150815183Z"
}
}
}
]
}
We can see the task field is reassigned and we have the first element of the array as a value.
Is there a way to manipulate actual data from Elasticsearch and convert all the documents with this characteristic using DSL queries?
Thanks.

You can achieve this with _update_by_query endpoint. Here is an example:
POST tasks/_update_by_query
{
"script": {
"source": """
if (ctx._source['task'][0] == ctx._source['task'][1]) {
ctx._source['task'] = ctx._source['task'][0];
}
""",
"lang": "painless"
},
"query": {
"match_all": {}
}
}
You can remove the match_all query if you want to update all documents or you can filter documents by chaning the conditions in the query.
Keep in mind that running a script to update all documents in the index may cause some performance issues while the update process is running.

elasticearch aggregation by array size

I need a stats on elasticsearch. I can't make the request.
I would like to know the number of people per appointment.
appointment index mapping
{
"id" : "383577",
"persons" : [
{
"id" : "1",
},
{
"id" : "2",
}
]
}
what i would like
"buckets" : [
{
"key" : "1", <--- appointment of 1 person
"doc_count" : 1241891
},
{
"key" : "2", <--- appointment of 2 persons
"doc_count" : 10137
},
{
"key" : "3", <--- appointment of 3 persons
"doc_count" : 8064
}
]
Thank you

The easiest way to do this is to create another integer field containing the length of the persons array and aggregating on that field.
{
"id" : "383577",
"personsCount": 2, <---- add this field
"persons" : [
{
"id" : "1",
},
{
"id" : "2",
}
]
}
The non-optimal way of achieving what you expect is to use a script that will return the length of the persons array dynamically, but be aware that this is sub-optimal and can potentially harm your cluster depending on the volume of data you have:
GET /_search
{
"aggs": {
"persons": {
"terms": {
"script": "doc['persons.id'].size()"
}
}
}
}
If you want to update all your documents to create that field you can do it like this:
POST index/_update_by_query
{
"script": {
"source": "ctx._source.personsCount = ctx._source.persons.length"
}
}
However, you'll also need to modify the logic of your indexing application to create that new field.

Skipping indexing of some values in an array, but keeping them in _source

In ElasticSearch, I am trying to index documents like:
{
"tags": [
{
"value": "...",
"quality": 0.7
},
...
]
}
I would like that _source contains the full document, but that that only those values are indexed which have quality above some threshold. I read documentation and it looks to me that if I modify input document in any way before indexing (e.g., filter out values), then the modified document will be stored under _source, not the original one.
Is there a way to achieve this?

There is one way to achieve this. By default the tags structure is disabled in the mapping (i.e. not indexed). Then by leveraging ingest processors, you can create a secondary tags structure (which I called indexedTags) that will only contain the tag elements whose quality component is higher than a given threshold.
So the mapping should look like this:
PUT test
{
"mappings": {
"properties": {
"tags": {
"enabled": false, <--- won't be indexed at all, but still present in _source
"properties": {
"value": {
"type": "text"
},
"quality": {
"type": "float"
}
}
},
"indexedTags": { <--- will only contain indexable values above threshold
"properties": {
"value": {
"type": "text"
},
"quality": {
"type": "float"
}
}
}
}
}
}
Then, we need to create an ingest pipeline that allows us to filter the right tag values. The following ingest processor uses a script processor to create the indexedTags array out of the tags one, and it will only contain elements whose quality field is above a defined threshold (e.g. 0.6 in this case):
PUT _ingest/pipeline/quality-threshold
{
"processors": [
{
"script": {
"source": """
ctx.indexedTags = ctx.tags.stream().filter(t -> t.quality > params.threshold).collect(Collectors.toList());
""",
"params": {
"threshold": 0.6
}
}
}
]
}
Finally, we can leverage that ingest pipeline while indexing documents:
PUT test/_doc/1?pipeline=quality-threshold
{
"tags": [
{
"value": "test",
"quality": 0.5
},
{
"value": "test2",
"quality": 0.8
}
]
}
When running the above command, the whole tags array will still be present in the _source but it won't be indexed. What will be indexed, however, is another array called indexedTags which will only contain the second element (i.e. test2), because its quality value is 0.8 and that's higher than the 0.6 threshold.
The document looks like this:
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"indexedTags" : [
{
"value" : "test2",
"quality" : 0.8
}
],
"tags" : [
{
"value" : "test",
"quality" : 0.5
},
{
"value" : "test2",
"quality" : 0.8
}
]
}
}
You can now see that the first element test is not indexed at all by searching for
GET test/_search?q=test
=> No results
While searching for test2 will retrieve your document:
GET test/_search?q=test2
=> Returns document 1

Attempting to use Elasticsearch Bulk API when _id is equal to a specific field

I am attempting to bulk insert documents into an index. I need to have _id equal to a specific field that I am inserting. I'm using ES v6.6
POST productv9/_bulk
{ "index" : { "_index" : "productv9", "_id": "in_stock"}}
{ "description" : "test", "in_stock" : "2001"}
GET productv9/_search
{
"query": {
"match": {
"_id": "2001"
}
}
}
When I run the bulk statement it runs without any error. However, when I run the search statement it is not getting any hits. Additionally, I have many additional documents that I would like to insert in the same manner.

What I suggest to do is to create an ingest pipeline that will set the _id of your document based on the value of the in_stock field.
First create the pipeline:
PUT _ingest/pipeline/set_id
{
"description" : "Sets the id of the document based on a field value",
"processors" : [
{
"set" : {
"field": "_id",
"value": "{{in_stock}}"
}
}
]
}
Then you can reference the pipeline in your bulk call:
POST productv9/doc/_bulk?pipeline=set_id
{ "index" : {}}
{ "description" : "test", "in_stock" : "2001"}
By calling GET productv9/_doc/2001 you will get your document.

How to get documents size(in bytes) in Elasticsearch

I am new to elasticsearch. I need to get the size of the documents of the query results.
Example:--
this is a document. (19bytes).
this is also a document. (24bytes)
content:{"a":"this is a document", "b":"this is also a document"}(53bytes)
when I query for the document in ES. I will get the above documents as result. So, the size of both documents is 32bytes. I need the 32bytes in elasticsearch as a result.

Does your document only contain a single field? I'm not sure this is 100% of what you want, but generally you can calculate the length of fields and either store them with the document or calculate them at query time (but this is a slow operation and I would avoid it if possible).
So here's an example with a test document and the calculation for the field length:
PUT test/_doc/1
{
"content": "this is a document."
}
POST test/_update_by_query
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "content_length"
}
}
]
}
},
"script": {
"source": """
if(ctx._source.containsKey("content")) {
ctx._source.content_length = ctx._source.content.length();
} else {
ctx._source.content_length = 0;
}
"""
}
}
GET test/_search
The query result is then:
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"content" : "this is a document.",
"content_length" : 19
}
}
]
}
}
BTW there are 19 characters (including spaces and dots in that one). If you want to exclude those, you'll have to add some more logic to the script. I would be careful with bytes BTW, since UTF8 might use more than one byte per character (like höhe) and this script is really only counting characters.
Then you can easily use the length in queries and aggregations.
If you want to calculate the size of all the subdocuments combined, use the following:
PUT test/_doc/2
{
"content": {
"a": "this is a document",
"b": "this is also a document"
}
}
POST test/_update_by_query
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "content_length"
}
}
]
}
},
"script": {
"source": """
if(ctx._source.containsKey("content")) {
ctx._source.content_length = 0;
for (item in ctx._source.content.entrySet()) {
ctx._source.content_length += item.getValue().length();
}
}
"""
}
}
GET test/_search
Just note that content can either be of the type text or have a subdocument, but you can't mix that.

There's no way to get elasticsearch docs size by API. The reason is that the doc indexed to Elasticsearch takes different size in the index, depending on whether you store _all, which fields are indexed, and the mapping type of those fields, doc_value and more. also elasticsearch uses deduplication and other methods of compaction, so the index size has no linear correlation with the original documents it contains.
One way to work around it is to calculate the document size in advance before indexing it, and add it as another field in the doc, i.e. doc_size field. then you can query this calculated field, and run aggregations on it.
Note however that as I stated above this does not represent the size of the index, and might be completely wrong - for example if all the docs contain a very long text field with the same value, then Elasticsearch would only store that long value once and reference to it, so the index size would be much smaller.

Elasticsearch now has a _size field, which can be enabled in mappings.
Once enabled, this gives out data size in Bytes.
GET <index_name>/_doc/<doc_id>?stored_fields=_size
Elasticsearch official doc

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch Deduplication - elasticsearch

Related

Convert two repeated values in array into a string

elasticearch aggregation by array size

Skipping indexing of some values in an array, but keeping them in _source

Attempting to use Elasticsearch Bulk API when _id is equal to a specific field

How to get documents size(in bytes) in Elasticsearch

Categories

Resources