How to get documents size(in bytes) in Elasticsearch - elasticsearch

I am new to elasticsearch. I need to get the size of the documents of the query results.
Example:--
this is a document. (19bytes).
this is also a document. (24bytes)
content:{"a":"this is a document", "b":"this is also a document"}(53bytes)
when I query for the document in ES. I will get the above documents as result. So, the size of both documents is 32bytes. I need the 32bytes in elasticsearch as a result.

Does your document only contain a single field? I'm not sure this is 100% of what you want, but generally you can calculate the length of fields and either store them with the document or calculate them at query time (but this is a slow operation and I would avoid it if possible).
So here's an example with a test document and the calculation for the field length:
PUT test/_doc/1
{
"content": "this is a document."
}
POST test/_update_by_query
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "content_length"
}
}
]
}
},
"script": {
"source": """
if(ctx._source.containsKey("content")) {
ctx._source.content_length = ctx._source.content.length();
} else {
ctx._source.content_length = 0;
}
"""
}
}
GET test/_search
The query result is then:
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"content" : "this is a document.",
"content_length" : 19
}
}
]
}
}
BTW there are 19 characters (including spaces and dots in that one). If you want to exclude those, you'll have to add some more logic to the script. I would be careful with bytes BTW, since UTF8 might use more than one byte per character (like höhe) and this script is really only counting characters.
Then you can easily use the length in queries and aggregations.
If you want to calculate the size of all the subdocuments combined, use the following:
PUT test/_doc/2
{
"content": {
"a": "this is a document",
"b": "this is also a document"
}
}
POST test/_update_by_query
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "content_length"
}
}
]
}
},
"script": {
"source": """
if(ctx._source.containsKey("content")) {
ctx._source.content_length = 0;
for (item in ctx._source.content.entrySet()) {
ctx._source.content_length += item.getValue().length();
}
}
"""
}
}
GET test/_search
Just note that content can either be of the type text or have a subdocument, but you can't mix that.

There's no way to get elasticsearch docs size by API. The reason is that the doc indexed to Elasticsearch takes different size in the index, depending on whether you store _all, which fields are indexed, and the mapping type of those fields, doc_value and more. also elasticsearch uses deduplication and other methods of compaction, so the index size has no linear correlation with the original documents it contains.
One way to work around it is to calculate the document size in advance before indexing it, and add it as another field in the doc, i.e. doc_size field. then you can query this calculated field, and run aggregations on it.
Note however that as I stated above this does not represent the size of the index, and might be completely wrong - for example if all the docs contain a very long text field with the same value, then Elasticsearch would only store that long value once and reference to it, so the index size would be much smaller.

Elasticsearch now has a _size field, which can be enabled in mappings.
Once enabled, this gives out data size in Bytes.
GET <index_name>/_doc/<doc_id>?stored_fields=_size
Elasticsearch official doc

Related

Elasticsearch Deduplication

I have a collection of documents where each document looks like
{
"_id": ... ,
"Author": ...,
"Content": ....,
"DateTime": ...
}
I would like to issue one query to the collection so that I get in response the oldest document from each author. I am considering using a terms aggregation but when I do that I get a list of buckets, being the unique Author values, telling me nothing about which of their documents is the oldest. Furthermore, that approach requires a subsequent call to ES, which is undesirable.
Any advice you could offer would be greatly appreciated. Thanks.
You can use collapse in elastic search.
It will return top 1 record per author sorted on DateTime
{
"size": 10,
"collapse": {
"field": "Author.keyword"
},
"sort": [
{
"DateTime": {
"order": "desc"
}
}
]
}
Result
"hits" : [
{
"_index" : "index83",
"_type" : "_doc",
"_id" : "e1QwrnABAWOsYG7tvNrB",
"_score" : null,
"_source" : {
"Author" : "b",
"Content" : "ADSAD",
"DateTime" : "2019-03-11"
},
"fields" : {
"Author.keyword" : [
"b"
]
},
"sort" : [
1552262400000
]
},
{
"_index" : "index83",
"_type" : "_doc",
"_id" : "elQwrnABAWOsYG7to9oS",
"_score" : null,
"_source" : {
"Author" : "a",
"Content" : "ADSAD",
"DateTime" : "2019-03-10"
},
"fields" : {
"Author.keyword" : [
"a"
]
},
"sort" : [
1552176000000
]
}
]
}
EDIT 1:
{
"size": 10,
"collapse": {
"field": "Author.keyword"
},
"sort": [
{
"DateTime": {
"order": "desc"
}
}
],
"aggs":
{
"authors": {
"terms": {
"field": "Author.keyword", "size": 10 },
"aggs": {
"doc_count": { "value_count": { "field":
"Author.keyword"
}
}
}
}
}
}
There's no simple way of doing it directly with one call to Elasticsearch. Fortunately, there's a nice article on Elastic Blog showing some methods of doing it.
One these methods is using logstash to remove duplicates. Other method include using a Python script that can be found on this github repository:
#!/usr/local/bin/python3
import hashlib
from elasticsearch import Elasticsearch
es = Elasticsearch(["localhost:9200"])
dict_of_duplicate_docs = {}
# The following line defines the fields that will be
# used to determine if a document is a duplicate
keys_to_include_in_hash = ["CAC", "FTSE", "SMI"]
# Process documents returned by the current search/scroll
def populate_dict_of_duplicate_docs(hits):
for item in hits:
combined_key = ""
for mykey in keys_to_include_in_hash:
combined_key += str(item['_source'][mykey])
_id = item["_id"]
hashval = hashlib.md5(combined_key.encode('utf-8')).digest()
# If the hashval is new, then we will create a new key
# in the dict_of_duplicate_docs, which will be
# assigned a value of an empty array.
# We then immediately push the _id onto the array.
# If hashval already exists, then
# we will just push the new _id onto the existing array
dict_of_duplicate_docs.setdefault(hashval, []).append(_id)
# Loop over all documents in the index, and populate the
# dict_of_duplicate_docs data structure.
def scroll_over_all_docs():
data = es.search(index="stocks", scroll='1m', body={"query": {"match_all": {}}})
# Get the scroll ID
sid = data['_scroll_id']
scroll_size = len(data['hits']['hits'])
# Before scroll, process current batch of hits
populate_dict_of_duplicate_docs(data['hits']['hits'])
while scroll_size > 0:
data = es.scroll(scroll_id=sid, scroll='2m')
# Process current batch of hits
populate_dict_of_duplicate_docs(data['hits']['hits'])
# Update the scroll ID
sid = data['_scroll_id']
# Get the number of results that returned in the last scroll
scroll_size = len(data['hits']['hits'])
def loop_over_hashes_and_remove_duplicates():
# Search through the hash of doc values to see if any
# duplicate hashes have been found
for hashval, array_of_ids in dict_of_duplicate_docs.items():
if len(array_of_ids) > 1:
print("********** Duplicate docs hash=%s **********" % hashval)
# Get the documents that have mapped to the current hashval
matching_docs = es.mget(index="stocks", doc_type="doc", body={"ids": array_of_ids})
for doc in matching_docs['docs']:
# In this example, we just print the duplicate docs.
# This code could be easily modified to delete duplicates
# here instead of printing them
print("doc=%s\n" % doc)
def main():
scroll_over_all_docs()
loop_over_hashes_and_remove_duplicates()
main()

Attempting to use Elasticsearch Bulk API when _id is equal to a specific field

I am attempting to bulk insert documents into an index. I need to have _id equal to a specific field that I am inserting. I'm using ES v6.6
POST productv9/_bulk
{ "index" : { "_index" : "productv9", "_id": "in_stock"}}
{ "description" : "test", "in_stock" : "2001"}
GET productv9/_search
{
"query": {
"match": {
"_id": "2001"
}
}
}
When I run the bulk statement it runs without any error. However, when I run the search statement it is not getting any hits. Additionally, I have many additional documents that I would like to insert in the same manner.
What I suggest to do is to create an ingest pipeline that will set the _id of your document based on the value of the in_stock field.
First create the pipeline:
PUT _ingest/pipeline/set_id
{
"description" : "Sets the id of the document based on a field value",
"processors" : [
{
"set" : {
"field": "_id",
"value": "{{in_stock}}"
}
}
]
}
Then you can reference the pipeline in your bulk call:
POST productv9/doc/_bulk?pipeline=set_id
{ "index" : {}}
{ "description" : "test", "in_stock" : "2001"}
By calling GET productv9/_doc/2001 you will get your document.

Preserve wrong messages in Elasticsearch

I have a static mapping in Elasticsearch index. When a message doesn't match this mapping, it is discarded. Is there a way to route it to a default index for wrong messages?
To give you example, I have some fields with integer type:
"status_code": {
"type": "integer"
},
When a message contains a number
"status_code": 123,
it's ok. But when it is
"status_code": "abc"
it fails.
You can have ES do this triage pretty easily using ingest nodes/processors.
The main idea is to create an ingest pipeline with a convert processor for the status_code field and if the conversion doesn't work, you can add an on_failure condition which will direct the document at another index that you can later process.
So create the failures ingest pipeline:
PUT _ingest/pipeline/failures
{
"processors": [
{
"convert": {
"field": "status_code",
"type": "integer"
}
}
],
"on_failure": [
{
"set": {
"field": "_index",
"value": "failed-{{ _index }}"
}
}
]
}
Then when you index a document, you can simply specify the pipeline in parameter. Indexing a document with correct status code will succeed:
PUT test/doc/1?pipeline=failures
{
"status_code": 123
}
However, trying to index a document with a bad status code, will actually also succeed, but your document will be indexed in the failed-test index and not the test one:
PUT test/doc/2?pipeline=failures
{
"status_code": "abc"
}
After running these two commands, you'll see this:
GET failed-test/_search
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "failed-test",
"_type" : "doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"status_code" : "abc"
}
}
]
}
}
To sum up, you didn't have to handle that exceptional case in your client code and could fully leverage ES ingest nodes to achieve the same task.
You can set the parameter ignore malformed to ignore just the field with the type mismatch and not the whole document.
And you can try to combine it with multi-fields, that allows you to map the same value in different ways.
You will probably need something like this:
"status_code": {
"type": "integer",
"fields": {
"as_string": {
"type": "keyword"
}
}
}
This way you will have a field named status_code as an intenger and the same value in a field named status_code.as_string as a keyword, but you should test to see if really does what you want.
Use Strict mapping and you will be able to catch the exception raised by Elastic.
Below is the excerpt from Elastic docs:
By default, when a previously unseen field is found in a document, Elasticsearch will add the new field to the type mapping. This behaviour can be disabled, both at the document and at the object level, by setting the dynamic parameter to false (to ignore new fields) or to strict (to throw an exception if an unknown field is encountered).
As a part of Exception handling, you can push the message to some other index where dynamic mapping is enabled.

ElasticSearch return non analyzed version of analyzed aggregate

I am having a problem implementing a autocomplete feature using the data in elastic search.. my documents currently have this kind of structure
PUT mainindex/books/1
{
"title": "The unread book",
"author": "Mario smith",
"tags": [ "Comedy", "Romantic" , "Romantic Comedy","México"]
}
all the fields are indexed, and the mapping for the tags is a lowercase,asciifolding filter..
Now the functionality that is required is that if the user types mario smith rom..., I need to sugest tags starting with rom.. but only for books of mario smith.. this required breaking the text into components.. and I already got that part.. the current query is something like this ..
{
"query": {
"query_string": {
"query": "mario smith",
"default_operator": "AND"
}
},
"size": 0,
"aggs": {
"autocomplete": {
"terms": {
"field": "suggest",
"order": {
"_term": "asc"
},
"include": {
"pattern": "rom.*"
}
}
}
}
}
and this returns the expected result, a list of word that the user should type next based on the query.. and the prefix of the word he is starting to type..
{
"aggregations" : {
"autocomplete" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "romantic comedy",
"doc_count" : 4
},
{
"key" : "romantic",
"doc_count" : 2
}
]
}
}
}
now the problem is that I can't present these words to the user because they are lowercase and without accents words liker México got indexed like mexico.. and in my language makes some words look weird.. if i remove the filters from the tag field the values are correctly saved into the index. but the pattern rom.* will not match because the user is typing in a diferrent case and may not use the correct accents..
in general terms what is need is to take a filtered set of documents.. aggregate their tags, return them in their natural format.. but filter out the ones that dont have the same prefix. filtering them in a case/accent insentitive way..
PS: I saw some suggestions about having 2 versions of the field,one analyzed and one raw.. but cant seem to be able to filter by one and return the other..
does anyone have an idea, how perform this query or implement this functionality?

Asking for significant terms but returns nothing

I am having an issue with Elasticsearch (version 2.0), I am trying to get the significant terms from a bunch of documents but it always returns nothing.
Here is the schema of my index :
{
"documents" : {
"warmers" : {},
"mappings" : {
"document" : {
"properties" : {
"text" : {
"index" : "not_analyzed",
"type" : "string"
},
"entities": {
"properties": {
"text": {
"index": "not_analyzed",
"type": "string"
}
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1447410095617",
"uuid" : "h2m2J9sJQaCpxvGDI591zg",
"number_of_replicas" : "1",
"version" : {
"created" : "2000099"
},
"number_of_shards" : "5"
}
},
"aliases" : {}
}
}
So it's a simple index that contains the field text, which is not analyzed, and an array entities that will contains dictionnaries with a single field: text, which is not analyzed neither.
What I want to do is to match some of the documents and extracts the most significant terms from the entities associated. For that, I use a wildcard and then an aggregation.
Here is the the request I am sending through curl:
curl -XGET 'http://localhost:9200/documents/_search' -d '{
"query": {
"bool": {
"must": {"wildcard": {"text": "*test*"}}
}
},
"aggregations" : {
"my_significant_terms" : {
"significant_terms" : { "field" : "entities.text" }
}
}
}'
Unfortunately, even if Elasticsearch is hitting on some documents, the buckets of the significant terms aggregation are always empty.
I tried to put analyzed instead of not_analyzed also, but I got the same empty results.
So first, is it relevant to do it this way ?
I am a very beginner to Elasticsearch, so, can you explain me how the significant terms aggregations work ?
And finaly, if it is relevant, why my query isn't working ?
EDIT: I just saw in the Elasticsearch documentation that the significant terms aggregation need a certain amount of data to become effective, and I just have 163 documents in my index. Could it be that ?
Not sure if it will help. Try to specify
"min_doc_count" : 1
the significant terms aggregation need a certain amount of data to
become effective, and I just have 163 documents in my index. Could it
be that ?
Using 1 shard not 5 will help if you have a small number of docs.

Resources