Raw data size in elasticsearch index - elasticsearch

How can I calculate the raw data size that the index is based on? In the /${INDEX_NAME}/_stats I can see the total size of the index: the raw data and index structure. How can I check what's raw data size?
For example, I have a 1MB file with some documents. I indexed it to ES and the total index size is 1.3 MB. How can I do reverse engineering when I have the index size and I want to understand the raw data size?

Elasticsearch does not keep the source data size by default. But you can configure Mapper Size Plugin to add the _size metadata field which keeps the size in bytes of the source data size.
When enabled, all documents indexed will have a new field '_size' defined with the length of _source document. For example:
GET my_index/_search?size=1&filter_path=hits.hits
Returns:
{
"hits" : {
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "123456789abcdef0",
"_score" : 1.0,
"_size" : 75,
"_source" : {
"#timestamp" : "2023-01-31T23:58:58.869Z",
"content" : {
"att1": "val1"
}
}
}
]
}
}
Moreover if you want to understand why the index size is 1.3 MB, you can use index disk usage API exactly for this purpose:
POST my_index/_disk_usage?run_expensive_tasks=true

Related

How do you join elasticsearch indexes to sort?

Is there a preferred way to join two Elasticsearch indices so that I can sort my query?
Index #1
// GET /activities/_doc/1aadea40-e93b-42b4-9c76-05ebed4335fe (simplified output)
{
"_index" : "activities-1605040906149",
"_type" : "_doc",
"_id" : "1aadea40-e93b-42b4-9c76-05ebed4335fe",
"_source" : {
"date" : 1614286078420,
"activityId" : "1aadea40-e93b-42b4-9c76-05ebed4335fe",
"referralId" : "943f6d94-b2dd-4e89-9383-447fdd1d73d8",
"duration" : 90
}
}
Index #2
// GET /referrals/_doc/2c022a6e-2543-4cdd-8595-98aea41e8966 (simplified output)
{
"_index" : "referrals-1612984843755",
"_type" : "_doc",
"_id" : "2c022a6e-2543-4cdd-8595-98aea41e8966",
"_source" : {
"displayName" : "JOHN DOE",
"referralId" : "2c022a6e-2543-4cdd-8595-98aea41e8966",
}
}
I’d like to be able to join the contents of the referrals index with the contents of my activities index and then sort based on the referral’s displayName. I would need to do this for tens of thousands of records.
Other solutions include denormalizing my data, but I was hoping to see if there was an alternative way.
you can do joins by following ways
nested structure : map using an object ,no need of joins
parent-child relationship : use a 'type' field to distinguish you documents using different type for a same index ,indexing parent and child into the same shard,hence query will be limited to single shard
application side joins : For example,add activities fields directly in the referrals documents, allowing you to search/query on them directly.
then you can sort based on the referral’s field

Elasticseach - force a field to index only, avoid store

How do I force a field to be indexed only and not store the data. This option is available in Solr and not sure if it's possible in Elasticseach.
From document
By default, field values are indexed to make them searchable, but they
are not stored. This means that the field can be queried, but the
original field value cannot be retrieved.
Usually this doesn’t matter. The field value is already part of the
_source field, which is stored by default. If you only want to retrieve the value of a single field or of a few fields, instead of
the whole _source, then this can be achieved with source filtering
If you don't want field to be stored in _source too. You can exclude the field from source in mapping
Mapping:
{
"mappings": {
"properties": {
"title":{
"type":"text"
},
"description":{
"type":
}
},
"_source": {
"excludes": [
"description"
]
}
}
}
Query:
GET logs/_search
{
"query": {
"match": {
"description": "b" --> field description is searchable(indexed)
}
}
}
Result:
"hits" : [
{
"_index" : "logs",
"_type" : "_doc",
"_id" : "-aC9V3EBkD38P4LIYrdY",
"_score" : 0.2876821,
"_source" : {
"title" : "a" --> field "description" is not returned
}
}
]
Note:
Removing fields from source will cause below issue
The update, update_by_query, and reindex APIs.
On the fly highlighting.
The ability to reindex from one Elasticsearch index to another, either to change mappings or analysis, or to upgrade an index to a new major version.
The ability to debug queries or aggregations by viewing the original document used at index time.
Potentially in the future, the ability to repair index corruption automatically.

How to Query just all the documents name of index in elasticsearch

PS: I'm new to elasticsearch
http://localhost:9200/indexname/domains/<mydocname>
Let's suppose we have indexname as our index and i'm uploading a lot of documents at <mydoc> with domain names ex:
http://localhost:9200/indexname/domains/google.com
http://localhost:9200/indexname/domains/company.com
Looking at http://localhost:9200/indexname/_count , says that we have "count": 119687 amount of documents.
I just want my elastic search to return the document names of all 119687 entries which are domain names.
How do I achieve that and is it possible to achieve that in one single query?
Looking at the example : http://localhost:9200/indexname/domains/google.com I am assuming your doc_type is domains and doc id/"document name" is google.com.
_id is the document name here which is always part of the response. You can use source filtering to disable source and it will show only something like below:
GET indexname/_search
{
"_source": false
}
Output
{
...
"hits" : [
{
"_index" : "indexname",
"_type" : "domains",
"_id" : "google.com",
"_score" : 1.0
}
]
...
}
If documentname is a field that is mapped, then you can still use source filtering to include only that field.
GET indexname/_search
{
"_source": ["documentname"]
}

Elasticsearch catch-all field slowness after upgrade

We upgraded a 2.4 cluster to 6.2 cluster using the reindex from remote approach. In 2.4, we were using the catch-all _all field to perform searches and were seeing response times under 500 ms for all our queries.
In 6.2, the _all field is no longer available for the new index, so we ended up creating a new text type field called all like "all": {"type": "text"} and set copy_to on all our other fields (about 2000 of them). But now, searches on this new catch-all field all are taking 2 to 10 times longer than the search on the 2.4 _all field. (We flushed the caches on both clusters before performing the queries.)
Both clusters are single data center, single node 8GB memory on the same AWS zone, hosted through elastic cloud. Both indices have the same number of documents (about 6M) and have about 150 Lucene segment files.
Any clues as to why?
UPDATE: Both indices return documents without the catch-all field i.e. they do not store the catch-all field.
Here is an example query and response:
$ curl --user "$user:$password" \
> -H 'Content-Type: application/json' \
> -XGET "$es/$index/$mapping/_search?pretty" -d'
> {
> "size": 1,
> "query" : {
> "match" : { "all": "sherlock" }
> }
> }
> '
{
"took" : 42,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 28133,
"max_score" : 2.290815,
"hits" : [ {
"_index" : "sherlock",
"_type" : "doc",
"_id" : "513763",
"_score" : 2.290815,
"_source" : {
"docid" : 513763,
"age" : 115,
"essay" : "Has Mr. Sherlock Holmes?",
"name" : {
"last" : "Pezzetti",
"first" : "Lilli"
},
"ssn" : 834632279
}
} ]
}
}
UPDATE 2: Another point I forgot to mention is that the 2.4 cluster is currently being used by a staging app, which sends a few queries to it every few minutes. Could this bring other factors like OS caching into play?
Did you store the _all field and returned it in your original setup? Do you return it now? If you didn't and now you do then that's a response overhead you are seeing and not a search overhead. Basically you should omit that field in your response (from you _source) if you don't need it (and any other field for that matter).
Check _source filtering for more

Efficient way to retrieve all _ids in ElasticSearch

What is the fastest way to get all _ids of a certain index from ElasticSearch? Is it possible by using a simple query? One of my index has around 20,000 documents.
Edit: Please also read the answer from Aleck Landgraf
You just want the elasticsearch-internal _id field? Or an id field from within your documents?
For the former, try
curl http://localhost:9200/index/type/_search?pretty=true -d '
{
"query" : {
"match_all" : {}
},
"stored_fields": []
}
'
Note 2017 Update: The post originally included "fields": [] but since then the name has changed and stored_fields is the new value.
The result will contain only the "metadata" of your documents
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 1.0,
"hits" : [ {
"_index" : "index",
"_type" : "type",
"_id" : "36",
"_score" : 1.0
}, {
"_index" : "index",
"_type" : "type",
"_id" : "38",
"_score" : 1.0
}, {
"_index" : "index",
"_type" : "type",
"_id" : "39",
"_score" : 1.0
}, {
"_index" : "index",
"_type" : "type",
"_id" : "34",
"_score" : 1.0
} ]
}
}
For the latter, if you want to include a field from your document, simply add it to the fields array
curl http://localhost:9200/index/type/_search?pretty=true -d '
{
"query" : {
"match_all" : {}
},
"fields": ["document_field_to_be_returned"]
}
'
Better to use scroll and scan to get the result list so elasticsearch doesn't have to rank and sort the results.
With the elasticsearch-dsl python lib this can be accomplished by:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
es = Elasticsearch()
s = Search(using=es, index=ES_INDEX, doc_type=DOC_TYPE)
s = s.fields([]) # only get ids, otherwise `fields` takes a list of field names
ids = [h.meta.id for h in s.scan()]
Console log:
GET http://localhost:9200/my_index/my_doc/_search?search_type=scan&scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
...
Note: scroll pulls batches of results from a query and keeps the cursor open for a given amount of time (1 minute, 2 minutes, which you can update); scan disables sorting. The scan helper function returns a python generator which can be safely iterated through.
For elasticsearch 5.x, you can use the "_source" field.
GET /_search
{
"_source": false,
"query" : {
"term" : { "user" : "kimchy" }
}
}
"fields" has been deprecated.
(Error: "The field [fields] is no longer supported, please use [stored_fields] to retrieve stored fields or _source filtering if the field is not stored")
Elaborating on answers by Robert Lujo and Aleck Landgraf,
if you want the IDs in a list from the returned generator, here is what I use:
from elasticsearch import Elasticsearch
from elasticsearch import helpers
es = Elasticsearch(hosts=[YOUR_ES_HOST])
hits = helpers.scan(
es,
query={"query":{"match_all": {}}},
scroll='1m',
index=INDEX_NAME
)
ids = [hit['_id'] for hit in hits]
Another option
curl 'http://localhost:9200/index/type/_search?pretty=true&fields='
will return _index, _type, _id and _score.
I know this post has a lot of answers, but I want to combine several to document what I've found to be fastest (in Python anyway). I'm dealing with hundreds of millions of documents, rather than thousands.
The helpers class can be used with sliced scroll and thus allow multi-threaded execution. In my case, I have a high cardinality field to provide (acquired_at) as well. You'll see I set max_workers to 14, but you may want to vary this depending on your machine.
Additionally, I store the doc ids in compressed format. If you're curious, you can check how many bytes your doc ids will be and estimate the final dump size.
# note below I have es, index, and cluster_name variables already set
max_workers = 14
scroll_slice_ids = list(range(0,max_workers))
def get_doc_ids(scroll_slice_id):
count = 0
with gzip.open('/tmp/doc_ids_%i.txt.gz' % scroll_slice_id, 'wt') as results_file:
query = {"sort": ["_doc"], "slice": { "field": "acquired_at", "id": scroll_slice_id, "max": len(scroll_slice_ids)+1}, "_source": False}
scan = helpers.scan(es, index=index, query=query, scroll='10m', size=10000, request_timeout=600)
for doc in scan:
count += 1
results_file.write((doc['_id'] + '\n'))
results_file.flush()
return count
if __name__ == '__main__':
print('attempting to dump doc ids from %s in %i slices' % (cluster_name, len(scroll_slice_ids)))
with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
doc_counts = executor.map(get_doc_ids, scroll_slice_ids)
If you want to follow along with how many ids are in the files, you can use unpigz -c /tmp/doc_ids_4.txt.gz | wc -l.
For Python users: the Python Elasticsearch client provides a convenient abstraction for the scroll API:
from elasticsearch import Elasticsearch, helpers
client = Elasticsearch()
query = {
"query": {
"match_all": {}
}
}
scan = helpers.scan(client, index=index, query=query, scroll='1m', size=100)
for doc in scan:
# do something
you can also do it in python, which gives you a proper list:
import elasticsearch
es = elasticsearch.Elasticsearch()
res = es.search(
index=your_index,
body={"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]})
ids = [d['_id'] for d in res['hits']['hits']]
Inspired by #Aleck-Landgraf answer, for me it worked by using directly scan function in standard elasticsearch python API:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
es = Elasticsearch()
for dobj in scan(es,
query={"query": {"match_all": {}}, "fields" : []},
index="your-index-name", doc_type="your-doc-type"):
print dobj["_id"],
This is working!
def select_ids(self, **kwargs):
"""
:param kwargs:params from modules
:return: array of incidents
"""
index = kwargs.get('index')
if not index:
return None
# print("Params", kwargs)
query = self._build_query(**kwargs)
# print("Query", query)
# get results
results = self._db_client.search(body=query, index=index, stored_fields=[], filter_path="hits.hits._id")
print(results)
ids = [_['_id'] for _ in results['hits']['hits']]
return ids
Url -> http://localhost:9200/<index>/<type>/_query
http method -> GET
Query -> {"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]}

Resources