How to get all the index patterns which never had any documents? - elasticsearch

For Kibana server decommissioning purposes, I want to get a list of index patterns which never had any single document and had documents.
How to achieve this using Kibana only?
I tried this but it doesn't give the list based on the document count.
GET /_cat/indices
Also in individual level getting the count to check the documents are there is time consuming .
GET index-pattern*/_count

You can try this. V is for verbose and s stands for sort.
GET /_cat/indices?v&s=store.size:desc
From the docs :
These metrics are retrieved directly from Lucene, which {es} uses internally to power indexing and search. As a result, all document counts include hidden nested documents.

Related

Get last document from index in Elasticsearch

I'm playing around the package github.com/olivere/elastic; all works fine, but I've a question: is it possible to get the last N inserted documents?
The From statement has 0 as default starting point for the Search action and I didn't understand if is possible to omit it in search.
Tldr;
Although I am not aware of a feature in elasticsearch api to retrieve the latest inserted documents.
There is a way to achieve something alike if you store the ingest time of the documents.
Then you can sort on the ingest time, and retrieve the top N documents.

Using stored_fields for retrieving a subset of the fields in Elastic Search

The documentation and recommendation for using stored_fields feature in ElasticSearch has been changing. In the latest version (7.9), stored_fields is not recommended - https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-fields.html
Is there a reason for this?
Where as in version 7.4.0, there is no such negative comment - https://www.elastic.co/guide/en/elasticsearch/reference/7.4/mapping-store.html
What is the guidance in using this feature? Is using _source filtering a better option? I ask because in some other doc, _source filtering is supposed to kill performance - https://www.elastic.co/blog/found-optimizing-elasticsearch-searches
If you use _source or _fields you will quickly kill performance. They access the stored fields data structure, which is intended to be used when accessing the resulting hits, not when processing millions of documents.
What is the best way to filter fields and not kill performance with Elastic Search?
source filtering is the recommended way to fetch the fields and you are getting confused due to the blog, but you seem to miss the very important concept and use-case where it is applicable. Please read the below statement carefully.
_source is intended to be used when accessing the resulting hits, not when processing millions of documents.
By default, elasticsearch returns only 10 hits/search results which can be changed based on the size parameter and if in your search results, you want to fetch few fields value than using source_filter makes perfect sense as it's done on the final result set(not all the documents matching search results),
While if you use the script, and using source value try to read field-value and filter the search result, this will cause queries to scan all the index which is the second part of the above-mentioned statement(not when processing millions of documents.)
Apart from the above, as all the field values are already stored as part of _source field which is enabled by default, you need not allocate extra space if you explicitly mark few fields as stored(disabled by default to save the index size) to retrieve field-values.

Stormcrawler - how does the es.status.filterQuery work?

I am using stormcrawler to put data into some Elasticsearch indexes, and I have a bunch of URL's in the status index, with a variety of statuses - DISCOVERED, FETCHED, ERROR, etc.
I was wondering if I could tell StormCrawler to just crawl the urls that are https and with the status: DISCOVERED and if that would actually work. I have the es-conf.yaml set as follows:
es.status.filterQuery: "-(url:https* AND status:DISCOVERED)"
Is that correct? how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?
See code of the AggregationSpout.
how does SC make use of the es.status.filterQuery? Does it run a
search and apply the value as a filter to retrieve only the applicable
documents to fetch?
yes, it filters the queries sent to the ES shards. This is useful for instance to process a subset of a crawl.
It is a positive filter i.e. the documents must match the query in order to be retrieved; you'd need to remove the - for it to do what you described.

Can I narrow results from Elastic Search _stats get?

I am using elastic search for the project I'm working on and I was wondering if there was a way to narrow the results I get from an indices stats search.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html
I currently use the docs to narrow the data I get back about the indices but now I want to only get back ones with a doc count greater than 0. Does anyone know if this is possible or how to?
Thanks!
For elastic search 1.5.2
If you're concerned about the size of the response (i.e. if you many many indices with many shards), the best you can do is to use response filtering (available only since ES 1.7) and only retrieve the docs field that you can further filter on the client-side:
curl 'localhost:9200/_stats/docs?pretty&filter_path=**.docs.count'

Unexplainable count results in ElasticSearch

We have an index running with 241.047 items in it. These items can have any number of subitems, which are indexed as nested documents. The total number of subitems is 381.705.
Both include_in_parent and include_in_root are not set in the mapping, which means that each nested document is indexed as additional documents. This should mean that there will be a total number of 241.047 + 381.705 = 622.752 documents in the index.
When I run the following Curl command to look up the number of documents in the index I get a different number, it's not far off but I'm wondering why it's giving me a different number and it's not returning the number I'm expecting.
curl -XGET
'http://localhost:9200/catawiki_development/_status?pretty' returns 622.861
Next to that, when I'm running a Curl command to get the number of root documents I get a different number than if I run a match_all query and ask for the number of documents returned
curl -XGET 'http://localhost:9200/elasticsearch_development/_count?pretty' returns 241.156
The match_all query returns the correct number of documents, 241.047
How can these difference be explained?
The path of a count api request is quite different from the path of a normal search request. In fact it is a shortcut that allows to only get the count of the documents matching a query, thats' it. It differs from a search with search_type=count too, which is effectively only the first part of a search: broadcast the search request to all shards, but no reduce/fetch since we only want to return the total number of matching documents. You can also add facets etc. to a search request (when using search_type=count too), which is something that you cannot do with the count api.
That said, I'm not that surprised you see a difference for the above reason, it would be nice to understand exactly what the problem is though. The best would be to be able to reproduce the problem with a small number of documents and open an issue including a curl recreation so that we can have a look at it.
In the meantime, I would suggest to use a search request with search_type=count if you have problems with the count api. That one is guaranteed to return the same number of documents as a normal search, just because it is exactly the same logic.

Resources