ElasticSearch Kibana match_all - elasticsearch

I have the next problem: I made a python program and it was indexing a lot of domains (8000 per hour). Now i Have 16000 domains (more or less). In Kibana Discover window I can see my data but if I pick in Dev Tools and I make the query "match_all" I can only see 10 domains. Where is the problem?
I need to show all data in one query.
This is my actual query:
GET /project/_search
{"query": {"match_all": {}}}
Thanks in advance!

You get 10 results because it's the default size for a query - you can see that information here.
As is stated in the link, you can add the size argument with another value to see more information, but will be limited by the index.max_result_window which is 10000 by default.
What is the purpose of retrieving all information in one go?
The python modules available to interact with elasticsearch would allow you to retrieve all the information easily, see this link to see the elasticsearch.helpers.scan function.

Related

How to design a system for Search query and Csv/Pdf export for 500GB data/day?

Problem Stmt
1 device is sending 500GB text data (logs) per day to my central server.
I want to design a system using which user can:
Apply exact-match filters and go through data using pagination
Export PDF/CSV reports for same query as above
Data can be stored for max 6 months. Its an on-premise solution. Some delay on queries is affordable. If we can do data compressions it would be awesome. I have 512GB RAM, 80core system and TBs of storage(these are upgradable)
What I have tried/found out:
Tech stack iam planning to use: MEAN stack for application dev. For core data part iam planning to use ELK stack. Elasticsearch single index can have <40-50gb ideal size recommendation.
So, my plan is create 100 indexes per day each of 5GB for each device. During query I can sort these indices based on their name (eg. 12_dec_2012_part_1 ...) and search into each index linearly and keep on doing this till the range user has asked. (I think this will hold good for ad-hoc request by user, but for reports if I do this and write to a csv file by going sequentially one by one it will take long time.) For reports I think best thing i can do is create pdf/csv for each index(5gb size), reason because most file openers cannot open very large csv/pdf files.
Iam new to big data problems. Iam not sure what approach is right; ELK or Hadoop ecosystem for this. (I would like to go with ELK)
Can someone point me to right direction or how to proceed or if someone has dealt with this type of problem statement? Any out of the way solution for these problems are also welcome.
Thanks!
exact-match filters
You can use term query or match_phrase query
Returns documents that contain an exact term in a provided field.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
pagination
You can use from and size parameter to pagination.
GET /_search
{
"from": 5,
"size": 20,
"query": {
"match": {
"user.id": "kimchy"
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html
Export PDF/CSV
You can use Kibana
Kibana provides you with several options to share Discover saved
searches, dashboards, Visualize Library visualizations, and Canvas
workpads.
https://www.elastic.co/guide/en/kibana/current/reporting-getting-started.html
Data can be stored for max 6 months
You can use ILM policy
You can configure index lifecycle management (ILM) policies to
automatically manage indices according to your performance,
resiliency, and retention requirements. For example, you could use ILM
to:
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
optimal shard size
For log indices you can use datastream indices.
A data stream lets you store append-only time series data across
multiple indices while giving you a single named resource for
requests. Data streams are well-suited for logs, events, metrics, and
other continuously generated data.
https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html
When you use datastream indices you don't think about shard size it
will rollover automatically. :)
For the compression you should update index settings
index.codec: best_compression
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html

Does ElasticSearch Keep Count The Number Of Times A Record Is Returned In A Given Period Of Time?

I have an ElasticSearch instance and it does one type of search - it takes a few parameters and returns the companies in its index that match the parameters given.
I'd like to be able to pull some stats that essentially says "This company has been returned from search queries X number of times in the past week".
Does ElasticSearch store metadata that will allow to pull this kind of info from it? If this kind of data isn't stored in ES out of the box, is there a way to enable it?
Elasticsearch (not ElasticSearch ;) ) does not do this natively, no. you can build something using the slow log, where you set the timing to 0 to get it to log everything, but that then logs everything which may not be useful/too noisy
things like https://www.elastic.co/enterprise-search, built on top of Elasticsearch, do provide this sort of insight

How does Elasticsearch6.8 cache the same request on the second time?

I am using Elasticsearch 6.8 and I'd like to know if I send the same request multiple times, will ES do any optimised operation on that? If yes, is there any document explain how this works?
Adding more details to the answer given by #fmdaboville.
The above caches are provided out of the box and below are some options if you want to enable/disable more cache.
Enabling/fine tuning more cache options
Query type: if you are using filters in a search query, then those are cached by default at Elasticsearch and doesn't contribute to the score as it just means to filter out the data, more info from this official doc:
In a filter context, a query clause answers the question “Does this
document match this query clause?” The answer is a simple Yes or
No — no scores are calculated. Filter context is mostly used for
filtering structured data, e.g.
Does this timestamp fall into the range 2015 to 2016? Is the status
field set to "published"? Frequently used filters will be cached
automatically by Elasticsearch, to speed up performance.
Using refresh interval: this official doc has a lot more info, but in short its good, if you are OK to get some obsolete data and ready to trade-off it in favor of performance.
This makes new index changes visible to search.
Disable the cache on a particular request
By default, heavy searches are cached at shards level as explained in this official doc, but for certain requests, if you want to enable/disable this behavior then you can go use this in your API call.
Simply add below query param in your search param. link in API call has several other settings to do it.
request_cache=true/false
Here is the documentation explaining how work the optimized search : Tune for search speed. The part about cache seems to be what you are looking for :
There are multiple caches that can help with search performance, such as the filesystem cache, the request cache or the query cache. Yet all these caches are maintained at the node level, meaning that if you run the same request twice in a row, have 1 replica or more and use round-robin, the default routing algorithm, then those two requests will go to different shard copies, preventing node-level caches from helping.

retrieve sorted search results from elasticsearch

I am facing a problem with elastic search. I am using elasticsearch 5.6
When I am searching an index on some fields and I get to have more than 40000 results.
I found 2 problems:
When trying to access page 1001 (results 10001) I get an error and I understood I can increase the default 10,000, However I can accept this limitation and expose back to the user only the first 10,000 results.
When I am trying to sort by a specific field, the sort does not work. This is a huge problem for me as this search is used by a client UI and I must enable paging through the results. I read about the scroll API but I does not fit my requirements (user requests from UI).
Do you have any idea how to solve this problem?
Thank you.

Query regarding Statsd and Collectd

I have a query regarding the usage of statsd and collectd.
Wherever I see in the internet, I am only getting examples where statsd/collectd is used to collect metric information about the Application/System.
My Question is: Can statsd/collectd be used to collect statistical information on any other datasets which is not a system performance related data Eg: in Ecommerce,?
Can we use it to get the information of top 10 or top 15 users / URLs that are hitting the website, in a time-series analysis(say in last 15 minutes or last 15 days)?
Any relevant links or document in this regards is most welcome.
Also, I wanted to know if we can store this data in Elastic Search as well. Any documents on this is also most relevant to me and most welcome.
Thanks

Resources