Time out on querying large elastic search index - elasticsearch

I queried a large index using a very large size, as I want to retrieve every matching document in a large index, but I got a timeout after a long time. No result is returned. Is there any other way to get all data without timing out?
My query:
{
"size": 90000000,
"query": {
"filtered": {"query": {"match_all":{}},"filter":{"term": {"isbn": 475869}}
}
}
}

You should use scrolling if you need to retrieve a large amount of data.
First, initiate the scroll with your query:
curl -XGET 'localhost:9200/your_index/your_type/_search?scroll=1m' -d '{
"size": 5000,
"query": {
"term" : {
"isbn" : "475869"
}
}
}'
Then you'll get the first 5000 documents as well as a _scroll_id token in the response, which you can use to perform the subsequent requests.
Then you can repeatedly perform the next requests using the scroll_id token from the previous response in order to get the next batch of 5000 documents, until you get no results anymore.
curl -XGET 'localhost:9200/_search/scroll' -d '{
"scroll" : "1m",
"scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1"
}'
Since you're using Jest, there's a SearchScroll class you can use. See in test cases how that class is used.

Related

Elasticsearch reindex API - Not able to reindex large number of documents

I'm using Elasticsearch's reindex API to migrate logs from an old cluster to a new version 7.9.2 cluster. Here is the command I'm using.
curl -X POST "new_host:9200/_reindex?pretty&refresh&wait_for_completion=true" -H 'Content-Type: application/json' -d'
{
"source": {
"remote": {
"host": "old_host:9200"
},
"index": "*",
"size": 10000,
"query": {
"match_all": {}
}
},
"conflicts": "proceed",
"dest": {
"index": "logstash"
}
}'
This gets only the last 10000 documents or 1 batch and request gets completed after that. However, I need to reindex more than a million documents. Is there a way to make the request run for all the matched documents? Can we set the number of batches in the request or make the request issue batches till all documents are indexed?
One option I can think of is to send request recursively by modifying query on datetime. Is there a better way to do it? Can I get all the matched documents (1 million plus) in one request?
Remove the query and size params in order to get all the data. If you need to filter only desired documents using a query, just remove the size to fetch all matched logs.
Using wait_for_completion=false as query param will return the task id and you will be able to monitor the reindex progress using GET /_tasks/<task_id>.
If you need or want to break the reindexing into serveral steps/chunks consider using the slice feature.
BTW: Reindex one index after another instead all at one using * and consider using daily/monthly indicies as it becomes easier to resume the process on errors and manage the log retention in comparison to one whole index.
In order to improve the speed, you should reduce the replicas to 0 and set refresh_interval=-1 in the destination index bevore reindexing and reset the values afterwards.
curl -X POST "new_host:9200/_reindex?pretty&wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
"source": {
"remote": {
"host": "old_host:9200"
},
"index": "index_name"
},
"conflicts": "proceed",
"dest": {
"index": "logstash"
}
}'
UPDATE based on comments:
While reindexing, there is at least one error what causes the reindexing to stop. The error is being caused by at least one document (id=xiB9...) having 'OK' as value in field 'fields.StatusCode'. But the mapping in the destination index has long as data type what is causing the mentioned exception.
The solution is to change the source documents StatusCode to 200 for example, but there will be probably more documents causing the very same error.
Another solution is to change the mapping in the destination index to keyword type - that requires a handmade mapping set before any data has been inserted and maybe reindexing the already present data.

Cannot get only number of hits in elastic search

Im using _msearch api to send multiple queries to elastic.
I only need to know how many hits generates each query.
What I understood, you can use the size parameter by setting it to "0" in order to only get the count. However, I still get results with all the found documents. Here is my query:
{"index":"myindex","type":"things","from":0,,"size":0}
{"query":{"bool":{"must":[{"match_all":{}}],"must_not":[],{"match":
{"firstSearch":true}}]}}}, "size" : 0}
{"index":"myindex","type":"things","from":0,,"size":0}
{"query":{"bool":{"must":[{"match_all":{}}],"must_not":[],{"match":
{"secondSearch":true}}]}}}, "size" : 0}
Im using curl to get the results, this way:
curl -H "Content-Type: application/x-ndjson" -XGET localhost:9200/_msearch?pretty=1 --data-binary "#requests"; echo
Setting size as zero signifies that you are asking Elasticsearch to return all the documents which satisfies the query.
You can let Elasticsearch know that you do not need the documents by sending "_source" as false.
Example:
{
"query": {},
"_source": false,
}
You can use
GET /indexname/type/_count?
{ "query":
{ "match_all": {} }
}
please read more document: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-count.html

Compare documents in Elasticsearch

I am new to Elasticsearch and I am trying to get all documents which have same mobile type. I couldn't find a relevant question and am currently stuck.
curl -XPUT 'http://localhost:9200/sessions/session/1' \
-d '{"useragent": "1121212","mobile": "android", "browser": "mozilla", "device": "computer", "service-code": "1112"}'
EDIT -
I need Elasticsearch equivalent of following -
SELECT * FROM session s1, session s2
where s1.device == s2.device
What you are trying to achieve is simple grouping docs on a field via self-join.
The similar notion of grouping can be achieved by terms aggregation in elasticsearch. Although this aggregation returns only the group level metrics like count, sum etc. It does not return the individual records.
However, there is another aggregation which can be applied as a sub-aggregation to the terms aggregation, top-hits aggregations.
The top_hits aggregator can effectively be used to group result sets
by certain fields via a bucket aggregator. One or more bucket
aggregators determines by which properties a result set get sliced
into.
Options
from - The offset from the first result you want to fetch.
size - The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned.
sort - How the top matching hits should be sorted. By default the hits are sorted by the score of the main query.
Here is a sample query
{
"query": {
"match_all": {}
},
"aggs": {
"top-mobiles": {
"terms": {
"field": "device"
},
"aggs": {
"top_device_hits": {
"top_hits": {}
}
}
}
}
}

ElasticSearch Date Field Mapping Malformation

In my ElasticHQ mapping:
#timestamp date yyyy-MM-dd HH:mm:ssZZZ
...
date date yyyy-MM-dd HH:mm:ssZZZ
In the above I have two types of date field each with a mapping to the same format.
In the data:
"#timestamp": "2014-05-21 23:22:47UTC"
....
"date": "2014-05-22 05:08:09-0400",
As above, the date format does not map to what ES thinks I have my dates formatted as. I assume something hinky happened at index time (I wasn't around).
Also interesting: When using a filtered range query like the following, I get a Parsing Exception explaining that my date is too short:
GET _search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"date": {
"from": "2013-11-23 07:00:29",
"to": "2015-11-23 07:00:29",
"time_zone": "+04:00"
}
}
}
}
}
}
But searching with the following passes ES's error check, but returns no results, I assume because of the date formatting in the documents.
GET _search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"date": {
"from": "2013-11-23 07:00:29UTC",
"to": "2015-11-23 07:00:29UTC",
"time_zone": "+04:00"
}
}
}
}
}
}
My question is this: given the above, is there any way we can avoid having to Re-Index and change the mapping and continue to search the malformed data? WE have around 1TB of data in this particular cluster, and would like to keep it as is, for obvious reasons.
Also attempted was a query that adheres to what is in the Data:
"query": {
"range": {
"date": {
"gte": "2014-05-22 05:08:09-0400",
"to": "2015-05-22 05:08:09-0400"
}
}
}
The dates you have in your documents actually do conform to the date format you have in your mapping, i.e. yyyy-MM-dd HH:mm:ssZZZ
In date format patterns, ZZZ stands for an RFC 822 time zone (e.g. -04:00, +04:00, EST, UTC, GMT, ...) so the dates you have in your data do comply otherwise they wouldn't have been indexed in the first place.
However, the best practice is to always make sure dates are transformed to UTC (or any other time zone common to the whole document base that makes sense in your context) before indexing them so that you have a common basis to query on.
As for your query that triggers errors, 2013-11-23 07:00:29 doesn't comply with the date format since the time zone is missing at the end. As you've rightly discovered, adding UTC at the end fixes the query parsing problem (i.e. the missing ZZZ part), but you might still get no results.
Now to answer your question, you have two main tasks to do:
Fix your indexing process/component to make sure all the dates are in a common timezone (usually UTC)
Fix your existing data to transform the dates in your indexed documents into the same timezone
1TB is a lot of data to reindex for fixing one or two fields. I don't know how your documents look like, but it doesn't really matter. The way I would approach the problem would be to run a partial update on all documents, and for this, I see two different solutions, in both of which the idea is to just fix the #timestamp and date fields:
Depending on your version of ES, you can use the update-by-query plugin but transforming a date via script is a bit cumbersome.
Or you can write an adhoc client that will scroll over all your existing documents and partial update each of them and send them back in bulk.
Given the amount of data you have, solution 2 seems more appropriate.
So... your adhoc script should first issue a scroll query to obtain a scroll id like this:
curl -XGET 'server:9200/your_index/_search?search_type=scan&scroll=1m' -d '{
"query": { "match_all": {}},
"size": 1000
}'
As a result, you'll get a scroll id that you can now use to iterate over all your data with
curl -XGET 'server:9200/_search/scroll?_source=date,#timestamp&scroll=1m' -d 'your_scroll_id'
You'll get 1000 hits (you can de/increase the size parameter in the first query above depending on your mileage) that you can now iterate over.
For each hit you get, you'll only have your two date fields that you need to fix. Then you can transform your dates into the standard timezone of your choosing using a solution like this for instance.
Finally, you can send your 1000 updated partial documents in one bulk like this:
curl -XPOST server:9200/_bulk -d '
{ "update" : {"_id" : "1", "_type" : "your_type", "_index" : "your_index"} }
{ "doc" : {"date" : "2013-11-23 07:00:29Z", "#timestamp": "2013-11-23 07:00:29Z"} }
{ "update" : {"_id" : "2", "_type" : "your_type", "_index" : "your_index"} }
{ "doc" : {"date" : "2014-09-12 06:00:29Z", "#timestamp": "2014-09-12 06:00:29Z"} }
...
'
Rinse and repeat with the next iteration...
I hope this should give you some initial pointers to get started. Let us know if you have any questions.

Selecting all the results from a bucket using TopHits aggregation

I am using TopHits aggregation over the Terms aggregation to fetch the records as shown in below query.
{
"aggregations" : {
"group by" : {
"terms" : {
"field" : "City"
},
"aggregations" : {
"top" : {
"top_hits" : {
"size" : 200
}
}}}}
I want to fetch all the records that are present in bucket instead of only top 200 records, but as the value of size increases the query time also increases for the same indexed data (for same number of records).
So I can not set the size value to a randomly large number as it is hampering the querying time.
Is there any way to achieve the same efficiently ?
Thanks.
In elastic search size having limitations default it returns 10 documents but if you want to increase documents then size values increase.
Let's check this example in this case
if deep pagination with from and size — e.g. ?size=10&from=10000 — is very inefficient as (in this example) 100,000 sorted results have to be retrieved from each shard and resorted in order to return just 10 results. This process has to be repeated for every page requested.
So this case you should use scroll api because of
The scroll API keeps track of which results have already been returned and so is able to return sorted results more efficiently than with deep pagination. However, sorting results (which happens by default) still has a cost.
In your case you should use scan and scroll as below :
curl - s - XGET localhost: 9200 / logs / syslogs / _search ? scroll = 10 m & search_type = scan ' {
"aggregations": {
"group by": {
"terms": {
"field": "City"
},
"aggregations": {
"top": {
"top_hits": {
"size": 200
}
}
}
}
}
}'
Above query return scroll id then pass that scroll id as below
curl -XGET 'localhost:9200/_search/scroll?scroll=1m' -d 'scroll id '

Resources