Setting Elastic search limit to "unlimited" - ruby

How can i get all the results from elastic search as the results only display limit to 10 only. ihave got a query like:
#data = Athlete.search :load => true do
size 15
query do
boolean do
must { string q, {:fields => ["name", "other_names", "nickname", "short_name"], :phrase_slop => 5} }
unless conditions.blank?
conditions.each do |condition|
must { eval(condition) }
end
end
unless excludes.blank?
excludes.each do |exclude|
must_not { eval(exclude) }
end
end
end
end
sort do
by '_score', "desc"
end
end
i have set the limit to 15 but i wan't to make it unlimited so that i can get all the data
I can't set the limit as my data keeps on changing and i want to get all the data.

You can use the from and size parameters to page through all your data. This could be very slow depending on your data and how much is in the index.
http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html

Another approach is to first do a searchType: 'count', then and then do a normal search with size set to results.count.
The advantage here is it avoids depending on a magic number for UPPER_BOUND as suggested in this similar SO question, and avoids the extra overhead of building too large of a priority queue that Shay Banon describes here. It also lets you keep your results sorted, unlike scan.
The biggest disadvantage is that it requires two requests. Depending on your circumstance, this may be acceptable.

From the docs, "Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000". So my admittedly very ad-hoc solution is to just pass size: 10000 or 10,000 minus from if I use the from argument.
Note that following Matt's comment below, the proper way to do this if you have a larger amount of documents is to use the scroll api. I have used this successfully, but only with the python interface.

use the scan method e.g.
curl -XGET 'localhost:9200/_search?search_type=scan&scroll=10m&size=50' -d '
{
"query" : {
"match_all" : {}
}
}
see here

You can use search_after to paginate, and the Point in Time API to avoid having your data change while you paginate. Example with elasticsearch-dsl for Python:
from elasticsearch_dsl.connections import connections
# Set up paginated query with search_after and a fixed point_in_time
elasticsearch = connections.create_connection(hosts=[elastic_host])
pit = elasticsearch.open_point_in_time(index=MY_INDEX, keep_alive="3m")
pit_id = pit["id"]
query_size = 500
search_after = [0]
hits: List[AttrDict[str, Any]] = []
while query_size:
if hits:
search_after = hits[-1].meta.sort
search = (
Search()
.extra(size=query_size)
.extra(pit={"id": pit_id, "keep_alive": "5m"})
.extra(search_after=search_after)
.filter(filter_)
.sort("url.keyword") # Note you need a unique field to sort on or it may never advance
)
response = search.execute()
hits = [hit for hit in response]
pit_id = response.pit_id
query_size = len(hits)
for hit in hits:
# Do work with hits

Related

Discover historical trends in Elasticsearch (not visual)

I have some experience with Elastic as logs storage, but I'm stuck on basic trends recognition (where I need to compare found documents to each other) over time periods.
Easy query would answer following question:
Find all occurrences of document rows (row is specified by growing/continues #timestamp value), where specific field (e.g. threads_count) is growing for fixed count of documents, or time period.
So if I have thread_count of some application, logged every minute over a day including timestamp. And I specify that I'm looking for growing trend in 10 minutes - result should return documents or document sets where thread_count was greater over the one from document minute before at least for 10 documents.
It is very similar task to see line graph, and identify growing parts by eye.
Maybe I just miss proper function name for search. I'm not interested in visualization, I would like to search similar situations over the API and take needed actions.
Any reference to documentation or simple example is welcome!
Well script cannot be used between documents. So you will have to use a payload.
In your query sort the result by date.
https://www.elastic.co/guide/en/elastic-stack-overview/6.3/how-watcher-works.html
A script in the payload could tell you if a field is increasing (something like that, don't have access to a es index right now)
"transform": {
"script": {
"source": "ctx.payload.transform = []; def current_score = -1;
def current = []; for (int j=0;j<ctx.payload.hits.hits;j++){
//check in the loop if current_score increasing using ctx.payload.hits.hits[j]._source.message], if not return "FALSE"
} ; return "TRUE",
"lang": "painless"
}
}
If you use logstash to index your documents, take a look to elapsed, could be nice too: https://www.elastic.co/guide/en/logstash/current/plugins-filters-elapsed.html

How can I find the true score from Elasticsearch query string with a wildcard?

My ElasticSearch 2.x NEST query string search contains a wildcard:
Using NEST in C#:
var results = _client.Search<IEntity>(s => s
.Index(Indices.AllIndices)
.AllTypes()
.Query(qs => qs
.QueryString(qsq => qsq.Query("Micro*")))
.From(pageNumber)
.Size(pageSize));
Comes up with something like this:
$ curl -XGET 'http://localhost:9200/_all/_search?q=Micro*'
This code was derived from the ElasticSearch page on using Co-variants. The results are co-variant; they are of mixed type coming from multiple indices. The problem I am having is that all of the hits come back with a score of 1.
This is regardless of type or boosting. Can I boost by type or, alternatively, is there a way to reveal or "explain" the search result so I can order by score?
Multi term queries like wildcard query are given a constant score equal to the boosting by default. You can change this behaviour using .Rewrite().
var results = client.Search<IEntity>(s => s
.Index(Indices.AllIndices)
.AllTypes()
.Query(qs => qs
.QueryString(qsq => qsq
.Query("Micro*")
.Rewrite(RewriteMultiTerm.ScoringBoolean)
)
)
.From(pageNumber)
.Size(pageSize)
);
With RewriteMultiTerm.ScoringBoolean, the rewrite method first translates each term into a should clause in a bool query and keeps the scores as computed by the query.
Note that this can be CPU intensive and there is a default limit of 1024 bool query clauses that can be easily hit for a large document corpus; running your query on the complete StackOverflow data set (questions, answers and users) for example, hits the clause limit for questions. You may want to analyze some text with an analyzer that uses an edgengram token filter.
Wildcard searches will always return a score of 1.
You can boost by a particular type. See this:
How to boost index type in elasticsearch?

Elasticsearch 2.1: Result window is too large (index.max_result_window)

We retrieve information from Elasticsearch 2.1 and allow the user to page thru the results. When the user requests a high page number we get the following error message:
Result window is too large, from + size must be less than or equal
to: [10000] but was [10020]. See the scroll api for a more efficient
way to request large data sets. This limit can be set by changing the
[index.max_result_window] index level parameter
The elastic docu says that this is because of high memory consumption and to use the scrolling api:
Values higher than can consume significant chunks of heap memory per
search and per shard executing the search. It’s safest to leave this
value as it is an use the scroll api for any deep scrolling https://www.elastic.co/guide/en/elasticsearch/reference/2.x/breaking_21_search_changes.html#_from_size_limits
The thing is that I do not want to retrieve large data sets. I only want to retrieve a slice from the data set which is very high up in the result set. Also the scrolling docu says:
Scrolling is not intended for real time user requests https://www.elastic.co/guide/en/elasticsearch/reference/2.2/search-request-scroll.html
This leaves me with some questions:
1) Would the memory consumption really be lower (any if so why) if I use the scrolling api to scroll up to result 10020 (and disregard everything below 10000) instead of doing a "normal" search request for result 10000-10020?
2) It does not seem that the scrolling API is an option for me but that I have to increase "index.max_result_window". Does anyone have any experience with this?
3) Are there any other options to solve my problem?
If you need deep pagination, one possible solution is to increase the value max_result_window. You can use curl to do this from your shell command line:
curl -XPUT "http://localhost:9200/my_index/_settings" -H 'Content-Type: application/json' -d '{ "index" : { "max_result_window" : 500000 } }'
I did not notice increased memory usage, for values of ~ 100k.
The right solution would be to use scrolling.
However, if you want to extend the results search returns beyond 10,000 results, you can do it easily with Kibana:
Go to Dev Tools and just post the following to your index (your_index_name), specifing what would be the new max result window
PUT your_index_name/_settings
{
"max_result_window" : 500000
}
If all goes well, you should see the following success response:
{
"acknowledged": true
}
The following pages in the elastic documentation talk about deep paging:
https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_fetch_phase.html
Depending on the size of your documents, the number of shards, and the
hardware you are using, paging 10,000 to 50,000 results (1,000 to
5,000 pages) deep should be perfectly doable. But with big-enough from
values, the sorting process can become very heavy indeed, using vast
amounts of CPU, memory, and bandwidth. For this reason, we strongly
advise against deep paging.
Use the Scroll API to get more than 10000 results.
Scroll example in ElasticSearch NEST API
I have used it like this:
private static Customer[] GetCustomers(IElasticClient elasticClient)
{
var customers = new List<Customer>();
var searchResult = elasticClient.Search<Customer>(s => s.Index(IndexAlias.ForCustomers())
.Size(10000).SearchType(SearchType.Scan).Scroll("1m"));
do
{
var result = searchResult;
searchResult = elasticClient.Scroll<Customer>("1m", result.ScrollId);
customers.AddRange(searchResult.Documents);
} while (searchResult.IsValid && searchResult.Documents.Any());
return customers.ToArray();
}
If you want more than 10000 results then in all the data nodes the memory usage will be very high because it has to return more results in each query request. Then if you have more data and more shards then merging those results will be inefficient. Also es cache the filter context, hence again more memory. You have to trial and error how much exactly you are taking. If you are getting many requests in small window you should do multiple query for more than 10k and merge it by urself in the code, which is supposed to take less application memory then if you increase the window size.
2) It does not seem that the scrolling API is an option for me but that I have to increase "index.max_result_window". Does anyone have any experience with this?
--> You can define this value in index templates , es template will be applicable for new indexes only ,so you either have to delete old indexes after creating template or wait for new data to be ingested in elasticsearch .
{
"order": 1,
"template": "index_template*",
"settings": {
"index.number_of_replicas": "0",
"index.number_of_shards": "1",
"index.max_result_window": 2147483647
},
In my case it looks like reducing the results via the from & size prefixes to the query will remove the error as we don't need all the results:
GET widgets_development/_search
{
"from" : 0,
"size": 5,
"query": {
"bool": {}
},
"sort": {
"col_one": "asc"
}
}

Why not use min_score with Elasticsearch?

New to Elasticsearch. I am interested in only returning the most relevant docs and came across min_score. They say "Note, most times, this does not make much sense" but doesn't provide a reason. So, why does it not make sense to use min_score?
EDIT: What I really want to do is only return documents that have a higher than x "score". I have this:
data = {
'min_score': 0.9,
'query': {
'match': {'field': 'michael brown'},
}
}
Is there a better alternative to the above so that it only returns the most relevant docs?
thx!
EDIT #2:
I'm using minimum_should_match and it returns a 400 error:
"error": "SearchPhaseExecutionException[Failed to execute phase [query], all shards failed;"
data = {
'query': {
'match': {'keywords': 'michael brown'},
'minimum_should_match': '90%',
}
}
I've used min_score quite a lot for trying to find documents that are a definitive match to a given set of input data - which is used to generate the query.
The score you get for a document depends on the query, of course. So I'd say try your query in many permutations (different keywords, for example) and decide which document is the first you would rather it didn't return for each, and and make a note of each of their scores. If the scores are similar, this would give you a good guess at the value to use for your min score.
However, you need to bear in mind that score isn't just dependant on the query and the returned document, it considers all the other documents that have data for the fields you are querying. This means that if you test your min_score value with an index of 20 documents, this score will probably change greatly when you try it on a production index with, for example, a few thousands of documents or more. This change could go either way, and is not easily predictable.
I've found for my matching uses of min_score, you need to create quite a complicated query, and set of analysers to tune the scores for various components of your query. But what is and isn't included is vital to my application, so you may well be happy with what it gives you when keeping things simple.
I don't know if it's the best solution, but it works for me (java):
// "tiny" search to discover maxScore
// it is fast, because it returns only 1 item
SearchResponse response = client.prepareSearch(INDEX_NAME)
.setTypes(TYPE_NAME)
.setQuery(queryBuilder)
.setSize(1)
.execute()
.actionGet();
// get the maxScore and
// and set minScore = 70%
float maxScore = response.getHits().maxScore();
float minScore = maxScore * 0.7;
// second round with minimum score
SearchResponse response = client.prepareSearch(INDEX_NAME)
.setTypes(TYPE_NAME)
.setQuery(queryBuilder)
.setMinScore(minScore)
.execute()
.actionGet();
I search twice, but the first time it's fast because it returns only 1 item, then we can get the max_score
NOTE: minimum_should_match work different. If you have 4 queries, and you say minimum_should_match = 70%, it doesn't mean that item.score should be > 70%. It means that the item should match 70% of the queries, that is minimum 3/4 queries

Extremely slow performance of some MongoDB queries

I have a collection of about 30K item all of which have an element called Program. "Program" is a first part of a compound index, so looking up an item with specific Program value is very fast. It is also fast to run range queries, e.g.:
db.MyCollection.find(
{ $and: [ { Program: { "$gte" : "K", "$lt" : "L" } },
{ Program: { "$gte" : "X", "$lt" : "Y" } } ] }).count();
The query above does not return any results because I am querying for an overlap of two non-overlaping ranges (K-L) and (X-Y)). The left range (K-L) contains about 7K items.
However if I replace the second "and" clause with "where" expression, the query execution takes ages:
db.MyCollection.find(
{ $and: [ { Program: { "$gte" : "K", "$lt" : "L" } }, { "$where" : "this.Program == \"Z\"" } ] }).count();
As you can see, the query above should also return an empty result set (range K-L is combined with Program=="Z"). I am aware of slow performance of "where", but should not Mongo first reduce potential result set by evaluating the left clause (that would result in about 7K items) and only then apply "where" check? If it does, should not processing of a few thousand items take seconds and not minutes as it does on my machine with Mongo service consuming about 3GB RAM while peforming this operation? Looks too heavy for relatively small collection.
There are a few things that you can do -
Use explain() to see what is happening on your query. explain() is described here. Use the $explain operator to return a document that describes the process and indexes used to return the query. For example -
db.collection.find(query).explain()
If that doesn't return enough information, you can look at using the Database Profiler. However, please bear in mind that this is not free and adds load itself. Within this page, you can also find some basic notes about optimising the query performance.
However, in your case, it all boils down to the $where operator:
$where evaluates JavaScript and cannot take advantage of indexes. Therefore, query performance improves when you express your query using the standard MongoDB operators (e.g., $gt, $in).
In general, you should use $where only when you can’t express your query using another operator. If you must use $where, try to include at least one other standard query operator to filter the result set. Using $where alone requires a table scan. $where, like Map Reduce, limits your concurrency.
As a FYI: couple of things to note about the output from explain():
ntoreturn Number of objects the client requested for return from a query. For example, findOne(), sets ntoreturn to limit() sets the appropriate limit. Zero indicates no limit.
query Details of the query spec.
nscanned Number of objects scanned in executing the operation.
reslen Query result length in bytes.
nreturned Number of objects returned from query.

Resources