How to turn off sorting in Solr? - sorting

I have a number of documents stored in Solr cluster, and want to get large amount of them (about 20 millions) by particular query. I use standard approach reading batches of rows (say, 10000) and moving to next batch with start parameter. However after about 1 400 000 docs I'm starting to get OutOfMemoryError. I believe this is because of the way Solr sorts docs before sending them to the client. As far as I know, it uses priority queue to get only top N results, and thus need not to load headers of all documents into memory. However, when I ask it to return results, say, from 1,000,000 to 1,010,000, it has to load headers for all previous 1,000,000 docs too.
I'm looking for a way to avoid this and just get all results satisfying query without sorting. Is there a way to do it? If not, what is appropriate way to get large number of results from Solr?

Your assumptions are correct. When you search for results from 1,000,000 to 1,010,000, Solr instantiates a priority queue of size 1,010,000.
This is really not a natural use-case for Solr which has been designed to return the top-k list of results, rather than an exhaustive list of results.
You could work around this by filtering by ranges of your primary key (q=yourquery&fq=ID:[1 TO 1000]&rows=1000, q=yourquery&fq=ID:[1001 TO 2000]&rows=1000, ...) but this is a ugly hack. :-)
Why do you need to get all results? For example, if you need to compute facets or statistics, Solr has two components that can do that efficiently.

Related

Elasticsearch Track total hits alternative with approximation

Based on this article - link there are some serious performance implications with having track_total_hits property set to true.
We currently use it to get the number of documents matching after users search. Then user can use pagination to scroll through the results. The number of documents for such a search usually ranges from 10k - 5M.
Example of a user work flow:
User performs a search which matches 150.000 documents
We show him the first 200 results which he can scroll through but we also show him the total number of documents found in the search.
Since we always show the number of document searches and often those numbers can be quite high we need some kind of a way to get that count. I'm not sure but if we almost always perform paginated searches I would assume a lot of the things would be in memory ? Maybe then this actually effects us less then how it's shown in the provided article?
Some kind of an approximation and not an exact count would be ok for us if it would improve performance.
Is there such an option in Elasticsearch where we can get approximated count on search requests ?
There is no option to get an approximate count, but you may want to consider assigning track_total_hits a lower bound instead of true , which is a good compromise from a performance standpoint ( https://www.elastic.co/guide/en/elasticsearch/reference/master/search-your-data.html#track-total-hits)
That way, you can show users that there are at least k results - but there could be more.
Also, try using search_after (if you are not using it already) for pagination.

Is there a way to change the Search API facet count to show a total word count instead of the count of matching fragments (documents)?

I'm creating an application using Marklogic 8 and the search API. I need to create facets based on MarkLogic defined collections, but instead of the facet count giving a tally of the number of fragments (documents) which contain X number of occurrences of the keyword search performed, I need the facet count to reflect the total number of times the keyword appears in all documents in the collection.
Right now, I'm using search:search() to process the query and return a element with the facet option enabled.
In the MarkLogic documentation, I've been looking at cts:frequency() which says:
"If you want the total frequency instead of the fragment-based frequency (that is, the total number of occurences of the value in the items specified in the cts:query option of the lexicon API), you must specify the item-frequency option to the lexicon API value input to cts:frequency."
But, I can't get that to work.
I've tried running a query like this in query console, but it times out.
cts:element-values(QName("http://www.tei-c.org/ns/1.0", "TEI"),
"", "item-frequency",
cts:and-query((
fn:collection("KirchlicheDogmatik/volume4/part3"),
cts:word-query("lehre"))))
The issue is probably that you have a range index on <TEI>, which contains the entire document. Range indexes are memory-mapped, so you have essentially forced the complete text contents of your database into memory. It's hard to say exactly what's going on, but it's probably struggling to inspect the values (range indexes are designed for smaller atomic values) and possibly swapping to disk.
MarkLogic has great documentation on its indexing, so I'd recommend starting there for a better understanding on how to use them: https://docs.marklogic.com/guide/concepts/indexing#id_51573
Note that even using the item-frequency option, results (or counts) are not guaranteed to be one-to-one with the "total number of times the keyword appears." It will report the number of "items" matching - in your example it would report on the number of <TEI> elements matching.
The problem of getting an exact count of terms matching a query across the whole database is actually quite hard. To get exact matching values within a document, you would need to use cts:highlight or cts:walk, which requires loading the whole document into memory. That typically works fine for a subset of documents, but ultimately to get an accurate value for the entire database, you would need to load the entire database into memory and process every document.
Nearly any approach to getting a term match count requires some kind of approximation and depends heavily on your markup. For example, if you index <p> (or even better <s>) elements, it would be possible to construct a query that uses indexes to count the number of matching paragraphs (or sentences), but that would still load an incredibly large amount of data into memory and keep it there. This is technically feasible if you are willing to allocate enough memory (and/or enough servers), but it hardly seems worth it.

Paging elasticsearch aggregation results

Imagine i have two kind of records: a bucket and an item, where item is contained in a bucket, and bucket may have relatively small amount of items (normally not more than 4, never more than 10). Those records are squashed into one (an item with extra bucket information) and placed inside Elasticsearch.
The task i am trying to solve is to find 500 buckets (at max) with all related items at once by filtered query that relies on item's attributes, and i'm stuck on limiting / offsetting aggregations. How do i perform such kind of task? I see top_hits aggregation which allows me to control size of related items amount, but i can't find a clue how can i control size of returned buckets.
update: okay, i'm terribly stupid. The size parameter of terms aggregation provides me with limiting. Is there any way to perform offset task? I don't need 100% precision and probably won't ever page those results, but anyway i'd like to see this functionality.
I don't think we'll be seeing this feature any time soon, see relevant discussion at GitHub.
Paging is tricky to implement because document counts for terms
aggregations are not exact when shard_size is less than the field
cardinality and sorting on count desc. So weird things may happen like
the first term of the 2nd page having a higher count than the last
element of the first page, etc.
There an interesting approach is mentioned, you could request like top 20 results on 1st page, then on 2nd page you run the same aggregation but exclude those 20 terms you already saw on the previous page and so forth. But this doesn't allow you "random" access to arbitrary page, you must go through pages in-order.
...if you only have a limited number of unique values compared to the
number of matched documents, doing the paging on client-side would be
more efficient. On the other hand, on high-cardinality-fields, your
first approach based on an exclude would probably be better.

SolR queries on simple index are getting slower as start param is higher

I'm working on a simple index containing one million docs with 30 fields each.
a q=: with a very low start value (0 for instance) takes only a few milliseconds (~1 actually)
the higher the start value is, the slowest SolR gets...
start=100000 => 171 ms
start=500000 => 844 ms
start=1000000 => 1274 ms
I'm a bit surprised by this performance degradation, and I'm afraid since the index is supposed to grow over hundred million documents within a few month.
Maybe did I something wrong in the schema? Or is it aenter code here normal behavior, given slicing docs behind the few first hundreds should usually not happen :)
EDIT
Thanks guys for those explanations - I was guessing something like that, however I do prefer be sure that this was not related to the way the schema has been described. So the question solved for me.
Every time you make search query to solr, it collect all the matching documents to the query. Then it skip the document until the start value is reached and then return the results.
Other point to note is that, every time you make the same search query but with higher start value, these documents are also not present in cache, so it might refresh cache as well. (depending on the size and type of cache you have configured)
Pagination naively works by retrieving all the documents up until the cut off point, throwing them away, then fetching enough documents to satisfy the number of documents requested and then returning.
If you're going to deep paging (going far into the dataset) this becomes expensive, and the CursorMark support was implemented (see "Fetching A Large Number of Sorted Results: Cursors") to support near-instant pagination into a large set of documents.
Yonik also has a good blog post about deep pagination in Solr.

index size impact on search speed (to store or not to store)

right now, we are using Solr as an fulltext index, where all fields of the documents are indexed but not stored.
There are some million documents, index-size is 50 GB. Average query-time is around 100ms.
To use features like Highlighting, we are thinking about to: additional store text. But, that could double the size of the index-files.
I know there is absolutely no (linear) relation between index size and query time. Rising the documents on factor 10 results in nearly no difference of query time.
But at all, the system (Solr/Lucene/Linux/...) has to handle more informations - the index files (for example) are based on much more I-nodes, and so on.
So I'm sure, there is an impact on query time in relation to the index-size. (But: is this noticeably?)
1st:
Do you think, I'm right?
Did you have any experiences on index-size and search speed in relation to with/without stored text?
Is it smart and reasonable to blow up the index by storing the documents?
2nd:
Do you know, how Solr/Lucene handled stored text? Maybe in separate files? (So that there is no impact for simples searches, where no stored text is needed!?)
Thank you.
Yes, it's absolutely true that the index grows if you make big fields stored, but if you want to highlight them, you don't have other ways. I don't think the speed will be decreased that much, maybe just because you need to download more data retrieving results, but it's not that relevant.
Regarding the lucene index format and the different files within the index you can have a look here: the stored fields are stored in a specific file.

Resources