Our solr index (Solr 3.4) has over 100 million docuemnts.
We frequently fire one type of query on this index to get documents, do some processing and dump in another index.
Query is of the form -
((keyword1 AND keyword2...) OR (keyword3 AND keyword4...) OR ...) AND date:[date1 TO *]
No. of keywords can be in the range of 100 - 1000.
We are adding sort parameter 'date asc'.
The keyword part of the query changes very rarely but date part always changes.
Now there are mainly 2 problems,
1) Query takes too much time.
2) Sometimes when 'numFound' is very large for a query, It gives OOM error (I guess this is because of sort).
We are not using any type of caching yet.
Will caching be helpful to solve these problems?
If yes, what type of cache or caching configuration is suitable to start with?
Related
I am using coherene for storing cached data. I am also using apache Lucene indexing for faster retrieval of the searched records on some attribute values. I am facing a problem of search delay or some time TimeOut while searching for records with one specific attribute.
If the same record is searched with other attribute-value, it is searched and retrieved instantly. E.g. A record {key=1234, value={a=abc,b=def,c=pqr}}, when searched with Lucene query b=def, it searches faster as expected. While if the same record, when searched with Lucene query c=pqr, it either times out in Coherence or takes significant time (more than 100 ms as against 2 to 5ms as expected). I verified that the lucene indexes, sort are exactly same for both b and c fields. Not able to figure out the reason of this delay and resolve.
I tried to debug the code to identify any different paths of execution for different fields search. However did not find any differences.
Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.
I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!
Is it possible to use Slice via solrTemplate ?
actually I am struggling to see if it will even make a difference because even without using spring, there doesnt appear to be any way of telling Solr to exclude its "numFound" (total results) from a query
And when I use a normal spring data Page<..> query , when I look under the hood I only see one query issued to solr, i.e. no extra one for count. Or is the count simply done inside Solr somehow in an extra step ?
confused
Total document count is part of the Solr query. No additional query is required. Therefore, there is no advantage to Slice vs. Page.
The only related concept is when somebody wants to export a significant amount of data, in which case built-in paging becomes slower the further is data requested. For that, Solr has exporting functionality.
I have to index around 10 million documents in solr for full text search. Each of these documents have around 25 additional metadata fields attached to them. Each of the metadata fields individually are small (upto 64 characters). Common queries would be involving a search term along with multiple metadata fields used to filter the data. So my questions is which would provide better performance wrt search response time. (indexing time is not a concern):
a. Index the text data as well as push all metadata fields into solr as stored fields and query solr for all the fields using a single query. (Effectively solr does the filtering with metadata as well as search)
b. Store the metadata fields in a db like Mysql. Use solr only for full text and then use the document ids returned from solr as an input to the database to filter based on other metadata to retrieve the final set of documents.
Thanks
Arijit
Definitely a). Solr isn't simply a fulltext search engine, it's much more. It's filter queries are at least as good/fast as MySQL select.
b) is just silly. Fetch many ids from MySQL by selecting those with correct metadata, do a fulltext search in Solr while filtering against that ids list, fetch document from MySQL or Solr (if you choose to store data in it, not just indexes). I can't imagine a case where this would be faster.
Why complicate things, especially if indexing time and HD space is not an issue, you should store all your data (meaning: subset needed by users) in Solr.
Exception would be if you had large amount of text to store (and retrieve) in each document. In those cases it would be faster to fetch it from RDB after you get your search results back. Anyway, noone can tell for sure which one would be faster in your case, so I suggest you test performance of both approaches (using JMeter for example).
Also, since you don't care about index time, you should do all the processing you can at index time instead of at query time (e.g. synonyms, payloads where they can replace boosting, ...).
See here for some additional info on Solr performance:
http://wiki.apache.org/solr/SolrPerformanceFactors