Delay in data search in coherence with apache lucene - caching

I am using coherene for storing cached data. I am also using apache Lucene indexing for faster retrieval of the searched records on some attribute values. I am facing a problem of search delay or some time TimeOut while searching for records with one specific attribute.
If the same record is searched with other attribute-value, it is searched and retrieved instantly. E.g. A record {key=1234, value={a=abc,b=def,c=pqr}}, when searched with Lucene query b=def, it searches faster as expected. While if the same record, when searched with Lucene query c=pqr, it either times out in Coherence or takes significant time (more than 100 ms as against 2 to 5ms as expected). I verified that the lucene indexes, sort are exactly same for both b and c fields. Not able to figure out the reason of this delay and resolve.
I tried to debug the code to identify any different paths of execution for different fields search. However did not find any differences.

Related

Check if document is part of Elasticsearch query?

Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

Issues with ElasticSearch for real-time geo queries

I'm building a service that will allow users to search for other users who are nearby, based on GPS coordinates. I've tried using ElasticSearch's geo spatial indexes. When a user signs in, he submits his GPS location to an ElasticSearch geo index. Other users periodically poll ElasticSearch, querying for new documents that contain GPS coordinates within a few hundred meters.
The problem is that ElasticSearch either doesn't update its index fast enough, or it caches its results, making it unsuitable for retrieving real-time results. I've tried disabling the cache with index.cache.filter.max_size=-1 and passing "_cache=false" with every query. ElasticSearch still returns stale results when polling with the same query, and it can return stale results for up to a few minutes.
Any idea on what could be happening? Maybe it's because I'm keeping the same connection open during polling, and ElasticSearch caches results for each connection? Still, the results can be out of date with subsequent requests.
Elasticsearch results don't become immediately available for search. They are accumulated in a buffer and become available only after operation called refresh. In other words, search is not real time, but "near real time" operation ("near" is because refresh is called every second by default). Please also note that get operation is real-time - you can get document immediately after it is indexed.
While you can force refresh process after each document or make it more often, it's not the best solution for your problem because very frequent refreshing can significantly reduce search and indexing performance. Instead, I would advise you to check Elasticsearch percolators, which were added exactly for the use cases such as yours.

Solr caching & sorting

Our solr index (Solr 3.4) has over 100 million docuemnts.
We frequently fire one type of query on this index to get documents, do some processing and dump in another index.
Query is of the form -
((keyword1 AND keyword2...) OR (keyword3 AND keyword4...) OR ...) AND date:[date1 TO *]
No. of keywords can be in the range of 100 - 1000.
We are adding sort parameter 'date asc'.
The keyword part of the query changes very rarely but date part always changes.
Now there are mainly 2 problems,
1) Query takes too much time.
2) Sometimes when 'numFound' is very large for a query, It gives OOM error (I guess this is because of sort).
We are not using any type of caching yet.
Will caching be helpful to solve these problems?
If yes, what type of cache or caching configuration is suitable to start with?

Solr performance with multiple fields

I have to index around 10 million documents in solr for full text search. Each of these documents have around 25 additional metadata fields attached to them. Each of the metadata fields individually are small (upto 64 characters). Common queries would be involving a search term along with multiple metadata fields used to filter the data. So my questions is which would provide better performance wrt search response time. (indexing time is not a concern):
a. Index the text data as well as push all metadata fields into solr as stored fields and query solr for all the fields using a single query. (Effectively solr does the filtering with metadata as well as search)
b. Store the metadata fields in a db like Mysql. Use solr only for full text and then use the document ids returned from solr as an input to the database to filter based on other metadata to retrieve the final set of documents.
Thanks
Arijit
Definitely a). Solr isn't simply a fulltext search engine, it's much more. It's filter queries are at least as good/fast as MySQL select.
b) is just silly. Fetch many ids from MySQL by selecting those with correct metadata, do a fulltext search in Solr while filtering against that ids list, fetch document from MySQL or Solr (if you choose to store data in it, not just indexes). I can't imagine a case where this would be faster.
Why complicate things, especially if indexing time and HD space is not an issue, you should store all your data (meaning: subset needed by users) in Solr.
Exception would be if you had large amount of text to store (and retrieve) in each document. In those cases it would be faster to fetch it from RDB after you get your search results back. Anyway, noone can tell for sure which one would be faster in your case, so I suggest you test performance of both approaches (using JMeter for example).
Also, since you don't care about index time, you should do all the processing you can at index time instead of at query time (e.g. synonyms, payloads where they can replace boosting, ...).
See here for some additional info on Solr performance:
http://wiki.apache.org/solr/SolrPerformanceFactors

Resources