What caching strategy for search queries - spring

We are developing a search engine web application that will enable users to search the content of about 200 portals.
Our business partner is taking care of maintaining and feeding a solr/lucene instance that is doing the workhorse job of indexing the data.
Our application queries solr and presents the results in a human-friendly way. However, we are wondering how we could limit the amount of queries, perhaps using some form of caching. The results could be cached for few hours.
What we are wondering is: what could be a good strategy for caching the queries results? Obviously we expect the method invocations to vary a lot... Does it make sense at all to do caching?
Is there some caching system that is particularly suitable in this use case? We are using Spring 3 for the development.

I would keep in mind that Solr already has a lot of caching built into it in order to speed up common queries. I'd advise you to look into the inherent capabilities in Solr/Lucene before you go off and reinvent the wheel with your own query cache.
Here is a good place to start.

The simplest solution is to reform your query before it hits Solr.
I created my own QueryBuilder method, which I pass through my query string before hitting Solr.
All this does is explodes all of the arguments and then sorts them in to a predefined group set.
For example, in order to normalize your queries so that they can be cacheable, you can sort alphabetically on each key, then reform the query string, and then use this to query Solr. (The actual query result will be unchanged).
Before you actually run the query, you could then create a hash of the Solr query string and check an in memory hash of all keys that have been saved against. If you find yourself approaching millions of query keys which might be quite likely, you might want to start looking at using a BloomFilter to reduce the keyspace and still maintain some degree of accuracy on cache hits.
Alternatively, you might want to look at putting a reverse proxy cache in between you and Solr. For example, if you were to query Solr like, Spring -> Varnish -> Solr, Varnish could be used to cache and it would use the query string as a hash. You would then be able to set a 2 hour Expires, in order to have the results automatically flushed/cleared/invalidated.
Hopefully this helps.

I have found that caching the results or the rendered content outside Lucene works best. Having an API search service that points to a caching tier with the results from a Lucene Index.
If you separate the caching tier out, you can then plug in whatever caching you want...distributed caching (Redis, Azure AppFabric, other cloud caching etc). Also you can cache the partial renderings of the web page (i.e. outputcaching in ASP.NET) or cache the API calls themselves using RESTful conventions. Things like cache-warming or proactive caching (based on usage) then are easy to do with services.
Your application/index cache then can be "re-used" across more tiers of your app instead of just caching at the index level. This all depends on how if your indexing updates are real-time, if the queries are date-level secure for each client/user id etc. As mentioned above Solr already does "some" of this stuff for you.

Related

How does Elasticsearch6.8 cache the same request on the second time?

I am using Elasticsearch 6.8 and I'd like to know if I send the same request multiple times, will ES do any optimised operation on that? If yes, is there any document explain how this works?
Adding more details to the answer given by #fmdaboville.
The above caches are provided out of the box and below are some options if you want to enable/disable more cache.
Enabling/fine tuning more cache options
Query type: if you are using filters in a search query, then those are cached by default at Elasticsearch and doesn't contribute to the score as it just means to filter out the data, more info from this official doc:
In a filter context, a query clause answers the question “Does this
document match this query clause?” The answer is a simple Yes or
No — no scores are calculated. Filter context is mostly used for
filtering structured data, e.g.
Does this timestamp fall into the range 2015 to 2016? Is the status
field set to "published"? Frequently used filters will be cached
automatically by Elasticsearch, to speed up performance.
Using refresh interval: this official doc has a lot more info, but in short its good, if you are OK to get some obsolete data and ready to trade-off it in favor of performance.
This makes new index changes visible to search.
Disable the cache on a particular request
By default, heavy searches are cached at shards level as explained in this official doc, but for certain requests, if you want to enable/disable this behavior then you can go use this in your API call.
Simply add below query param in your search param. link in API call has several other settings to do it.
request_cache=true/false
Here is the documentation explaining how work the optimized search : Tune for search speed. The part about cache seems to be what you are looking for :
There are multiple caches that can help with search performance, such as the filesystem cache, the request cache or the query cache. Yet all these caches are maintained at the node level, meaning that if you run the same request twice in a row, have 1 replica or more and use round-robin, the default routing algorithm, then those two requests will go to different shard copies, preventing node-level caches from helping.

Elasticsearch Best Practices Flow

I am using elastic-search for product filtering for products. We have complex logic of product availability. I can see two options
Using elastic to store only product specific data and product availability logic resides in web server part. we first filter data from elastic then check the condition on those result set if it matches the logic of availability.
or We can flatten the data and store it in elastic though for that case there will be duplicates of data.
My concern is if it is good practice to call elastic endpoint from browser. As it does not have any auth system by default. and every query and response will be visible in network log. I believe call should be made from web server to elastic and front end will communicate with elastic unaware of elastic existence.
Any best practice insight will be helpful
Simply create and authenticated endpoint in your backend and send the queries to that endpoint. Do make sure there are some enforced limits such as
size -- You don't want to let anybody download your whole index and
aggregation depth -- you don't want anyone to perform summaries on your whole index/indices to get a competitive advantage.
Regarding the duplicates: I wouldn't worry too much about the storage aspect (many NoSQL approaches will probably have some duplication to facilitate fast queries) but keep in mind that aggregations might yield "wrong" counts and sums. You'd typically perform those aggregations to get, say, the totals in your product categories and you want to make sure they are representative of your warehouse state.
More cannot really be said right now based on the limited information you've provided.

ElasticSearch vs Relational Database

I'm creating a microservice to handle the contacts that are created in the software. I'll need to create contacts and also search if a contact exists based on some information (name, last name, email, phone number). The idea is the following:
A customer calls, if it doesn't exist we create the contact asking all his personal information. The second time he calls, we will search coincidences by name, last name, email, to detect that the contact already exists in our DB.
What I thought is to use a MongoDB as primary storage and use ElasticSearch to perform the query, but I don't know if there is really a big difference between this and querying in a common relational database.
EDIT: Imagine a call center that is getting calls all the time from mostly different people, and we want to search fast (by name, email, last name) if that person it's in our DB, wouldn't ElasticSearch be good for this?
A relational database can store data and also index it.
A search engine can index data but also store it.
Relational databases are better in read-what-was-just-written performance. Search engines are better at really quick search with additional tricks like all kinds of normalization: lowercase, ä->a or ae, prefix matches, ngram matches (if indexed respectively). Whether its 1 million or 10 million entries in the store is not the big deal nowadays, but what is your query load? Well, there are only this many service center workers, so your query load is likely far less than 1qps. No problem for a relational DB at all. The search engine would start to make sense if you want some normalization, as described above, or you start indexing free text comments, descriptions of customers.
If you don't have a problem with performance, then keep it simple and use 1 single datastore (maybe with some caching in your application).
Elasticsearch is not meant to be a primary datastore so my advice is to use a simple relational database like Postgres and use simple SQL queries / a ORM mapper. If the dataset is not really large it should be fast enough.
When you have performance issues on searches you can use a combination of relation db and Elasticsearch. You can use Elasticsearch feeders to update ES with your data in you relational db.
Indexed RDBMS works well for search
If your data is structured i.e. columns are clearly defined, searching 1 million records will also not be a problem in RDBMS.
When to use Elastic
Text Search: Searching words across multiple properties (e.g. description, name etc.)
JSON Store and search: If data being stored is in json format and later needs to be searched
Auto Suggestions: Elastic is better at providing autocomplete suggestions
Elastic as an application data provider
Elastic should not be seen as data store, even if you storing data in it. It is about how you perceive elastic. Elastic should be used to store and setup data for the application. It is the application which decides how and when to use elastic (search and suggestions). Elastic is not a nosql storage alternative if compared to RDBMS, you should use a nosql database instead.
This perception puts elastic in line with redis and kafka. These tools are key components of an application design and they are used to serve as events stores, search engines and cache etc. to the applications.
Database with Elastic
Your design should use both. For storing the contacts use the database, index the contacts for querying. Also make the data available in elastic for searching, autocomplete and related matches.
As always, it depends on your specific use case. You briefly described it, but how are you acually going to use the data?
If it's just something simple like checking if a customer exists and then creating a new customer, then use the RDMS option. Moreover, if you don't expect a large dataset, so that scaling isn't an issue (hence the designation that Elasticsearch is for BigData), but you have transactions and data integrity is important, then a RDMS will be the right fit. Some examples could be for tax, leasing, or financial reporting systems.
However, if you have a large dataset, you need a wide range of query capabilities, such as a fuzzy search or searches where the user
can select multiple filters on the data or you want to do some predictive analysis on the data, then Elasticsearch is the clear choice.
For example, I worked on an web based app with a large customer base: 11 million, with 200+ hits per second at peak time for a find a doctor application. The customer could check some checkboxes to determine, specialty, spoken languages, ratings, hospitals, etc. all sorted by the distance from the users location with a 2 second or less response time. It would be very difficult for a RDMS to match that.

Both ElasticSearch and Redis, overkill usecase?

I'm currently designing the architecture of my project or atleast try to figure it out what will be useful in my case.
** Simple use case
I will have several thousands of profiles in a backend and I to need implement a fast search engine. So elasticsearch look perfect in that case. Everytime a profile is updated, the index will be updated by an asynchronous task.
My question now is : If I want to implement a cache system for the detail of a profile. Should I stick with elasticsearch and put these data in my index ? Or use Redis and do something like profil_id => data ?
I think both sounds good the problem is whenever a profile is updated, I will have to flush it after the reindexing in elasticsearch. If I want to see the change in my backend.
So what can I do ? Thank you so much !
You should consider using RediSearch. Using RediSearch can provide you a solution for your needs, getting both Redis performance and a full-text support.
Elasticsearch and redis are basically meant to solve two different problems, As one does indexing while other does caching.
Redis is meant to return already requested data as fast as possible whereas as
Elasticsearch is a search and analytics engine, it would perfectly fit a use-case where you have to implement a fast search engine and it will be more performant than any in-memory data structure store or cache such as redis(Assuming your searches will be complex, will involve some aggregation/filters).
The problem comes profile updates Since your profile updates are not that frequent you could actually do partial updates to the ES index rather doing reindex.So whenever a person updates its profile get the changeling set(changed data) and do a partial update to the particular document in ES Index. You can see how its done here partial update.
This one particular stackoverflow answer will help you cache vs indexing

Is it safe to expose the Elasticsearch Search API directly through your application's API?

I am developing an AngularJS app with a Java/Spring Boot API. It uses Spring Data Elasticsearch to provide access to Elasticsearch's Search API for searching. Here is an example:
Page<Address> page = addressSearchRepository.search(simpleQueryStringQuery(query), pageable);
The variable query is a user's search string. pageable is an object that specifies page number, page size, and sorting. I can use QueryBuilders to build other Elasticsearch queries and expose them as different API endpoints.
Another option is to use QueryBuilders.wrapperQuery and send Elasticsearch queries directly from JavaScript. Here is an example where jsonQuery is a string containing a full Elasticsearch query:
Page<Address> page = addressSearchRepository.search(wrapperQuery(jsonQuery), pageable);
This would be a secure endpoint that only authenticated users can access. This seems to be equivalent to exposing an Elasticsearch index's Search API directly. Assuming that any data in the index is safe to show the user, would this be a security risk?
In my research so far I've found that it may be possible to crash Elasticsearch using a query, but it isn't that big of a problem in newer versions: https://www.elastic.co/blog/found-crash-elasticsearch#arbitrary-large-size-parameter
Maybe limiting the page size or using the scan and scroll API when the page size is very large would mitigate this.
I know that script fields should be avoided at all costs, but they are disabled by default (as of v1.4.3).
You can still crash Elasticsearch if you know how to do it. For example, if you start building a 10 deep nested aggregations, you might very well go and take a break. It will either take a lot of time, or be very expensive, use a lot of memory, make the JVM do a lot of garbage collection (which basically freezes all other threads running in the JVM), reclaim back small amounts of memory. It can make the cluster unresponsive in this way.
I'm not saying that whatever aggregations you take and create a 10 deep nested aggregations you'll cripple the cluster, but under normal circumstances a cluster built for a certain SLA and deal with a certain amount of data, given some heavy aggregations (for example terms on analyzed string fields), will be very highly computational for the nodes.
Maybe the nodes will not run out of memory, but the nodes will barely be responsive.
Elastic's team is trying to implement other circuit breakers and to add default limits to certain types of queries and aggregations (a huge task). But if your aim is for your users not to crash ES, while they have full access to all queries, I think there are ways to crash it. I, personally, wouldn't expose ES and let my users do whatever they want with whatever queries they create.
Depending on how your wrapper is configured, I'd only allow my users certain types of queries/aggregations and for those I'd impose some limits (applicable for those queries/aggs that accept limits).

Resources