Is it safe to expose the Elasticsearch Search API directly through your application's API? - elasticsearch

I am developing an AngularJS app with a Java/Spring Boot API. It uses Spring Data Elasticsearch to provide access to Elasticsearch's Search API for searching. Here is an example:
Page<Address> page = addressSearchRepository.search(simpleQueryStringQuery(query), pageable);
The variable query is a user's search string. pageable is an object that specifies page number, page size, and sorting. I can use QueryBuilders to build other Elasticsearch queries and expose them as different API endpoints.
Another option is to use QueryBuilders.wrapperQuery and send Elasticsearch queries directly from JavaScript. Here is an example where jsonQuery is a string containing a full Elasticsearch query:
Page<Address> page = addressSearchRepository.search(wrapperQuery(jsonQuery), pageable);
This would be a secure endpoint that only authenticated users can access. This seems to be equivalent to exposing an Elasticsearch index's Search API directly. Assuming that any data in the index is safe to show the user, would this be a security risk?
In my research so far I've found that it may be possible to crash Elasticsearch using a query, but it isn't that big of a problem in newer versions: https://www.elastic.co/blog/found-crash-elasticsearch#arbitrary-large-size-parameter
Maybe limiting the page size or using the scan and scroll API when the page size is very large would mitigate this.
I know that script fields should be avoided at all costs, but they are disabled by default (as of v1.4.3).

You can still crash Elasticsearch if you know how to do it. For example, if you start building a 10 deep nested aggregations, you might very well go and take a break. It will either take a lot of time, or be very expensive, use a lot of memory, make the JVM do a lot of garbage collection (which basically freezes all other threads running in the JVM), reclaim back small amounts of memory. It can make the cluster unresponsive in this way.
I'm not saying that whatever aggregations you take and create a 10 deep nested aggregations you'll cripple the cluster, but under normal circumstances a cluster built for a certain SLA and deal with a certain amount of data, given some heavy aggregations (for example terms on analyzed string fields), will be very highly computational for the nodes.
Maybe the nodes will not run out of memory, but the nodes will barely be responsive.
Elastic's team is trying to implement other circuit breakers and to add default limits to certain types of queries and aggregations (a huge task). But if your aim is for your users not to crash ES, while they have full access to all queries, I think there are ways to crash it. I, personally, wouldn't expose ES and let my users do whatever they want with whatever queries they create.
Depending on how your wrapper is configured, I'd only allow my users certain types of queries/aggregations and for those I'd impose some limits (applicable for those queries/aggs that accept limits).

Related

Implements popular keyword in ElasticSearch

I'm using ElasticSearch on AWS EC2.
And i want to implement today's popular keyword function in ES.
there is 3 indexes(place, genre, name), and i want see today's popular keyword in name index only.
I tried to use ES slowlog and logstash. but slowlog save logs every shard's log.
(ex)number of shards : 5 then 5 query log saved.
Is there any good and easy way to implement popular keyword in ES?
As far as I know, this is not supported by Elasticsearch and you need to build your own custom solution.
Design you mentioned using the slowlog is not good as you mentioned its on per shard basis, even if you do some more computing and able to merge and relate them to a single search at index level, it would not be good, as
you have to change the slow log configuration and for every index there needs to be a different threshold, you can change it to 0ms, to make sure you get all the search queries in slow logs, but that would take a huge disk space and would not be good for Elasticsearch performance.
You have to do some parsing of slow log in your application and if you do it runtime it would be very costly.
I think you can maintain a distributed cache in your application where you store the top searched keyword like the leaderboard of a multi-player gaming app, which is changing very frequently but in your case, you don't even have to update this cache very frequently. I would not go into much implementation details, but simple Hashmap of search term as key and count as value would solve the issue.
Hope this helps. let me know if you have questions.

Using elastic search for a UI dashboard behind a proxy

I am working on a search dashboard with full text search capabilities, backed by ES. The search would initially be consumed by a UI dashboard. I am planning to have an application web service (WS) api layer between the UI dashboard and ES which will route the business search to ES.
There can be multiple clients to WS going forward, each with its own business use cases, and complex data requirements (basically response fields). There are many entities and huge number of fields across them. Each client would need to specify what fields entities it wants to return with what fields.
To support this dynamically changing requirement, one approach could be to have the WS be a pass through to the ES (with pre validations like access control and post transformations to the response from ES). The WS APIs will look exactly like the ES APIs, the UI should build ES queries through JS client and send it to WS, which after access control will get data from ES.
I am new to ES and skeptic of this approach. Can there be any particular challenges in this approach. One of my colleague has worked on ES before but always with a backend Java client, so he's not too sure.
I looked up a ES Js client and there's an official one here.
Some Context here:
We have around 4 different entities (can increase in future) with both full text and keyword type fields. A typical search could have multiple filters and search terms and would want to specify the result fields. Also, some searches would be across entities and some to individual ones. We are maintaining a separate entity for each entity.
What I understand from your post is, below is what you want to achieve at high level.
There can be multiple clients to WS going forward, each with its own
business use cases, and complex data requirements (basically response
fields)
And as you are not sure, how to do this, you are thinking to build Elasticsearch queries from Javascript in your front-end only. I am not a very big fan of this approach as it exposes, how you are building queries and if some hacker knows crucial like below information, then can bring your entire ES cluster to its knees:
Knows what types of wildcard queries.
Knows index names and ES cluster details(although you may have access control but still you are exposing the crucial info).
How you are building your search queries.
Above are just a few examples and will add more info.
Right approach
As you already have a backend, where you would be checking the access, there only build the Elasticsearch queries and you even have the advantage of your teammates who knows it.
For building complex response field, you can use the source filtering, using which you can specify in your search request, what all fields you want to return in your search result.

Elasticsearch Best Practices Flow

I am using elastic-search for product filtering for products. We have complex logic of product availability. I can see two options
Using elastic to store only product specific data and product availability logic resides in web server part. we first filter data from elastic then check the condition on those result set if it matches the logic of availability.
or We can flatten the data and store it in elastic though for that case there will be duplicates of data.
My concern is if it is good practice to call elastic endpoint from browser. As it does not have any auth system by default. and every query and response will be visible in network log. I believe call should be made from web server to elastic and front end will communicate with elastic unaware of elastic existence.
Any best practice insight will be helpful
Simply create and authenticated endpoint in your backend and send the queries to that endpoint. Do make sure there are some enforced limits such as
size -- You don't want to let anybody download your whole index and
aggregation depth -- you don't want anyone to perform summaries on your whole index/indices to get a competitive advantage.
Regarding the duplicates: I wouldn't worry too much about the storage aspect (many NoSQL approaches will probably have some duplication to facilitate fast queries) but keep in mind that aggregations might yield "wrong" counts and sums. You'd typically perform those aggregations to get, say, the totals in your product categories and you want to make sure they are representative of your warehouse state.
More cannot really be said right now based on the limited information you've provided.

ElasticSearch - Configuration to Analyse a document on Indexing

In a single request, I want to retrieve documents from a SOR, store them in ElasticSearch, and then search those documents using the ES search API.
There seems to be some lag from the time the document is indexed and the time it is analyzed and ready to be searched.
Is there any way to configure ES to not return from the request to index a document until the analyzer has analyzed it and so that it can immediately be searched?
Elasticsearch is "near real-time" by nature, i.e. all indices are refreshed every second (by default). While it may seem enough in a majority of cases, it might not, such as in your case.
If you need your documents to be available immediately, you need to refresh your indices explicitly by calling
POST /_refresh
or if you only want to refresh one index
POST /my_index/_refresh
The refresh needs to happen after the indexing call returned and before the search call is sent off.
Note that doing this on every document indexing will hurt the performance of your system. It might be better to make your application aware of the near real-time nature of ES and handle this on the client-side.
The refresh API, as suggested in the accepted answer, is heavy in nature and you may not want to call this API after every index operation, if you are going to do a significant number of indexing operations.
What happens under the hood is that the translog maintained by elasticsearch is written to the in memory segment which elasticsearch maintains. This operations is best left to the discretion of elasticsearch, however, there are some configuration parameters you can play around with.
There is an alternative approach you can take, it may or may not suit your specific use case, but here it goes.
Query the index/_stats/refresh api and retrieve the status of refresh from there, index your document and then keep performing the same stats query again. If the version has increased since your indexing time, it means you are good for searching your document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html

What caching strategy for search queries

We are developing a search engine web application that will enable users to search the content of about 200 portals.
Our business partner is taking care of maintaining and feeding a solr/lucene instance that is doing the workhorse job of indexing the data.
Our application queries solr and presents the results in a human-friendly way. However, we are wondering how we could limit the amount of queries, perhaps using some form of caching. The results could be cached for few hours.
What we are wondering is: what could be a good strategy for caching the queries results? Obviously we expect the method invocations to vary a lot... Does it make sense at all to do caching?
Is there some caching system that is particularly suitable in this use case? We are using Spring 3 for the development.
I would keep in mind that Solr already has a lot of caching built into it in order to speed up common queries. I'd advise you to look into the inherent capabilities in Solr/Lucene before you go off and reinvent the wheel with your own query cache.
Here is a good place to start.
The simplest solution is to reform your query before it hits Solr.
I created my own QueryBuilder method, which I pass through my query string before hitting Solr.
All this does is explodes all of the arguments and then sorts them in to a predefined group set.
For example, in order to normalize your queries so that they can be cacheable, you can sort alphabetically on each key, then reform the query string, and then use this to query Solr. (The actual query result will be unchanged).
Before you actually run the query, you could then create a hash of the Solr query string and check an in memory hash of all keys that have been saved against. If you find yourself approaching millions of query keys which might be quite likely, you might want to start looking at using a BloomFilter to reduce the keyspace and still maintain some degree of accuracy on cache hits.
Alternatively, you might want to look at putting a reverse proxy cache in between you and Solr. For example, if you were to query Solr like, Spring -> Varnish -> Solr, Varnish could be used to cache and it would use the query string as a hash. You would then be able to set a 2 hour Expires, in order to have the results automatically flushed/cleared/invalidated.
Hopefully this helps.
I have found that caching the results or the rendered content outside Lucene works best. Having an API search service that points to a caching tier with the results from a Lucene Index.
If you separate the caching tier out, you can then plug in whatever caching you want...distributed caching (Redis, Azure AppFabric, other cloud caching etc). Also you can cache the partial renderings of the web page (i.e. outputcaching in ASP.NET) or cache the API calls themselves using RESTful conventions. Things like cache-warming or proactive caching (based on usage) then are easy to do with services.
Your application/index cache then can be "re-used" across more tiers of your app instead of just caching at the index level. This all depends on how if your indexing updates are real-time, if the queries are date-level secure for each client/user id etc. As mentioned above Solr already does "some" of this stuff for you.

Resources