Spring Boot Suggestions Endpoint Optimization - spring-boot

I have a microservice that have Rest endpoint which returns suggested item names. For every character entered in the text field (Min 3 character) this endpoint is invoked.
The endpoint internally calls elastic search which fetches the items based on query string.
When RPM increases with APM monitoring there is a huge spike in the heap consumptions as well.
Any idea how could I optimize this endpoint?

Related

Pagination using REST API using Spring Boot and GBQ

I have Experience API endpoint which will give results as per limit and offset. My REST API calls the Big Query and get the data set as it has more than 500 records, I need to store product ids in cache so that I can use it in every pagination request for given limit and offset. Is there any alternative better solution for this kind of scneario?

ElasticSearch | Efficient Pagination with more than 10k documents

I have a microservice with elasticsearch as a backend store. Now I have multiple indices with hundreds of thousands of documents inserted in those indices.
Now I need to expose GET APIs for those indices. GET /employees/get.
I have gone through ES pagination using scroll and search_after. But both of them require meta information like scroll_id and search_after(key) to do pagination.
Now the concern is my microservice shouldn't expose these scroll_ids or search_after. With current approach, I can list up to 10k docs but not after that. And I don't want users of microservice to know about the backend store or anything about it. So How can I achieve this in elasticservice?
I have below approach in mind:
Store the scroll_id in-memory and retrieve the results based on that for subsequent queries. Get query will be as below:
GET /employees/get?page=1 By default each page will have 10k documents.
Implement scroll API internally over GET API and return all matching documents to users. But this increases the latency and memory. Because at times I may end up returning 100k docs to user in a single call.
Expose GET API with search string. By default return 10k documents and further the results will be refreshed with searchstring as explained:
Lets say GET /employees/get return 10k documents. And accept query_string to enrich the 10k like auto suggestion using n gram. Then we show most valid 10k docs everytime. I know this is not actual pagination but somehow this too solves the problem in a hacky way. This is my Plan-B.
Edited:
This is my usecase:
Return list of employees of a company. There are more than 100k employees. So I have to return the results in pages. GET /employees/get?from=0&size=1000 and GET /employees/get?from=1001&size=1000
But once I reach from+size to 10k, ES rejects the query.
Please suggest what would be the ideal way to implement pagination in microservice with ES as backend store and not letting user to know about internals of ES.

Is it safe to expose the Elasticsearch Search API directly through your application's API?

I am developing an AngularJS app with a Java/Spring Boot API. It uses Spring Data Elasticsearch to provide access to Elasticsearch's Search API for searching. Here is an example:
Page<Address> page = addressSearchRepository.search(simpleQueryStringQuery(query), pageable);
The variable query is a user's search string. pageable is an object that specifies page number, page size, and sorting. I can use QueryBuilders to build other Elasticsearch queries and expose them as different API endpoints.
Another option is to use QueryBuilders.wrapperQuery and send Elasticsearch queries directly from JavaScript. Here is an example where jsonQuery is a string containing a full Elasticsearch query:
Page<Address> page = addressSearchRepository.search(wrapperQuery(jsonQuery), pageable);
This would be a secure endpoint that only authenticated users can access. This seems to be equivalent to exposing an Elasticsearch index's Search API directly. Assuming that any data in the index is safe to show the user, would this be a security risk?
In my research so far I've found that it may be possible to crash Elasticsearch using a query, but it isn't that big of a problem in newer versions: https://www.elastic.co/blog/found-crash-elasticsearch#arbitrary-large-size-parameter
Maybe limiting the page size or using the scan and scroll API when the page size is very large would mitigate this.
I know that script fields should be avoided at all costs, but they are disabled by default (as of v1.4.3).
You can still crash Elasticsearch if you know how to do it. For example, if you start building a 10 deep nested aggregations, you might very well go and take a break. It will either take a lot of time, or be very expensive, use a lot of memory, make the JVM do a lot of garbage collection (which basically freezes all other threads running in the JVM), reclaim back small amounts of memory. It can make the cluster unresponsive in this way.
I'm not saying that whatever aggregations you take and create a 10 deep nested aggregations you'll cripple the cluster, but under normal circumstances a cluster built for a certain SLA and deal with a certain amount of data, given some heavy aggregations (for example terms on analyzed string fields), will be very highly computational for the nodes.
Maybe the nodes will not run out of memory, but the nodes will barely be responsive.
Elastic's team is trying to implement other circuit breakers and to add default limits to certain types of queries and aggregations (a huge task). But if your aim is for your users not to crash ES, while they have full access to all queries, I think there are ways to crash it. I, personally, wouldn't expose ES and let my users do whatever they want with whatever queries they create.
Depending on how your wrapper is configured, I'd only allow my users certain types of queries/aggregations and for those I'd impose some limits (applicable for those queries/aggs that accept limits).

ElasticSearch - Configuration to Analyse a document on Indexing

In a single request, I want to retrieve documents from a SOR, store them in ElasticSearch, and then search those documents using the ES search API.
There seems to be some lag from the time the document is indexed and the time it is analyzed and ready to be searched.
Is there any way to configure ES to not return from the request to index a document until the analyzer has analyzed it and so that it can immediately be searched?
Elasticsearch is "near real-time" by nature, i.e. all indices are refreshed every second (by default). While it may seem enough in a majority of cases, it might not, such as in your case.
If you need your documents to be available immediately, you need to refresh your indices explicitly by calling
POST /_refresh
or if you only want to refresh one index
POST /my_index/_refresh
The refresh needs to happen after the indexing call returned and before the search call is sent off.
Note that doing this on every document indexing will hurt the performance of your system. It might be better to make your application aware of the near real-time nature of ES and handle this on the client-side.
The refresh API, as suggested in the accepted answer, is heavy in nature and you may not want to call this API after every index operation, if you are going to do a significant number of indexing operations.
What happens under the hood is that the translog maintained by elasticsearch is written to the in memory segment which elasticsearch maintains. This operations is best left to the discretion of elasticsearch, however, there are some configuration parameters you can play around with.
There is an alternative approach you can take, it may or may not suit your specific use case, but here it goes.
Query the index/_stats/refresh api and retrieve the status of refresh from there, index your document and then keep performing the same stats query again. If the version has increased since your indexing time, it means you are good for searching your document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html

Distributed Tracing logging and Integrating with Logstash,Kibana and ElasticSearch

I have implemented a Distributed Transaction Logging library with Tree like Structure as mention in Google Dapper(http://research.google.com/pubs/pub36356.html) and eBay CAL Transaction Logging Framework(http://devopsdotcom.files.wordpress.com/2012/11/screen-shot-2012-11-11-at-10-06-39-am.png).
Log Format
TIMESTAMP HOSTNAME DATACENTER ENVIRONMENT EVENT_GUID PARENT_GUID TRACE_GUID APPLICATION_ID TREE_LEVEL TRANSACTION_TYPE TRANSACTION_NAME STATUS_CODE DURATION(in ms) PAYLOAD(key1=value2,key2=value2)
GUID HEX NUMBER FORMAT
MURMER_HASH(HOSTNAME + DATACENTER + ENVIRONMENT)-JVM_THREAD_ID-(TIME_STAMP+Atomic Counter)
What I would like to do is to integrate this format with Kibana UI and when user want to search and click on on TRACE_GUID it will show something similar to Distributed CALL graph which show where the time was spent. Here is UI http://twitter.github.io/zipkin/. This will be great. I am not UI developer if some can point me how to do this that will be great.
Also I would like to know how I can index elastic search payload data so user specify some expression like in payload (duration > 1000) then, Elastic Search will bring all the loglines that satisfy condition. Also, I would like to index Payload as Name=Value pair so user can query (key3=value2 or key4 = exception) some sort of regular expression. Please let me know if this can be achieved. Any help pointer would be great..
Thanks,
Bhavesh
The first step to good searching in elasticsearch is to create fields from your data. With logs, logstash is the proper tool. The grok{} filter uses patterns (existing or user-defined regexps) to split the input into fields.
You would need to make sure that it was mapped to an integer (e.g. %{INT:duration:int} in your pattern). You could then query elasticsearch for "duration:>1000" to get the results.
Elasticsearch uses the lucene query engine, so you can find sample queries based on that.

Resources