Elasticsearch slow performance for huge data retrieval with source field - performance

I'm using ElasticSearch to search from more than 10 million records, most records contains 1 to 25 words. I want to retrieve data from it, the method I'm using now is drastically slow for big data retrieval as I'm trying to get data from the source field. I want a method that can make this process faster. I'm free to use other database or anything with ElasticSearch. Can anyone suggest some good Ideas and Example for this?
I've tried searching for solution on google and one solution I found was pagination and I've already applied it wherever it's possible but pagination is not an option when I want to retrieve many(5000+) hits in one query.
Thanks in advance.

Try using scroll
While a search request returns a single “page” of results, the scroll
API can be used to retrieve large numbers of results (or even all
results) from a single search request, in much the same way as you
would use a cursor on a traditional database.

Related

Filter result in memory to search in elasticsearch from multiple indexes

I have 2 indexes and they both have one common field (basically relationship).
Now as elastic search is not giving filters from multiple indexes, should we store them in memory in variable and filter them in node.js (which basically means that my application itself is working as a database server now).
We previously were using MongoDB which is also a NoSQL DB but we were able to manage it through aggregate queries but seems the elastic search is not providing that.
So even if we use both databases combined, we have to store results of them somewhere to further filter data from them as we are giving users advanced search functionality where they are able to filter data from multiple collections.
So should we store results in memory to filter data further? We are currently giving advanced search in 100 million records to customers but that was not having the advanced text search that elastic search provides, now we are planning to provide elastic search text search to customers.
What do you suggest should we use the approach here to make MongoDB and elastic search together? We are using node.js to serve data.
Or which option to choose from
Denormalizing: Flatten your data
Application-side joins: Run multiple queries on normalized data
Nested objects: Store arrays of objects
Parent-child relationships: Store multiple documents through joins
https://blog.mimacom.com/parent-child-elasticsearch/
https://spoon-elastic.com/all-elastic-search-post/simple-elastic-usage/denormalize-index-elasticsearch/
Storing things client side in memory is not the solution.
First of all the simplest way to solve this problem is to simply make one combined index. Its very trivial to do this. Just insert all the documents from index 2 into index 1. Prefix all fields coming from index-2 by some prefix like "idx2". That way you won't overwrite any similar fields. You can use an ingestion pipeline to do this, or just do it client side. You only will ever do this once.
After that you can perform aggregations on the single index, since you have all the data in one-index.
If you are using somehting other than ES as your primary data-store you need to reconfigure the indexing operation to redirect everything that was earlier going into index-2 to go into index-1 as well(with the prefixed terms).
100 million records is trivial for something like ELasticsearch. Doing anykind of "joins" client side is NOT RECOMMENDED, as this will obviate the entire value of using ES.
If you need any further help on executing this, feel free to contact me. I have 11 years exp in ES. And I have seen people struggle with "joins" for 99% of the time. :)
The first thing to do when coming from MySQL/PostGres or even Mongodb is to restructure the indices to suit the needs of data-querying. Never try to work with multiple indices, ES is not built for that.
HTH.

Solr Slice v Page

Is it possible to use Slice via solrTemplate ?
actually I am struggling to see if it will even make a difference because even without using spring, there doesnt appear to be any way of telling Solr to exclude its "numFound" (total results) from a query
And when I use a normal spring data Page<..> query , when I look under the hood I only see one query issued to solr, i.e. no extra one for count. Or is the count simply done inside Solr somehow in an extra step ?
confused
Total document count is part of the Solr query. No additional query is required. Therefore, there is no advantage to Slice vs. Page.
The only related concept is when somebody wants to export a significant amount of data, in which case built-in paging becomes slower the further is data requested. For that, Solr has exporting functionality.

How to search for multiple strings in very large database

I want to search for multiple strings in a very large database. These strings are part of different attributes of database table. I have tried string search using LIKE in sql query. But it is taking a lot of time to get results. I have used Oracle database.
Should I use indexing of database? I found that Lucene can be used for it.
I also got some suggestions of using big data concepts. Which approach should I use?
The easiest way is:
1.) adding an index to the columns you like to search trough
2.) using oracle text as #lalitKumarB wrote
The most powerful way is:
3.) use an separate search engine (solr, elaticsearch).
But, probably you have to change you application in order to explicit use the search index for searching trough the data,...
I had the same situation some years before. Trying to search text in an big database. After a wile I found out, that database based search will never reach the performance of an dedicate search engine. And: you will have much more search features working out of the box, if you use solr (for example), like spelling correction, "More like this", ...
One option is to hold the data on orcale, searching in solr and return the ID of the document in order to only load the one row form oracle, the is referenced by the ID.
2nd option is to keep oracle as base datapool for your search engine and search in solr (or elasticsearch) in order to return the whole document/row from solr, not only the ID. So you don't need to load the data from the database any more.
The best option depends on your needs.
You have the choice between elasticsearch, solr or lucene

Does Elasticsearch stream results?

Does Elasticsearch stream the query results as they are "calculated" or does it calculate everything and then return the final response back to the client?
By default elasticsearch will only return a limited set of results for a query. (i.e. searching for * will only return the default count set regardless of the number of matches).
Generally to implement "streaming" , you make an initial search to get total count of matching documents and then ask for documents in ranges ( i.e. first 10, next 10, etc.. )
See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-from-size.html
for how to request the number of documents returned.
Have you tried scroll query?
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html much easier to deal with than pagination.
Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.
Answer to the question in the comments:
So question would this be the right way to export large results for a
"report" type system? I'm not talking about frond end? I'm talking
about a back end application that will execute a custom query and
build a file with 300000 + result
I'm sure there might be a valid reasons for doing this, but to me it sounds like you're using a hammer to drive screws. Much of the point of using elasticsearch is to use it's aggregations features to do more of the computing in the data store.
Aggregations Documentation
If you really need the raw data of 300000 records, then thats what you need. However, if it's a report, that implies you're doing some manipulation of the data into metrics. Much of the point of ES is that it allows you to build "custom reports" on the fly. I suspect it will be much faster to put as much logic as you can into the query, rather simply manipulating the raw data.
Without knowing more about the requirements, I can't come up with any better answer than that.
No, Elastic so far does not support this. The Elastic API uses a traditional request/response model. The query results are paginated, buffered on the server-side, and sent back to the client. A truly read of the response body in a streaming fashion does not seem to be in the Elastic roadmap.
With that said, for big result sets the scroll API has been deprecated and was never intended for real-time user queries. At the moment the best option is the search_after that could be seen as a cursor in traditional RDBMS.

solr query- get results without scanning files

I would like to execute a solr query and get only the uniquKey I've defined.
The documents are very big so defining fl='my_key' is not fast enough - all the matching documents are still scanned and the query can take hours (even though the search itself was fast - numFound takes few seconds to return).
I should mention that all the data is stored, and creating a new index is not an option.
One idea I had was to get the docIds of the results and map them to my_key in the code.
I used fl=[docid], thinking it doesn't need scanning to get this info, but it still takes too long to return.
Is there a better way to get the docIds?
Or a way to unstore certain fields without reindexing?
Or perhapse a compeletly different way to get the results without scanning all the fields?
Thanks,
Dafna
Sorry, but the only way is to break your gigantic documents in more than one. I don't see how it will be possible to only match the fields you specified and let the documents alone. This is not how Lucene works.
One could make a document that used only indexed fields that are needed to query to turn the job easier, or break the document based on the queries that are needed. Or simply adding another documents with the structure needed for these new queries. It's up to you.

Resources