Search/Filter/Sort on constantly changing 1 million documents - elasticsearch

My use-case is I have max 1 Million documents and documents getting updated constantly (once every 5 mins). Each document has almost 40 columns and I have sort/filter/search requirements on almost every column.
Since the documents are changing constantly, the doc value 5 minutes earlier is not valid anymore. I am thinking that an ideal DB component will need to be running in memory. For the other use-cases in the application (where documents do not change constantly), I am using ElasticSearch cluster. So to be consistent with the search elsewhere in the application, I want to explore if I can run a separate ES node/cluster purely in memory for my use-case above. I could not find any examples or precursors for running ElasticSearch in production in a pure in-memory configuration.
If not ES, can I run Apache Solr in memory? I can try out any technology which allows me to run in a pure in-memory mode, and provide functionality similar to ES (free text search at a per-column level).
What would you recommend for this use-case?

Related

Implements popular keyword in ElasticSearch

I'm using ElasticSearch on AWS EC2.
And i want to implement today's popular keyword function in ES.
there is 3 indexes(place, genre, name), and i want see today's popular keyword in name index only.
I tried to use ES slowlog and logstash. but slowlog save logs every shard's log.
(ex)number of shards : 5 then 5 query log saved.
Is there any good and easy way to implement popular keyword in ES?
As far as I know, this is not supported by Elasticsearch and you need to build your own custom solution.
Design you mentioned using the slowlog is not good as you mentioned its on per shard basis, even if you do some more computing and able to merge and relate them to a single search at index level, it would not be good, as
you have to change the slow log configuration and for every index there needs to be a different threshold, you can change it to 0ms, to make sure you get all the search queries in slow logs, but that would take a huge disk space and would not be good for Elasticsearch performance.
You have to do some parsing of slow log in your application and if you do it runtime it would be very costly.
I think you can maintain a distributed cache in your application where you store the top searched keyword like the leaderboard of a multi-player gaming app, which is changing very frequently but in your case, you don't even have to update this cache very frequently. I would not go into much implementation details, but simple Hashmap of search term as key and count as value would solve the issue.
Hope this helps. let me know if you have questions.

Elastic search API Vs Spring data Vs logstash

I am planing to use elastic search for our dashboard using spring boot based rest services. After research i see top 3 options
Option A:
Use Elastic Search Java API ( from comment looks like going to go away)
Use Elastic Search Java Rest Client
Use spring-data-elasticsearch ( planing to use es 5.6 but challenging for latest es 6 as I don't see it's supports right now)
Option B:
Or shall I use logstash approach to
Sync data between postgressql and elastic search using logstash ?
Which one among them will be long term approach to get near real time data from ES in high load scenario ??
Usecase: I need to save some data from postgresql table to elastic search for my dashboard (near real time )
Update is frequent for both tables and es
to maintain current state
Load is going to increase in couple of week
The options you listed, in essence, are: should you go with a ready to use solution (logstash) or should you implement your own.
Try logstash first to see if it works for you - it'll take less time than implementing your own solution, and you can get working solution in minutes (if it's not hundreds of tables)
If you want near-real time, then you need to figure out if it allows you to:
handle incremental updates, i.e. if its 'tracking_column' configuration will work for your data structure and it will only load updated records in each run, not the whole table.
run it at the desired frequency
and in general, satisfies your latency requirements
If you decide to go with your own solution, keep in mind that spring-data-elasticsearch is a higher level wrapper for underlying elasticsearch client. If there are latency goals, then working on the lower level (elasticsearch clients) may give you better control and more options to tune the pipeline.
Otherwise, the client choice will not matter that much as data feed features (volume/update frequency) and db/es cluster configuration.

Solr Performance Issues (Caching/RAM usage)

We are using Solr 5.2 (on windows 2012 server/jdk 1.8) for document content indexing/querying. We found that querying slows down intermittently under load condition.
In our analysis we found following two issues.
Solr is not effectively using caches
Whenever new document indexed, it opens new searcher and cache will become invalid (as it was associated with old Index Searcher). In our scenario, new documents are indexed very frequently (at least 10 document are indexed per minute). So effectively cache will not be useful as it will open new searcher frequently to make new documents available for searching. How can improve caching usage?
RAM is not utilized
We observed that Solr is using only 1-2 GB of heap even though we have assign 50 GB. Seems like it is not loading index into RAM which leads to high IO. Is it possible to configure Solr to fully load indexes in memory? Don't find any documentation about this.

Elasticsearch vs Cassandra vs Elasticsearch with Cassandra

I am learning NoSQL and looking at different options for one of my client's requirements. I have gone through various resources before putting up this question (a person with little knowledge in NoSQL)
I need to store data at faster rate and read data.
Fully fail-safe and easily scalable.
Able to search through data for Analytics.
I ended up with a short list of: Cassandra and Elasticsearch
What I do understand is Cassandra is a perfect NoSQL storage solution for me, as I can write data and read data using indexes. Where it fails or it could fail is on Analytics. In the future, if I want to get data from from_date to to_date, or more ways to get data for analytics, if I don't design the Data model properly or keeping long term sight, which might be quite hard in ever changing world.
While Elastic Search is best at indexing (backed by Lucene), and can search the data randomly by throwing some random text. But does it work the same for even if I want to retrieve data from_date to to_date (I expect it might be). But the real question is, is it a Search Engine, or perfect NoSQL data storage like Cassandra? If yes, why do we still need Cassandra?
If both of these are in different world, please explain that! How do we combine them to get a more effective solution?
One of our applications uses data that is stored into both Cassandra and ElasticSearch. We use Cassandra to access those records whenever we can, and have data duplicated into query tables designed to adhere to specific application-side requests. For a more liberal search than our query tables can allow, ElasticSearch performs that functionality nicely.
We have asked that same question (of ourselves)..."Why don't we just get everything from ElastsicSearch?"
The answer is that ElasticSearch was designed to be a search engine, and not a persistent data store. Sometimes ElasticSearch loses writes. Schema changes are difficult to do in ElasticSearch without blowing everything away and reloading. For that purpose, I have written jobs that are designed to keep ElasticSearch in-sync with our Cassandra cluster. There was also a fairly recent discussion on Quora about this topic, that yielded similar points.
That being said, ElasticSearch works great as a search engine. And Cassandra works great as a scalable, high-performance datastore. But querying data is different from searching for data. There are times that we need one or the other, and a combination of the two works well for our application. It may (or it may not) work well for yours.
As for analytics, I have had some success in using the Cassandra Spark connector, to serve more complex OLAP queries.
Edit 20200421
I've written a newer answer to a similar question:
ElasticSearch vs. ElasticSearch+Cassandra
Cassandra + Lucene is a great option. There are different initiatives for this issue, for example:
Stratio’s Cassandra Lucene Index - Derived from Stratio Cassandra, is a plugin for Apache Cassandra that extends its index functionality. (https://github.com/Stratio/cassandra-lucene-index)
Stratio Cassandra, it's a native integration with Apache Lucene, it is very interesting. (https://github.com/Stratio/stratio-cassandra) - THIS PROJECT HAS BEEN DISCONTINUED IN FAVOUR OF Stratio’s Cassandra Lucene Index
Tuplejump Calliope, it's like Stratio Cassandra, but it's less active. (https://github.com/tuplejump/stargate-core)
DSE Search by Datastax. It allows using Cassandra with Apache Solr, but it's a proprietary option.(http://www.datastax.com/what-we-offer/products-services/datastax-enterprise)
After working on this problem myself I have realized that NoSQL databases like casandra are good when you want to make sure you are preserving your data schema with reliable writing operation, and don't want to take advantage of indexing operations that elasticsearch offers. In case you want to preserve some indexes data then elasticsearch is good in case you are trusting your scheme and only going to do far more reads than writes.
My case was data analytics. So I preserved a lot of my Latices in elastic search since later I wanted to traverse through the data a lot to see what should be my next step. I would have used casandra if I wanted to have a lot of changes in the schema of the data in my analytic pilelines.
Also there are many nice representing tools like kibana that you can use to present your data with some good graphics. Maybe I am lazy but they are very good looking and they helped me.
Storing data in a combination of Cassandra and ElasticSearch gives you most functionality. It allows you to lookup key-value tables, and also allows you to search data in indexes.
The combination gives you a lot of flexibility, ideal for your application.
Elassandra is the combined solution of Cassandra + Elastic search , It uses Elastic search to index the data and Cassandra as the data store , i'm not sure about the performance but as per this article , its performance is good.
If your application needs search feature then , Elassandra is the best open source option. DSE search is available but its expensive.
We had developed an application where we used Elasticsearch and Cassandra.
Similar data was stored into Cassandra and indexed into Elasticsearch.
Our application's UI was having features like searches, aggregations, data export, etc.
The back-end microservices were continuously getting huge data (on Kafka topics) and storing it into Cassandra. Once the data is stored into Cassandra, the services would make sure the data is indexed into Elasticsearch.
Cassandra was acting as "Source of truth" for Elasticsearch. In the cases, where reindexing of the ES index was required, we queried Cassandra and reindexed the data into ES.
This solution helped us, as this was very easy to scale and the searches and aggregations were much faster.
Cassandra is great at retrieving data by ID. I don't know much about secondary index performance, but I doubt it's as fast as Elasticsearch. Certainly Elasticsearch wins when it comes to full text search functionality (text analysis, relevancy scoring, etc).
Cassandra wins on update performance, too. Elasticsearch supports updates, but an update is really a reindex + soft delete in an atomic operation.
Cassandra has a very nice replication model (if you need to be extra-fail-safe). Elasticsearch is OK, too, I'm not in the camp that says ES is particularly unreliable (it has issues sometimes, like all software).
Elasticsearch also has aggregations for real-time analytics. And because searches are so fast, analytics on a subset of data will be fast, too.
If your requirements are satisfied well enough by one of them (like here it seems like ES would work well), I would just use one. If you have requirements from both worlds, then you can either:
use one of them and work around the downsides. For example, you may be able to handle many updates with Elasticsearch, but with more shards and more hardware
use both and make sure they're in sync
As elasticsearch is built on Lucene index and if you want to store indexing in elasticsearch it performs best comparing to indexing in Cassandra itself for retrieving the data.
If your requirements are not related to real-time retrieval then you can use elasticsearch as NoSQL database also, there are thoughts that ElasticSearch loses writes & Schema changes are difficult, but if your volume of data is not too big. You can easily achive elasticsearch as a search engine with best indexing along with elasticsearch as aNoSQL database. There are several way that you can prevent it. I have worked on the schema changes in elasticsearch, if your data structure is consistent then it will create any issues.
Being a supporter of ElasticSearch or SOlr. I have worked on both the search engines and i experienced that both the search engines can be used fluently if you configure them correctly.
Only cons that i can think of it, if you are targetting real time result and can't comprosie milliseconds delay in your response. Then its better to take help of other NoSQL databases like cassandra or couchbase.
Cassandra with solr, work better than Cassandra with elasticSearch.

MongoDB Vs Oracle for Real time search

I am building an application where i am tracking user activity changes and showing the activity logs to the users. Here are a few points :
Insert 100 million records per day.
These records to be indexed and available in search results immediately(within a few seconds).
Users can filter records on any of the 10 fields that are exposed.
I think both Mongo and Oracle will not accomplish what you need. I would recommend offloading the search component from your primary data store, maybe something like ElasticSearch:
http://www.elasticsearch.org/
My recommendation is ElasticSearch as your primary use-case is "filter" (Facets in ElasticSearch) and search. Is it written to scale-up (otherwise Lucene is also good) and keeping big data in mind.
100 million records a day sounds like you would need a rapidly growing server farm to store the data. I am not familiar with how Oracle would distribute these data, but with MongoDB, you would need to shard your data based on the fields that your search queries are using (including the 10 fields for filtering). If you search only by shard key, MongoDB is intelligent enough to only hit the machines that contain the correct shard, so it would be like querying a small database on one machine to get what you need back. In addition, if the shard keys can fit into the memory of each machine in your cluster, and are indexed with MongoDB's btree indexing, then your queries would be pretty instant.

Resources