is there any issue if i using ElasticSearch instead of relational database? - elasticsearch

as the question title, if crud data directly through elasticsearch without relation database(mysql/postgresql), is there any issue here?
i know elasticsearch good at searhing, but if update data frequencies, maybe got bad performance?
if every update-request setRefreshPolicy(IMMEDIATE), maybe got bad performance also?

ElasticSearch will likely outperform a relational db on similar hardware, though workloads can vary. However, ElasticSearch can do this because it has made certain design decisions that are different than the design decisions of a relational database.
ElasticSearch is eventually consistent. This means that queries immediately after your insert might still get old results. There are things that can be done to mitigate this but nothing will eliminate the possibility.
Prior to version 5.x ElasticSearch was pretty good at losing data when bad things happen the 5.x release was all about making Elastic more robust in those regards, and data loss is no longer the problem it was previously, though potential for data loss still exists, particularly if you make configuration mistakes.
If you frequently modify documents in ElasticSearch you will generate large numbers of deleted documents as every update generates a new document and marks an old document as deleted. Over time those old documents fall off, or you can force the system to clean them out, but if you are doing rapid modifications this could present a problem for you.

The application I am working for is using Elasticsearch as the backend. There are 9 microservices connecting to this backend. Writes are fewer when compared to reads. Our write APIs have a performance requirements of max. 3 seconds.
We have configured 1 second as the refresh interval and always using WAIT_FOR instead of IMMEDIATE and fewer times using NONE in the case of asynchronous updates.

Related

ElasticSearch vs. ElasticSearch+Cassandra

My main question is what is the benefit of integrating Cassandra and Elasticsearch versus using only Elasticsearch?
In fact, there are answers to similar questions on StackOverflow (e.g., here and here). But there are some points:
A lot of answers are old. Much may have changed in these years.
One point that is mentioned is that "Sometimes ElasticSearch loses writes". However, it can be imagined those alleged loses may had been because of some bugs that have been solved in these years. It is assumable that e.g., Cassandra may also have some bugs that cause data loses. Is there any fundamental differences between Cassandra and Elasticsearch that cause Elasticsearch to lose data but doesn't cause it for Cassandra?
It is mentioned that "Schema changes are difficult to do in ElasticSearch without blowing everything away and reloading." This may not be a major problem for us, assuming that our data model is relatively stable or at-least backward-compatible. Also, because of dynamic mapping in Elasticsearch it may adapt itself with the new requirements (e.g., extra fields).
With respect to the indexing delay in Elasticsearch, Cassandra also does not provide consistency. So, in Cassandra you may also face delays in reading the written data.
Overall, what extra features does Cassandra offer when used in conjunction with Elasticsearch?
P.S. It may be better if the question is answered in general. But, if it is necessary, assume that we only append rows to the database and never delete or update anything. We want to be able to do full-text search in the data.
So as the author of one of the linked answers (Elasticsearch vs Cassandra vs Elasticsearch with Cassandra), I suppose that I should weigh in here.
those alleged loses may had been because of some bugs that have been solved in these years.
This is an absolutely true statement. The answer I wrote is almost six years old, and ElasticSearch has grown to be a much more reliable product in that time. That being said, there are some things which Cassandra can do that ElasticSearch just wasn't designed to do (and vice-versa).
what extra features does Cassandra offer...
I can think of a few, which I'll summarize here:
Write throughput/performance/latency
ElasticSearch is a search engine based on the Lucene project. Handling large amounts of write throughput at low latencies is just not something that it was designed to do; at least not "out of the box." There are ways to configure ElasticSearch to be better at this, as described here: Techniques to Achieve High Write Throughput With ElasticSearch. But in terms of building a new cluster with minimal config, you'll spend less time engineering Cassandra to accomplish this.
"Sometimes ElasticSearch loses writes"
Yes, I wrote that. Again, ElasticSearch has improved. A lot. But I still see this happen under high write throughput conditions. When a cluster is engineered for a certain level of throughput, and an application exceeds those tolerances causing a node to become overwhelmed from the write back-pressure, writes will be lost.
Cassandra is not immune to this problem, either. It just has a higher tolerance for it. If you were to use them both together, architecting something like Kafka to "throttle" the write throughput to each would be a good approach.
Multi Data center High Availability (MDHA)
With the ability to define logical data centers and availability zones (racks), Cassandra has always been good at replicating a data set over multiple regions.
This is problematic for ElasticSearch, as it does not have a concept of a logical data center, and its "master" nodes are not active/active.
Peer nodes vs. role-based nodes
As a follow-up to my MDHA point, ElasticSearch now allows for nodes to be designated with a "role" in the cluster. You can specify multiple nodes to act as the "master" role, in-charge of adding and updating indexes. Any node can direct search traffic to the nodes which work under the "data" role. In fact, one way to improve write throughput (my first talking point), is to designate a node or two with the "ingest" role, which can prevent read and write traffic from interfering with each other.
This deviates from Cassandra's approach where every node is a peer, and can handle reads and writes. Being able to treat all nodes the same, simplifies maintenance and administration. And "no," despite popular misconception, a "seed" node not is not anything special.
Query vs. Search
To me, this is the fundamental difference between the two. Querying is not the same as searching. They may seem similar, but they are quite different.
Retrieving data by matching a pattern on one or multiple columns/properties is searching. Also with searching, the number of results is more of an unknown beforehand. Sure, Cassandra has added some features in the last few years to allow for pattern matching based on LIKE queries (I don't recommend its use). But when the ability to "search" a data set is required, Cassandra can't compete with ElasticSearch.
Retrieving data by providing a specific value on a specific key (column) is querying. With querying, it is also easier to have accurate expectations on the number of results to be returned. If I was building an app and I knew that I'd only ever have to retrieve data based on a static, pre-defined query with a specific key, I'd choose Cassandra every time.
With Cassandra, I can also tune query consistency, requiring operational acknowledgement from more or fewer replicas. Likewise, I can also direct those operations to a specific geographic region, based on the locality of the application.
...when used in conjunction with Elasticsearch?
They compliment each other well. Cassandra is good at some things (detailed above) that ElasicSearch is not (and vice-versa...saying that a lot). Requirements for an application may require both searching and querying. Sometimes you've got an app that needs that high-speed key lookup "oh, and we also want search."
Summary, tl;dr;
So while I've written quite a bit here, the main point that I'll keep coming back to, is picking the right tool for the job. When I need to search I'll pick ElasticSearch. When I need to query in a highly-available, geographically-aware scenario, I'll pick Cassandra. I still see applications use both (in tandem), so both have their merits.

CQRS (Lagom) elasticsearch read-side

I've read that ElasticSearch isn't the most reliable in terms of durability, but I would like to use it to store data on the read-side for optimal searching.
If we store events (write-side) in a cassandra database, that means that data is never really lost.
I don't really understand what is meant with 'data durability'.
If we use ES on the read-side, does that mean that some data may not be properly imported? Does it mean that one day data may randomly be lost, or the risk that all data may one day just have disappeared?
The use case is a Twitter-like geolocation based app.
How reliable is it in the end to use ES exclusively on the read-side, without needing a more reliable datastore (write-side) to store the data?
Depending on what is meant with this "durability", I wonder what measures should be taken to replay events and keep ES consistent at all times.
Thanks
I don't have a huge amount of experience running ES in production, but essentially, ensuring that when you persist data, it stays persisted, especially in a distributed system, is hard. There are many, many edge cases that are very hard to get right, and it takes time for a database to mature and sort those edge cases out. A less durable database is one that probably hasn't ironed all these issues out.
Of course, ElasticSearch is popular open source database with a thriving community maintaining it, so there's likely no well defined cases where "your data will be lost in this circumstance", rather, there's likely cases that either haven't been come across yet, or when they have been come across by users in the wild, the users that came across them didn't care enough to debug it because they were only using ES as a secondary data store and were able to rebuild it from their primary data store. Whenever a case is identified that ES loses data under well understood circumstances, the maintainers of ES would be quick to fix that.
The most typical use cases for ES are as a secondary database store, and in such a use case, durability isn't as important because the data store can be rebuilt from the primary. Accordingly, you'll find durability isn't as high a priority to the maintainers of ES because their users aren't asking for it - that's not say it's not a high priority, just relative to other databases, it's not as high.
So, if you use ES, you've got a higher chance of encountering bugs where you'll lose data, than with other databases that are either more mature or put more of a focus on durability in their development.
As to whether you should regularly drop your ES database and replay the events, it really depends on your use case and how important it is for your ES database to be consistent. A lot of the edge cases around ES's durability probably result in major corruptions with significant data loss - ie, you'll know if it happens, so there's no need to drop and replay regularly in that case. Another thing to consider is that because of the way CQRS read sides work, you'll only have a limited number of writers to your ES store, and you can easily control that concurrency. What this means is that a spike in load won't result in a spike in concurrent writers, what will happen is that your ES store might temporarily lag behind in consistency from your primary store. Due to this, you're probably less likely to encounter the edge cases that might trigger ES to lose data.
So, you're probably fine not bothering dropping and rebuilding unless something catastrophic happens, unless the consequences of silently losing small amounts of data in a way that you won't notice are so high that the incredibly small chance that that might happen is unacceptable.
I know this topic is more then 3 years old but I am also using Elasticsearch for the read side of the CQRS but I think there are other platforms fitting better to write side but it is not just a database technology, in todays Event Sourced paradigm more is necessary, I am using Akka's Finite State Machine with Cassandra, which in my opinion fits better that sort extreme write loads better then Elasticsearch.
I wrote a blog about it, if anybody likes to see, Write Side for Elasticsearch CQRS

Jedis 'front end' for ES

I just started learning about Redis. I installed it on my laptop and wrote a simple java client. I have an Elasticsearch instance that handles queries that come in from a web based application. It's pretty fast, but I'm wondering if there is a practical case where I could 'front' the elasticsearch instance with Redis to speed up response time for the clients. In my very limited redis knowledge, I'm wondering if storing the responses from ES queries in Redis would be practical, or would provide any value? More generally, can someone give me an example of how ES and Redis are used together. Thanks
One use case for having Redis in the picture is to use it as temporary buffer when loading documents into Elasticsearch via Logstash.
Since Redis is basically a cache, its main purpose is to make data available fast that would not be promptly available otherwise, because the back-end service you're querying is not fast enough. Since you are saying that your Elasticsearch instance is "pretty fast" (whatever that means), why would you want to cache the response?
Also, when you put a cache into the picture, you have other new concerns that arise, most importantly, how do you expire the cache, when and at which frequency? So if your data in Elasticsearch is pretty stable, you might benefit from a cache. However, if your data in Elasticsearch is changing frequently, you'll often be faced with many issues of stale data in your Redis cache, and that's a problem you don't want to have.
In my opinion, it's much better to spend time improving your ES queries and mappings to deliver blazing fast data, than to spend your time tuning a cache that might be useful 1% of the time.

How reliable is ElasticSearch as a primary datastore against factors like write loss, data availability

I am working on a project with a requirement of coming up with a generic dashboard where a users can do different kinds of grouping, filtering and drill down on different fields. For this we are looking for a search store that allows slice and dice of data.
There would be multiple sources of data and would be storing it in the Search Store. There may be some pre-computation required on the source data which can be done by an intermediate components.
I have looked through several blogs to understand whether ES can be used reliably as a primary datastore too. It mostly depends on the use-case we are looking for. Some of the information about the use case that we have :
Around 300 million record each year with 1-2 KB.
Assuming storing 1 year data, we are today with 300 GB but use-case can go up to 400-500 GB given growth of data.
As of now not sure, how we will push data, but roughly, it can go up to ~2-3 million records per 5 minutes.
Search request are low, but requires complex queries which can search data for last 6 weeks to 6 months.
document will be indexed across almost all the fields in document.
Some blogs say that it is reliable enough to use as a primary data store -
http://chrisberkhout.com/blog/elasticsearch-as-a-primary-data-store/
http://highscalability.com/blog/2014/1/6/how-hipchat-stores-and-indexes-billions-of-messages-using-el.html
https://karussell.wordpress.com/2011/07/13/jetslide-uses-elasticsearch-as-database/
And some blogs say that ES have few limitations -
https://www.found.no/foundation/elasticsearch-as-nosql/
https://www.found.no/foundation/crash-elasticsearch/
http://www.quora.com/Why-should-I-NOT-use-ElasticSearch-as-my-primary-datastore
Has anyone used Elastic Search as the sole truth of data without having a primary storage like PostgreSQL, DynamoDB or RDS? I have looked up that ES has certain issues like split brains and index corruption where there can be a problem with the data loss. So, I am looking to know if anyone has used ES and have got into any troubles with the data
Thanks.
Short answer: it depends on your use case, but you probably don't want to use it as a primary store.
Longer answer: You should really understand all of the possible issues that can come up around resiliency and data loss. Elastic has some great documentation of these issues which you should really understand before using it as a primary data store. In addition Aphyr's post on the topic is a good resource.
If you understand the risks you are taking and you believe that those risks are acceptable (e.g. because small data loss is not a problem for your application) then you should feel free to go ahead and try it.
It is generally a good idea to design redundant data storage solutions. For example, it could be a fast and reliable approach to first just push everything as flat data to a static storage like s3 then have ES pull and index data from there. If you need more flexibility leveraging some ORM, you could have an RDS or Redshift layer in between. This way the data can always be rebuilt in ES.
It depends on your needs and requirements how you set the balance between redundancy and flexibility/performance. If there's a lot of data involved, you could store the raw data statically and just index some parts of it by ES.
Amazon Lambda offers great features:
Many developers store objects in Amazon S3 while using Amazon DynamoDB
to store and index the object metadata and enable high speed search.
AWS Lambda makes it easy to keep everything in sync by running a
function to automatically update the index in Amazon DynamoDB every
time objects are added or updated from Amazon S3.
Since 2015 when this question was originally posted a lot of resiliency issues have been found and addressed, and in recent years a lot of features and specifically stability and resiliency features have been added, that it's definitely something to consider given the right use-cases and leveraging the right features in the right way.
So as of 2022, my answer to this question is - yes you can, as long as you do it correctly and for the right use-case.

Hibernate Search Automatic Indexing

I am working on developing an application which caters to about 100,000 searches everyday. We can safely assume that there are about the same number of updates / insertions / deletions in the database daily. The current application uses native SQL and we intend to migrate it to Hibernate and use Hibernate Search.
As there are continuous changes in the database records, we need to enable automatic indexing. The management has concerns about the performance impact automatic indexing can cause.
It is not possible to have a scheduled batch indexing as the changes in the records have to be available for search as soon as they are changed.
I have searched to look for some kind of performance statistics but have found none.
Can anybody who has already worked on Hibernate Search and faced a similar situation share their thoughts?
Thanks for the help.
Regards,
Shardul.
It might work fine, but it's hard to guess without a baseline. I have experience with even more searches / day and after some fine tuning it works well, but it's impossible to know if that will apply for your scenario without trying it out.
If normal tuning fails and NRT doesn't proof fast enough, you can always shard the indexes, use a multi-master configuration and plug in a distributed second level cache such as Infinispan: all combined the architecture can achieve linear scalability, provided you have the time to set it up and reasonable hardware.
It's hard to say what kind of hardware you will need, but it's a safe bet that it will be more efficient than native SQL solutions. I would suggest to make a POC and see how far you can get on a single node; if the kind of queries you have are a good fit for Lucene you might not need more than a single server. Beware that Lucene is much faster in queries than in updates, so since you estimate you'll have the same amount of writes and searches the problem is unlikely to be in the amount of searches/second, but in the writes(updates)/second and total data(index) size. Latest Hibernate Search introduced an NRT index manager, which suites well such use cases.

Resources