How reliable is ElasticSearch as a primary datastore against factors like write loss, data availability - elasticsearch

I am working on a project with a requirement of coming up with a generic dashboard where a users can do different kinds of grouping, filtering and drill down on different fields. For this we are looking for a search store that allows slice and dice of data.
There would be multiple sources of data and would be storing it in the Search Store. There may be some pre-computation required on the source data which can be done by an intermediate components.
I have looked through several blogs to understand whether ES can be used reliably as a primary datastore too. It mostly depends on the use-case we are looking for. Some of the information about the use case that we have :
Around 300 million record each year with 1-2 KB.
Assuming storing 1 year data, we are today with 300 GB but use-case can go up to 400-500 GB given growth of data.
As of now not sure, how we will push data, but roughly, it can go up to ~2-3 million records per 5 minutes.
Search request are low, but requires complex queries which can search data for last 6 weeks to 6 months.
document will be indexed across almost all the fields in document.
Some blogs say that it is reliable enough to use as a primary data store -
http://chrisberkhout.com/blog/elasticsearch-as-a-primary-data-store/
http://highscalability.com/blog/2014/1/6/how-hipchat-stores-and-indexes-billions-of-messages-using-el.html
https://karussell.wordpress.com/2011/07/13/jetslide-uses-elasticsearch-as-database/
And some blogs say that ES have few limitations -
https://www.found.no/foundation/elasticsearch-as-nosql/
https://www.found.no/foundation/crash-elasticsearch/
http://www.quora.com/Why-should-I-NOT-use-ElasticSearch-as-my-primary-datastore
Has anyone used Elastic Search as the sole truth of data without having a primary storage like PostgreSQL, DynamoDB or RDS? I have looked up that ES has certain issues like split brains and index corruption where there can be a problem with the data loss. So, I am looking to know if anyone has used ES and have got into any troubles with the data
Thanks.

Short answer: it depends on your use case, but you probably don't want to use it as a primary store.
Longer answer: You should really understand all of the possible issues that can come up around resiliency and data loss. Elastic has some great documentation of these issues which you should really understand before using it as a primary data store. In addition Aphyr's post on the topic is a good resource.
If you understand the risks you are taking and you believe that those risks are acceptable (e.g. because small data loss is not a problem for your application) then you should feel free to go ahead and try it.

It is generally a good idea to design redundant data storage solutions. For example, it could be a fast and reliable approach to first just push everything as flat data to a static storage like s3 then have ES pull and index data from there. If you need more flexibility leveraging some ORM, you could have an RDS or Redshift layer in between. This way the data can always be rebuilt in ES.
It depends on your needs and requirements how you set the balance between redundancy and flexibility/performance. If there's a lot of data involved, you could store the raw data statically and just index some parts of it by ES.
Amazon Lambda offers great features:
Many developers store objects in Amazon S3 while using Amazon DynamoDB
to store and index the object metadata and enable high speed search.
AWS Lambda makes it easy to keep everything in sync by running a
function to automatically update the index in Amazon DynamoDB every
time objects are added or updated from Amazon S3.

Since 2015 when this question was originally posted a lot of resiliency issues have been found and addressed, and in recent years a lot of features and specifically stability and resiliency features have been added, that it's definitely something to consider given the right use-cases and leveraging the right features in the right way.
So as of 2022, my answer to this question is - yes you can, as long as you do it correctly and for the right use-case.

Related

ElasticSearch vs. ElasticSearch+Cassandra

My main question is what is the benefit of integrating Cassandra and Elasticsearch versus using only Elasticsearch?
In fact, there are answers to similar questions on StackOverflow (e.g., here and here). But there are some points:
A lot of answers are old. Much may have changed in these years.
One point that is mentioned is that "Sometimes ElasticSearch loses writes". However, it can be imagined those alleged loses may had been because of some bugs that have been solved in these years. It is assumable that e.g., Cassandra may also have some bugs that cause data loses. Is there any fundamental differences between Cassandra and Elasticsearch that cause Elasticsearch to lose data but doesn't cause it for Cassandra?
It is mentioned that "Schema changes are difficult to do in ElasticSearch without blowing everything away and reloading." This may not be a major problem for us, assuming that our data model is relatively stable or at-least backward-compatible. Also, because of dynamic mapping in Elasticsearch it may adapt itself with the new requirements (e.g., extra fields).
With respect to the indexing delay in Elasticsearch, Cassandra also does not provide consistency. So, in Cassandra you may also face delays in reading the written data.
Overall, what extra features does Cassandra offer when used in conjunction with Elasticsearch?
P.S. It may be better if the question is answered in general. But, if it is necessary, assume that we only append rows to the database and never delete or update anything. We want to be able to do full-text search in the data.
So as the author of one of the linked answers (Elasticsearch vs Cassandra vs Elasticsearch with Cassandra), I suppose that I should weigh in here.
those alleged loses may had been because of some bugs that have been solved in these years.
This is an absolutely true statement. The answer I wrote is almost six years old, and ElasticSearch has grown to be a much more reliable product in that time. That being said, there are some things which Cassandra can do that ElasticSearch just wasn't designed to do (and vice-versa).
what extra features does Cassandra offer...
I can think of a few, which I'll summarize here:
Write throughput/performance/latency
ElasticSearch is a search engine based on the Lucene project. Handling large amounts of write throughput at low latencies is just not something that it was designed to do; at least not "out of the box." There are ways to configure ElasticSearch to be better at this, as described here: Techniques to Achieve High Write Throughput With ElasticSearch. But in terms of building a new cluster with minimal config, you'll spend less time engineering Cassandra to accomplish this.
"Sometimes ElasticSearch loses writes"
Yes, I wrote that. Again, ElasticSearch has improved. A lot. But I still see this happen under high write throughput conditions. When a cluster is engineered for a certain level of throughput, and an application exceeds those tolerances causing a node to become overwhelmed from the write back-pressure, writes will be lost.
Cassandra is not immune to this problem, either. It just has a higher tolerance for it. If you were to use them both together, architecting something like Kafka to "throttle" the write throughput to each would be a good approach.
Multi Data center High Availability (MDHA)
With the ability to define logical data centers and availability zones (racks), Cassandra has always been good at replicating a data set over multiple regions.
This is problematic for ElasticSearch, as it does not have a concept of a logical data center, and its "master" nodes are not active/active.
Peer nodes vs. role-based nodes
As a follow-up to my MDHA point, ElasticSearch now allows for nodes to be designated with a "role" in the cluster. You can specify multiple nodes to act as the "master" role, in-charge of adding and updating indexes. Any node can direct search traffic to the nodes which work under the "data" role. In fact, one way to improve write throughput (my first talking point), is to designate a node or two with the "ingest" role, which can prevent read and write traffic from interfering with each other.
This deviates from Cassandra's approach where every node is a peer, and can handle reads and writes. Being able to treat all nodes the same, simplifies maintenance and administration. And "no," despite popular misconception, a "seed" node not is not anything special.
Query vs. Search
To me, this is the fundamental difference between the two. Querying is not the same as searching. They may seem similar, but they are quite different.
Retrieving data by matching a pattern on one or multiple columns/properties is searching. Also with searching, the number of results is more of an unknown beforehand. Sure, Cassandra has added some features in the last few years to allow for pattern matching based on LIKE queries (I don't recommend its use). But when the ability to "search" a data set is required, Cassandra can't compete with ElasticSearch.
Retrieving data by providing a specific value on a specific key (column) is querying. With querying, it is also easier to have accurate expectations on the number of results to be returned. If I was building an app and I knew that I'd only ever have to retrieve data based on a static, pre-defined query with a specific key, I'd choose Cassandra every time.
With Cassandra, I can also tune query consistency, requiring operational acknowledgement from more or fewer replicas. Likewise, I can also direct those operations to a specific geographic region, based on the locality of the application.
...when used in conjunction with Elasticsearch?
They compliment each other well. Cassandra is good at some things (detailed above) that ElasicSearch is not (and vice-versa...saying that a lot). Requirements for an application may require both searching and querying. Sometimes you've got an app that needs that high-speed key lookup "oh, and we also want search."
Summary, tl;dr;
So while I've written quite a bit here, the main point that I'll keep coming back to, is picking the right tool for the job. When I need to search I'll pick ElasticSearch. When I need to query in a highly-available, geographically-aware scenario, I'll pick Cassandra. I still see applications use both (in tandem), so both have their merits.

Data Store with multiple shard keys

I've been researching several different data store technologies that can be used for storing huge amounts of semi-structured logs for people to search through later. I've looked at cassandra, riak, and elastic search so far, and it seems like elastic search offers the closest fit for what I'm interested in (largely because it indexes everything transparently). However, there's one feature that I'm interested in that seems to escape them all, and I was wondering if there is a data store with this feature.
What I'm thinking about is the ability to transparent shard on multiple key. To be clear, I'm not talking about using a composite key for sharding. I mean that if you had a table that was sharded by user_id, time_of_creation, and ip_address, and you inserted a row, three copies of that row would be created, each one in a different cluster that's sharded by a different key (or maybe they could all somehow actually be in the same cluster. That important part is that the data would be duplicated). And when you wanted to query this table later, the data store would transparently choose which cluster to use.
In the articles I've read about Cassandra, people often recommend doing something like this, but it's definitely a manual process in at least three ways:
For insertion, you have you insert into each table yourself.
When it comes to querying, you have to figure out which table you want to query (you need to pick the one that uses the right cluster key).
And, if you ever want to add another key to shard on, you have to write a routine the existing data into the new table.
Although I was using cassandra as an example, I believe that the situation with riak and elastic search is similar. I understand that a data store that offered this ability would probably have to make huge trades to do so. Updating/deleting might no longer be possible (or it would have extremely poor performance), and consistency would suffer. But, it's a set of trades that I find acceptable when dealing with logs, so I was wondering if anyone is familiar with a technology that offers this feature.

Is it appropriate to use a search engine as a caching layer?

We're talking about a normalized dataset, with several different entities that must often be accessed along with related records. We want to be able to search across all of this data. We also want to use a caching layer to store view-ready denormalized data.
Since search engines like Elasticsearch and Solr are fast, and since it seems appropriate in many cases to put the same data into both a search engine and a caching layer, I've read at least anecdotal accounts of people combining the two roles. This makes sense on a surface level, at least, but I haven't found much written about the pros and cons of this architecture. So: is it appropriate to use a search engine as a cache, or is using one layer for two roles a case of being penny wise but pound foolish?
These guys have done this...
http://www.artirix.com/elasticsearch-as-a-smart-cache/
The problem I see is not in the read speed, but in the write speed. You are incurring a pretty hefty cost for adding things to the cache (forcing spool to disk and index merge).
Things like memcached or elastic cache if you are on AWS, are much more efficient at both inserts and reads.
"Elasticsearch and Solr are fast" is relative, caching infrastructure is often measured in single-digit millisecond range, same for inserts. These search engines are at least measured in 10's of milliseconds for reads, and much higher for writes.
I've heard of setups where ES was used for what is it really good for: full context search and used in parallel with a secondary storage. In these setups data was not stored (but it can be) - "store": "no" - and after searching with ES in its indices, the actual records were retrieved from the second storage level - usually a RDBMS - given that ES was holding a reference to the actual record in the RDBMS (an ID of some sort). If you're not happy with whatever secondary storage gives in you in terms of speed and "search" in general I don't see why you couldn't setup an ES cluster to give you the missing piece.
The disadvantage here is the time spent architecting the ES data structure because ES is not as good as a RDBMS at representing relationships. And it really doesn't need to, its main job and purpose is different. And is, actually, happier with a denormalized set of data to search over.
Another disadvantage is the complexity of keeping in sync the two storage systems which will require some thinking ahead. But, once the initial setup and architecture is in place, it should be easy afterwards.
the only recommended way of using a search engine is to create indices that match your most frequently accessed denormalised data access patterns. You can call it a cache if you want. For searching it's perfect, as it's fast enough.
Recommended thing to add cache for there - statistics for "aggregated" queries - "Top 100 hotels in Europe", as a good example of it.
May be you can consider in-memory lucene indexes, instead of SOLR or elasticsearch. Here is an example

need a solution for archiving logs and having real-time search functionality

I've been considering following options.
senseidb [http://www.senseidb.com] This needs a fixed schema also data gateways. So there is no simple way to push data but provide data streams. My data is unstuctured and there are very few common attributes across all kinds of logs
riak[http://wiki.basho.com/Riak-Search.html]
vertica - cost factor?
Hbase(+Hadoop ecosystem +lucene) - main cons here are on single machine this wont make much sense and am not sure about free text search capability to be built around this
Main requirements are
1. it has to sustain thousands of incoming request for archival and at the same time build real-time index which will allow end user to do free-text search
storage (log archives + index ) has to be optimal
There are number of specialized log storage and indexing, I don't know that I'd cram logs into a normal data store necessarily.
If you have lots of money, it's tough to beat Splunk.
If you'd prefer an open source option, check out the ServerFault discussion. logstash + ElasticSearch seems to be a really strong choice, and should grow pretty well as your logs do.
Have you given a thought on the line of these implementation. It might be helpful to integrate Lucene and Hadoop for you problem.
http://www.cloudera.com/blog/2011/09/hadoop-for-archiving-email/
http://www.cloudera.com/blog/2012/01/hadoop-for-archiving-email-part-2/
So instead of email, your use case could use the log files and the parameters to index.
For the 2-3 TB of data sounds like a "in the middle" case. If it is all the data I would not suggest going into BigData / NoSQL venture.
I think RDBMS with full text search capability should do on good hardware. I would suggest to do some aggressive partitioning by time to be able to work with 2-3 TB data. Without partitioning it would be too mach. In the same time - if your data will be partitioned by days i think data size will be fine for MySQL.
Taking to the account the comment below that data size is about 10-15TB, and taking into account that need for some replication will multiply this number x2-x3. We also should consider size of indexes which I would estimate as dozens percents from the data size. Probably efficient single node solution might be more expensive then clustering mostly because of licensing costs.
In best of my understanding existing Hadoop/NoSQL solutions can not answer your requirements out of the box, mostly because of number of documents to be indexed. In out case - each log is a document. (http://blog.mgm-tp.com/2010/06/hadoop-log-management-part3/)
So I think solution will be in aggregating logs for some period of time together, and threating it as one document.
For the storage of these logs packages HDFS or Swift could be a good solutions.

Hadoop as document store database

We have a large document store currently running at 3TB in space and it increments by 1 TB every six months. They are currently stored in a windows file system which has at times caused problems in terms of access and retrieval. We are looking to exploit a Hadoop based document store database. Is it a good idea to go ahead with Hadoop? Anyone has any exposure to the same? What can be the challenges, technology roadblocks in achieving the same?
Hadoop is more for batch processing that high data access. You should have a look at some NoSQL systems, like document oriented databases. Hard to answer without knowing what your data is like.
The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).
You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery. Obviously files in a folder are a bit different.
Also remember than in some systems such as HBase there is no indexing concept (I'm gussing you have file indexing setup on this windows FS document store). All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.
So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. THe work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing Oracle on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.
Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.But again depending on your data (are the files word docs or text docs with info about products, invoices or instruments or something)...
Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.
My advice would be to actually play around with a few different systems. Look at;
MongoDB - Document - CP
CouchDB - Document - AP
Cassandra - Column Family - Available & Partition Tolerant (AP)
VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).
Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.
HDFS does not sound to be right solution. It is optimized for massive parralel processing of the data and not to be general purpose file system.
Specifically it has following limitations making it probabbly bad choice:
a) It is sensitive to the number of files. Practical limit should be about dozens of millions of files.
b) The files are read only, and can only be appended, but not edited. It is fine for analytical data processing but might not suite your need.
c) It has single point of failure - namenode. So its reliability is limited.
If you need system with comparable scalability, but not sensitive to number of files I would suggest OpenStack's Swift. It also does not have SPOF.
My suggestion is you can buy a NAS storage. May be EMS isilon kind of product you can consider.
Hadoop HDFS is not for file storage. It is storage to processing the data (for reports, analytics..)
NAS is for file sharing
SAN is more for a database
http://www.slideshare.net/jabramo/emc-sanoverviewpresentation
Declaration: I am not a EMC person, so you can consider any product. I just used EMC for reference.

Resources