We're talking about a normalized dataset, with several different entities that must often be accessed along with related records. We want to be able to search across all of this data. We also want to use a caching layer to store view-ready denormalized data.
Since search engines like Elasticsearch and Solr are fast, and since it seems appropriate in many cases to put the same data into both a search engine and a caching layer, I've read at least anecdotal accounts of people combining the two roles. This makes sense on a surface level, at least, but I haven't found much written about the pros and cons of this architecture. So: is it appropriate to use a search engine as a cache, or is using one layer for two roles a case of being penny wise but pound foolish?
These guys have done this...
http://www.artirix.com/elasticsearch-as-a-smart-cache/
The problem I see is not in the read speed, but in the write speed. You are incurring a pretty hefty cost for adding things to the cache (forcing spool to disk and index merge).
Things like memcached or elastic cache if you are on AWS, are much more efficient at both inserts and reads.
"Elasticsearch and Solr are fast" is relative, caching infrastructure is often measured in single-digit millisecond range, same for inserts. These search engines are at least measured in 10's of milliseconds for reads, and much higher for writes.
I've heard of setups where ES was used for what is it really good for: full context search and used in parallel with a secondary storage. In these setups data was not stored (but it can be) - "store": "no" - and after searching with ES in its indices, the actual records were retrieved from the second storage level - usually a RDBMS - given that ES was holding a reference to the actual record in the RDBMS (an ID of some sort). If you're not happy with whatever secondary storage gives in you in terms of speed and "search" in general I don't see why you couldn't setup an ES cluster to give you the missing piece.
The disadvantage here is the time spent architecting the ES data structure because ES is not as good as a RDBMS at representing relationships. And it really doesn't need to, its main job and purpose is different. And is, actually, happier with a denormalized set of data to search over.
Another disadvantage is the complexity of keeping in sync the two storage systems which will require some thinking ahead. But, once the initial setup and architecture is in place, it should be easy afterwards.
the only recommended way of using a search engine is to create indices that match your most frequently accessed denormalised data access patterns. You can call it a cache if you want. For searching it's perfect, as it's fast enough.
Recommended thing to add cache for there - statistics for "aggregated" queries - "Top 100 hotels in Europe", as a good example of it.
May be you can consider in-memory lucene indexes, instead of SOLR or elasticsearch. Here is an example
Related
My main question is what is the benefit of integrating Cassandra and Elasticsearch versus using only Elasticsearch?
In fact, there are answers to similar questions on StackOverflow (e.g., here and here). But there are some points:
A lot of answers are old. Much may have changed in these years.
One point that is mentioned is that "Sometimes ElasticSearch loses writes". However, it can be imagined those alleged loses may had been because of some bugs that have been solved in these years. It is assumable that e.g., Cassandra may also have some bugs that cause data loses. Is there any fundamental differences between Cassandra and Elasticsearch that cause Elasticsearch to lose data but doesn't cause it for Cassandra?
It is mentioned that "Schema changes are difficult to do in ElasticSearch without blowing everything away and reloading." This may not be a major problem for us, assuming that our data model is relatively stable or at-least backward-compatible. Also, because of dynamic mapping in Elasticsearch it may adapt itself with the new requirements (e.g., extra fields).
With respect to the indexing delay in Elasticsearch, Cassandra also does not provide consistency. So, in Cassandra you may also face delays in reading the written data.
Overall, what extra features does Cassandra offer when used in conjunction with Elasticsearch?
P.S. It may be better if the question is answered in general. But, if it is necessary, assume that we only append rows to the database and never delete or update anything. We want to be able to do full-text search in the data.
So as the author of one of the linked answers (Elasticsearch vs Cassandra vs Elasticsearch with Cassandra), I suppose that I should weigh in here.
those alleged loses may had been because of some bugs that have been solved in these years.
This is an absolutely true statement. The answer I wrote is almost six years old, and ElasticSearch has grown to be a much more reliable product in that time. That being said, there are some things which Cassandra can do that ElasticSearch just wasn't designed to do (and vice-versa).
what extra features does Cassandra offer...
I can think of a few, which I'll summarize here:
Write throughput/performance/latency
ElasticSearch is a search engine based on the Lucene project. Handling large amounts of write throughput at low latencies is just not something that it was designed to do; at least not "out of the box." There are ways to configure ElasticSearch to be better at this, as described here: Techniques to Achieve High Write Throughput With ElasticSearch. But in terms of building a new cluster with minimal config, you'll spend less time engineering Cassandra to accomplish this.
"Sometimes ElasticSearch loses writes"
Yes, I wrote that. Again, ElasticSearch has improved. A lot. But I still see this happen under high write throughput conditions. When a cluster is engineered for a certain level of throughput, and an application exceeds those tolerances causing a node to become overwhelmed from the write back-pressure, writes will be lost.
Cassandra is not immune to this problem, either. It just has a higher tolerance for it. If you were to use them both together, architecting something like Kafka to "throttle" the write throughput to each would be a good approach.
Multi Data center High Availability (MDHA)
With the ability to define logical data centers and availability zones (racks), Cassandra has always been good at replicating a data set over multiple regions.
This is problematic for ElasticSearch, as it does not have a concept of a logical data center, and its "master" nodes are not active/active.
Peer nodes vs. role-based nodes
As a follow-up to my MDHA point, ElasticSearch now allows for nodes to be designated with a "role" in the cluster. You can specify multiple nodes to act as the "master" role, in-charge of adding and updating indexes. Any node can direct search traffic to the nodes which work under the "data" role. In fact, one way to improve write throughput (my first talking point), is to designate a node or two with the "ingest" role, which can prevent read and write traffic from interfering with each other.
This deviates from Cassandra's approach where every node is a peer, and can handle reads and writes. Being able to treat all nodes the same, simplifies maintenance and administration. And "no," despite popular misconception, a "seed" node not is not anything special.
Query vs. Search
To me, this is the fundamental difference between the two. Querying is not the same as searching. They may seem similar, but they are quite different.
Retrieving data by matching a pattern on one or multiple columns/properties is searching. Also with searching, the number of results is more of an unknown beforehand. Sure, Cassandra has added some features in the last few years to allow for pattern matching based on LIKE queries (I don't recommend its use). But when the ability to "search" a data set is required, Cassandra can't compete with ElasticSearch.
Retrieving data by providing a specific value on a specific key (column) is querying. With querying, it is also easier to have accurate expectations on the number of results to be returned. If I was building an app and I knew that I'd only ever have to retrieve data based on a static, pre-defined query with a specific key, I'd choose Cassandra every time.
With Cassandra, I can also tune query consistency, requiring operational acknowledgement from more or fewer replicas. Likewise, I can also direct those operations to a specific geographic region, based on the locality of the application.
...when used in conjunction with Elasticsearch?
They compliment each other well. Cassandra is good at some things (detailed above) that ElasicSearch is not (and vice-versa...saying that a lot). Requirements for an application may require both searching and querying. Sometimes you've got an app that needs that high-speed key lookup "oh, and we also want search."
Summary, tl;dr;
So while I've written quite a bit here, the main point that I'll keep coming back to, is picking the right tool for the job. When I need to search I'll pick ElasticSearch. When I need to query in a highly-available, geographically-aware scenario, I'll pick Cassandra. I still see applications use both (in tandem), so both have their merits.
I am evaluating a few different options for powering an analytics application using an open-source technology. One of the options is using ElasticSearch, though I haven't been able to find any examples of companies using it for large-scale implementations of analytics, thus my question here.
For datasets of 1B-10B points, what limitations (if any, or would it be possible?) would ElasticSearch have? For example, in having a feature-set like Google Analytics, with it.
Here's one user who seems to do analytics on largeish amounts of data - https://digitalgov.gov/2015/01/07/elk - plus description of what they do including downsides.
With Elasticsearch there is no black-white answer to a question as open-ended as yours. The amount of records is not everything: how much disk space are we talking about, how many nodes, how many indices, the number of shards for each, what kind of analytics you need, hardware specs etc etc. Two things are certain from the data you mentioned: you need dedicated master nodes and more importantly good client nodes and depending on queries and the concurrent searches count you will need more or less of them.
In Elasticsearch 5 the client node is called coordinating node but it has the same role. One limitation I can think of is the heap/RAM memory of such coordinating node. The heap of an Elasticsearch node shouldn't be set to values larger than ~30GB due to the longer garbage collection cycles of the JVM (larger memory to clean, more time it takes, more unusable the node is). During GC nothing else runs on that JVM. So you could be limited by the size of the memory.
I said that you most likely will need coordinating nodes because heavy aggregations (what will probably be the most used feature in an analytics platform) will use cpu and memory in the final phase of a query where it gathers the results from all shards involved and performs a final sorting and aggregation. Thus it will need more memory than a normal data node would only for aggregations.
I doubt though that a single aggregation will use so many GBs of memory but it could theoretically use it if the query/aggregation being used is built in a reckless way. Depending on how many concurrent searches there are and how much memory they use you might need more or less coordinating nodes so that the GC cycles are not very frequent.
Bottom line: I think this is possible but some common sense is needed (see my comment about reckless aggregations) and some as close to reality as possible estimations regarding the load.
Google Analytics Pros:
Easy to Install
Can be used in multiple environments (e.g. web, mobile, other)
Customized data collection
Google Analytics Cons:
Custom reporting is limited
Upgrading to Premium is expensive
Requires continual traning
Slices data into smaller samples to deal with large sampling issues
ElasticSearch Pros:
Distributed by design
Easier to scale horizontally
Good at full text search
Fast indexing & querying
ElasticSearch Cons:
Not a relational database therefore does not benefit from things like foreign-key constaints
Data consistency can be affected
No built-in authentication or authorization system
How many views per bucket is too much, assuming a large amount of data in the bucket (>100GB, >100M documents, >12 document types), and assuming each view applies only to one document type? Or asked another way, at what point should some document types be split into separate buckets to save on the overhead of processing all views on all document types?
I am having a hard time deciding how to split my data into couchbase buckets, and the performance implications of the views required on the data. My data consists of more than a dozen relational DBs, with at least half with hundreds of millions of rows in a number of tables.
The http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html doc section "using document types" seems to imply having multiple document types in the same bucket is not ideal because views on specific document types are updated for all documents, even those that will never match the view. Indeed, it suggests separating data into buckets to avoid this overhead.
Yet there is a limit of 10 buckets per cluster for performance reasons. My only conclusion therefore is that each cluster can handle a maximum of 10 large collections of documents efficiently. Is this accurate?
Tug's advice was right on and allow me to add some perspective as well.
A bucket can be considered most closely related to (though not exactly) a "database instantiation" within the RDMS world. There will be multiple tables/schemas within that "database" and those can all be combined within a bucket.
Think about a bucket as a logical grouping of data that all shares some common configuration parameters (RAM quota, replica count, etc) and you should only need to split your data into multiple buckets when you need certain datasets to be controlled separately. Other reasons are related to very different workloads to different datasets or the desire to be able to track the workload to those datasets separately.
Some examples:
-I want to control the caching behavior for one set of data differently than another. For instance, many customers have a "session" bucket that they want always in RAM whereas they may have a larger, "user profile" bucket that doesn't need all the data cached in RAM. Technically these two data sets could reside in one bucket and allow Couchbase to be intelligent about which data to keep in RAM, but you don't have as much guarantee or control that the session data won't get pushed out...so putting it in its own bucket allows you to enforce that. It also gives you the added benefit of being able to monitor that traffic separately.
-I want some data to be replicated more times than others. While we generally recommend only one replica in most clusters, there are times when our users choose certain datasets that they want replicated an extra time. This can be controlled via separate buckets.
-Along the same lines, I only want some data to be replicated to another cluster/datacenter. This is also controlled per-bucket and so that data could be split to a separate bucket.
-When you have fairly extreme differences in workload (especially around the amount of writes) to a given dataset, it does begin to make sense from a view/index perspective to separate the data into a separate bucket. I mention this because it's true, but I also want to be clear that it is not the common case. You should use this approach after you identify a problem, not before because you think you might.
Regarding this last point, yes every write to a bucket will be picked up by the indexing engine but by using document types within the JSON, you can abort the processing for a given document very quickly and it really shouldn't have a detrimental impact to have lots of data coming in that doesn't apply to certain views. If you don't mind, I'm particularly curious at which parts of the documentation imply otherwise since that certainly wasn't our intention.
So in general, we see most deployments with a low number of buckets (2-3) and only a few upwards of 5. Our limit of 10 comes from some known CPU and disk IO overhead of our internal tracking of statistics (the load or lack thereof on a bucket doesn't matter here). We certainly plan to reduce this overhead with future releases, but that still wouldn't change our recommendation of only having a few buckets. The advantages of being able to combine multiple "schemas" into a single logical grouping and apply view/indexes across that still exist regardless.
We are in the process right now of coming up with much more specific guidelines and sizing recommendations (I wrote those first two blogs as a stop-gap until we do).
As an initial approach, you want to try and keep the number of design documents around 4 because by default we process up to 4 in parallel. You can increase this number, but that should be matched by increased CPU and disk IO capacity. You'll then want to keep the number of views within each document relatively low, probably well below 10, since they are each processed in serial.
I recently worked with one user who had an fairly large amount of views (around 8 design documents and some dd's with nearly 20 views) and we were able to drastically bring this down by combining multiple views into one. Obviously it's very application dependent, but you should try to generate multiple different "queries" off of one index. Using reductions, key-prefixing (within the views), and collation, all combined with different range and grouping queries can make a single index that may appear crowded at first, but is actually very flexible.
The less design documents and views you have, the less disk space, IO and CPU resources you will need. There's never going to be a magic bullet or hard-and-fast guideline number unfortunately. In the end, YMMV and testing on your own dataset is better than any multi-page response I can write ;-)
Hope that helps, please don't hesitate to reach out to us directly if you have specific questions about your specific use case that you don't want published.
Perry
As you can see from the Couchbase documentation, it is not really possible to provide a "universal" rules to give you an exact member.
But based on the best practice document that you have used and some discussion(here) you should be able to design your database/views properly.
Let's start with the last question:
YES the reason why Couchbase advice to have a small number of bucket is for performance - and more importantly resources consumption- reason. I am inviting you to read these blog posts that help to understand what's going on "inside" Couchbase:
Sizing 1: http://blog.couchbase.com/how-many-nodes-part-1-introduction-sizing-couchbase-server-20-cluster
Sizing 2: http://blog.couchbase.com/how-many-nodes-part-2-sizing-couchbase-server-20-cluster
Compaction: http://blog.couchbase.com/compaction-magic-couchbase-server-20
So you will see that most of the "operations" are done by bucket.
So let's now look at the original question:
yes most the time your will organize the design document/and views by type of document.
It is NOT a problem to have all the document "types" in a single(few) buckets, this is in fact the way your work with Couchbase
The most important part to look is, the size of your doc (to see how "long" will be the parsing of the JSON) and how often the document will be created/updated, and also deleted, since the JS code of the view is ONLY executed when you create/change the document.
So what you should do:
1 single bucket
how many design documents? (how many types do you have?)
how any views in each document you will have?
In fact the most expensive part is not during the indexing or quering it is more when you have to rebalance the data and indices between nodes (add, remove , failure of nodes)
Finally, but it looks like you already know it, this chapter is quite good to understand how views works (how the index is created and used):
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-operation.html
Do not hesitate to add more information if needed.
I've been considering following options.
senseidb [http://www.senseidb.com] This needs a fixed schema also data gateways. So there is no simple way to push data but provide data streams. My data is unstuctured and there are very few common attributes across all kinds of logs
riak[http://wiki.basho.com/Riak-Search.html]
vertica - cost factor?
Hbase(+Hadoop ecosystem +lucene) - main cons here are on single machine this wont make much sense and am not sure about free text search capability to be built around this
Main requirements are
1. it has to sustain thousands of incoming request for archival and at the same time build real-time index which will allow end user to do free-text search
storage (log archives + index ) has to be optimal
There are number of specialized log storage and indexing, I don't know that I'd cram logs into a normal data store necessarily.
If you have lots of money, it's tough to beat Splunk.
If you'd prefer an open source option, check out the ServerFault discussion. logstash + ElasticSearch seems to be a really strong choice, and should grow pretty well as your logs do.
Have you given a thought on the line of these implementation. It might be helpful to integrate Lucene and Hadoop for you problem.
http://www.cloudera.com/blog/2011/09/hadoop-for-archiving-email/
http://www.cloudera.com/blog/2012/01/hadoop-for-archiving-email-part-2/
So instead of email, your use case could use the log files and the parameters to index.
For the 2-3 TB of data sounds like a "in the middle" case. If it is all the data I would not suggest going into BigData / NoSQL venture.
I think RDBMS with full text search capability should do on good hardware. I would suggest to do some aggressive partitioning by time to be able to work with 2-3 TB data. Without partitioning it would be too mach. In the same time - if your data will be partitioned by days i think data size will be fine for MySQL.
Taking to the account the comment below that data size is about 10-15TB, and taking into account that need for some replication will multiply this number x2-x3. We also should consider size of indexes which I would estimate as dozens percents from the data size. Probably efficient single node solution might be more expensive then clustering mostly because of licensing costs.
In best of my understanding existing Hadoop/NoSQL solutions can not answer your requirements out of the box, mostly because of number of documents to be indexed. In out case - each log is a document. (http://blog.mgm-tp.com/2010/06/hadoop-log-management-part3/)
So I think solution will be in aggregating logs for some period of time together, and threating it as one document.
For the storage of these logs packages HDFS or Swift could be a good solutions.
I am planning on using CouchDB on a project. But as the querying mechanism involves writing views (which are a lot like indexes on regular RDMBMS's) I was wondering, if the document database keeps getting updated a lot ( a write heavy database) would CouchDB perform well compared to a regular RDBMS? Or do we have to compact/re-index the system occasionally to make it perform faster?
You might think of the pros/cons of the CouchDB view model this way. (CouchDB hackers may disagree but IMO it's accurate enough for users.)
A view function always performs a full "table scan" when it is first created (just like an RDBMS BTW)
As long as they have no side effects, map and reduce functions can be arbitrarily complex
Every document and map/reduce result is cached and never calculated again
If you add or change a document, it (and only it) will be re-computed (and cached) for that view
Given these, you can draw some conclusions about CouchDB performance:
There is never a re-index phase for the entire data set, just incremental per document update
Changing a view function forces re-building the entire index
Since both CouchDB and RDBMS must update the index for new data, it's reasonable to think performance will be similar for heavy update/insert usage.
Obviously YMMV and the standard cop-out, "you must test your own load" applies. However I will add a few more considerations.
I say RDBMS is flat out superior for exploratory-style querying your data. When you don't even know what questions to ask from your data, you really can't beat a language for querying that is structured.
However, once you define what you want to know, CouchDB (and perhaps Hadoop) provide the most rich querying system because you are just writing code.
If your data set is large, NoSQL databases will scale more easily. For example, CouchDB-Lounge allows a cluster of couches for parallel processing. Hadoop does the same so then it would come down to secondary considerations: familiarity, maintainability, CouchDB is a web server but requires a bit more DIY; Hadoop internalizes more cluster management at the cost of complexity, foreignness, etc.
I hope that helps shed some light on your decision!