Theoretically speaking, can't you just cache search results from a SQL query made to the database making it similar to ElastiSearch? I understand you would run into invalidation issues, but what are the fundamental differences between ElastiSearch and a cache like Redis?
Elasticsearch is primarily a Search engine optimized to store and retrieve structured or semi-structured data. It takes care of processing structured/semi-structured data, indexes and provides a nice DSL to query data. Oh, and it happens to be super fast :)
A distributed cache like Memcached and Redis ( BTW Redis is not just cache, but a data structure store) primarily stores Key-Value pairs for faster lookup. Think of your local hash table distributed across a bunch of machines.
Two different use cases. If it's just for the cache - Elasticsearch may not be the right choice.
Related
I want to have a memory cache layer in my application. To populate cache with items, I have to get data from a large Cassandra table. Select all is not recommended, because without using partition keys, it's a slow read operation. Prior to that I can "predict" partition keys using other Cassandra table that I'll have to read all again, but relatively it's a smaller volume table. After reading data from user table and creating a list of potential partition keys (userX, userY) that may or may not be present in initial table. With that list try and populate cache by executing select queries with each potential key. That also doesn't sound like a really good idea.
So the question is? How to properly populate cache layer with data from Cassandra DB?
The second option is preferred for warming up or pre-loading your cache.
Single-partition asynchronous queries from multiple client/app instances is much better than doing a full table scan. Asynchronous queries from lots of clients distributes the load efficiently to all nodes in the cluster which is why they perform better.
It should be said that if you've got your data model right and you've sized your cluster correctly, you can achieve single-digit millisecond latencies. I work with a lot of large organisations who have a 95% SLA for 6-8ms reads. Cheers!
I am trying to compare performance of application queries on H2 database & Ignite with an Oracle baseline.
I created a test including:
A set of tables and indexes.
A data set of random generated data with 50k records per tables.
A query with 1 INNER & 10 LEFT OUTER joins (query returned around 188k records).
I noticed significant differences in terms of performance.
Running the query, on my machine (i5 dual core, 16Gb RAM):
Oracle manages to run this query in around 350ms.
H2 takes 4.5s (regardless of the mode - server & in-memory).
Ignite takes 9s.
Iterating over the JDBC result set:
Less than 50ms for H2 in-memory mode
Around 2s for the H2 server mode
Around 5s for Oracle
Around 1s for Ignite
Couple of questions:
Do these figures make sense? Did I just missed the basics of H2 query optimization?
Looking at H2 explain plans, what is the exact meaning of scanCount? Is this something constant for a given query & data set or a performance indicator?
Is there a way to improve H2 performances by tuning indexing or hinting queries?
How to explain the different between Ignite & H2?
Is the order of joins important? Asking because on Oracle, having up-to-date statistics, the CBO changes the order of joins. I didn't notice such behavior with H2.
Queries & data I used for this test are available here on Github.
Thanks,
L.
Let me share some basic facts related to Ignite vs. RDBMS performance benchmarking. Copy-pasting this from a new GridGain doc that will be released this month. Just replace GridGain occurrences with Ignite. Please double-check these principles are followed. Let me know if you don't see a difference.
GridGain and Ignite are frequently compared to relational databases for their SQL capabilities with an expectation that existing SQL queries, created for an RDBMS, will work out of the box and perform faster in GridGain without any changes. Usually, such a faulty assumption is based on the fact that GridGain stores and processes data in-memory. However, it’s not enough just to put data in RAM and expect an order of magnitude performance increase. GridGain as a distributed platform requires extra changes for the sake of performance and below you can see a standard checklist of best practices to consider before you benchmark GridGain against an RDBMS or do any performance testing:
Ignite/GridGain is optimized for multi-nodes deployments with RAM as
a primary storage. Don’t try to compare a single-node GridGain
cluster to a relational database that was optimized for such
single-node configurations. You should deploy a multi-node GridGain
cluster with the whole copy of data in RAM.
Be ready to adjust your data model and existing SQL queries if any.
Use the affinity collocation concept during the data modelling phase
for proper data distribution. Remember, it’s not enough just to put
data in RAM. If your data is properly collocated you can run SQL
queries with JOINs at massive scale and expect significant
performance benefits.
Define secondary indexes and use other standard, and
GridGain-specific, tuning techniques described below.
Keep in mind that relational databases leverage local caching
techniques and, depending on the total data size, an RDBMS can
complete some queries even faster than GridGain even in a multi-node
configuration. If your data set is around 10-100GB and an RDBMS has
enough RAM for caching data locally than it, for instance, can
outperform a multi-node GridGain cluster because the latter will be
utilizing the network. Store much more data in GridGain to see the
difference.
In RethinkDB, how efficiently are documents stored on disk? How does this compare to other databases, like MySQL, MongoDB, PostgreSQL, Cassandra? Does RethinkDB use greater, lesser, or equal space to store the same or similar data? This is what I mean by 'efficient'.
Obviously, I imagine that it may be difficult to compare in this way, but I am curious nevertheless. It would be helpful to get a sense of the general disk usage profile of each database, and whether some databases may be more or less efficient with their disk usage than others. In particular, it would be great to get a sense of where RethinkDB stands in this regard.
I am looking for a data store that serves the following needs:-
distributed because we have lots of data to query (in TBs)
Write intensive data store. Data will be generated from services and we want to store the data to perform analytics on them.
We want the analytical queries to be reasonably fast (order of minutes, not hours)
Most of our queries would be of the "Select, Filter, Aggregate, Sort" type.
Schema changes often as what we store will change depending on the changing requirements of the system
Part of the data that we store may also be used for pure large scale map/reduce jobs for other purposes.
Key-value stores are scalable but does not support our Query requirements.
Map/Reduce jobs are scalable and can execute the queries, but I think it will not meet our query latency requirements.
An RDBMS (like MySQL) would satisfy our query needs but it will force us to have a fixed schema. We could scale it but then we have to do sharing etc.
Commercial solutions like Vertica seem like a solution that would solve all of our problems, but I would avoid a commercial solution if I can.
HBase seems to be a system that is as scalable as Hadoop because of the underlying HDFS and seems to have the facilities to perform Filters and Aggregations, but I am not sure about the performance of Filter queries in HBase.
Currently HBase does not support Secondary indexes. This makes me wonder if HBase is a right option for Filtering on any arbitrary column. As per the documentation, Filtering on row-id and Column family is faster than filtering on just the column qualifier. However, I also read that having the Bloom Filter index on RowId and Column family significantly increases the size of the Bloom filter and makes this option practically infeasible.
I am unable to find much data online about performance of Filter queries in HBase.
Hoping I can find some more information here.
Thanks!
try apache cassandra, it supports Secondary Indexes very well. Coming to hbase bloom filters, please go thru this link, it describes multiple options of bloom depending on pattern, Hbase bllom filters
You are probably looking for
MPP solutions like Postgres-XL
or related plateforms.
I want to implement a cache system for our application, we've started integrating with Memcached. Recently I started hearing of Hypertable, and saw some great benchmarks done with that..
However, I couldn't find good comparison between the two.
Just to get things straight: I know that Hypertable is considered closer to a DB than to a cache. On the other hand, it's not exactly an RDBMS - in fact, it's exactly not an RDBMS. It has its own benefits, but the question is whether they're worth the performance cost (if any)?
Hypertable is an implementation of concepts in Google's BigTable. Namely a column-oriented DB which has properties of being highly denormalized which means it doesn't need joins.
Memcached is an in-memory caching layer which acts like a distributed hashtable, keeping your app from having to hit the actual DB.
Both lend themselves well to being distributed and work well with MapReduce style topologies but they serve different purposes. Memcached/DHT is going to serve to speed access to data in memory while HyperTable/BigTable are actual mechanisms for permanent data storage on disk.
Memcached is used for speeding things up, e.g. results of SQL queries, without going to DB, by storing everything in memory (RAM).
Hypertable (HBase, Cassandra, MongoDB etc.) and others are permanent storage NoSQL DBs (data stored and retrieved from Hard Drives). They can't give you the performance of the reading/writing from/to RAM (e.g. memcached). So these are not compared to one another.
A better use case is to use NoSQL DBs for permanent storage, and using memcached as a front-side fast access cache between web-application and (NoSQL or any) DB.