Hbase read high load - hadoop

I'm in research process for noSQL solution for our company needs.
For now the search narrows to hBase. I've read a lot about architecture, performance etc, but one thing is still uncovered for me.
For example if you have 100 nodes cluster, and one row gets 100.000 simultaneous requests. In this case all the 100.000 requests will hit only one node, where the row is stored? As I understand HBase replication is only for data backup (not for read load balance), and there no any master/slave mechanism (like in MySQL)?

Regarding to 100.000 concurrent requests for single row - I think nobody is good for this currently. Under normal condition it is simply not needed - clients are anyway isolated from DB so access is limited in this case (and probably cached).
Regarding to storage and replication. First, there is at least 2 types of replication and actually it is not HBase. HBase relies on HDFS which is fault tolerant by nature. Read about HBase master and HBase region server role if you need to understand details but in general all things related to replication go to HDFS.

I guess 100,000 concurrent request will not work very well on HBase, however real world scenarios seems to work quite well
yfrog get 10K request per second and eBay chose it for the new version of their product search engine as well as Facebook for their messaging system
You can also take a look at hstack benchmarks on more modest cluster

HBase replication is not only for data backup, also availability. As that does not seem to be the only point you cover with your question here I pointed you to that link where you can find more information. If you have specific questions regarding your schema design you should start in the home page of the Apache hosted project first of all. For the last question mark about master/slave, that URL I sent still applies (And you can ask the HBase developers about it if you are unsure anyway): http://hbase.apache.org/replication.html

Related

what should be considered before choosing hbase?

I am very new in big data space.
We got suggestion from team we should use hbase instead of RDBMS for high performance . We do not have any idea what should/must be considered before switching RDMS to hbase. Any ideas?
One of my favourite book describes..
Coming to #Whitefret's last point : There is some thing called CAP theorm based on which decision can be taken.
Consistency (all nodes see the same data at the same time)
Availability (every request receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
In this context Hbase supports CP
However, for switching RDBMS to HBASE you can use SQOOP.
It's a difficult question, there are many things to consider.
Can you optimize your RDBMS? Adding indexes, denormalization of joins that cost too much ... There are many path to consider and I am no expert.
Is your data big? This is very vague, and you have a space between RDBMS and Big Data where you can't be sure which one to use. Millions of rows can still be handled by RDBMS efficiently.
Do you need relation in you data? NoSQL database don't use relation, this can be hard for people from a SQL background. There are frameworks that gives SQL to HBase, but it is a bad idea in general to have a RDBMS model when using Big Data
If you can answer those questions and you think NoSQL is the drill, ask your team how they feel about it. NoSQL database comes with problem you would never meet in the SQL world. They should build a prototype first to understand how all this works, and maybe make some training available for them.
In Summary:
- Find if you need non relational database
- Choose the right one (is Hbase really what you need?, why not consider Cassandra or MongoDB?)
HBase like all NoSQL DB come with great new features but sadly nothing is free (not even mentionning the money cost).
In HBase, you really should check if all the query that you might want to do can be fullfilled with the HBase data model. An important thing to consider is the schema design (the modelisation of the rowkey most and foremost).
I advice you to read this really good paper :
http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf
I think that a really good answer to your question can be found on the HBase official site.
"HBase isn’t suitable for every problem.
First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.
Third, make sure you have enough hardware. Even HDFS doesn’t do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only. "
https://hbase.apache.org/book.html

Analytics and Mining of data sitting on Cassandra

We have a lot of user interaction data from various websites stored in Cassandra such as cookies, page-visits, ads-viewed, ads-clicked, etc.. that we would like to do reporting on. Our current Cassandra schema supports basic reporting and querying. However we also would like to build large queries that would typically involve Joins on large Column Families (containing millions of rows).
What approach is best suited for this? One possibility is to extract data out to a relational database such as mySQL and do data mining there. Alternate could be to attempt at use hadoop with hive or pig to run map reduce queries for this purpose? I must admit I have zero experience with the latter.
Anyone have experience of performance differences in one one vs the other? Would you run map reduce queries on a live Cassandra production instance or on a backup copy to prevent query load from affecting write performance?
In my experience Cassandra is better suited to processes where you need real-time access to your data, fast random reads and just generally handle large traffic loads. However, if you start doing complex analytics, the availability of your Cassandra cluster will probably suffer noticeably. In general from what I've seen it's in your best interest to leave the Cassandra cluster alone, otherwise the availability starts suffering.
Sounds like you need an analytics platform, and I would definitely advise exporting your reporting data out of Cassandra to use in an offline data-warehouse system.
If you can afford it, having a real data-warehouse would allow you to do complex queries with complex joins on multiples tables. These data-warehouse systems are widely used for reporting, here is a list of what are in my opinion the key players:
Netezza
Aster/TeraData
Vertica
A recent one which is gaining a lot of momentum is Amazon Redshift, but it is currently in beta, but if you can get your hands on it you could give this a try since it looks like a solid analytics platform with a pricing much more attractive than the above solutions.
Alternatives like using Hadoop MapReduce/Hive/Pig are also interesting to look at, but probably not a replacement for Hadoop technologies. I would recommend Hive if you have a SQL background because it will be very easy to understand what you're doing and you can scale easily. There are actually already libraries integrated with Hadoop, like Apache Mahout, which allow you to do data-mining on a Hadoop cluster, you should definitely give this a try and see if it fits your needs.
To give you an idea, an approach that I've used that has been working well so far is pre-aggregating the results in Hive and then have the reports themselves generated in a data-warehouse like Netezza to compute complex joins .
Disclosure: I'm an engineer at DataStax.
In addition to Charles' suggestions, you might want to look into DataStax Enterprise (DSE), which offers a nice integration of Cassandra with Hadoop, Hive, Pig, and Mahout.
As Charles mentioned, you don't want to run your analytics directly against Cassandra nodes that are handling your real-time application needs because they can have a substantial impact on performance. To avoid this, DSE allows you to devote a portion of your cluster strictly to analytics by using multiple virtual "datacenters" (in the NetworkToplogyStrategy sense of the term). Queries performed as part of a Hadoop job will only impact those nodes, essentially leaving your normal Cassandra nodes unaffected. Additionally, you can scale each portion of the cluster up or down separately based on your performance needs.
There are a couple of upsides to the DSE approach. The first is that you don't need to perform any ETL prior to processing your data; Cassandra's normal replication mechanisms keep the nodes devoted to analytics up to date. Second, you don't need an external Hadoop cluster. DSE includes a drop-in replacement for HDFS called CFS (CassandraFS), so all source data, intermediate results, and final results from a Hadoop job can be stored in the Cassandra cluster.

How to use HBase and Hadoop to serve live traffic AND perform analytics? (Single cluster vs separate clusters?)

Our primary purpose is to use Hadoop for doing analytics. In this use case, we do batch processing, so throughput is more important than latency, meaning that HBase is not necessarily a good fit (although getting closer to real-time analytics does sound appealing). We are playing around with Hive and we like it so far.
Although analytics is the main thing we want to do in the immediate future with Hadoop, we are also looking to potentially migrate parts of our operations to HBase and to serve live traffic out of it. The data that would be stored there is the same data that we use in our analytics, and I wonder if we could just have one system for both live traffic and analytics.
I have read a lot of reports and it seems that most organizations choose to have separate clusters for serving traffic and for analytics. This seems like a reasonable choice for stability purposes, since we plan to have many people writing Hive queries, and badly written queries could potentially compromise the live operations.
Now my question is: how are those two different use cases reconciled (serving live traffic and doing batch analytics)? Do organizations use systems to write all data in two otherwise independent clusters? Or is it possible to do this out of the box with a single cluster in which some of the nodes serve live traffic and others do only analytics?
What I'm thinking is that we could perhaps have all data coming into the nodes that are used for serving live traffic, and let the HDFS replication mechanisms manage the copying of data into nodes that are used for analytics (increasing the replication higher than the default 3 probably makes sense in such scenario). Hadoop can be made aware of special network topologies, and it has functionality to always replicate at least one copy to different racks, so this seems to mesh well with what I'm describing.
The nodes dedicated to live traffic could be set to have zero (or few) map and reduce slots, so that all Hive queries end up being processed by the nodes dedicated to analytics.
The nodes dedicated to analytics would always be a little behind those dedicated to serving live traffic, but that does not seem to be a problem.
Does that kind of solution make sense? I am thinking it could be more simple to have one cluster than two, but would this be significantly riskier? Are there known cases of companies using a HBase cluster to serve live traffic while also running batch analytics jobs on it?
I'd love to get your opinions on this :) !
Thanks.
EDIT: What about Brisk? It's based on Cassandra instead of HBase, but it seems to be made exactly for what I'm describing (hybrid clusters). Has anyone worked with it before? Is it mature?
--
Felix
Your approach has a few problems... even in rack aware mode, if you have more than a few racks I don't see how you can be guaranteed your nodes will be replicated on those nodes. If you lose one of your "live" nodes, then you will be under-replicated for a while and won't have access to that data.
HBase is greedy in terms of resources and I've found it doesn't play well with others (in terms of memory and CPU) in high load situations. You mention, too, that heavy analytics can impact live performance, which is also true.
In my cluster, we use Hadoop quite a bit to preprocess data for ingest into HBase. We do things like enrichment, filtering out records we don't want, transforming, summarization, etc. If you are thinking you want to do something like this, I suggest sending your data to HDFS on your Hadoop cluster first, then offloading it to your HBase cluster.
There is nothing stopping you from having your HBase cluster and Hadoop cluster on the same network backplane. I suggest instead of having hybrid nodes, just dedicate some nodes to your Hadoop cluster and some nodes to your Hbase cluster. The network transfer between the two will be quite snappy.
Just my personal experience so I'm not sure how much of it is relevant. I hope you find it useful and best of luck!
I think this kind of solution might have sense, since MR is mostly CPU intensive and HBASE is a memory hungry beast. What we do need - is to properly arrange resource management. I think it is possible in the following way:
a) CPU. We can define maximum number of MR mappers/reducers per slot and assuming that each mapper is single threaded we can limit CPU consumption of the MR. The rest will go to HBASE.
b) Memory.We can limit memory for mappers and reducers and the rest give to HBASE.
c) I think we can not properly manage HDFS bandwidth sharing, but I do not think it should be a problem for HBASE -since for it disk operations are not on the critical path.

Distributed and replicated data storage for small amounts of data under Windows

We're looking for a good solution to a caching problem. We'd like to distribute a relatively small amount of data (perhaps 10's of GBs) among a cluster of web servers such that:
The data is replicated to all nodes
The data is persistent
The data can be accessed locally
Our motivation for a caching solution is that we currently have a single point of failure: a SQL Server database. We're unable to set up a fail-over cluster for this database, unfortunately. We're already using Memcached to a large extent, but we want to avoid the problem where if a Memcached node goes down, we'd suddenly have a large amount of cache misses and therefore experience a massive amount of requests to one endpoint.
We'd prefer instead to have local persistent caches on each web server node so that the resulting load would be distributed. When a retrieval is made, it would pass through the following:
Check for data in Memcached. If it's not there...
Check for data in local persistent storage. If it's not there...
Retrieve data from the database.
When data changes, the cache key is invalidated at both caching layers.
We've been looking at a bunch of potential solutions, but none of them seem to match exactly what we need:
CouchDB
This is pretty close; the data model we'd like to cache is very document-oriented. However, its replication model isn't exactly what we're looking for. It seems to me as though replication is an action you have to perform rather than a permanent relationship among nodes. You can set up continuous replication, but this doesn't persist between restarts.
Cassandra
This solution seems to be mostly geared toward those with large storage requirements. We have a large amount of users, but small amounts of data. Cassandra looks to be able to support n number of fail-over nodes, but 100% replication among nodes doesn't seem to be what it's intended for; instead, it seems more geared toward distribution only.
SAN
One attractive idea is that we can store a bunch of files on a SAN or similar type of appliance. I haven't worked with these before, but it seems like this would still be a single point of failure; if the SAN goes down, we'd suddenly be going to the database for all cache misses.
DFS Replication
A simple Google search revealed this. It seems to do what we want; it synchronizes files across all nodes in a replication cluster. But the marketing text makes it look like it's more of a system for ensuring documents are copied to different office locations. Also, it has limits, like a file count maximum, that wouldn't work well for us.
Have any of you had similar requirements to ours and found a good solution that meets your needs?
We've been using Riak successfully in production for several months now for a problem that's somewhat similar to what you describe. We too have evaluated CouchDB and Cassandra before.
The advantage of Riak in this sort of problems imo is that distribution and data replication are at the core of the system. You define how many replicas of the data across the cluster you want and it takes care of the rest (it's a bit more complicated than that of course, but that's the essence). We went through adding nodes, removing nodes, had nodes crush, and it's proven surprisingly resilient.
It's a lot like Couch in other matters - document oriented, REST interface, Erlang.
You can check the hazelcast.
It does not persist the data but provides a fail-over system. Each node can have a number of nodes to backup it's data in case a node fails.

getting close to real-time with hadoop

I need some good references for using Hadoop for real-time systems like searching with little response time. I know hadoop has its overhead of hdfs, but whats the best way of doing this with hadoop.
You need to provide a lot more information about the goals and challenges of your system to get good advice. Perhaps Hadoop is not what you need, and you just require some distributed systems foo? (Oh and are you totally sure you require a distributed system? There's an awful lot you can do with a replicated database on top of a couple of large-memory machines).
Knowing nothing about your problem, I'll give you are few shot-in-the-dark attempts at answering.
Take a look at HBase, which provides a structured queriable datastore on top of HDFS, similar to Google's BigTable. http://hadoop.apache.org/hbase/
It could be that you just need some help with managing replication and sharding of data. Check out Gizzard, a middleware to do just that: http://github.com/twitter/gizzard
Processing can always be done beforehand. If that means you materialize too much data, maybe something like Lucandra can help -- Lucene running on top of Cassandra as a backend? http://github.com/tjake/Lucandra
If you really really need to do serious processing at query time, the way to do that is to run dedicated processes that do the specific kinds of computations you need, and use something like Thrift to send requests for computation and receive results back. Optimize them to have all the needed data in-memory. The process that receives the query itself can then do nothing more than break the problem into pieces, send the pieces to compute nodes, and collect the results. This sounds like Hadoop, but is not because it's made for computation of specific problems with pre-loaded data rather than a generic computation model for arbitrary computing.
Hadoop is completely the wrong tool for this kind of requirement. It is explicitly optimised for large batch jobs that run for several minutes up to hours or even days.
FWIW, HDFS has nothing to do with the overhead. It's the fact that Hadoop jobs deploy a jar file onto every node, setup a working area, start each job running, pass information via files between stages of the computation, communicate progress and status with the job runner, etc., etc.
This query is old but it begs an answer. Even if there are millions of documents but are not changing in real-time like FAQ docs, Lucene + SOLR for distribution should pretty much suffice the need. Hathi Trust indexes billions of documents using the same combination.
It is a completely different problem if the index is changing in real time. Even Lucene will have problems dealing with updating its index and you have to look at real time search engines. There has been some attempt at reworking Lucene for real time and maybe it should work. You can also look at HSearch, a real time distributed search engine built on Hadoop and HBase, hosted at http://bizosyshsearch.sourceforge.net

Resources