HBase: how put/get knows which region server to write to? - hadoop

In HBase, how the put/get operations know which region server the row should be written to?
In case of multiple rows to be read how multiple region servers are contacted and the results are retrieved?

I assume your question is simply curiosity, since this behavior is abstracted from the user and you shouldn't care.
In HBase, how the put/get operations know which region server the row should be written to?
From the hbase documentation book:
The HBase client HTable is responsible for finding RegionServers that are serving the particular row range of interest. It does this by querying the .META. and -ROOT- catalog tables (TODO: Explain). After locating the required region(s), the client directly contacts the RegionServer serving that region (i.e., it does not go through the master) and issues the read or write request. This information is cached in the client so that subsequent requests need not go through the lookup process. Should a region be reassigned either by the master load balancer or because a RegionServer has died, the client will requery the catalog tables to determine the new location of the user region.
So first step is looking up in meta and root to determine where it is, then it contacts that regionserver to do that work.
In case of multiple rows to be read how multiple region servers are contacted and the results are retrieved?
There are two ways to read from HBase in general: scanners and gets.
If you run multiple gets, those will each individually fetch those records separately. Each one of those is possibly going to a different region server.
The scanner will simply look for the start of the range and then move forward from there. Sometimes it needs to move to a different regionserver when it reaches the end, but the client handles that behind the scenes. If there is some way to design the table such that your multiple gets is a scan and not a series of gets, you should hypothetically have better performance.

Providing the same scenario and explanation from BigTable Paper: "The client library caches tablet locations. If the client
does not know the location of a tablet, or if it discovers
that cached location information is incorrect, then
it recursively moves up the tablet location hierarchy.
If the client's cache is empty, the location algorithm
requires three network round-trips, including one read
from Chubby. If the client's cache is stale, the location
algorithm could take up to six round-trips, because stale
cache entries are only discovered upon misses (assuming
that METADATA tablets do not move very frequently).
Although tablet locations are stored in memory, so no
GFS accesses are required, we further reduce this cost
in the common case by having the client library prefetch
tablet locations: it reads the metadata for more than one
tablet whenever it reads the METADATA table."
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf

Related

RethinkDB changefeeds performance: architectural advice?

I am building an application with RethinkDB and I'm about to switch to using changefeeds. But I'm facing an architectural choice and I'd like to get some advice.
My application currently loads all user data from several tables on user login (sending all of it to the frontend), and then processes requests from the frontend, altering the database, and preparing and sending changed items to users. I'd like to switch that over to changefeeds. The way I see it, I have two choices:
Set up a single changefeed for each table. Filter by users logged in to a particular server, and distribute the changes to users manually. These changefeeds are never closed, e.g. they have the lifetime of my servers.
When a user logs in, set up an individual changefeed for that user, for that user's data only (using a getAll with a secondary index). Maintain as many changefeeds as there are currently logged in users. Close them when users log out.
Solution #1 has a big disadvantage: RethinkDB changefeeds do not have a concept of time (or version number), like for example Kafka does. This means that there is no way to a) load initial data, and b) get changes that happened since the initial load. There is a time window where changes can be lost: between initial data load (a) and the moment the changefeed is set up (b). I find this worrying.
Solution #2 seems better, because includeInitial can be used to get initial data, and then get subsequent changes without interruption. I'd have to deal with initial load performance (it's faster to load a single dump of all data than process thousands of updates), but it seems more "correct". But what about scaling? I'm planning to handle up to 1k users per server — is RethinkDB prepared to handle thousands of changefeeds, each being essentially a getAll query? The actual activity in these changefeeds will be very low, it's just the number that I'm worried about.
The RethinkDB manual is a bit terse about changefeed scaling, saying that:
Changefeeds perform well as they scale, although they create extra intracluster messages in proportion to the number of servers with open feed connections on each write.
Solution #2 creates many more feeds, but the number of servers with open feed connections is actually the same for both solutions. And "changefeeds perform well as they scale" isn't quite enough to go on :-)
I'd also be interested to know what are recommended practices for handling server restarts/upgrades and disconnections. The way I see it, if anything happens to RethinkDB, clients have to perform a full data load (using includeInitial) after reconnecting, because there is no way to know what changes have been lost during downtime. Is that what people do?
RethinkDB should be able to handle thousands of changefeeds just fine if it's on reasonable hardware. One thing some people to do lower network load in that case is they put a proxy node on the same machine as their app server, and connect to that, since the proxy node knows enough to deduplicate the changefeed messages coming in over the network, and because it takes a lot of CPU/memory load off of their main cluster.
Currently the only way to recover from a crash is to restart the changefeed using includeInitial. There are plans to add write timestamps in the future, but handling deletes is complicated in that case.

MongoDB: let different client reads from different replicas

I have a heavy and large mongo table, which has a lot of reads. One of the read clients is an offline process which periodically scans a table aggressively. While other clients read the same table as online service. I'd like to separate them. What I'm thinking is to have a dedicate replica node for this offline client to read from, and then let the other clients read from the remaining replicas. How to do that?
You should consider marking one of the nodes as hidden member of the replica set. It will receive all the replicated writes from primary but won't receive any read traffic (from your online service if you use proper-replicaset-enabled connection string). Then from your offline client you can use connection string which targets the hidden member directly
http://docs.mongodb.org/manual/core/replica-set-hidden-member/

Event based communication between Hybris cluster nodes

In what use-case would there be a need to communicate between cluster nodes? The ClusterAwareEvent interface offers the possibility to specify a source node and a target node, but shouldn't cluster nodes be as independent of each other as possible?
Well there are a few reasons why they would need to communicate or rather why you would want them to communicate.
Firstly there is a concept called the Cache Invalidation Concept, which is where each cluster member holds only valid data, but can communicate with one another by TCP or UDP to mark some cache entries as invalid. For example if a database item has been changed.
A basic overview of the invalidation process:
Product description is changed. Therefore, all cache entries
referring to the product are invalid.
This modification to the description is done on a node, which now has to send a notification to all cluster nodes that the data is invalid.
Nodes that hold the product in their cache discard the cached data of
the product and re-retrieve the product from the database the next
time the product is used.
Other features of clustering within Hybris where you would want to communicate with other nodes would be:
Load Balancing
Semi-Session Failover - This allows sessions(sticky sessions) to transfer to a different cluster. Useful if say a server is going down for maintenance or a hardware defect.
These would be the main reasons I can think of off the top of my head for why you would want clusters to communicate.

Handle huge data imported from facebook

I'm currently create a program that imports all groups and feeds from Facebook which the user wants.
I used to use the Graph API with OAuth and this works very well.
But I came to the point that I realized that one request can't handle the import of 1000 groups plus the feeds.
So I'm looking for a solution that imports this data in the background (like a cron job) into a database.
Requirements
Runs in background
Runs under Linux
Restful
Questions
What's you experience about that?
Would hadoop the right solution?
You can use neo4j.
Neo4j is a graph database, reliable and fast for managing and querying highly connected data
http://www.neo4j.org/
1) Decide structure of nodes, relationships, and there properties and accordingly
You need to create API that will get data from facebook and store it in Neo4j.
I have used neo4j in 3 big projects, and it is best for graph data.
2) Create a cron jon that will get data from facebook and store into the neo4j.
I think implementing mysql for graph database is not a good idea. for large data neo4j is the good option.
Interestingly you designed the appropriate solution yourself already. So in fact you need following components:
a relational database, since you want to request data in a structured, quick way
-> from experiences I would pressure the fact to have a fully normalized data model (in your case with tables users, groups, users2groups), also have 4-Byte surrogate keys over larger keys from facebook (for back referencing you can store their keys as attributes, but internal relations are more efficient on surrogate keys)
-> establish indexes based on hashes rather than strings (eg. crc32(lower(STRING))) - an example select would than be this: select somethinguseful from users where name=SEARCHSTRING and hash=crc32(lower(SEARCHSTRING))
-> never,ever establish unique columns based on strings with length > 8 Byte; unique bulk inserts can be done based on hashes+string checking via insert...select
-> once you got that settled you could also look into sparse matrices (see wikipedia) and bitmaps to get your users2groups optimized (however I have learned that this is an extra that should not hinder you to come up with a first version soon)
a cron job that is run periodically
-> ideally along the caps, facebook is giving you (so if they rule you to not request more often than once per second, stick to that - not more, but also try to come as close as possible to the cap) -> invest some time in getting the management of this settled, if different types of requests need to be fired (request for user records <> requests for group records, but maybe hit by the same cap)
-> most of the optimization can only be done with development - so if I were you I would stick to any high level programming language that does not bother to much with var type juggling and that also comes along with a broad support for associative arrays such as PHP and I would programm that thing myself
-> I made good experiences with setting up the cron job as web page with deactivated output buffering (for php look at ob_end_flush(void)) - easy to test and the cron job can be triggered via curl; if you channel status outputs via an own function (eg with time stamps) this could then also become flexible to either run viw browser or via command line -> which means efficient testing + efficient production running
your user ui, which only requests your database and never, ever, never the external system api
lots of memory, to keep your performance high (optimal: all your data+index data fits into database memory/cache dedicated to the database)
-> if you use mysql as database you should look into innodb_flush_log_at_trx_commit=0, and innodb_buffer_pool_size (just google, if interested)
Hadoop is a file system layer - it could help you with availability. However I would put this into the category of "sparse matrix", which is nothing that stops you from coming up with a solution. From my experience availability is not a primary constraint in data exposure projects.
-------------------------- UPDATE -------------------
I like neo4j from the other answer. So I wondered what I can learn for my future projects. My experiences with mysql is that RAM is usually the biggest constraint. So increasing your RAM to be able to load the full database can gain you performance improvements by a factor of 2-1000 - depending on from where you are coming from. Everything else such as index improvements and structure somehow follows. So if I would need to make up a performance prioritization list, it would be something like this:
MYSQL + enough RAM dedicated to the database to load all data
NEO4J + enough RAM dedicated to the database to load all data
I would still prefer MYSQL. It stores records efficiently, but needs to run joins for deriving relations (which neo4j does not require to that extend). Join-costs are usually low with the right indexes and according to http://docs.neo4j.org/chunked/milestone/configuration-caches.html neo4j does need to add extra management data to the property separation. For big data projects those management data sums up and in full load to memory set ups requires you buy more memory. Performance wise these both options are ultimate. Further, much further down the line you would find this:
NEO4J + not enough RAM dedicated to the database to load all data
MYSQL + not enough RAM dedicated to the database to load all data
In worst case MYSQL will even put indexes to disk (at least partly), which can result in massive read delay. In comparison with NEO4J you could perform a ' direct jump from node to node' exercise, which should - at least in theory - be faster.

Neo4j - Using Java plugins to REST api to improve performance?

I am building an application that requires a lot of data constantly being extracted from a local MongoDB to be put into Neo4j. Seeing as I am also having many users access the Neo4j database, from both a Django webserver and other places, I decided on using the REST interface for Neo4j.
The problem I am having is that, even with batch insertion, the Neo4j server is active over 50% of the time with just trying the insert all the data from the mongoDB. As far as I can see there might be some waiting time because of the HTTP requests but I have been trying to tweak but have only gotten so far.
The question is, if I write a Java plugin (http://docs.neo4j.org/chunked/stable/server-plugins.html) that can handle inserting the mongoDB extractions directly, will I then go around the REST API? Or will the java plugin commands just convert to regular REST API requests? Furthermore, will there be a performance boost by using the plugin?
The last question is how do I optimize the speed of the REST API (So far I am performing around 1500 read/write operations which includes many "get_or_create_in_index" operations)? Is there a sweet spot where the number of queries appended to one HTTP requests will keep Neo4j busy until the next HTTP request arrives?
Update:
I am using Neo4j version 2.0
The data that I am extracting consists of bluetooth observations, where the phone that is running the app i created scans all nearby phones. This single observation is then saved as a document in MongoDB and consists of the users id, the time of the scan and a list of the phones/users that he has seen in that scan.
In Neo4j I model all the users as nodes and I also model an observation between two users as a node so that it will look like this:
(user1)-[observed]->(observation_node)-[observed]->(user2)
Furthermore I index all user nodes.
When moving the observation from mongoDB to Neo4j, I do the following for each document:
Check in the index if the user doing the scan already has a node assigned, else create one
Then for each observed user in the scan: A) Check in index if the observed user has a node else create one B) Create an observation node and relationships between the users and the observation node, if this doesn't already exist C) Make a relationship between the observation node and a timeline node (the timeline just consists of a tree of nodes so that I can quickly find observations at a certain time)
As it can be seen I am doing quite a few lookups in the user index (3), some normal read (2-3) and potentially many writes for each observation.
Each bluetooth scan average around 5-30 observations and I batch 100 scans in a single HTTP request. This means that each request usually contains 5000-10000 updates.
What version are you using?
The unmanaged extension would use the underlying Java-API so it much faster, also you can decide on the format & protocol of the data that you push to it.
It is sensible to batch writes, so that you don't incurr tx overhead per each tiny write. E.g. aggregating 10-50k updates in one operation helps a lot.
What is the concrete shape of the updates you do? Can you edit your question to reflect that?
Some resources for this:
http://maxdemarzi.com/2013/09/05/scaling-writes/
http://maxdemarzi.com/2013/12/31/the-power-of-open-source-software/

Resources