Does HBase use the compute capacity of all nodes in the cluster for query execution? - hadoop

We are having a setup of 1 master and 2 slave nodes. The data is setup in postgres and in hbase and its a similar dataset (same number of rows) - 65 million rows. Yet, we dont find a measurable increase in performance from HBase for the same query.
My first thought is - does HBase use the compute capacity of all nodes to fork the query out? Perhaps this is why the performance is not measurably better.
Any other reasons for why the performance between Postgres and HBase would be about the same? Any specific configuration items to look for?
EDIT : Something I found while researching this : http://www.flurry.com/2012/06/12/137492485#.VaQP_5QpBpg

This is kind of a yes and no answer. Depending on what you are doing for your 'query' and your region distribution, you may or may not be using all the nodes. For example, if you are running a scan across the table it will run against each region (assuming more then one) in sequence. However if you are using a multi-get for keys that are in different regions, this will run in parallel.
The real benefit is going to come as the number of regions increase and you start parallelizing requests (multiple clients). Regions will be distributed across region servers by the Master as regions are split.

Related

Try to confirm my understanding of HBase and MapReduce behavior

I'm trying to do some process on my HBase dataset. But I'm pretty new to the HBase and Hadoop ecosystem.
I would like to get some feedback from this community, to see if my understanding of HBase and the MapReduce operation on it is correct.
Some backgrounds here:
We have a HBase table that is about 1TB, and exceeds 100 million records.2. It has 3 region servers and each region server contains about 80 regions, making the total region 240.3. The records in the table should be pretty uniform distributed to each region, from what I know.
And what I'm trying to achieve is that I could filter out rows based on some column values, and export those rows to HDFS filesystem or something like that.
For example, we have a column named "type" and it might contain value 1 or 2 or 3. I would like to have 3 distinct HDFS files (or directories, as data on HDFS is partitioned) that have records of type 1, 2, 3 respectively.
From what I can tell, MapReduce seems like a good approach to attack these kinds of problems.
I've done some research and experiment, and could get the result I want. But I'm not sure if I understand the behavior of HBase TableMapper and Scan, yet it's crucial for our code's performance, as our dataset is really large.
To simplify the issue, I would take the official RowCounter implementation as an example, and I would like to confirm my knowledge is correct.
So my questions about HBase with MapReduce is that:
In the simplest form of RowCounter (without any optional argument), it is actually a full table scan. HBase iterates over all records in the table, and emits the row to the map method in RowCounterMapper. Is this correct?
The TableMapper will divide the task based on how many regions we have in a table. For example, if we have only 1 region in our HBase table, it will only have 1 map task, and it effectively equals to a single thread, and does not utilize any parallel processing of our hadoop cluster?
If the above is correct, is it possible that we could configure HBase to spawn multiple tasks for a region? For example, when we do a RowCounter on a table that only has 1 region, it still has 10 or 20 tasks, and counting the row in parallel manner?
Since TableMapper also depends on Scan operation, I would also like to confirm my understanding about the Scan operation and performance.
If I use setStartRow / setEndRow to limit the scope of my dataset, as rowkey is indexed, it does not impact our performance, because it does not emit full table scan.
In our case, we might need to filter our data based on their modified time. In this case, we might use scan.setTimeRange() to limit the scope of our dataset. My question is that since HBase does not index the timestamp, will this scan become a full table scan, and does not have any advantage compared to we just filter it by our MapReduce job itself?
Finally, actually we have some discussion on how we should do this export. And we have the following two approaches, yet not sure which one is better.
Using the MapReduce approach described above. But we are not sure if the parallelism will be bound by how many regions a table has. ie, the concurrency never exceeds the region counts, and we could not improve our performance unless we increase the region.
We maintain a rowkey list in a separate place (might be on HDFS), and we use spark to read the file, then just get the record using a simple Get operation. All the concurrency occurs on the spark / hadoop side.
I would like to have some suggestions about which solution is better from this community, it will be really helpful. Thanks.
Seems like you have a very small cluster. Scalability is dependent on number of region servers(RS) also. So, just by merely increasing number of regions in table without increasing number of region servers wont really help you speed up the job. I think 80 Regions/RS for that table itself is decent enough.
I am assuming you are going to use TableInputFormat, it works by running 1 mapper/region and performs server side filter on basis of scan object. I agree that scanning using TableInputFormat is optimal approach to export large amount of data from hbase but scalability and performance not just proportional to number of regions. There are many many other factors like # of RS, RAM and Disk on each RS, uniform distribution of data are some of them.
In general, I would go with #1 since you just need to prepare a scan object and then hbase will take care of rest.
#2 is more cumbersome since you need to maintain the rowkey state outside hbase.

HBase Replication - Replicate data in 3 data centers

I our application we are having data from 3 different countries and we are persisting data in HBase.
In each country, we will be keeping data of all the 3 countries.
To achieve this, is it possible that we create our Hadoop cluster using data centers in all these 3 countries and we keep data replication as 3. So due to rack-awareness feature, our data will get auto replicated in all the 3 countries?
Any pointers will be of great help.
Thanks
You can’t have HBASE cluster across countries. This won’t work because of latency, failover problems, network issues, etc.
A good option would be to have 3 clusters, one HBase table per country and sync the tables between clusters as proposed above
As far as I know only Google has successfully implemented a multi-country database providing both consistency and availability: Spanner. But the key elements of the solution are: a private physical network between the Data Centers and their own implementation of NTP which guarantee that all servers across the world have the same clock with just a few milliseconds precision.
This solution looks theoretically feasible but writes may become pretty slow as data needs to replicated to 3 nodes located in different geographies. It needs to be tried out and check whether the latency is within tolerable limit.
Another option could be, to have three different HBase clusters at three locations and design tables in such a way that tables from one HBase cluster could be copied to another one during night hours to keep the data in sync daily. In this case, an HBase cluster will have current data from it's own location but the data from other two cities will lag by a day.

Hive query having 15 tables join is expected to generate 1 Billion records, on 3 datanodes, 16GB RAM each Is this the right way to do?

My name is Vitthal.
The Hortonworks HDP 2.4 Cluster on Amazon is 3 Datanodes, Masters on different Instances.
7 Instances 16GB RAM each.
Total 1TB HDD Space
3 Data Nodes
Hadoop version 2.7
I have pulled data from Postgres into Hadoop Distributed Environment.
The Data is 15 Tables, Among them 4 tables are having 15 Million Records, rest are Masters.
I've pulled them in HDFS, compressed as ORC, and SnappyCodec. Created Hive External Tables with schema.
Now I'm firing a query which joins all the 15 tables and selects the columns which I need in a final flat table. The records expected are more than 1.5 Billion.
I have optimized Hive, Yarn, MapReduce Engine viz. Parallel Execution, Vectorization, Optimized Joins, Small Table Condition, Heap Size etc.
The query is running on Cluster / Hive / Tez since 20 hours & it's reached 90% where the last reducer is running. The 90% is reached long back like since 18 hours it's stuck at 90%.
Am I doing it the right way ?
If I understand, you have effectively copied tables in their raw form from your RDBMs into Hadoop in order to create a flattened view into one or more new tables. You're using Hive to do this. All of this sounds fine.
There are many possibilities why this is taking so long, but several come to mind.
First, YARN will allocate containers (one per CPU core, typically) that mappers and reducers will use to run the parallelized parts of the query. This should allow you to utilize all of the resources you have available.
I use Cloudera, but I assume Hortonworks has similar tools that let you see how many containers are in use, how many mappers and reducers are created by Hive, and so on. You should see that most or all of your available CPUs are in use constantly. Jobs should be finishing at some reasonable rate (perhaps every minute, or every 15 minutes). Depending on the query, Hive is often able to break it into distinct "stages" that are executed distinctly from others, then reassembled at the end.
If this is the case, everything may be fine, but your cluster may be under-resourced. But before you throw more AWS instances at the problem, consider the query itself.
First, Hive has several tools that are essential for optimizing performance, most importantly, partitioning. When you create tables, you should find some means of partitioning the resulting datasets into roughly equal subsets. A common method is to use dates, for example year+month+day (perhaps 20160417), or if you expect to have lots of historical data, maybe just year+month. This will also allow you to dramatically optimize queries that can be constrained by date. I seem to recall that Hive (or maybe it's YARN) will allocate partitions to different containers, so if you don't see all your workers working, then this would be a possible cause. Use the PARTITIONED BY clause in your CREATE TABLE statement.
The reason to choose something like date is that presumably your data is relatively evenly distributed over time (dates). We had chosen a customer_id as a partition key in an early implementation but as we grew, so did our customers. Hundreds of smaller customers would finish in a few minutes, then hundreds of mid-sized customers would finish in an hour, then a couple of our largest customers would take 10 or more hours to complete. We would see complete utilization of the cluster for that first hour, then only a couple containers in use for the last couple of customers. Not good.
This phenomenon is known as "data skew", so you want to carefully choose partitions to avoid skew. There are some options involving SKEW BY and CLUSTER BY that can help deal with getting evenly sized or smaller data files that you could consider.
Note that the raw import data should also be partitioned, as partitions act like indexes in a RDBMS, so are important for performance. In this case, choose partitions that use the keys that your larger query joins on. It is possible and common to have multiple partitions, so a date-based top-level partition, with a sub-partition on the join key could be helpful ... maybe ... depends on your data.
We have also found that it's very important to optimize the query itself. Hive has some hinting mechanisms that can direct it to run the query differently. While quite rudimentary compared to RDBMS, EXPLAIN is very helpful for understanding how Hive will break up the query and when it needs to scan a full dataset. It's hard to read the explain output, so get comfortable with the Hive documentation :-).
Lastly, if you can't make Hive do things in a sensible manner (if its optimizer still results in imbalanced stages) you can create intermediate tables with an additional Hive query that runs to create a partially transformed dataset before building the final one. This seems expensive since you're adding an additional write, and read of new tables, but in the case you describe it may be much faster overall. Also, it's sometimes useful to have intermediate tables just to test or sample data.
Writing Hive is a lot less like writing regular software -- you can get the Hive query done pretty quickly in most cases. Getting it to run fast has taken us 10 or 15 tries in a few cases. Good luck, and I hope this is helpful.

Task scheduling with spark

I am running fairly large task on my 4 node cluster. I am reading around 4 GB of filtered data from a single table and running Naïve Baye’s training and prediction. I have HBase region server running on a single machine which is separate from the spark cluster running in fair scheduling mode, although HDFS is running on all machines.
While executing, I am experiencing strange task distribution in terms of the number of active tasks on the cluster. I observed that only one active task or at most two tasks are running on one/two machines at any point of time while the other are sitting idle. My expectation was that the data in the RDD will be divided and processed on all the nodes for operations like count and distinct etcetera. Why are all nodes not being used for large tasks of a single job? Does having HBase on a separate machine has anything to do with this?
Some things to check:
Presumably you are reading in your data using hadoopFile() or hadoopRDD(): consider setting the [optional] minPartitions parameter to make sure the number of partitions is equal to the number of nodes you want to use.
As you create other RDDs in your application, check the number of partitions of those RDDs and how evenly the data is distributed across them. (Sometimes an operation can create an RDD with the same number of partitions but can make the data within it badly unbalanced.) You can check this by calling the glom() method, printing the number of elements of the resulting RDD (the number of partitions) and then looping through it and printing the number of elements of each of the arrays. (This introduces communication so don't leave it in your production code.)
Many of the API calls on RDD have optional parameters for setting the number of partitions, and then there are calls like repartition() and coalesce() that can change the partitioning. Use them to fix problems you find using the above technique (but sometimes it will expose the need to rethink your algorithm.)
Check that you're actually using RDDs for all your large data, and haven't accidentally ended up with some big data structure on the master.
All of these assume that you have data skew problems rather than something more sinister. That's not guaranteed to be true, but you need to check your data skew situation before looking for something complicated. It's easy for data skew to creep in, especially given Spark's flexibility, and it can make a real mess.

Mapreduce performance speeed up check on simple mongodb installation with 2 secondary and a primary node

I have simple mongodb installation with two secondary and one primary nodes. When i run a mapreduce query on a datasize of 5 gb it takes same time which it was taking on a standalone mongodb installation on one node. I am using command line. Do I have to use any specific command to exploit extra replica sets for mapreduce?
Thank you in advance.
You can speed up your job if you can use aggregation framework instead of mapreduce - aggregation framework is a lot faster.
You can't really scale your operations using replica sets, since replica sets are for high availability and failover (plus redundancy of data) not for scaling. You can run mapReduce or aggregation on a secondary, just connect to the secondary and specify rs.slaveOk() and then run mapReduce/aggregate - but you cannot not output results to a collection then, since you cannot write to a secondary, so it has to return results inline.
This will move the extra load from the primary, but it won't make it faster per se. If you want to utilize multiple servers, you need to shard your database - by distributing the data over multiple shards/hosts you will automatically cause your mapReduce and/or aggregation queries to run over multiple servers - even though a small penalty will exist for managing the results (they have to be merged still) the longest part of the job will likely more than offset the extra overhead.

Resources