Try to confirm my understanding of HBase and MapReduce behavior - hadoop

I'm trying to do some process on my HBase dataset. But I'm pretty new to the HBase and Hadoop ecosystem.
I would like to get some feedback from this community, to see if my understanding of HBase and the MapReduce operation on it is correct.
Some backgrounds here:
We have a HBase table that is about 1TB, and exceeds 100 million records.2. It has 3 region servers and each region server contains about 80 regions, making the total region 240.3. The records in the table should be pretty uniform distributed to each region, from what I know.
And what I'm trying to achieve is that I could filter out rows based on some column values, and export those rows to HDFS filesystem or something like that.
For example, we have a column named "type" and it might contain value 1 or 2 or 3. I would like to have 3 distinct HDFS files (or directories, as data on HDFS is partitioned) that have records of type 1, 2, 3 respectively.
From what I can tell, MapReduce seems like a good approach to attack these kinds of problems.
I've done some research and experiment, and could get the result I want. But I'm not sure if I understand the behavior of HBase TableMapper and Scan, yet it's crucial for our code's performance, as our dataset is really large.
To simplify the issue, I would take the official RowCounter implementation as an example, and I would like to confirm my knowledge is correct.
So my questions about HBase with MapReduce is that:
In the simplest form of RowCounter (without any optional argument), it is actually a full table scan. HBase iterates over all records in the table, and emits the row to the map method in RowCounterMapper. Is this correct?
The TableMapper will divide the task based on how many regions we have in a table. For example, if we have only 1 region in our HBase table, it will only have 1 map task, and it effectively equals to a single thread, and does not utilize any parallel processing of our hadoop cluster?
If the above is correct, is it possible that we could configure HBase to spawn multiple tasks for a region? For example, when we do a RowCounter on a table that only has 1 region, it still has 10 or 20 tasks, and counting the row in parallel manner?
Since TableMapper also depends on Scan operation, I would also like to confirm my understanding about the Scan operation and performance.
If I use setStartRow / setEndRow to limit the scope of my dataset, as rowkey is indexed, it does not impact our performance, because it does not emit full table scan.
In our case, we might need to filter our data based on their modified time. In this case, we might use scan.setTimeRange() to limit the scope of our dataset. My question is that since HBase does not index the timestamp, will this scan become a full table scan, and does not have any advantage compared to we just filter it by our MapReduce job itself?
Finally, actually we have some discussion on how we should do this export. And we have the following two approaches, yet not sure which one is better.
Using the MapReduce approach described above. But we are not sure if the parallelism will be bound by how many regions a table has. ie, the concurrency never exceeds the region counts, and we could not improve our performance unless we increase the region.
We maintain a rowkey list in a separate place (might be on HDFS), and we use spark to read the file, then just get the record using a simple Get operation. All the concurrency occurs on the spark / hadoop side.
I would like to have some suggestions about which solution is better from this community, it will be really helpful. Thanks.

Seems like you have a very small cluster. Scalability is dependent on number of region servers(RS) also. So, just by merely increasing number of regions in table without increasing number of region servers wont really help you speed up the job. I think 80 Regions/RS for that table itself is decent enough.
I am assuming you are going to use TableInputFormat, it works by running 1 mapper/region and performs server side filter on basis of scan object. I agree that scanning using TableInputFormat is optimal approach to export large amount of data from hbase but scalability and performance not just proportional to number of regions. There are many many other factors like # of RS, RAM and Disk on each RS, uniform distribution of data are some of them.
In general, I would go with #1 since you just need to prepare a scan object and then hbase will take care of rest.
#2 is more cumbersome since you need to maintain the rowkey state outside hbase.


How do we optimise the spark job if the base table has 130 billions of records

We are joining multiple tables and doing complex transformations and enrichments.
In that the base table will have around 130 billions of records, how can we optimise the spark job when the spark filters all the records keep in memory and do the enrichments with other left outer join tables. Currently spark job is running for more than 7 hours, can you suggest some techniques
Here is what you can try
Partition your base tables on which you want to run your query, create partition on specific column like Department, or Date etc which you use during joining. If the under lying table is hive you can also try bucketing.
Try optimised joins which suits your requirement such sorted merge join, hash join.
File format, use parquet file format as it much faster compared to ORC for analytical queries, and it also stores data in columnar format.
If your query has multiple steps and some steps are reused try to use caching, as spark supports memory and disk caching.
Tune your spark jobs by specifying the number of partitions, executor, cores, driver memory as per the resources available. Check spark history UI to understand how data is distributed. Try various configurations see what works best for you.
Spark might perform poorly if there large skewness in data. if that is the case you might need further optimisation to handle it.
Apart from the above mentioned techniques, you can try below option as well to optimize your job.
1.You can partition your data by inspecting your data fields. Most common columns that are used for partitioning are like date columns, region ID, country code etc.Once data is partitioned your can explain your dataframe like df.explain() and see if is using PartitioningAwareFileIndex.
2.Try tuning the spark settings and cluster configuration to scale with the input data volume.
Try changing the spark.sql.files.maxPartitionBytes to 256 MB or 512
MB , we have see significant performance gain by changing this
Use appropriate number of executor , cores & executor memory based on
compute need
Try analyzing the spark history to identify the stage jobs which are
consuming significant time. This would be good point to start
debugging your job.

Hive query having 15 tables join is expected to generate 1 Billion records, on 3 datanodes, 16GB RAM each Is this the right way to do?

My name is Vitthal.
The Hortonworks HDP 2.4 Cluster on Amazon is 3 Datanodes, Masters on different Instances.
7 Instances 16GB RAM each.
Total 1TB HDD Space
3 Data Nodes
Hadoop version 2.7
I have pulled data from Postgres into Hadoop Distributed Environment.
The Data is 15 Tables, Among them 4 tables are having 15 Million Records, rest are Masters.
I've pulled them in HDFS, compressed as ORC, and SnappyCodec. Created Hive External Tables with schema.
Now I'm firing a query which joins all the 15 tables and selects the columns which I need in a final flat table. The records expected are more than 1.5 Billion.
I have optimized Hive, Yarn, MapReduce Engine viz. Parallel Execution, Vectorization, Optimized Joins, Small Table Condition, Heap Size etc.
The query is running on Cluster / Hive / Tez since 20 hours & it's reached 90% where the last reducer is running. The 90% is reached long back like since 18 hours it's stuck at 90%.
Am I doing it the right way ?
If I understand, you have effectively copied tables in their raw form from your RDBMs into Hadoop in order to create a flattened view into one or more new tables. You're using Hive to do this. All of this sounds fine.
There are many possibilities why this is taking so long, but several come to mind.
First, YARN will allocate containers (one per CPU core, typically) that mappers and reducers will use to run the parallelized parts of the query. This should allow you to utilize all of the resources you have available.
I use Cloudera, but I assume Hortonworks has similar tools that let you see how many containers are in use, how many mappers and reducers are created by Hive, and so on. You should see that most or all of your available CPUs are in use constantly. Jobs should be finishing at some reasonable rate (perhaps every minute, or every 15 minutes). Depending on the query, Hive is often able to break it into distinct "stages" that are executed distinctly from others, then reassembled at the end.
If this is the case, everything may be fine, but your cluster may be under-resourced. But before you throw more AWS instances at the problem, consider the query itself.
First, Hive has several tools that are essential for optimizing performance, most importantly, partitioning. When you create tables, you should find some means of partitioning the resulting datasets into roughly equal subsets. A common method is to use dates, for example year+month+day (perhaps 20160417), or if you expect to have lots of historical data, maybe just year+month. This will also allow you to dramatically optimize queries that can be constrained by date. I seem to recall that Hive (or maybe it's YARN) will allocate partitions to different containers, so if you don't see all your workers working, then this would be a possible cause. Use the PARTITIONED BY clause in your CREATE TABLE statement.
The reason to choose something like date is that presumably your data is relatively evenly distributed over time (dates). We had chosen a customer_id as a partition key in an early implementation but as we grew, so did our customers. Hundreds of smaller customers would finish in a few minutes, then hundreds of mid-sized customers would finish in an hour, then a couple of our largest customers would take 10 or more hours to complete. We would see complete utilization of the cluster for that first hour, then only a couple containers in use for the last couple of customers. Not good.
This phenomenon is known as "data skew", so you want to carefully choose partitions to avoid skew. There are some options involving SKEW BY and CLUSTER BY that can help deal with getting evenly sized or smaller data files that you could consider.
Note that the raw import data should also be partitioned, as partitions act like indexes in a RDBMS, so are important for performance. In this case, choose partitions that use the keys that your larger query joins on. It is possible and common to have multiple partitions, so a date-based top-level partition, with a sub-partition on the join key could be helpful ... maybe ... depends on your data.
We have also found that it's very important to optimize the query itself. Hive has some hinting mechanisms that can direct it to run the query differently. While quite rudimentary compared to RDBMS, EXPLAIN is very helpful for understanding how Hive will break up the query and when it needs to scan a full dataset. It's hard to read the explain output, so get comfortable with the Hive documentation :-).
Lastly, if you can't make Hive do things in a sensible manner (if its optimizer still results in imbalanced stages) you can create intermediate tables with an additional Hive query that runs to create a partially transformed dataset before building the final one. This seems expensive since you're adding an additional write, and read of new tables, but in the case you describe it may be much faster overall. Also, it's sometimes useful to have intermediate tables just to test or sample data.
Writing Hive is a lot less like writing regular software -- you can get the Hive query done pretty quickly in most cases. Getting it to run fast has taken us 10 or 15 tries in a few cases. Good luck, and I hope this is helpful.

Task scheduling with spark

I am running fairly large task on my 4 node cluster. I am reading around 4 GB of filtered data from a single table and running Naïve Baye’s training and prediction. I have HBase region server running on a single machine which is separate from the spark cluster running in fair scheduling mode, although HDFS is running on all machines.
While executing, I am experiencing strange task distribution in terms of the number of active tasks on the cluster. I observed that only one active task or at most two tasks are running on one/two machines at any point of time while the other are sitting idle. My expectation was that the data in the RDD will be divided and processed on all the nodes for operations like count and distinct etcetera. Why are all nodes not being used for large tasks of a single job? Does having HBase on a separate machine has anything to do with this?
Some things to check:
Presumably you are reading in your data using hadoopFile() or hadoopRDD(): consider setting the [optional] minPartitions parameter to make sure the number of partitions is equal to the number of nodes you want to use.
As you create other RDDs in your application, check the number of partitions of those RDDs and how evenly the data is distributed across them. (Sometimes an operation can create an RDD with the same number of partitions but can make the data within it badly unbalanced.) You can check this by calling the glom() method, printing the number of elements of the resulting RDD (the number of partitions) and then looping through it and printing the number of elements of each of the arrays. (This introduces communication so don't leave it in your production code.)
Many of the API calls on RDD have optional parameters for setting the number of partitions, and then there are calls like repartition() and coalesce() that can change the partitioning. Use them to fix problems you find using the above technique (but sometimes it will expose the need to rethink your algorithm.)
Check that you're actually using RDDs for all your large data, and haven't accidentally ended up with some big data structure on the master.
All of these assume that you have data skew problems rather than something more sinister. That's not guaranteed to be true, but you need to check your data skew situation before looking for something complicated. It's easy for data skew to creep in, especially given Spark's flexibility, and it can make a real mess.

How to decide on the number of partitions required for input data size and cluster resources?

My use case as mentioned below.
Read input data from local file system using sparkContext.textFile(input path).
partition the input data(80 million records) into partitions using RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer function. Without using coalesce() or repartition() on the input data spark executes really slow and fails with out of memory exception.
The issue i am facing here is in deciding the number of partitions to be applied on the input data. The input data size varies every time and hard coding a particular value is not an option. And spark performs really well only when certain optimum partition is applied on the input data for which i have to perform lots of iteration(trial and error). Which is not an option in a production environment.
My question: Is there a thumb rule to decide the number of partitions required depending on the input data size and cluster resources available(executors,cores, etc...)? If yes please point me in that direction. Any help is much appreciated.
I am using spark 1.0 on yarn.
Two notes from Tuning Spark in the Spark official documentation:
1- In general, we recommend 2-3 tasks per CPU core in your cluster.
2- Spark can efficiently support tasks as short as 200 ms, because it reuses one executor JVM across many tasks and it has a low task launching cost, so you can safely increase the level of parallelism to more than the number of cores in your clusters.
These are two rule of tumb that help you to estimate the number and size of partitions. So, It's better to have small tasks (that could be completed in hundred ms).
Determining the number of partitions is a bit tricky. Spark by default will try and infer a sensible number of partitions. Note: if you are using the textFile method with compressed text then Spark will disable splitting and then you will need to re-partition (it sounds like this might be whats happening?). With non-compressed data when you are loading with sc.textFile you can also specify a minium number of partitions (e.g. sc.textFile(path, minPartitions) ).
The coalesce function is only used to reduce the number of partitions, so you should consider using the repartition() function.
As far as choosing a "good" number you generally want at least as many as the number of executors for parallelism. There already exists some logic to try and determine a "good" amount of parallelism, and you can get this value by calling sc.defaultParallelism
I assume you know the size of the cluster going in,
then you can essentially try to partition the data in some multiples of
that & use rangepartitioner to partition the data roughly equally. Dynamic
partitions are created based on number of blocks on filesystem & hence the
task overhead of scheduling so many tasks mostly kills the performance.
import org.apache.spark.RangePartitioner;
var file=sc.textFile("<my local path>")
var data= partitionedFile.partitionBy(new RangePartitioner(3, partitionedFile))

Amazon EMR not utilizing all the nodes

I am using 4 core nodes..
I am using hive to run queries on a table.
Various queries seem to be under utilizing the capacity.
My table consists of 8 integer fields and about 1000 rows.
queries of the form
select avg(col1-col2) from tbl;
select count(*) from tbl;
and every other query I tried
are producing
number of reducers=1,number of mappers=1
i have tried using set mapred.reduce.tasks=4;
but it doesnt work.
The weirdest thing is that when I use mapred.job.tracker=local which means one map and one reduce on the local node itself the task finished twice as fast.
All the reduce/map slots except one are open all the time.
Why isnt adding capacity even slightly improving exec time?
Is my data sample so small that increasing capacity doesn't matter and localizing the mapping and reduction actually improves the time?
The reason you are getting a single mapper is because your table is so small. I'm assuming your 1000 row table is one file which is much smaller than then your HDFS block size. Try a million row table or larger and you will start seeing it utilize multiple mappers. The answers to this question has some more information on how the number of mappers is chosen.
The reason you are getting a single reducer is a combination of two things. First, you are working with a tiny amount of data (for Hive) so you end up with one reducer. Second, some queries (like COUNT(*) FROM some_table) must have one reducer (see the question here)
You nailed it on why running the job locally is faster. 1000 row tables are great for testing the logic of your queries, but not for determining things like runtime. Running Hive on a cluster instead of locally will probably only start being better once you have data on the order of GBs. Hive is definitely not the "right tool for the job" until you get into queries that touch at least 10's of GBs, though 100's of GBs or TBs (or more) is easier to justify.
