Bulk Loading Key-value pair data into HBASE - hadoop

I am evaluating HBASE for dealing with a very wide dataset with a variable number of columns per row. In its raw form, my data has a variable list of parameter names and values for each row. In its transformed form, it is available in key-value pairs.
I want to load this data into HBASE. It is very easy to translate my key-value pair processed data into individual "put" statements to get the data in. However I need to bulkload as I have 1000s of columns and millions of rows, leading to billions of individual key-value pairs, requiring billions of "put" statements. Also, the list of columns (a,b,c,d,...) is not fully known ahead of time. I investigated the following options so far:
importtsv: Cannot be used because that requires the data to be pivoted from rows to columns ahead of time, with a fixed set of known columns to import.
HIVE to generate HFile: This option too requires column names to be specified ahead of time and map each column in the hive table to a column in hbase.
My only option seems to be to parse a chunk of the data once, pivot it into a set of known columns, and bulk load that. This seems wasteful, as HBASE is going to blow it down into key-value pairs anyway. There really should be a simpler more efficient way of bulk loading the key value pairs?
Raw data format:
rowkey1, {a:a1, b:b1}
rowkey2, {a:a2, c:c2}
rowkey3, {a:a3, b:b3, c:c3, d:d3}
processed Data format:
rowkey1, a, a1
rowkey1, b, b1
rowkey2, a, a2
rowkey2, c, c2
rowkey3, a, a3
rowkey3, b, b3
rowkey3, c, c3
rowkey3, d, d3

You almost assuredly want to use a customer M/R job + Incremental loading (aka bulk loading).
There general process will be:
Submit an M/R job that has been configured using HFileOutputFormat.configureIncrementalLoad
Map over the raw data and write PUTs for HBase
Load the output of job into table using the following:
sudo -u hdfs hdfs dfs -chown -R hbase:hbase /path/to/job/output
sudo -u hbase hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /path/to/job/output table-name-here
There are ways to do the load from java but it means impersonating HBase. The tricky part here is making sure that the files are owned by HBase and the user running the incremental load is also HBase. This Cloudera Blog Post talks a bit more about about those details.
In general I recommend taking a peek at this GH Repo which seems to cover the basics of the process.

Related

Hive or Hbase when we need to pull more number of columns?

I have a data structure in Hadoop with 100 columns and few hundred rows. Most of the times I need to query 65% of columns. In this case which is better to use HBASE or HIVE? Please advice.
Just number of columns you are accessing is NOT the criteria for deciding hbase or hive.
HIVE (SQL) :
Use Hive when you have warehousing needs and you are good at SQL and don't want to write MapReduce jobs. One important point though, Hive queries get converted into a corresponding MapReduce job under the hood which runs on your cluster and gives you the result. Hive does the trick for you. But each and every problem cannot be solved using HiveQL. Sometimes, if you need really fine grained and complex processing you might have to take MapReduce's shelter.
Hbase (NoSQL database):
You can use Hbase to serve that purpose. If you have some data which you want to access real time, you could store it in Hbase.
hbase get 'rowkey' is powerful when you know your access pattern
Hbase follows CP of CAP Theorm
Consistency:
Every node in the system contains the same data (e.g. replicas are never out of data)
Availability:
Every request to a non-failing node in the system returns a response
Partition Tolerance:
System properties (consistency and/or availability) hold even when the system is partitioned (communicate lost) and data is lost (node lost)
also have a look at this
Its very difficult to answer the question in one line.
HBASE is NoSQL database: your data need to store denormalized data because HBASE is very bad for joi
ning tables.
Hive: You can store data in similar format (normalized) in Hive, but would only see benefits when doing batch processing.

HBase aggregation, Get And Put operation, Bulk Operation

I would like to know how can I map a value of a key.
I know that it can be done with Get and then Put operations. Is there any other way to do it efficiently? 'checkAndPut' is not ver helpful
can it be done with something like :
(key,value) => value+g()
I have read the book HBase the Definitive Guide and it seems like Map Reduce Job interpreted to Put/Get operations on top of HBase. Does it means that it is not a 'Bulk Operation' (since it's an operation per key) ?
How /Does Spark relevant here ?
HBase has scans (1) to retrieve multiple rows; and MapReduce jobs can and do use this command (2).
For HBase 'bulk' is mostly [or solely] is 'bulk load'/'bulk import' where one adds data via constructing HFiles and 'injecting' them to HBase cluster (as opposed to PUT-s) (3).
Your task can be implemented as a MapReduce Job as well as a Spark app (4 being one of examples, maybe not the best one), or a Pig script, or a Hive query if you use HBase table from Hive (5); pick your poison.
If you set up a Table with a counter then you can use an Increment to add a certain amount to the existing value in an atomic operation.
From a MapReduce job you would aggregate your input in micro batches (wherever you have your incremental counts), group them by key/value, sum them up, and then issue a Put from your job (1 Put per key).
What I mentioned above is not a 'bulk' operation but it would probably work just fine if the amount of rows that you modify in each batch is relatively small compared to the total number or rows in your table.
IFF you expect to modify your entire table at each batch then you should look at Bulk Loads. This will require you to write a job that reads your existing values in HBase, your new values from the incremental sources, add them together, and write them back to HBase (In a 'bulk load' fashion, not directly)
A Bulk Load writes HFiles directly to HDFS without going through the HBase 'write pipeline' (Memstore, minor compactions, major compactions, etc), and then issue a command to swap the existing files with the new ones. The swap is FAST! Note, you could also generate the new HFile outside the HBase cluster (not to overload it) and then copy them over and issue the swap command.

Big data signal analysis: better way to store and query signal data

I am about doing some signal analysis with Hadoop/Spark and I need help on how to structure the whole process.
Signals are now stored in a database, that we will read with Sqoop and will be transformed in files on HDFS, with a schema similar to:
<Measure ID> <Source ID> <Measure timestamp> <Signal values>
where signal values are just string made of floating point comma separated numbers.
000123 S001 2015/04/22T10:00:00.000Z 0.0,1.0,200.0,30.0 ... 100.0
000124 S001 2015/04/22T10:05:23.245Z 0.0,4.0,250.0,35.0 ... 10.0
...
000126 S003 2015/04/22T16:00:00.034Z 0.0,0.0,200.0,00.0 ... 600.0
We would like to write interactive/batch queries to:
apply aggregation functions over signal values
SELECT *
FROM SIGNALS
WHERE MAX(VALUES) > 1000.0
To select signals that had a peak over 1000.0.
apply aggregation over aggregation
SELECT SOURCEID, MAX(VALUES)
FROM SIGNALS
GROUP BY SOURCEID
HAVING MAX(MAX(VALUES)) > 1500.0
To select sources having at least a single signal that exceeded 1500.0.
apply user defined functions over samples
SELECT *
FROM SIGNALS
WHERE MAX(LOW_BAND_FILTER("5.0 KHz", VALUES)) > 100.0)
to select signals that after being filtered at 5.0 KHz have at least a value over 100.0.
We need some help in order to:
find the correct file format to write the signals data on HDFS. I thought to Apache Parquet. How would you structure the data?
understand the proper approach to data analysis: is better to create different datasets (e.g. processing data with Spark and persisting results on HDFS) or trying to do everything at query time from the original dataset?
is Hive a good tool to make queries such the ones I wrote? We are running on Cloudera Enterprise Hadoop, so we can also use Impala.
In case we produce different derivated dataset from the original one, how we can keep track of the lineage of data, i.e. know how the data was generated from the original version?
Thank you very much!
1) Parquet as columnar format is good for OLAP. Spark support of Parquet is mature enough for production use. I suggest to parse string representing signal values into following data structure (simplified):
case class Data(id: Long, signals: Array[Double])
val df = sqlContext.createDataFrame(Seq(Data(1L, Array(1.0, 1.0, 2.0)), Data(2L, Array(3.0, 5.0)), Data(2L, Array(1.5, 7.0, 8.0))))
Keeping array of double allows to define and use UDFs like this:
def maxV(arr: mutable.WrappedArray[Double]) = arr.max
sqlContext.udf.register("maxVal", maxV _)
df.registerTempTable("table")
sqlContext.sql("select * from table where maxVal(signals) > 2.1").show()
+---+---------------+
| id| signals|
+---+---------------+
| 2| [3.0, 5.0]|
| 2|[1.5, 7.0, 8.0]|
+---+---------------+
sqlContext.sql("select id, max(maxVal(signals)) as maxSignal from table group by id having maxSignal > 1.5").show()
+---+---------+
| id|maxSignal|
+---+---------+
| 1| 2.0|
| 2| 8.0|
+---+---------+
Or, if you want some type-safety, using Scala DSL:
import org.apache.spark.sql.functions._
val maxVal = udf(maxV _)
df.select("*").where(maxVal($"signals") > 2.1).show()
df.select($"id", maxVal($"signals") as "maxSignal").groupBy($"id").agg(max($"maxSignal")).where(max($"maxSignal") > 2.1).show()
+---+--------------+
| id|max(maxSignal)|
+---+--------------+
| 2| 8.0|
+---+--------------+
2) It depends: if size of your data allows to do all processing in query time with reasonable latency - go for it. You can start with this approach, and build optimized structures for slow/popular queries later
3) Hive is slow, it's outdated by Impala and Spark SQL. Choice is not easy sometimes, we use rule of thumb: Impala is good for queries without joins if all your data stored in HDFS/Hive, Spark has bigger latency but joins are reliable, it supports more data sources and has rich non-SQL processing capabilities (like MLlib and GraphX)
4) Keep it simple: store you raw data (master dataset) de-duplicated and partitioned (we use time-based partitions). If new data arrives into partition and your already have downstream datasets generated - restart your pipeline for this partition.
Hope this helps
First, I believe Vitaliy's approach is very good in every aspect. (and I'm all for Spark)
I'd like to propose another approach, though. The reasons are:
We want to do Interactive queries (+ we have CDH)
Data is already structured
The need is to 'analyze' and not quite 'processing' of the data. Spark could be an overkill if (a) data being structured, we can form sql queries faster and (b) we don't want to write a program every time we want to run a query
Here are the steps I'd like to go with:
Ingestion using sqoop to HDFS: [optionally] use --as-parquetfile
Create an External Impala table or an Internal table as you wish. If you have not transferred the file as parquet file, you can do that during this step. Partition by, preferably Source ID, since our groupings are going to happen on that column.
So, basically, once we've got the data transferred, all we need to do is to create an Impala table, preferably in parquet format and partitioned by the column that we're going to use for grouping. Remember to do compute statistics after loading to help Impala run it faster.
Moving data:
- if we need to generate feed out of the results, create a separate file
- if another system is going to update the existing data, then move the data to a different location while creating->loading the table
- if it's only about queries and analysis and getting reports (i.e, external tables suffice), we don't need to move the data unnecessarily
- we can create an external hive table on top of the same data. If we need to run long-running batch queries, we can use Hive. It's a no-no for interactive queries, though. If we create any derived tables out of queries and want to use through Impala, remember to run 'invalidate metadata' before running impala queries on the hive-generated tables
Lineage - I have not gone deeper into it, here's a link on Impala lineage using Cloudera Navigator

HBase bulk load usage

I am trying to import some HDFS data to an already existing HBase table.
The table I have was created with 2 column families, and with all the default settings that HBase comes with when creating a new table.
The table is already filled up with a large volume of data, and it has 98 online regions.
The type of row keys it has, are under the form of(simplified version) :
2-CHARS_ID + 6-DIGIT-NUMBER + 3 X 32-CHAR-MD5-HASH.
Example of key: IP281113ec46d86301568200d510f47095d6c99db18630b0a23ea873988b0fb12597e05cc6b30c479dfb9e9d627ccfc4c5dd5fef.
The data I want to import is on HDFS, and I am using a Map-Reduce process to read it. I emit Put objects from my mapper, which correspond to each line read from the HDFS files.
The existing data has keys which will all start with "XX181113".
The job is configured with :
HFileOutputFormat.configureIncrementalLoad(job, hTable)
Once I start the process, I see it configured with 98 reducers (equal to the online regions the table has), but the issue is that 4 reducers got 100% of the data split among them, while the rest did nothing.
As a result, I see only 4 folder outputs, which have a very large size.
Are these files corresponding to 4 new regions which I can then import to the table? And if so, why only 4, while 98 reducers get created?
Reading HBase docs
In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region. In order to do this, jobs whose output will be bulk loaded into HBase use Hadoop's TotalOrderPartitioner class to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table.
confused me even more as to why I get this behaviour.
Thanks!
The number of maps you'd get doesn't depend on the number of regions you have in the table but rather how the data is split into regions (each region contains a range of keys). since you mention that all your new data start with the same prefix it is likely it only fit into a few regions.
You can pre split your table so that the new data would be divided between more regions

How do I diff two tables in HBase

I am trying to compare two different tables in HBase so that I can automate the validation of some ETL processes that we use to move data in HBase. What's the best way to compare two tables in HBase?
My use case is below:
What I am trying to do is create one table that will be my expected output. This table will contain all of the data that I am expecting to be created via executing the teams code against an input file. I will then take the diff between the actual output table and the expected output table to verify the integrity of the component under test.
I don't know of anything out of the box but you can write a multi-table map/reduce.
The mappers will just emit keys from each table (with a value being all the hbase key values plus a table name)
The reducer can make sure it has 2 records of each key and compare the key-values. When there's only one key it can see which table is out of sync
I know this question is a little old, but how large are the tables? If they will both fit into memory you could load them into Pig using HBaseStorage, then use Pig's built in DIFF function to compare the resulting bags.
This will work even with large tables that don't fit into memory, according to the docs, but it will be extremely slow.
dataset1 = LOAD '/path/to/dataset1' USING PigStorage('<your delimiter>') AS (a:chararray, b:chararray, c:chararray, d:chararray);
dataset2 = LOAD '/path/to/dataset2' USING PigStorage('<your delimiter>') AS (a:chararray, b:chararray, c:chararray, d:chararray);
dataset3 = COGROUP dataset1 BY (a, b,c, d), dataset2 BY (a, b, c, d);
dataset4 = FOREACH dataset3 GENERATE DIFF(dataset1,dataset2);

Resources