Hive over HBase for deep analytical query - hadoop

I am able to connect and access an existing HBase table with Hive (using Hive HBase Storage Handler).
I think the interface is not much powerful. Can this interface be used for large analytical data processing?

No It can't. Any WHERE clause ends up as a full SCAN in HBase table and scans are extremely slow. Please check https://phoenix.apache.org/ as an alternative.

Apache Phoenix is more applicable for querying HBase.
you can also query HBase using Hive, then your query will get converted in Map Reduce Job which will take more time then Phoenix.
PS : You can use Hive for Big Data analytics even if you are using Hbase.

A good solution to do Analytic queries over HBase faster is combine HBase with Hive and Impala.
As an example of this would be the following scenario:
I have a Kafka producer receiving thousands of signals from IoT devices from a socket in json format. I am processing this signals with a consumer in Spark streaming and putting these signals in a HBase table.
HBase table and data example
$ hbase shell
hbase> create_namespace 'device_iot'
hbase> create 'device_iot:device', 'data'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:deviceID', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23c'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:temperature', '12'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:latitude', '52.14691120000001'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:longitude', '11.658838699999933'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:time', '2019-08-14T23:30:30000'
Hive table on top of HBase table
CREATE EXTERNAL TABLE t_iot_devices (
id string, deviceID string, temperature int, latitude double, longitude double, time string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:deviceID,data:temperature,data:latitude,data:longitude,data:time")
TBLPROPERTIES("hbase.table.name" = "device_iot:device");
Querying in Impala
impala> invalidate metadata;
SELECT deviceID, max(temperature) AS maxTemperature
FROM t_iot_devices
GROUP BY deviceID;
+--------------------------------------+----------------+
| deviceid | maxtemperature |
+--------------------------------------+----------------+
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23b | 39 |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23a | 39 |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23c | 39 |
+--------------------------------------+----------------+
SELECT deviceID, substr(time,1,10) AS day, max(temperature) AS highest
FROM t_iot_devices
WHERE substr(time,1,10) = '2019-07-07'
GROUP BY deviceID, substr(time,1,10);
+--------------------------------------+------------+---------+
| deviceid | day | highest |
+--------------------------------------+------------+---------+
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23c | 2019-07-07 | 34 |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23b | 2019-07-07 | 35 |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23a | 2019-07-07 | 22 |
+--------------------------------------+------------+---------+

Related

Apache Phoenix queries taking too long

I am using Apache Phoenix to run some queries but their performance look bad compared to what I was expecting. As an example, considering a table like:
CREATE TABLE MY_SHORT_TABLE (
MPK BIGINT not null,
... 38 other columns ...
CONSTRAINT pk PRIMARY KEY (MPK, 4 other columns))
SALT_BUCKETS = 4;
which has arround 460000 lines,
a query like:
select sum(MST.VALUES),
MST.III, MST.BBB, MST.DDD, MST.FFF,
MST.AAA, MST.CCC, MST.EEE, MST.HHH
from
MY_SHORT_TABLE MST
group by
MST.AAA, MST.BBB, MST.CCC, MST.DDD,
MST.EEE, MST.FFF, MST.HHH, MST.III
is taking arround 9-11 seconds to complete.
In a table with a similar structure but with near 3 400 000 lines, it takes arroud 45 seconds to complete the query.
I have 5 hosts (1 Master and 4 RegionServer+PhoenixQS) in this cluster with 6 vCPU and 32GB RAM.
The configurations I am using at in this example are:
HBase RegionServer Maximum Memory=8192(8GB)
HBase Master Maximum Memory=8192(8GB)
Number of Handlers per RegionServer=30
Memstore Flush Size=128MB
Maximum Record Size=1MB
Maximum Region File Size=10GB
% of RegionServer Allocated to Read Buffers=40%
% of RegionServer Allocated to Write Buffers=40%
HBase RPC Timeout=6min
Zookeeper Session Timeout=6min
Phoenix Query Timeout=6min
Number of Fetched Rows when Scanning from Disk=1000
dfs.client.read.shortcircuit=true
dfs.client.read.shortcircuit.buffer.size=131072
phoenix.coprocessor.maxServerCacheTimeToLiveMs=30000
I am using HDP 2.4.0, so Phoenix 4.4.
The example query explain is the following:
+------------------------------------------+
| PLAN |
+------------------------------------------+
| CLIENT 8-CHUNK PARALLEL 8-WAY FULL SCAN OVER MY_SHORT_TABLE |
| SERVER AGGREGATE INTO DISTINCT ROWS BY [AAA, BBB, CCC, DDD, EEE, FFF, HHH |
| CLIENT MERGE SORT |
+------------------------------------------+
Also, I have created an index as:
CREATE INDEX i1DENORM2T1 ON MY_SHORT_TABLE (HHH)
INCLUDE ( AAA, BBB, CCC, DDD, EEE, FFF, HHH, VALUES ) ;
This index changes the query execution plan to:
+------------------------------------------+
| PLAN |
+------------------------------------------+
| CLIENT 4-CHUNK PARALLEL 4-WAY FULL SCAN OVER I1DENORM2T1 |
| SERVER AGGREGATE INTO DISTINCT ROWS BY ["AAA", "BBB", "DDD", "EEE", "FFF", "HHH |
| CLIENT MERGE SORT |
+------------------------------------------+
However the performance do not match the expectations (arround 3-4 seconds).
What is wrong in the above configs or what should I change in order to get a better performance?
Thanks in advance.

Do UDF (which another spark job is needed) to each element of array column in SparkSQL

The structure of a hive table (tbl_a) is as follows:
name | ids
A | [1,7,13,25168,992]
B | [223, 594, 3322, 192928]
C | null
...
Another hive table (tbl_b) have the corresponding mapping between id to new_id. This table is big so cannot be loaded into memory
id | new_id
1 | 'aiks'
2 | 'ficnw'
...
I intend to create a new hive table to have the same structure as tbl_a, but convert the array of id to the array of new_id:
name | ids
A | ['aiks','fsijo','fsdix','sssxs','wie']
B | ['cx', 'dds', 'dfsexx', 'zz']
C | null
...
Could anyone give me some idea on how to implement this scenario in spark sql or in spark DataFrame? Thanks!
This is an expensive operation but you can make it using a coalesce, explode and a left outer join as followed :
tbl_a
.withColumn("ids", coalesce($"ids", array(lit(null).cast("int"))))
.select($"name", explode($"ids").alias("id"))
.join(tbl_b, Seq("id"), "leftouter")
.groupBy("name").agg(collect_list($"new_id").alias("ids"))
.show

Hive with data that does not have a delimiter

I am having some data in HDFS that does not have a delimiter. That is, the individual data fields are identified by their position in the line.
For instance,
CountryXTOWNYCRIMEVALUEZ
So here the country would be positions 0 to 7, the town 8 to 12, and the crime statistic would be 13 to 23.
Is there a way to import data organised like this directly into Hive? I suppose a workable way would be to design a map reduce job that delimits the data, but I was wondering if there is a Hive command that can be used to import the data directly?
RegexSerDe
create external table mytable
(
country string
,town string
,crime_statistic string
)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties
(
'input.regex' = '^(.{8})(.{5})(.*)$'
)
location '/...location of the data...'
;
select * from mytable
;
+----------+-------+-----------------+
| country | town | crime_statistic |
+----------+-------+-----------------+
| CountryX | TOWNY | CRIMEVALUEZ |
+----------+-------+-----------------+

How does data distribution happens in bucketing in HIVE?

I have created a table as below with 3 buckets, and loaded some data into it.
create table testBucket (id int,name String)
partitioned by (region String)
clustered by (id) into 3 buckets;
I have set bucketing property as well. $set hive.enforce.bucketing=true;
But when I listed the table files in HDFS I could see that that 3 files are creates as I have mentioned 3 buckets.
But data got loaded in only one file and rest 2 files are just empty. So I am confused why my data got loaded into only file?
So could someone please explain me how data distribution happens in bucketing?
[test#localhost user]$ hadoop fs -ls /user/hive/warehouse/database2.db/buckettab/region=USA
Found 3 items
-rw-r--r-- 1 user supergroup 38 2016-06-27 08:34 /user/hive/warehouse/database2.db/buckettab/region=USA/000000_0
-rw-r--r-- 1 user supergroup 0 2016-06-27 08:34 /user/hive/warehouse/database2.db/buckettab/region=USA/000001_0
-rw-r--r-- 1 user supergroup 0 2016-06-27 08:34 /user/hive/warehouse/database2.db/buckettab/region=USA/000002_0
Bucketing is a method to evenly distributed the data across many files. Create multiple buckets and then place each record into one of the buckets based on some logic mostly some hashing algorithm.
Bucketing feature of Hive can be used to distribute/organize the table/partition data into multiple files such that similar records are present in the same file. While creating a Hive table, a user needs to give the columns to be used for bucketing and the number of buckets to store the data into. Which records go to which bucket are decided by the Hash value of columns used for bucketing.
[Hash(column(s))] MOD [Number of buckets]
Hash value for different columns types is calculated differently. For int columns, the hash value is equal to the value of int. For String columns, the hash value is calculated using some computation on each character present in the String.
Data for each bucket is stored in a separate HDFS file under the table directory on HDFS. Inside each bucket, we can define the arrangement of data by providing the SORT BY column while creating the table.
Lets See an Example
Creating a Hive table using bucketing
For creating a bucketed table, we need to use CLUSTERED BY clause to define the columns for bucketing and provide the number of buckets. Following query creates a table Employee bucketed using the ID column into 5 buckets.
CREATE TABLE Employee(
ID BIGINT,
NAME STRING,
AGE INT,
SALARY BIGINT,
DEPARTMENT STRING
)
COMMENT 'This is Employee table stored as textfile clustered by id into 5 buckets'
CLUSTERED BY(ID) INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Inserting data into a bucketed table
We have following data in Employee_old table.
0: jdbc:hive2://localhost:10000> select * from employee_old;
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| employee_old.id | employee_old.name | employee_old.age | employee_old.salary | employee_old.department |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| 1 | Sudip | 34 | 62000 | HR |
| 2 | Suresh | 45 | 76000 | FINANCE |
| 3 | Aarti | 25 | 37000 | BIGDATA |
| 4 | Neha | 27 | 39000 | FINANCE |
| 5 | Rajesh | 29 | 59000 | BIGDATA |
| 6 | Suman | 37 | 63000 | HR |
| 7 | Paresh | 42 | 71000 | BIGDATA |
| 8 | Rami | 33 | 56000 | HR |
| 9 | Arpit | 41 | 46000 | HR |
| 10 | Sanjeev | 51 | 99000 | FINANCE |
| 11 | Sanjay | 32 | 67000 | FINANCE |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
We will select data from the table Employee_old and insert it into our bucketed table Employee.
We need to set the property ‘hive.enforce.bucketing‘ to true while inserting data into a bucketed table. This will enforce bucketing, while inserting data into the table.
Set the property
0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing=true;
Insert data into Bucketed table employee
0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE Employee SELECT * from Employee_old;
Verify the Data in Buckets
Once we execute the INSERT query, we can verify that 5 files are created Under the Employee table directory on HDFS.
Name Type
000000_0 file
000001_0 file
000002_0 file
000003_0 file
000004_0 file
Each file represents a bucket. Let us see the contents of these files.
Content of 000000_0
All records with Hash(ID) mod 5 == 0 goes into this file.
5,Rajesh,29,59000,BIGDATA
10,Sanjeev,51,99000,FINANCE
Content of 000001_0
All records with Hash(ID) mod 5 == 1 goes into this file.
1,Sudip,34,62000,HR
6,Suman,37,63000,HR
11,Sanjay,32,67000,FINANCE
Content of 000002_0
All records with Hash(ID) mod 5 == 2 goes into this file.
2,Suresh,45,76000,FINANCE
7,Paresh,42,71000,BIGDATA
Content of 000003_0
All records with Hash(ID) mod 5 == 3 goes into this file.
3,Aarti,25,37000,BIGDATA
8,Rami,33,56000,HR
Content of 000004_0
All records with Hash(ID) mod 5 == 4 goes into this file.
4,Neha,27,39000,FINANCE
9,Arpit,41,46000,HR
I feel all ID MOD 3 will be same for USA partition (region=USA) in sample data.
Bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.
Take a look at the language Manual here
It states:
How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.
In your case because you are clustering by Id which is an Int and then you're bucketing it into 3 buckets only it looks like all values are being hashed into one of these buckets. To ensure this is working, add some rows that have different ids from the ones you have in the file and increase the number of buckets and see if they get hashed into separate files this time around.

drill not showing hive or hbase tables

I've created both a hbase and hive table to store some data logging information. I can query both hbase and hive from the command line no prob.
hbase: scan MVLogger; // comes back with 9k plus records
hive: select * from MVLogger; // comes back with 9k plus records
my hbase table definition is
'MVLogger', {NAME => 'dbLogData', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS true
=> '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65
536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
My hive (external) table definition is:
CREATE EXTERNAL TABLE `MVLogger`(
`rowid` int,
`ID` int,
`TableName` string,
`CreatedDate` string,
`RowData` string,
`ClientDB` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'hbase.columns.mapping'=':key,dbLogData:ID,dbLogData:TableName,dbLogData:CreatedDate,dbLogData:RowData,dbLogData:ClientDB')
TBLPROPERTIES (
'hbase.table.name'='MVLogger')
When I use sqlline and look at the drill schema this is what I see
0: jdbc:drill:zk=ip-*.compu> show schemas;
+-------------+
| SCHEMA_NAME |
+-------------+
| hive.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| cp.default |
| hbase |
| sys |
| INFORMATION_SCHEMA |
+-------------+
and when I do a use [schema] (any of them but sys) and then do a show tables I get nothing... For example
0: jdbc:drill:zk=ip-*.compu> use hbase;
+------------+------------+
| ok | summary |
+------------+------------+
| true | Default schema changed to 'hbase' |
+------------+------------+
1 row selected (0.071 seconds)
0: jdbc:drill:zk=ip-*.compu> show tables;
+--------------+------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+------------+
+--------------+------------+
No rows selected (0.37 seconds)
In the Drill Web UI (ambari) under storage options for Drill I see an enabled hbase and hive. The configuration for the hive storage is the following.
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://ip-*.compute.internal:9083",
"hive.metastore.warehouse.dir": "/apps/hive/warehouse/",
"fs.default.name": "hdfs://ip-*.compute.internal:8020/",
"hive.metastore.sasl.enabled": "false"
}
}
Any ideas of why I'm not able to query hive/hbase ?
Update: The table is showing up in the hive schema now but when I try to query it with a simple select * from ... it just hangs and I can't find anything in any of the log files. The hive table's actual data store is hbase BTW.
Found out Hbase .98 is not yet compatible with the drill/hbase plugin... http://mail-archives.apache.org/mod_mbox/incubator-drill-user/201410.mbox/%3CCAKa9qDmN_fZ8V8W1JKW8HVX%3DNJNae7gR-UMcZC9QwKVNynQJkA%40mail.gmail.com%3E
it's maybe too late but for others who may see the post and having this issue .
0: jdbc:drill:zk=ip-*.compu> use hbase;
+------------+------------+
| ok | summary |
+------------+------------+
| true | Default schema changed to 'hbase' |
+------------+------------+
1 row selected (0.071 seconds)
0: jdbc:drill:zk=ip-*.compu> show tables;
+--------------+------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+------------+
+--------------+------------+
No rows selected (0.37 seconds)
the user that is running drill has no access permission on hbase . grant drill user access on hbase and you will see the tables .
try going to hbase shell with drill user and run "list" it will also be empty until you grant permission then u will see the tables .

Resources