I've created both a hbase and hive table to store some data logging information. I can query both hbase and hive from the command line no prob.
hbase: scan MVLogger; // comes back with 9k plus records
hive: select * from MVLogger; // comes back with 9k plus records
my hbase table definition is
'MVLogger', {NAME => 'dbLogData', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS true
=> '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65
536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
My hive (external) table definition is:
CREATE EXTERNAL TABLE `MVLogger`(
`rowid` int,
`ID` int,
`TableName` string,
`CreatedDate` string,
`RowData` string,
`ClientDB` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'hbase.columns.mapping'=':key,dbLogData:ID,dbLogData:TableName,dbLogData:CreatedDate,dbLogData:RowData,dbLogData:ClientDB')
TBLPROPERTIES (
'hbase.table.name'='MVLogger')
When I use sqlline and look at the drill schema this is what I see
0: jdbc:drill:zk=ip-*.compu> show schemas;
+-------------+
| SCHEMA_NAME |
+-------------+
| hive.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| cp.default |
| hbase |
| sys |
| INFORMATION_SCHEMA |
+-------------+
and when I do a use [schema] (any of them but sys) and then do a show tables I get nothing... For example
0: jdbc:drill:zk=ip-*.compu> use hbase;
+------------+------------+
| ok | summary |
+------------+------------+
| true | Default schema changed to 'hbase' |
+------------+------------+
1 row selected (0.071 seconds)
0: jdbc:drill:zk=ip-*.compu> show tables;
+--------------+------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+------------+
+--------------+------------+
No rows selected (0.37 seconds)
In the Drill Web UI (ambari) under storage options for Drill I see an enabled hbase and hive. The configuration for the hive storage is the following.
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://ip-*.compute.internal:9083",
"hive.metastore.warehouse.dir": "/apps/hive/warehouse/",
"fs.default.name": "hdfs://ip-*.compute.internal:8020/",
"hive.metastore.sasl.enabled": "false"
}
}
Any ideas of why I'm not able to query hive/hbase ?
Update: The table is showing up in the hive schema now but when I try to query it with a simple select * from ... it just hangs and I can't find anything in any of the log files. The hive table's actual data store is hbase BTW.
Found out Hbase .98 is not yet compatible with the drill/hbase plugin... http://mail-archives.apache.org/mod_mbox/incubator-drill-user/201410.mbox/%3CCAKa9qDmN_fZ8V8W1JKW8HVX%3DNJNae7gR-UMcZC9QwKVNynQJkA%40mail.gmail.com%3E
it's maybe too late but for others who may see the post and having this issue .
0: jdbc:drill:zk=ip-*.compu> use hbase;
+------------+------------+
| ok | summary |
+------------+------------+
| true | Default schema changed to 'hbase' |
+------------+------------+
1 row selected (0.071 seconds)
0: jdbc:drill:zk=ip-*.compu> show tables;
+--------------+------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+------------+
+--------------+------------+
No rows selected (0.37 seconds)
the user that is running drill has no access permission on hbase . grant drill user access on hbase and you will see the tables .
try going to hbase shell with drill user and run "list" it will also be empty until you grant permission then u will see the tables .
Related
I am able to connect and access an existing HBase table with Hive (using Hive HBase Storage Handler).
I think the interface is not much powerful. Can this interface be used for large analytical data processing?
No It can't. Any WHERE clause ends up as a full SCAN in HBase table and scans are extremely slow. Please check https://phoenix.apache.org/ as an alternative.
Apache Phoenix is more applicable for querying HBase.
you can also query HBase using Hive, then your query will get converted in Map Reduce Job which will take more time then Phoenix.
PS : You can use Hive for Big Data analytics even if you are using Hbase.
A good solution to do Analytic queries over HBase faster is combine HBase with Hive and Impala.
As an example of this would be the following scenario:
I have a Kafka producer receiving thousands of signals from IoT devices from a socket in json format. I am processing this signals with a consumer in Spark streaming and putting these signals in a HBase table.
HBase table and data example
$ hbase shell
hbase> create_namespace 'device_iot'
hbase> create 'device_iot:device', 'data'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:deviceID', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23c'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:temperature', '12'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:latitude', '52.14691120000001'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:longitude', '11.658838699999933'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:time', '2019-08-14T23:30:30000'
Hive table on top of HBase table
CREATE EXTERNAL TABLE t_iot_devices (
id string, deviceID string, temperature int, latitude double, longitude double, time string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:deviceID,data:temperature,data:latitude,data:longitude,data:time")
TBLPROPERTIES("hbase.table.name" = "device_iot:device");
Querying in Impala
impala> invalidate metadata;
SELECT deviceID, max(temperature) AS maxTemperature
FROM t_iot_devices
GROUP BY deviceID;
+--------------------------------------+----------------+
| deviceid | maxtemperature |
+--------------------------------------+----------------+
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23b | 39 |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23a | 39 |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23c | 39 |
+--------------------------------------+----------------+
SELECT deviceID, substr(time,1,10) AS day, max(temperature) AS highest
FROM t_iot_devices
WHERE substr(time,1,10) = '2019-07-07'
GROUP BY deviceID, substr(time,1,10);
+--------------------------------------+------------+---------+
| deviceid | day | highest |
+--------------------------------------+------------+---------+
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23c | 2019-07-07 | 34 |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23b | 2019-07-07 | 35 |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23a | 2019-07-07 | 22 |
+--------------------------------------+------------+---------+
The structure of a hive table (tbl_a) is as follows:
name | ids
A | [1,7,13,25168,992]
B | [223, 594, 3322, 192928]
C | null
...
Another hive table (tbl_b) have the corresponding mapping between id to new_id. This table is big so cannot be loaded into memory
id | new_id
1 | 'aiks'
2 | 'ficnw'
...
I intend to create a new hive table to have the same structure as tbl_a, but convert the array of id to the array of new_id:
name | ids
A | ['aiks','fsijo','fsdix','sssxs','wie']
B | ['cx', 'dds', 'dfsexx', 'zz']
C | null
...
Could anyone give me some idea on how to implement this scenario in spark sql or in spark DataFrame? Thanks!
This is an expensive operation but you can make it using a coalesce, explode and a left outer join as followed :
tbl_a
.withColumn("ids", coalesce($"ids", array(lit(null).cast("int"))))
.select($"name", explode($"ids").alias("id"))
.join(tbl_b, Seq("id"), "leftouter")
.groupBy("name").agg(collect_list($"new_id").alias("ids"))
.show
i am new to cassandra and i am trying to read a row from database which contains values
siteId | country | someMap
1 | US | {a:b, x:z}
2 | PR | {a:b, x:z}
I have also created an index on table using create index on columnfamily(keys(someMap));
but still when i query as select * from table where siteId=1 and someMap contains key 'a'
it returns an entiremap as
1 | US | {a:b, x:z}
Can somebody help me on what should i do to get the value as
1 | US | {a:b}
You can not: even if internally each entry of a Map|List|Set is stored as a column you can only retrieve the whole collection but not part of it. You are not asking cassandra give me the entry of the map containing X, but the row whom map contains X.
HTH,
Carlo
Kindest,
I am trying to extend some working HIVE queries, but seem to fall short. Just wanting to test GROUP BY function, which is common to a number of queries that I need to complete. Here is the query that I am trying to execute:
DROP table CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary;
CREATE EXTERNAL TABLE IF NOT EXISTS CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary ( messageRowID STRING, payload_sensor INT, messagetimestamp BIGINT, payload_temp FLOAT, payload_timestamp BIGINT, payload_timestampmysql STRING, payload_watt INT, payload_wattseconds INT )
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1",
"cassandra.port" = "9160",
"cassandra.ks.name" = "EVENT_KS",
"cassandra.ks.username" = "admin",
"cassandra.ks.password" = "admin",
"cassandra.cf.name" = "currentcost_stream",
"cassandra.columns.mapping" = ":key, payload_sensor, Timestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds" );
select messageRowID, payload_sensor, messagetimestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds, hour(from_unixtime(payload_timestamp)) AS hourly
FROM CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary
WHERE payload_timestamp > unix_timestamp() - 3024*60*60
GROUP BY hourly;
This yields the following error:
ERROR: Error while executing Hive script.Query returned non-zero code:
10, cause: FAILED: Error in semantic analysis: Line 1:320 Invalid
table alias or column reference 'hourly': (possible column names are:
messagerowid, payload_sensor, messagetimestamp, payload_temp,
payload_timestamp, payload_timestampmysql, payload_watt,
payload_wattseconds)
The intention is to end up with a timebound query (say last 24 hours) broken in by SUM() on payload_wattsecond etc. To get started breaking then out the creation of the summary tables, I started building a group by query which were going to derive the hourly anchor for the select query.
Problem though is the error above. Would greatly appreciate any pointers to what is wrong here.. can't seem to find it myself, but then again I am a newbie on HIVE.
Thanks in advance ..
UPDATE: Tried to update the query. Here is the query that I just tried to run:
DROP table CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary;
CREATE EXTERNAL TABLE IF NOT EXISTS CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary ( messageRowID STRING, payload_sensor INT, messagetimestamp BIGINT, payload_temp FLOAT, payload_timestamp BIGINT, payload_timestampmysql STRING, payload_watt INT, payload_wattseconds INT )
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1",
"cassandra.port" = "9160",
"cassandra.ks.name" = "EVENT_KS",
"cassandra.ks.username" = "admin",
"cassandra.ks.password" = "admin",
"cassandra.cf.name" = "currentcost_stream",
"cassandra.columns.mapping" = ":key, payload_sensor, Timestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds" );
select messageRowID, payload_sensor, messagetimestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds, hour(from_unixtime(payload_timestamp))
FROM CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary
WHERE payload_timestamp > unix_timestamp() - 3024*60*60
GROUP BY hour(from_unixtime(payload_timestamp));
.. that however gives another error, which is:
ERROR: Error while executing Hive script.Query returned non-zero code: 10, cause: FAILED: Error in semantic analysis: Line 1:7 Expression not in GROUP BY key 'messageRowID'
Thoughts?
UPDATE #2) The following is a quick dump of a few samples that are derived into the EVENT_KS CF in WSO2BAM. The last column is a calculated (in the perl daemon..) #watt_seconds, which will be used in a query to calculate the aggregate sum totalled into kwH, which then will be dumped into MySQL tables for sync to the application that holds the ui/ux layer..
[12:03:00] [jskogsta#enterprise ../Product Centric Opco Modelling]$ ~/local/apache-cassandra-2.0.8/bin/cqlsh localhost 9160 -u admin -p admin --cqlversion="3.0.5"
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 1.2.13 | CQL spec 3.0.5 | Thrift protocol 19.36.2]
Use HELP for help.
cqlsh> use "EVENT_KS";
cqlsh:EVENT_KS> select * from currentcost_stream limit 5;
key | Description | Name | Nick_Name | StreamId | Timestamp | Version | payload_sensor | payload_temp | payload_timestamp | payload_timestampmysql | payload_watt | payload_wattseconds
-------------------------------------------+---------------------------+--------------------+----------------------+---------------------------+---------------+---------+----------------+--------------+-------------------+------------------------+--------------+---------------------
1403365575174::10.11.205.218::9443::9919 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403365575174 | 1.0.18 | 1 | 13.16 | 1403365575 | 2014-06-21 23:46:15 | 6631 | 19893
1403354553932::10.11.205.218::9443::2663 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403354553932 | 1.0.18 | 1 | 14.1 | 1403354553 | 2014-06-21 20:42:33 | 28475 | 0
1403374113341::10.11.205.218::9443::11852 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403374113341 | 1.0.18 | 1 | 10.18 | 1403374113 | 2014-06-22 02:08:33 | 17188 | 154692
1403354501924::10.11.205.218::9443::1894 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403354501924 | 1.0.18 | 1 | 10.17 | 1403354501 | 2014-06-21 20:41:41 | 26266 | 0
1403407054092::10.11.205.218::9443::15527 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403407054092 | 1.0.18 | 1 | 17.16 | 1403407054 | 2014-06-22 11:17:34 | 6332 | 6332
(5 rows)
cqlsh:EVENT_KS>
What I will be trying to do is to issue a query against this table (actually multiples depending on the various presentation aggregations that are required..), and present a view based on hourly sum's, 10-minute sum's, daily sum's, monthly sum's etc. etc. Depending on the query, the GROUP BY was intended to give this 'index' so to speak. Right now just testing this.. so will see how it ends up in the end. Hope this makes sense?!
So not trying to remove duplicates...
UPDATE 3) Was going about this all wrong.. and thought a bit more on the tip that was given below. Hence just simplifying the whole query gave the right results. The following query gives the total amount of kwH on an hourly basis for the WHOLE dataset. With this, I can create the various iterations of kwH spent over various time periods like
Hourly over the last 24 hours
Daily over the last year
Minute over the last hour
.. etc. etc.
Here is the query:
DROP table CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary;
CREATE EXTERNAL TABLE IF NOT EXISTS CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary ( messageRowID STRING, payload_sensor INT, messagetimestamp BIGINT, payload_temp FLOAT, payload_timestamp BIGINT, payload_timestampmysql STRING, payload_watt INT, payload_wattseconds INT )
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1",
"cassandra.port" = "9160",
"cassandra.ks.name" = "EVENT_KS",
"cassandra.ks.username" = "admin",
"cassandra.ks.password" = "admin",
"cassandra.cf.name" = "currentcost_stream",
"cassandra.columns.mapping" = ":key, payload_sensor, Timestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds" );
select hour(from_unixtime(payload_timestamp)) AS hourly, (sum(payload_wattseconds)/(60*60)/1000)
FROM CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary
GROUP BY hour(from_unixtime(payload_timestamp));
This query yields the following based on the sample data:
hourly _c1
0 16.91570472222222
1 16.363228888888887
2 15.446414166666667
3 11.151388055555556
4 18.10564666666667
5 2.2734924999999997
6 17.370668055555555
7 17.991484444444446
8 38.632728888888884
9 16.001440555555554
10 15.887023888888889
11 12.709341944444445
12 23.052629722222225
13 14.986092222222222
14 16.182284722222224
15 5.881564999999999
18 2.8149172222222223
19 17.484405
20 15.888274166666665
21 15.387210833333333
22 16.088641666666668
23 16.49990916666667
Which is aggregate kwH per hourly timeframe over the entire dataset..
So, now on to the next problem. ;-)
I created a table with a unicode character in the name (to specifically test table names with unicode). It created the table fine, but my method for detecting if the table exists broke!
Here is the interaction in question:
caribou_test=# select table_name from information_schema.tables where table_schema = 'public';
table_name
-------------
...
pinkpink1
(16 rows)
caribou_test=# select table_name from information_schema.tables where table_schema = 'public' and table_name = 'pinkƒpink1';
table_name
------------
(0 rows)
caribou_test=# select table_name from information_schema.tables where table_schema = 'public' and table_name = 'pinkpink1';
table_name
------------
(0 rows)
caribou_test=# select * from pinkƒpink1;
id | position | env_id | locked | created_at | updated_at | status_id | status_position | i1l0 | f∆ | growth555
----+----------+--------+--------+----------------------------+-------------------------+-----------+-----------------+-------+-------+--------------
1 | 0 | 1 | f | 2013-06-27 14:50:34.228136 | 2013-06-27 14:50:34.227 | 1 | 0 | YELLL | 55555 | 1.3333388822
(1 row)
The table name is pinkƒpink1 (test data). As you can see, when I select the table names from information_schema.tables it displays without the ƒ, but I can't select the table name either way! But I can still issue selects to that table directly. What is going on here?
EDIT: providing requested information for #craig-ringer:
caribou_test=# SELECT current_setting('server_encoding') AS server_encoding, current_setting('client_encoding') AS client_encoding, version();
server_encoding | client_encoding | version
-----------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------
UTF8 | UTF8 | PostgreSQL 9.2.2 on x86_64-apple-darwin12.2.1, compiled by Apple clang version 4.1 (tags/Apple/clang-421.11.66) (based on LLVM 3.1svn), 64-bit
caribou_test=# SELECT * FROM pg_class WHERE relname = 'pinkƑpink1';
---> (0 rows)
caribou_test=# SELECT upper('ƒ') = 'Ƒ', lower('Ƒ') = 'ƒ';
?column? | ?column?
----------+----------
t | t
(1 row)
caribou_test=# WITH chars(rowid, thechar) AS (VALUES (1,'ƒ'),(2,'Ƒ'),(3,upper('ƒ')),(4,lower('Ƒ'))) SELECT rowid, thechar, convert_to(thechar, 'utf-8') from chars;
rowid | thechar | convert_to
-------+---------+------------
1 | ƒ | \xc692
2 | Ƒ | \xc691
3 | Ƒ | \xc691
4 | ƒ | \xc692
It looks like a bug, perhaps in regclass or something related to it:
# create table pinkƒpink1 (id serial);
NOTICE: CREATE TABLE will create implicit sequence "pink?pink1_id_seq" for serial column "pink?pink1.id"
CREATE TABLE
# select 'pinkƒpink1'::name;
name
------------
pinkƒpink1
(1 row)
# select 'pinkƒpink1'::regclass;
regclass
-------------
"pinkpink1"
(1 row)
# select relname from pg_class where oid = 'pinkƒpink1'::regclass;
relname
-----------
pinkpink1
# select relname from pg_class where relname = 'pinkƒpink1'::name;
relname
---------
(0 rows)
# select relname from pg_class where relname = 'pinkpink1';
relname
---------
(0 rows)
(My system is OSX Lion with everything utf8, in case it matters.)
For the workaround, you can cast it to ::regclass as is done above (the one that found the table). Note that casting to ::regclass will yield an error if the table doesn't exist, though, so code around that accordingly.
Per Craig's request:
# SELECT current_setting('server_encoding') AS server_encoding, current_setting('client_encoding') AS client_encoding, version();
server_encoding | client_encoding | version
-----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------
UTF8 | UTF8 | PostgreSQL 9.2.4 on x86_64-apple-darwin11.4.2, compiled by Apple LLVM version 4.2 (clang-425.0.28) (based on LLVM 3.2svn), 64-bit
(1 row)
And per Erwin's:
# SELECT name, setting FROM pg_settings WHERE name IN ('lc_collate','lc_ctype','client_encoding','server_encoding');
name | setting
-----------------+-------------
client_encoding | UTF8
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
server_encoding | UTF8
(4 rows)
I tested your case locally with Postgres 9.1.9 and it just works.
Same in this SQLfiddle with Postgres 9.2.4. It just works.
It must be something that is not in your question ...
OSX?
Seems to be reproducible on OSX.
To help debug this you should provide more information.
server-encoding, client encoding, locale settings:
SELECT name, setting
FROM pg_settings
WHERE name IN ('lc_collate','lc_ctype','client_encoding','server_encoding')
Which client? How do you connect?
ƒ is a lower case Ƒ. Postgres depends on the underlying OS for locale settings.
When you query the information schema or the catalog tables, you need to supply an exact string (case sensitive!). But when you use the identifier without double-quoting in an SQL statement it is cast to lower case first. If your locale for some reason thinks it has to convert ƒ to some lower case equivalent, this would explain everything we have seen.
To rule this out (or verify), try your test with and without double-quoting:
CREATE TEMP TABLE "pinkƒpink1" (id int);
CREATE TEMP TABLE pinkƒpink1 (id int);
In my test under Debian Linux both result in the same table name, so I cannot execute the second command. I suspect, it is different in your case, which would explain the whole matter.