We are trying to connect MS 9.4 to HBase via Impala connector.
First we created the hive tables liking them to HBase tables with following create table (as we saw in the docs):
CREATE TABLE hiveTableName1
(key int, columnName1 codClient, columnName2 clientName)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,columnfamily1:columnName1,columnfamily1:columnName2")
TBLPROPERTIES ("hbase.table.name" = "hbaseTableName1");
We did this twice, since we want to crete two hive tables and their correspondent hbase tables, in order to perform a join between them later with MS.
For the connection between MS with HBase, we follow the steps by selecting the MicroStrategy ODBC Driver for Impala Wire Protocol, and filling in the Data Source Name (Impala Data Source previously created with the Impala Driver), host and port (both for Impala installation in our AWS infraestructure) and impala/impala for credentials.
The thing is that when we finish complete the wizard and select the default namespace (which is the only one available. No any other ns has been created), we can see the hive tables that we created before, instead of the hbase tables.
I mean:
hiveTableName1
hiveTableName2
instead of
hbaseTableName1
hbaseTableName2
And, since these are the only tables availables, we only can perform our report with these two tables: a very easy join between these two tables by one field.
Both tables have 200.000 records and the join takes more than 1 minute to complete.
I'm sure that we are missing something here, and the process of linking hive tables to hbase ones are not completely right.
Is there a way to be able to connect to these two hbase tables instead of hive ones?
Any help will be really appreciated.
1. HBase does not support SQL and does not support the concept of "join" anyway.
2. Mapping Hive tables on HBase tables means that every Hive query triggers a full scan on HBase side, then the result is fed to a MapReduce batch job that does the filters and the joins.
Bottom line: 1 min is quite fast for what you are doing.
If you expect sub-second results, try some "small data" technologies (e.g. MySQL, Oracle, even MS Access) or forget about joins.
For sub-minutes results, you might give a try to Apache Phoenix: it's a HBase wrapper with indexes and some kind of SQL. Not sure about ODBC/JDBC drivers though.
Related
Situation:
I'm reading files from hdfs with Spark and creating around 1000 Hive tables in 30 minutes.
Requirement:
As fast as possible I need to have these 1000 tables on Oracle as well.
My thoughts:
1. Load same dataframe to Hive and then to Oracle via jdbc within the same Spark application.
2. Load data to Hive. Sqoop tables from Hive to Oracle.
Any other ideas? Basically I need to replicate whole Hive database with ~1000 tables to Oracle.
Extremely appreciate any advice.
I have been working with hive a lot and pretty much easily grasped it since its too close to SQL being me as a DB Dev earlier.
I also know about the Hive metastore which is a MYSQL service storing the metadata of the Hive Tables we are creating on top of HDFS data.
But then terms HCAT and HBASE came which totally confuses me from a Hive developer point of view.
How are they related and can be used. Is it true :
HBASE : It can be used like Hive to create tables with data stored in HDFS but the only difference is it is NOSQL (can accept unstructured data and is not strict with schema and column numbers)?
HCAT : It is another service which consists of SERDE , METASTORE and is used all the time by HIVE . Hive cant work without this service since it contains metastore db ?
I am really confused. Please help.
We are stuck with a problem where-in we are trying to do a near real time sync between a RDBMS(Source) and hive (Target). Basically the source is pushing the changes (inserts, updates and deletes) into HDFS as avro files. These are loaded into external tables (with avro schema), into the Hive. There is also a base table in ORC, which has all the records that came in before the Source pushed in the new set of records.
Once the data is received, we have to do a de-duplication (since there could be updates on existing rows) and remove all deleted records (since there could be deletes from the Source).
We are now performing a de-dupe using rank() over partitioned keys on the union of external table and base table. And then the result is then pushed into a new table, swap the names. This is taking a lot of time.
We tried using merges, acid transactions, but rank over partition and then filtering out all the rows has given us the best possible time at this moment.
Is there a better way of doing this? Any suggestions on improving the process altogether? We are having quite a few tables, so we do not have any partitions or buckets at this moment.
You can try with storing all the transactional data into Hbase table.
Storing data into Hbase table using Primary key of RDBMS table as Row Key:-
Once you pull all the data from RDBMS with NiFi processors(executesql,Querydatabasetable..etc) we are going to have output from the processors in Avro format.
You can use ConvertAvroToJson processor and then use SplitJson Processor to split each record from array of json records.
Store all the records in Hbase table having Rowkey as the Primary key in the RDBMS table.
As when we get incremental load based on Last Modified Date field we are going to have updated records and newly added records from the RDBMS table.
If we got update for the existing rowkey then Hbase will overwrite the existing data for that record, for newly added records Hbase will add them as a new record in the table.
Then by using Hive-Hbase integration you can get the Hbase table data exposed using Hive.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
By using this method we are going to have Hbase table that will take care of all the upsert operations and we cannot expect same performance from hive-hbase table vs native hive table will perform faster,as hbase tables are not meant for sql kind of queries, hbase table is most efficient if you are accessing data based on Rowkey,
if we are going to have millions of records then we need to do some tuning to the hive queries
Tuning Hive Queries That Uses Underlying HBase Table
I'm using presto mainly with hive connector to connect to hive metastore.
All of my tables are external tables pointing to data stored in S3.
My main issue with this is that there is no way (at least on I'm aware of ) to do partition discovery in Presto ,so before I start query a table in presto I need to switch to hive and run msck repair table mytable
is there more reasonable way to do it in Presto?
I'm on version 0.227 and the following helps me:
select * from hive.yourschema."yourtable$partitions"
This select returns all the partitions mapped in your catalog. You can filter, order, etc. as a normal query would.
No.
If the HIVE metastore doesn't see the partitions, PrestoDB will not see it.
Maybe a cron can help you.
There is now a way to do this:
CALL system.sync_partition_metadata(schema_name=>'<your-schema>', table_name=>'<your-table>', mode=>'FULL')
Credit to this post and this video
I am new to Hadoop and learning Hive.
In Hadoop definative guide 3rd edition page no. 428 last paragraph
I don't understand below paragraph regarding external table in HIVE.
"A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table."
Can anybody explain briefly what above phrase says?
Usually the data in the initial dataset is not constructed in the optimal way for queries.
You may want to modify the data (like modifying some columns adding columns, making aggregation etc) and to store it in a specific way (partitions / buckets / sorted etc) so that the queries would benefit from these optimizations.
The key difference between external and managed table in Hive is that data in the external table is not managed by Hive.
When you create external table you define HDFS directory for that table and Hive is simply "looking" in it and can get data from it but Hive can't delete or change data in that folder. When you drop external table Hive only deletes metadata from its metastore and data in HDFS remains unchanged.
Managed table basically is a directory in HDFS and it's created and managed by Hive. Even more - all operations for removing/changing partitions/raw data/table in that table MUST be done by Hive otherwise metadata in Hive metastore may become incorrect (e.g. you manually delete partition from HDFS but Hive metastore contains info that partition exists).
In Hadoop definative guide I think author meant that it is a common practice to write MR-job that produces some raw data and keeps it in some folder. Than you create Hive external table which will look into that folder. And than safelly run queries without the risk to drop table etc.
In other words - you can do MR job that produces some generic data and than use Hive external table as a source of data for insert into managed tables. It helps you to avoid creating boring similar MR jobs and delegate this task to Hive queries - you create query that takes data from external table, aggregates/processes it how you want and puts the result into managed tables.
Another goal of external table is to use as a source data from remote servers, e.g. in csv format.
There is no reason to move table to managed unless you are going to enable ACID or other features supported only for managed tables.
The list of differences in features supported by managed/external tables may change in future, better use current documentation. Currently these features are:
ARCHIVE/UNARCHIVE/TRUNCATE/MERGE/CONCATENATE only work for managed
tables
DROP deletes data for managed tables while it only deletes
metadata for external ones
ACID/Transactional only works for
managed tables
Query Results Caching only works for managed
tables
Only the RELY constraint is allowed on external tables
Some Materialized View features only work on managed tables
You can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344
Data structure has nothing in common with external/managed table type. If you want to change structure you do not necessarily need to change table managed/external type
It is also mentioned in the book.
when your table is external table.
you can use other technologies like PIG,Cascading or Mapreduce to process it .
You can also use multiple schemas for that dataset.
and You can also create data lazily if it is external table.
when you decide that dataset should be used by only Hive,make it hive managed table.