I bit confuse with Hadoop hive which i read from Wiki used for make OLAP.
Now i want to make OLAP on Hive from OLTP database which use Mysql.
How i can solve this? can i use Kettle for make OLAP in Hive?
any guidance how to make OLAP on Hive from OLTP mysql ?
Tks.
I would suggest the following approach:
a) To specify the history part of your OLTP process. Usually it is some kind of logs over operations. Lets call it fact table.
b) To have fact table to be partitioned by time
c) Periodically offload oldest partition from the MySQL by exporting it to CSV and deleting from the MySQL.
e) Load this CSV file to Hive
By implementing this schema you will have all but latest data in the hive, and prevent growing of the MySQL OLTP database.
Related
Situation:
I'm reading files from hdfs with Spark and creating around 1000 Hive tables in 30 minutes.
Requirement:
As fast as possible I need to have these 1000 tables on Oracle as well.
My thoughts:
1. Load same dataframe to Hive and then to Oracle via jdbc within the same Spark application.
2. Load data to Hive. Sqoop tables from Hive to Oracle.
Any other ideas? Basically I need to replicate whole Hive database with ~1000 tables to Oracle.
Extremely appreciate any advice.
I have been working with hive a lot and pretty much easily grasped it since its too close to SQL being me as a DB Dev earlier.
I also know about the Hive metastore which is a MYSQL service storing the metadata of the Hive Tables we are creating on top of HDFS data.
But then terms HCAT and HBASE came which totally confuses me from a Hive developer point of view.
How are they related and can be used. Is it true :
HBASE : It can be used like Hive to create tables with data stored in HDFS but the only difference is it is NOSQL (can accept unstructured data and is not strict with schema and column numbers)?
HCAT : It is another service which consists of SERDE , METASTORE and is used all the time by HIVE . Hive cant work without this service since it contains metastore db ?
I am really confused. Please help.
I have this environment:
Haddop environment (1 master, 4 slaves) with several applications:
ambari, hue, hive, sqoop, hdfs ... Server in production (separate
from hadoop) with mysql database.
My goal is:
Optimize the queries made on this mysql server that are slow to
execute today.
What did I do:
I imported the mysql data to HDFS using Sqoop.
My doubts:
I can not make selects direct in HDFS using Hive?
Do I have to load the data into Hive and make the queries?
If new data is entered into the mysql database, what is the best way
to get this data and insert it into HDFS and then insert it into
Hive again? (Maybe in real time)
Thank you in advance
I can not make selects direct in HDFS using Hive?
You can. Create External Table in hive specifying your hdfs location. Then you can perform any HQL over it.
Do I have to load the data into Hive and make the queries?
In case of external table, you don't need to load data in hive; your data resides in the same HDFS directory.
If new data is entered into the mysql database, what is the best way to get this data.
You can use Sqoop Incremental Import for this. It will fetch only newly added/updated data (depending upon incremental mode). You can create a sqoop job and schedule it as per your need.
You can try Impala which is much faster than Hive in case of SQL queries. You need to define tables most probably specifying some delimiter, storage format and where the data is stored on HDFS (I don't know what kind of data are you storing). Then you can write SQL queries which will take the data from HDFS.
I have no experience with real-time data ingestion from relational databases, however you can try scheduling Sqoop jobs with cron.
There are more than 300 tables in my hive environment.
I want to export all the tables from Hive to Oracle/MySql including metadata.
My Oracle database doesn't have any tables corresponding to these Hive tables.
Sqoop import from Oracle to Hive creates tables in Hive if the table doesn't exists.But Sqoop export from Hive to Oracle doesn't create table if not exists and fails with an exception.
Is there any option in Sqoop to export metadata also? or
Is there any other Hadoop tool through which I can achieve this?
Thanks in advance
The feature you're asking for isn't in Spark. I don't know of a current hadoop tool which can do what you're asking either unfortunately. A potential workaround is using the "show create table mytable" statement in Hive. It will return the create table statements. You can parse this manually or pragmatically via awk and get the create tables in a file, then run this file against your oracle db. From there, you can use sqoop to populate the tables.
It won't be fun.
Sqoop can't copy metadata or create table in RDBMS on the basis of Hive table.
Table must be there in RDBMS to perform sqoop export.
Why is it so?
Mapping from RDBMS to Hive is easy because hive have only few datatypes(10-15). Mapping from multiple RDBMS datatypes to Hive datatype is easily achievable. But vice versa is not that easy. Typical RDBMS has 100s of datatypes (that too different in different RDBMS).
Also sqoop export is newly added feature. This feature may come in future.
We are trying to connect MS 9.4 to HBase via Impala connector.
First we created the hive tables liking them to HBase tables with following create table (as we saw in the docs):
CREATE TABLE hiveTableName1
(key int, columnName1 codClient, columnName2 clientName)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,columnfamily1:columnName1,columnfamily1:columnName2")
TBLPROPERTIES ("hbase.table.name" = "hbaseTableName1");
We did this twice, since we want to crete two hive tables and their correspondent hbase tables, in order to perform a join between them later with MS.
For the connection between MS with HBase, we follow the steps by selecting the MicroStrategy ODBC Driver for Impala Wire Protocol, and filling in the Data Source Name (Impala Data Source previously created with the Impala Driver), host and port (both for Impala installation in our AWS infraestructure) and impala/impala for credentials.
The thing is that when we finish complete the wizard and select the default namespace (which is the only one available. No any other ns has been created), we can see the hive tables that we created before, instead of the hbase tables.
I mean:
hiveTableName1
hiveTableName2
instead of
hbaseTableName1
hbaseTableName2
And, since these are the only tables availables, we only can perform our report with these two tables: a very easy join between these two tables by one field.
Both tables have 200.000 records and the join takes more than 1 minute to complete.
I'm sure that we are missing something here, and the process of linking hive tables to hbase ones are not completely right.
Is there a way to be able to connect to these two hbase tables instead of hive ones?
Any help will be really appreciated.
1. HBase does not support SQL and does not support the concept of "join" anyway.
2. Mapping Hive tables on HBase tables means that every Hive query triggers a full scan on HBase side, then the result is fed to a MapReduce batch job that does the filters and the joins.
Bottom line: 1 min is quite fast for what you are doing.
If you expect sub-second results, try some "small data" technologies (e.g. MySQL, Oracle, even MS Access) or forget about joins.
For sub-minutes results, you might give a try to Apache Phoenix: it's a HBase wrapper with indexes and some kind of SQL. Not sure about ODBC/JDBC drivers though.