I have this environment:
Haddop environment (1 master, 4 slaves) with several applications:
ambari, hue, hive, sqoop, hdfs ... Server in production (separate
from hadoop) with mysql database.
My goal is:
Optimize the queries made on this mysql server that are slow to
execute today.
What did I do:
I imported the mysql data to HDFS using Sqoop.
My doubts:
I can not make selects direct in HDFS using Hive?
Do I have to load the data into Hive and make the queries?
If new data is entered into the mysql database, what is the best way
to get this data and insert it into HDFS and then insert it into
Hive again? (Maybe in real time)
Thank you in advance
I can not make selects direct in HDFS using Hive?
You can. Create External Table in hive specifying your hdfs location. Then you can perform any HQL over it.
Do I have to load the data into Hive and make the queries?
In case of external table, you don't need to load data in hive; your data resides in the same HDFS directory.
If new data is entered into the mysql database, what is the best way to get this data.
You can use Sqoop Incremental Import for this. It will fetch only newly added/updated data (depending upon incremental mode). You can create a sqoop job and schedule it as per your need.
You can try Impala which is much faster than Hive in case of SQL queries. You need to define tables most probably specifying some delimiter, storage format and where the data is stored on HDFS (I don't know what kind of data are you storing). Then you can write SQL queries which will take the data from HDFS.
I have no experience with real-time data ingestion from relational databases, however you can try scheduling Sqoop jobs with cron.
Related
Situation:
I'm reading files from hdfs with Spark and creating around 1000 Hive tables in 30 minutes.
Requirement:
As fast as possible I need to have these 1000 tables on Oracle as well.
My thoughts:
1. Load same dataframe to Hive and then to Oracle via jdbc within the same Spark application.
2. Load data to Hive. Sqoop tables from Hive to Oracle.
Any other ideas? Basically I need to replicate whole Hive database with ~1000 tables to Oracle.
Extremely appreciate any advice.
I have been working with hive a lot and pretty much easily grasped it since its too close to SQL being me as a DB Dev earlier.
I also know about the Hive metastore which is a MYSQL service storing the metadata of the Hive Tables we are creating on top of HDFS data.
But then terms HCAT and HBASE came which totally confuses me from a Hive developer point of view.
How are they related and can be used. Is it true :
HBASE : It can be used like Hive to create tables with data stored in HDFS but the only difference is it is NOSQL (can accept unstructured data and is not strict with schema and column numbers)?
HCAT : It is another service which consists of SERDE , METASTORE and is used all the time by HIVE . Hive cant work without this service since it contains metastore db ?
I am really confused. Please help.
I am a little confused on the purpose of the MetaStore. When you create a table in hive:
CREATE TABLE <table_name> (column1 data_type, column2 data_type);
LOAD DATA INPATH <HDFS_file_location> INTO table managed_table;
So I know this command takes the contents of the file in HDFS and creates a MetaData form of it and stores it in the MetaStore (including column types, column names, the place where it is in HDFS, etc. of each row in the HDFS file). It doesn't actually move the data from HDFS into Hive.
But what is the purpose of storing this MetaData?
When I connect to Hive using Spark SQL for example the MetaStore doesn't contain the actual information in HDFS but just MetaData. So is the MetaStore simply used by Hive to do parsing and compiling steps against the HiveQL query and to create the MapReduce jobs?
Metastore is for storing schema(table definitions including location in HDFS, serde, columns, comments, types, partition definitions, views, access permissions, etc) and statistics. There is no such operation as moving data from HDFS to Hive because Hive tables data is stored in HDFS(or other compatible filesystem like S3). You can define new table or even few tables on top of some location in HDFS and put files in it. You can change existing table location or partition location, all this information is stored in the metastore, so Hive knows how to access data. Table is a logical object defined in the metastore and data itself are just files in some location in HDFS.
See also this answer about Hive query execution flow(high level): https://stackoverflow.com/a/45587873/2700344
Hive performs schema-on-read operations, which means that for the data to be processed in some structured manner (i.e. a table-like object), the layout of said data needs to be summarized in a relational structure
takes the contents of the file in HDFS and creates a MetaData form of it
As far as I know, no files are actually read when you create a table.
SparkSQL connects to the metastore directly. Both Spark and HiveServer have their own query parsers. It's not part of the metastore. MapReduce/Tez/Spark jobs are also not handled by the metastore. It's just a relational database. If it's Mysql, Postgres, or Oracle, you can easily go connect to it and inspect the contents. By default, both Hive and Spark use an embedded Derby database
Recently, I came accross a situation:
There is a data file in remote hdfs, we need to encrpt the data file and then create impala table to query the data in local hdfs system, how impala query encryped data file, I don't know how to solve it.
It can be done by creating User Defined Functions(UDF) in hive. You can create UDF functions by using UDF Hive interface. Then, make jar out of your UDF class, put hive lib.
What is the difference between Apache Sqoop and Hive? I know that sqoop is used to import/export data from RDBMS to HDFS and Hive is a SQL layer abstraction on top of Hadoop. Can I can use Sqoop for importing data into HDFS and then use Hive for querying?
Yes, you can. In fact many people use sqoop and hive for exactly what you have told.
In my project what I had to do was to load the historical data from my RDBMS which was oracle, move it to HDFS. I had hive external tables defined for this path. This allowed me to run hive queries to do transformations. Also, we used to write mapreduce programs on top of these data to come up with various analysis.
Sqoop transfers data between HDFS and relational databases. You can use Sqoop to transfer data from a relational database management system (RDBMS) such as MySQL or Oracle into HDFS and use MapReduce on the transferred data. Sqoop can export this transformed data back into an RDBMS as well. More info http://sqoop.apache.org/docs/1.4.3/index.html
Hive is a data warehouse software that facilitates querying and managing large datasets residing in HDFS. Hive provides schema on read (as opposed to schema on write for RDBMS) onto the data and the ability to query the data using a SQL-like language called HiveQL. More info https://hive.apache.org/
Yes you can. As a matter of fact, that's exactly how it is meant to be used.
Sqoop :
We can integrate with any external data sources with HDFS i.e Sql , NoSql and Data warehouses as well using this tool at the same time we export it as well since this can be used as bi-directional ways.
sqoop to move data from a relational database into Hbase.
Hive: 1.As per my understanding we can import the data from Sql databases into hive rather than NoSql Databases.
We can't export the data from HDFS into Sql Databases.
We can use both together using the below two options
sqoop create-hive-table --connect jdbc:mysql://<hostname>/<dbname> --table <table name> --fields-terminated-by ','
The above command will generate the hive table and this table name will be same name in the external table and also the schema
Load the data
hive> LOAD DATA INPATH <filename> INTO TABLE <filename>
Hive can be shortened to one step if you know that you want to import stright from a database directly into hive
sqoop import --connect jdbc:mysql://<hostname>/<dbname> --table <table name> -m 1 --hive-import