Access Hive Data from Java - jdbc

I need to acces the data in Hive, from Java.According to the documentation for Hive JDBC Driver,the current JDBC driver can only be used to query data from default database of Hive.
Is there a way to access data from a Hive database other than the default one , through Java?

For example, you have a hive table:
create table visit (
id int,
url string,
ref string
)
partitioned by (date string)
Then you can use the statement
INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM visit WHERE date='2013-05-15';
to load the data to the hdfs then write a mapred job to handle it. Or you can use the statement
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hdfs_out' SELECT * FROM visit WHERE date='2013-05-15';
to load the data to the local file system and write a normal java program to handle it.

The JDBC documentation is found in Hive confluence documentation. For using the JDBC driver you need to have an access by a hive server.
But there are other possibilities to access the data... This all depends on your setup. You could for example also use spark assuming the Hive config and hadoop configs are set appropriately.
the JDBC documentation

Related

Loading data from SQL Server to S3 as parquet - AWS EMR

We have our data in SQL Server at the moment, we are trying to move them to our s3 bucket as parquet files. The intention is to analyse this s3 data in AWS EMR (Spark, Hive & Presto mainly). We don't want to store our data in HDFS.
What are the choices here? so far from our knowledge, it seems we can use either spark or sqoop for this import. Though sqoop is faster than Spark in this case due to parallelism (parallel db connections), it seems writing parquet file from sqoop to s3 is not possible - Sqoop + S3 + Parquet results in Wrong FS error . Workaround is to move to hdfs and then to s3. However this seems to be non-efficient. How about using SparkSQL to pull this data from SQL Server and write as parquet in s3 ?
Once we load this data as parquet in this format
s3://mybucket/table_a/day_1/(parquet files 1 ... n).
s3://mybucket/table_a/day_2/(parquet files 1 ... n).
s3://mybucket/table_a/day_3/(parquet files 1 ... n).
How can I combine them together as a single table and query using Hive. I understand that we can create hive external table pointing to s3, but can we point to multiple files?
Thanks.
EDIT: Adding this as requested.
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:257) at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Though I am little late, however for future reference. In our project, we are exactly doing this and I would prefer Sqoop over Spark.
Reason: I used Glue to read data from Mysql to S3 and the reads are not parallel (Has AWS Support looks at it and that's how Glue(which uses Pyspark) work but writing to S3 once the read is complete its parallel). This is not efficient and its slow. 100GB of data to be read and written to S3 takes 1.5Hr.
So i used Sqoop on EMR with Glue Catalog turned on(so hive metastore is on AWS) and i am able to write to S3 directly from Sqoop which is way faster 100GB of data read takes 20mins.
You will have to set the set hive.metastore.warehouse.dir=s3:// and you should see you data being written to S3 if you do an hive-import or just direct write.
The Spark read jdbc pull the data with mutliple connections. Here is the link
def
jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties):
Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function.
Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
url
JDBC database url of the form jdbc:subprotocol:subname.
table
Name of the table in the external database.
columnName
the name of a column of integral type that will be used for partitioning.
lowerBound
the minimum value of columnName used to decide partition stride.
upperBound
the maximum value of columnName used to decide partition stride.
numPartitions
the number of partitions. This, along with lowerBound (inclusive), upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the column columnName evenly. When the input is less than 1, the number is set to 1.
connectionProperties
JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "fetchsize" can be used to control the number of rows per fetch.DataFrame
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
Create hive table with partition columns as date and save and specify the following location
create table table_name (
id int,
dtDontQuery string,
name string
)
partitioned by (date string) Location s3://s3://mybucket/table_name/
Add a column in your data called as date and populate it with sysdate. You no need to add the column if it is not required, we can just populate the location. But it can be an audit column for your analytics also.
Use spark dataframe.partitionBy(date).write.parquet.location(s3://mybucket/table_name/)
Daily Perform the MSCK repair on the hive table So the New Partition is added to the table.
Apply the numPartitions on non numerical columns is by creating the hash function of that column into number of connections you want and use that column
Spark is a pretty good utility tool. You can easily connect to a JDBC data source, and you can write to S3 by specifying credentials and an S3 path (e.g. Pyspark Save dataframe to S3).
If you're using AWS, your best bet for Spark, Presto and Hive is to use the AWS Glue Metastore. This is a data catalog that registers your s3 objects as tables within databases, and provides an API for locating those objects.
The answer to your Q2 is yes, you can have a table that refers to multiple files. You'd normally want to do this if you have partitioned data.
You can create the hive external table as follows
create external table table_a (
siteid string,
nodeid string,
aggregation_type string
)
PARTITIONED BY (day string)
STORED AS PARQUET
LOCATION 's3://mybucket/table_a';
Then you can run the following command to register the partition files stored under each days directory into HiveMatastore
MSCK REPAIR TABLE table_a;
Now you can access your files through hive queries. We have used this approach in our project and working well. After the above command, you can run the query
select * from table_a where day='day_1';
Hope this helps.
-Ravi

Main purpose of the MetaStore in Hive?

I am a little confused on the purpose of the MetaStore. When you create a table in hive:
CREATE TABLE <table_name> (column1 data_type, column2 data_type);
LOAD DATA INPATH <HDFS_file_location> INTO table managed_table;
So I know this command takes the contents of the file in HDFS and creates a MetaData form of it and stores it in the MetaStore (including column types, column names, the place where it is in HDFS, etc. of each row in the HDFS file). It doesn't actually move the data from HDFS into Hive.
But what is the purpose of storing this MetaData?
When I connect to Hive using Spark SQL for example the MetaStore doesn't contain the actual information in HDFS but just MetaData. So is the MetaStore simply used by Hive to do parsing and compiling steps against the HiveQL query and to create the MapReduce jobs?
Metastore is for storing schema(table definitions including location in HDFS, serde, columns, comments, types, partition definitions, views, access permissions, etc) and statistics. There is no such operation as moving data from HDFS to Hive because Hive tables data is stored in HDFS(or other compatible filesystem like S3). You can define new table or even few tables on top of some location in HDFS and put files in it. You can change existing table location or partition location, all this information is stored in the metastore, so Hive knows how to access data. Table is a logical object defined in the metastore and data itself are just files in some location in HDFS.
See also this answer about Hive query execution flow(high level): https://stackoverflow.com/a/45587873/2700344
Hive performs schema-on-read operations, which means that for the data to be processed in some structured manner (i.e. a table-like object), the layout of said data needs to be summarized in a relational structure
takes the contents of the file in HDFS and creates a MetaData form of it
As far as I know, no files are actually read when you create a table.
SparkSQL connects to the metastore directly. Both Spark and HiveServer have their own query parsers. It's not part of the metastore. MapReduce/Tez/Spark jobs are also not handled by the metastore. It's just a relational database. If it's Mysql, Postgres, or Oracle, you can easily go connect to it and inspect the contents. By default, both Hive and Spark use an embedded Derby database

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

How to encrypt data in hdfs , and then create hive or impala table to query it?

Recently, I came accross a situation:
There is a data file in remote hdfs, we need to encrpt the data file and then create impala table to query the data in local hdfs system, how impala query encryped data file, I don't know how to solve it.
It can be done by creating User Defined Functions(UDF) in hive. You can create UDF functions by using UDF Hive interface. Then, make jar out of your UDF class, put hive lib.

delete duplicates using pig where there is no primary key

Im a newbie to hadoop and I have a use case where there are 3 columns name,value,time stamp.The data is , comma separated and is in csv format I need to check for the duplicates and delete them using pig.How can I achieve that.
You can use pig DISTINCT function to remove duplicate.
Please refer this link to know about DISTINCT function.
As you are saying that your data reside in HIVE table and you want to access those data through pig, You can use HCatLoader() to access HIVE table through pig. HCatalog can be used for both external and internal HIVE table. But before using this function, please verify that your cluster has configured HCatalog. If you are using Hadoop 2.X then it should be there.
Using HCatalog your pig LOAD command will be like this.
A = LOAD 'table_name' using HCatLoader();
If you don't want to use HCatalog and if your HIVE tables are external table and you know the HDFS location of the data then you can use CSVLoader() to access the data. Using CSVLoader() your pig LOAD command will be like this.
REGISTER piggybank.jar
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
--Load data using CSVLoader.
A = LOAD '/user/hdfs/dirtodata/MyData.csv' using CSVLoader AS (
name:chararray, value:chararray, timestamp:chararray,
);
Hive external tables are designed in such a way that user can access
the data from outside hive such as Pig and MapReduce programming. But if your HIVE table is internal table and you want to analyze the data using Pig, then you can use HCatLoader() to access hive table data through pig.
In both scenario there wont be any effect in original data during the analytic. Here you are accessing the data, you are not modifying the original data.
Please refer below useful link to understand more about HCat.
http://hortonworks.com/hadoop-tutorial/how-to-use-hcatalog-basic-pig-hive-commands/
https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat

Resources