Repairing hive table using hiveContext in java - hadoop

I want to repair the hive table for any newly added/deleted partitions.Instead of manually running msck repair command in hive,is there any way to achieve this in java?I am trying to get all partitions from hdfs and from hive metastore and then after comparing them will put newly added/deleted partitions in hive metastore.But i am not able to get the api from hivecontext.I have tried to get all the partitions using hivecontext,but it is throwing error table not found.
System.out.println(hiveContext.metadataHive().getTable("anshu","mytable").getAllPartitions());
Is there any way to add/remove partitions in hive using java?

Spark Option :
using hivecontext you can execute this like below example. no need to do it manually
sqlContext = HiveContext(sc)
sqlContext.sql("MSCK REPAIR TABLE your table")
Is there any way to add/remove partitions in hive using java?
Plain java option :
If you want to do it in plain java way with out using spark, with plain java code then
You can use class HiveMetaStoreClient to query directly from HiveMetaStore.
Please see my answer here with example usage

Related

Spark(2.3) not able to identify new columns in Parquet table added via Hive Alter Table command

I have a Hive Parquet table which I am creating using Spark 2.3 API df.saveAstable. There is a separate Hive process that alters the same parquet table to add columns (based on requirements).
However, next time when I try to read the same parquet table into Spark dataframe, the new column which was added to the parquet table using Hive Alter Table command is not showing up in the df.printSchema output.
Based on initial analysis, it seems that there might be some conflict, and Spark is using its own schema instead of reading the Hive metastore.
Hence, I tried the below options :
Changing the spark setting:
spark.sql.hive.convertMetastoreParquet=false
and Refreshing the spark catalog:
spark.catalog.refreshTable("table_name")
However, the above two options are not solving the problem.
Any suggestions or alternatives would be super helpful.
This sounds like a bug described in SPARK-21841. JIRA description also contains the idea for a possible workaround:
...Interestingly enough it appears that if you create the table
differently like:
spark.sql("create table mydb.t1 select ip_address from mydb.test_table limit 1")
Run your alter table on mydb.t1 val t1 = spark.table("mydb.t1")
Then it works properly...
To fix this solution, you have to use the same alter command used in hive to spark-shell as well.
spark.sql("alter table TABLE_NAME add COLUMNS (col_A string)")

Questions about Hive

I have this environment:
Haddop environment (1 master, 4 slaves) with several applications:
ambari, hue, hive, sqoop, hdfs ... Server in production (separate
from hadoop) with mysql database.
My goal is:
Optimize the queries made on this mysql server that are slow to
execute today.
What did I do:
I imported the mysql data to HDFS using Sqoop.
My doubts:
I can not make selects direct in HDFS using Hive?
Do I have to load the data into Hive and make the queries?
If new data is entered into the mysql database, what is the best way
to get this data and insert it into HDFS and then insert it into
Hive again? (Maybe in real time)
Thank you in advance
I can not make selects direct in HDFS using Hive?
You can. Create External Table in hive specifying your hdfs location. Then you can perform any HQL over it.
Do I have to load the data into Hive and make the queries?
In case of external table, you don't need to load data in hive; your data resides in the same HDFS directory.
If new data is entered into the mysql database, what is the best way to get this data.
You can use Sqoop Incremental Import for this. It will fetch only newly added/updated data (depending upon incremental mode). You can create a sqoop job and schedule it as per your need.
You can try Impala which is much faster than Hive in case of SQL queries. You need to define tables most probably specifying some delimiter, storage format and where the data is stored on HDFS (I don't know what kind of data are you storing). Then you can write SQL queries which will take the data from HDFS.
I have no experience with real-time data ingestion from relational databases, however you can try scheduling Sqoop jobs with cron.

Presto and hive partition discovery

I'm using presto mainly with hive connector to connect to hive metastore.
All of my tables are external tables pointing to data stored in S3.
My main issue with this is that there is no way (at least on I'm aware of ) to do partition discovery in Presto ,so before I start query a table in presto I need to switch to hive and run msck repair table mytable
is there more reasonable way to do it in Presto?
I'm on version 0.227 and the following helps me:
select * from hive.yourschema."yourtable$partitions"
This select returns all the partitions mapped in your catalog. You can filter, order, etc. as a normal query would.
No.
If the HIVE metastore doesn't see the partitions, PrestoDB will not see it.
Maybe a cron can help you.
There is now a way to do this:
CALL system.sync_partition_metadata(schema_name=>'<your-schema>', table_name=>'<your-table>', mode=>'FULL')
Credit to this post and this video

delete duplicates using pig where there is no primary key

Im a newbie to hadoop and I have a use case where there are 3 columns name,value,time stamp.The data is , comma separated and is in csv format I need to check for the duplicates and delete them using pig.How can I achieve that.
You can use pig DISTINCT function to remove duplicate.
Please refer this link to know about DISTINCT function.
As you are saying that your data reside in HIVE table and you want to access those data through pig, You can use HCatLoader() to access HIVE table through pig. HCatalog can be used for both external and internal HIVE table. But before using this function, please verify that your cluster has configured HCatalog. If you are using Hadoop 2.X then it should be there.
Using HCatalog your pig LOAD command will be like this.
A = LOAD 'table_name' using HCatLoader();
If you don't want to use HCatalog and if your HIVE tables are external table and you know the HDFS location of the data then you can use CSVLoader() to access the data. Using CSVLoader() your pig LOAD command will be like this.
REGISTER piggybank.jar
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
--Load data using CSVLoader.
A = LOAD '/user/hdfs/dirtodata/MyData.csv' using CSVLoader AS (
name:chararray, value:chararray, timestamp:chararray,
);
Hive external tables are designed in such a way that user can access
the data from outside hive such as Pig and MapReduce programming. But if your HIVE table is internal table and you want to analyze the data using Pig, then you can use HCatLoader() to access hive table data through pig.
In both scenario there wont be any effect in original data during the analytic. Here you are accessing the data, you are not modifying the original data.
Please refer below useful link to understand more about HCat.
http://hortonworks.com/hadoop-tutorial/how-to-use-hcatalog-basic-pig-hive-commands/
https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat

Hive is not showing tables

I am new to Hadoop and Hive world.
I have a strange problem. When I was working on hive prompt. I have created few tables and hive was showing those tables.
After exiting Hive session when I am again starting Hive terminal "show tables;" is not showing any table!. I can see tables in '/user/hive/warehouse' in HDFS.
What is wrong am I doing. Can you please help me on this?
BalduZ is right . set this in $HIVE_HOME/conf/hive-site.xml
property name = javax.jdo.option.ConnectionURL
property value = jdbc:derby:;databaseName=/home/youruser/hive_metadata/metastore_db;create=true
Next time onwards you can run hive from any dir location. This will solve your problem.
I assume you are using the default configuration, so the problem is where you call hive to start working, since you need to call it from the same directory in order to see the tables you created in the previous hive session.
For example, if you call hive when you are in ~/test/hive and create some tables, and the next time you use hive you start it from ~/test you will not see the tables you created earlier. The easiest solution is to always start hive from the same directory.
However, a better solution would be to configure hive so that it uses a database like MySQL as a metastore. You can find how to do this here.

Resources