Can not contact a hive table partition, after delete hdfs file related to partition - hadoop

My Hadoop Cluster works batch job for every data at 11:00.
The job creates hive table partition(ex. p_date=201702,p_domain=0) and import rdbms data to the hive table partition like ETL....(hive table is not external table)
but the job has failed, and i removed some hdfs file(the partition location => p_date=20170228,p_domain=0) for reprocess.
It is my mistake, i just a typing query for drop partition at beeline...
And i contact a hang when i query this way "select * from table_name where p_date=20170228,p_domain=0", But "select * from table_name where p_date=20170228,p_domain=6" is success.
I can not find a error log and console message is not appear
How can i solve this problem?
And i hope you understand my lack of english.

You should not delete your partitions in Hive table in that way. There is a special command for doing this:
ALTER TABLE table_name DROP IF EXISTS PARTITION(partitioncolumn= 'somevalue');
Deleteing the files from HDFS is not sufficient. You need to clean the data from the metastore. For this you need to connect to you relational db and remove the data from partition-related table in MetaStore database.
mysql
mysql> use hive;
mysql> SELECT PART_ID PARTITIONS WHERE PART_NAME like '%p_date=20170228,p_domain=0%'
+---------+-------------+------------------+--------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME | SD_ID | TBL_ID |
+---------+-------------+------------------+--------------------+-------+--------+
| 7 | 1487237959 | 0 | partition name | 336 | 329 |
+---------+-------------+------------------+--------------------+-------+--------+
mysql> DELETE FROM PARTITIONS WHERE PART_ID=7;
mysql> DELETE FROM PARTITION_KEY_VALS WHERE PART_ID=7;
mysql> DELETE FROM PARTITION_PARAMS WHERE PART_ID=7;
After this Hive should stop using this partition in your queries.

Related

Hive "insert into" doesnt add values

Im new to hadoop etc.
Connect via beeline to hiveserver2. Then I create table:
create table test02(id int, name string);
Table creates and I try to insert values:
insert into test02(id, name) values (1, "user1");
And nothing happens. table02 and values__tmp__table__1 are created but they are both empty.
Hadoop directory "/user/$username/warehouse/test01" is empty to.
0: jdbc:hive2://localhost:10000> insert into test02 values (1,"user1");
No rows affected (2.284 seconds)
0: jdbc:hive2://localhost:10000> select * from test02;
+------------+--------------+
| test02.id | test02.name |
+------------+--------------+
+------------+--------------+
No rows selected (0.326 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+------------------------+
| tab_name |
+------------------------+
| test02 |
| values__tmp__table__1 |
+------------------------+
2 rows selected (0.137 seconds)
Temp tables like these are created when hive needs to manage intermediate data during an operation. Hive automatically deletes all temporary tables at the end of the Hive session in which they are created. If you close the session and open it again, you won't find the temp table.
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.5.0/bk_data-access/content/temp-tables.html
Insert data like this ->
insert into test02 values (999, "user_new");
Data would be inserted into test02 and a temp table like values__tmp__table__1 (temp table will gone after the hive session).
I found a solution. I'm new to Hadoop&co, so the answer was not obvious to me.
First, I turned Hive logging to level ERROR to see the problem:
Find hive-exec-log4j2.properties ({your hive directory}/conf/)
Find property.hive.log.level and set the value to ERROR (..log.level = ERROR)
Then, while executing the command insert into via Beeline, I saw all of the errors. The main error was:
There are 0 datanode(s) running and no node(s) are excluded in this operation
I found the same question elsewhere. The top answer helped me, which was to delete all /tmp/* files (which stored all of my local HDFS data).
Then, like the first time, I initialized namenode (-format) and Hive (ran my metahive script).
The problem was solved—though it did expose another issue, which I'll need to look into: the insert into executes in 25+ seconds.

Impala via JDBC: retrieve number of dropped partitions

I am dropping multiple partitions of an Impala table via
ALTER TABLE foobar DROP IF EXISTS PARTITION (pkey='foo' OR pkey='bar');
When using impala-shell I am presented a result telling me how many partitions were actually dropped:
Starting Impala Shell without Kerberos authentication
***********************************************************************************
Welcome to the Impala shell.
(Impala Shell v3.2.0-cdh6.3.2 (1bb9836) built on Fri Nov 8 07:22:06 PST 2019)
The SET command shows the current value of all shell and query options.
***********************************************************************************
Opened TCP connection to impala:21000
Connected to impala:21000
Server version: impalad version 3.2.0-cdh6.3.2 RELEASE (build 1bb9836227301b839a32c6bc230e35439d5984ac)
[impala:21000] default> use my_schema;
Query: use my_schema
[impala:21000] my_schema> ALTER TABLE FOOBAR DROP IF EXISTS PARTITION (pkey='foo' OR pkey='bar');
Query: ALTER TABLE FOOBAR DROP IF EXISTS PARTITION (pkey='foo' OR pkey='bar')
+-------------------------+
| summary |
+-------------------------+
| Dropped 1 partition(s). |
+-------------------------+
Fetched 1 row(s) in 0.13s
Now, in our productive code, we are stuck using only JDBC. When executing the same DDL statement via JDBC, for my Statement st I have st.getResultSet() == null and st.getUpdateCount() == -1
Is there a way to retrive the number of dropped partitions via JDBC only?

Presto Query HIVE Table Exception: Failed to list directory

I'm new to Presto. I have two machine for presto 0.160, one is coordinator, the other is worker. I want to query table in hive. Now I can "show tables", "desc tablename", but when I want to "select * from tablename", exception occured: "Query 20170728_123013_00011_q4s3a failed: Failed to list directory: hdfs://cdh-test/user/hive/warehouse/employee_hive"
presto> desc hive.default.employee_hive;
Column | Type | Comment
-------------+---------+---------
eid | integer |
name | varchar |
salary | varchar |
destination | varchar |
(4 rows)
Query 20170728_123001_00010_q4s3a, FINISHED, 2 nodes
Splits: 2 total, 2 done (100.00%)
0:00 [4 rows, 268B] [40 rows/s, 2.68KB/s]
presto> select * from hive.default.employee_hive;
Query 20170728_123013_00011_q4s3a, FAILED, 1 node
Splits: 1 total, 0 done (0.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20170728_123013_00011_q4s3a failed: Failed to list directory: hdfs://cdh-test/user/hive/warehouse/employee_hive
Here is my configuration for hive catalog:
connector.name=hive-cdh4
hive.metastore.uri=thrift://***:9083
hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
where am I wrong?
The path that the table is stored on needs to exist on HDFS for Presto to open it successfully. From the path it appears your table is an "internal" hive table, meaning hive should have created the path itself. Since it hasn't, you could create it yourself using a command similar to hdfs dfs -mkdir hdfs://cdh-test/user/hive/warehouse/employee_hive, although the exact command depends on your HDFS set up.
you can't access the hadoop directory directory. I hope you have created the table as textfile and it stores internal directory of respective user.
you just create table as external table and you can able to access via presto
Create External Table tablename (columnames datatypes) row format delimited fields terminated by '\t' stored as textfile;
load data inpath 'Your_hadoop_directory' into table tablename;
else you just create a internal table and load it to external ORC table and access via presto
Create Table tablename (columnames datatypes) row format delimited fields terminated by '\t' stored as textfile;
load data inpath 'Your_hadoop_directory' into table tablename;
Create external Table tablename (columnames datatypes) STORED AS ORC;
insert into orc_tablename select * from internal_tablename
I solved above issue by creating ORC table.

Search a table in all databases in hive

In Hive, how do we search a table by name in all databases?
I am a Teradata user. Is there any counterpart of systems tables (present in Teradata) like dbc.tables, dbc.columns which are present in HIVE?
You can use SQL like to search a table.
Example:
I want to search a table with the name starting from "Benchmark" I don't know the rest of it.
Input in HIVE CLI:
show tables like 'ben*'
Output:
+-----------------------+--+
| tab_name |
+-----------------------+--+
| benchmark_core_month |
| benchmark_core_qtr |
| benchmark_core_year |
+-----------------------+--+
3 rows selected (0.224 seconds)
Or you can try below command if you are using Beeline
!tables
Note: It will work with Beeline only (JDBC client based)
More about beeline: http://blog.cloudera.com/blog/2014/02/migrating-from-hive-cli-to-beeline-a-primer/
you can also use hdfs to find a table in all databases:
the path of hive databases is:
/apps/hive/warehouse/
so, by using hdfs :
hdfs dfs -find /apps/hive/warehouse/ -name t*
You should query the metastore.
You can find the connection properties within hive-site.xml
bash
<$HIVE_HOME/conf/hive-site.xml grep -A1 jdo
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
--
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
--
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
--
<name>javax.jdo.option.ConnectionPassword</name>
<value>cloudera</value>
Within the metastore you can use a query similar to the following
mysql
select *
from metastore.DBS as d
join metastore.TBLS as t
on t.DB_ID =
d.DB_ID
where t.TBL_NAME like '% ... put somthing here ... %'
order by d.NAME
,t.TBL_NAME
;
Searching for tables with name containing infob across all Hive databases
for i in `hive -e "show schemas"`; do echo "Hive DB: $i"; hive -e "use $i; show tables"|grep "infob"; done
Hive stores all its metadata information in Metastore. Metastore schema can be found at: link: https://issues.apache.org/jira/secure/attachment/12471108/HiveMetaStore.pdf
It has tables like DBS for database, TBLS for tables and Columns. You may use appropriate join, to find out table name or column names.
#hisi's answer is elegant. However it induce an error with lacking memory for GC on our cluster. So, there is another less elegant approach that works for me.
Let foo is the table name to search. So
hadoop fs -ls -R -C /apps/hive/warehouse/ 2>/dev/null | grep '/apps/hive/warehouse/[^/]\{1,\}/foo$'
If one does not remember exact name of table but only substring bar in table name, then command is
hadoop fs -ls -R -C /apps/hive/warehouse/ 2>/dev/null | grep '/apps/hive/warehouse/[^/]\{1,\}/[^/]\{1,\}$' | grep bar
That's an extention of Mantej Singh's answer: you can use pyspark to find tables across all Hive databases (not just one):
from functools import reduce
from pyspark import SparkContext, HiveContext
from pyspark.sql import DataFrame
sc = SparkContext()
sqlContext = HiveContext(sc)
dbnames = [row.databaseName for row in sqlContext.sql('SHOW DATABASES').collect()]
tnames = []
for dbname in dbnames:
tnames.append(sqlContext.sql('SHOW TABLES IN {} LIKE "%your_pattern%"'.format(dbname)))
tables = reduce(DataFrame.union, tnames)
tables.show()
The way to do this is to iterate through the databases searching for table with a specified name.

Can External Tables in Hive Intelligently Identify Partitions?

I need to run this whenever I need to mount partition. Rather than me doing it manually is there a way to auto detect partition in external hive tables
ALTER TABLE TableName ADD IF NOT EXISTS PARTITION()location 'locationpath';
Recover Partitions (MSCK REPAIR TABLE)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
MSCK REPAIR TABLE table_name;
partitions will be add automatically
Using dynamic partition, the directory does not need to be created manually. But dynamic partition mode needs to be set to nonstrict, by default it is strict
CREATE External TABLE profile (
userId int
)
PARTITIONED BY (city String)
location '/user/test/profile';
set hive.exec.dynamic.partition.mode=nonstrict;
hive> insert into profile partition(city)
select * from nonpartition;
hive> select * from profile;
OK
1 Chicago
1 Chicago
2 Orlando
and in HDFS
[cloudera#quickstart ~]$ hdfs dfs -ls /user/test/profile
Found 2 items
drwxr-xr-x - cloudera supergroup 0 2016-08-26
22:40 /user/test/profile/city=Chicago
drwxr-xr-x - cloudera supergroup 0 2016-08-26
22:40 /user/test/profile/city=Orlando

Resources