Hive - Queries on Partitions return nothing - hadoop

I have a table that is being partitioned by a specific start date (ds). I can query the latest partition (the previous day's data) and it will use the partition fine.
hive> select count(1) from vtc4 where ds='2012-11-01' ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 6.43 sec HDFS Read: 46281957 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 430 msec
OK
151225
Time taken: 35.007 seconds
However, when I try to query earlier partitions, hive seems to read the partition fine, but does not return any results.
hive> select count(1) from vtc4 where ds='2012-10-31' ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 7.64 sec HDFS Read: 37754168 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 640 msec
OK
0
Time taken: 29.07 seconds
However, if I tell hive to run the query against the date field inside the table itself, and don't use the partition, I get the correct result.
hive> select count(1) from vtc4 where date_started >= "2012-10-31 00:00:00" and date_started < "2012-11-01 00:00:00" ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 63 Reduce: 1 Cumulative CPU: 453.52 sec HDFS Read: 16420276606 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 7 minutes 33 seconds 520 msec
OK
123201
Time taken: 265.874 seconds
What am I missing here? I'm running hadoop 1.03 and hive 0.9. I'm pretty new to hive/hadoop, so any help would be appreciated.
Thanks.
EDIT 1:
hive> describe formatted vtc4 partition (ds='2012-10-31');
Partition Value: [2012-10-31 ]
Database: default
Table: vtc4
CreateTime: Wed Oct 31 12:02:24 PDT 2012
LastAccessTime: UNKNOWN
Protect Mode: None
Location: hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31
Partition Parameters:
transient_lastDdlTime 1351875579
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.191 seconds
The partition folders exist, but when i try to do a hadoop fs -ls on hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31 it says the file/directory does not exist. If I browse to that directory using the web interface, I can get into the folder , as well as see the /part-m-000* files. If I do a fs -ls on hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-11-01 it works fine.

Seems like either a permissions thing, or something funky with the either hive's or the namenode's metadata. Here's what I would try:
copy the data in that partition to some other location in hdfs. You may need to do this as the hive or hdfs user, depending on how your permissions are set up.
alter table vtc4 drop partition (ds='2012-10-31');
alter table vtc4 add partition (ds='2012-10-31');
copy the data back into that partition on hdfs

Another thing with hive partition is that it sometime doesn't register in metadata system when created outside of hive (e.g. from sparksql). You can also try MSCK REPAIR TABLE xc_bonus; after any changes to partition so it reflects correctly.

Related

How to find recently updated values in hive without using time stamp

I have a table like
id name sal
1 Saa 45000
2 aaa 33000
after incremental load
id name sal
3 bbb 55000
How to get only recently updated value without time stamp
The easiest and most efficient way is using a partition. You can have a partitioned table and create a new partition every time you do the incremental load. This way the latest partition will only have the latest records.
Please be noted that very frequent incremental loads can lead to a lot of small partitions which may not be an optimum data design.
There can be couple other ways of doing this but that purely depends on what is your use case, the data rate, and volume.
Hope that helps!
Create a table.
hive> create table student(id int, name string);
OK
Time taken: 3.503 seconds
Insert one record into the table.
hive> insert into student values(1, 'first');
hive> select * from student;
OK
1 first
Time taken: 0.109 seconds, Fetched: 1 row(s)
Use below command on Hive terminal to find the location of the table. i.e meta store location of student table.
hive> describe formatted student;
You should get the details as shown below.
# Detailed Table Information
Database: retaildb
Owner: root
CreateTime: Thu Mar 08 15:52:47 PST 2018
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student
Now Check the content
[root#quickstart cloudera]# hdfs dfs -ls hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student
Found 1 items
-rwxr-xr-x 1 root supergroup 8 2018-03-08 15:53 hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/000000_0
[root#quickstart cloudera]# hdfs dfs -cat hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/*
1first
Insert one more record.
hive> select * from student;
OK
1 first
1 second
Time taken: 0.095 seconds, Fetched: 2 row(s)
Check the metastore location.
[root#quickstart cloudera]# hdfs dfs -ls hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/
Found 2 items
-rwxr-xr-x 1 root supergroup 8 2018-03-08 15:53 hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/000000_0
-rwxr-xr-x 1 root supergroup 9 2018-03-08 15:57 hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/000000_0_copy_1
[root#quickstart cloudera]# hdfs dfs -cat hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/000000_0_copy_1
1second
[root#quickstart cloudera]# hdfs dfs -cat hdfs://quickstart.cloudera:8020/user/hive/warehouse/retaildb.db/student/*
1first
1second

Presto not returning rows from Hive Metabase

I'm very new to AWS EMR. I've got Hive up and running and been querying external tables in S3 with no problems. I have now installed Presto onto the EMR cluster and this seems to be up and running and can read the Hive metabase. However, every query I run returns the column headers but not actually any columns (query below).
presto:default> select count(*) from patrequests;
_col0
-------
0
(1 row)
Query 20171113_163811_00033_vdw6c, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
The same query in Hive runs fine:
hive> select * from patrequests limit 10;
OK
2017-10-01T00:00:18.6772628Z 779 ***** ***** ***** ***** 2017 10
Time taken: 2.876 seconds, Fetched: 10 row(s)
The data is stored in an S3 bucket in JSON format - no nesting.
Any help greatly appreciated.
Thanks
The problem seemed to be with the JSON Serde org.openx.data.jsonserde.JsonSerDe not being available to Presto. Bootstrapping the instance with the following from an S3 bucket seemed to resolve the issues:
#!/bin/bash
wget -P /usr/lib/presto/plugin/hive-hadoop2/ 'https://s3-eu-west-1.amazonaws.com/########/json-serde-1.3.9-SNAPSHOT-jar-with-dependencies.jar';
wget -P /usr/lib/hive-hcatalog/share/hcatalog/ 'https://s3-eu-west-1.amazonaws.com/########/json-serde-1.3.9-SNAPSHOT-jar-with-dependencies.jar';

Why does querying an external hive table require write access to the hdfs directory?

I've hit an interesting permissions problem when setting up an external table to view some Avro files in Hive.
The Avro files are in this directory :
drwxr-xr-x - myserver hdfs 0 2017-01-03 16:29 /server/data/avrofiles/
The server can write to this file, but regular users cannot.
As the database admin, I create an external table in Hive referencing this directory:
hive> create external table test_table (data string) stored as avro location '/server/data/avrofiles';
Now as a regular user I try to query the table:
hive> select * from test_table limit 10;
FAILED: HiveException java.security.AccessControlException: Permission denied: user=regular.joe, access=WRITE, inode="/server/data/avrofiles":myserver:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
Weird, I'm only trying to read the contents of the file using hive, I'm not trying to write to it.
Oddly, I don't get the same problem when I partition the table like this:
As database_admin:
hive> create external table test_table_partitioned (data string) partitioned by (value string) stored as avro;
OK
Time taken: 0.104 seconds
hive> alter table test_table_partitioned add if not exists partition (value='myvalue') location '/server/data/avrofiles';
OK
As a regular user:
hive> select * from test_table_partitioned where value = 'some_value' limit 10;
OK
Can anyone explain this?
One interesting thing I noticed is that the Location value for the two tables are different and have different permissions:
hive> describe formatted test_table;
Location: hdfs://server.companyname.com:8020/server/data/avrofiles
$ hadoop fs -ls /apps/hive/warehouse/my-database/
drwxr-xr-x - myserver hdfs 0 2017-01-03 16:29 /server/data/avrofiles/
user cannot write
hive> describe formatted test_table_partitioned;
Location: hdfs://server.companyname.com:8020/apps/hive/warehouse/my-database.db/test_table_partitioned
$ hadoop fs -ls /apps/hive/warehouse/my-database.db/
drwxrwxrwx - database_admin hadoop 0 2017-01-04 14:04 /apps/hive/warehouse/my-database.db/test_table_partitioned
anyone can do anything :)

Can External Tables in Hive Intelligently Identify Partitions?

I need to run this whenever I need to mount partition. Rather than me doing it manually is there a way to auto detect partition in external hive tables
ALTER TABLE TableName ADD IF NOT EXISTS PARTITION()location 'locationpath';
Recover Partitions (MSCK REPAIR TABLE)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
MSCK REPAIR TABLE table_name;
partitions will be add automatically
Using dynamic partition, the directory does not need to be created manually. But dynamic partition mode needs to be set to nonstrict, by default it is strict
CREATE External TABLE profile (
userId int
)
PARTITIONED BY (city String)
location '/user/test/profile';
set hive.exec.dynamic.partition.mode=nonstrict;
hive> insert into profile partition(city)
select * from nonpartition;
hive> select * from profile;
OK
1 Chicago
1 Chicago
2 Orlando
and in HDFS
[cloudera#quickstart ~]$ hdfs dfs -ls /user/test/profile
Found 2 items
drwxr-xr-x - cloudera supergroup 0 2016-08-26
22:40 /user/test/profile/city=Chicago
drwxr-xr-x - cloudera supergroup 0 2016-08-26
22:40 /user/test/profile/city=Orlando

Unable to see created database in hive in specified location

I'm new to Hive.
I've created a database in Hive and by default the database is created in Hive warehouse. When I run the -ls against Hive Warehouse I'm able to see created database practice.db.
Query Used to Create Database:
create database practice
COMMENT 'Holds all practice tables';
I've created another database in Hive. When I run the -ls command against the path where I created the database unable to see practice_first.db.
Query Used to Create Database:
create database practice_first
COMMENT 'Holds all practice tables'
LOCATION '/somepath in hdfs here';
Even I checked in Hive warehouse practice_first.db is not there in hive warehouse also.
When I run show database I'm able to see practice_first database in list of databases.
Any suggestion where Hive created the practice_first.db.
Thanks in Advance.
Just tested to make sure it should appear without any tables being added and it does. Try using describe on the database after you create to verify the path:
hive> create database test1 COMMENT 'testing' LOCATION '/user/testuser/test1.db';
OK
Time taken: 0.63 seconds
hive> describe database test1;
OK
test1 testing hdfs://nameserviceHA/user/testuser/test1.db testuser USER
Time taken: 0.183 seconds, Fetched: 1 row(s)
hive> exit;
$ hadoop fs -ls /user/testuser/ | grep test
drwxr-xr-x - testuser testgrp 0 2015-02-12 15:43 /user/testuser/test1.db
Doing the same with the default location will automatically name the dir with a .db extension. Also it looks like the group in this case will be hive instead of the user's group:
hive> drop database test1;
OK
Time taken: 0.379 seconds
hive> create database test1 COMMENT 'testing'; OK
Time taken: 0.319 seconds
hive> describe database test1;
OK
test1 testing hdfs://nameserviceHA/user/hive/warehouse/test1.db testuser USER
Time taken: 0.263 seconds, Fetched: 1 row(s)
hive> exit;
$ hadoop fs -ls /user/hive/warehouse | grep test1
drwxrwxrwt - testuser hive 0 2015-02-12 15:53 /user/hive/warehouse/test1.db
It will be created at the path specified by LOCATION
Check HIVE-1537 and this
You will be able to see practice_first.db directory if
1) you create the database in default location.
2) you explicitly provide the name of the directory as practice_first.db in LOCATION in CREATE DATABASE statement.

Resources