Unable to see created database in hive in specified location - hadoop

I'm new to Hive.
I've created a database in Hive and by default the database is created in Hive warehouse. When I run the -ls against Hive Warehouse I'm able to see created database practice.db.
Query Used to Create Database:
create database practice
COMMENT 'Holds all practice tables';
I've created another database in Hive. When I run the -ls command against the path where I created the database unable to see practice_first.db.
Query Used to Create Database:
create database practice_first
COMMENT 'Holds all practice tables'
LOCATION '/somepath in hdfs here';
Even I checked in Hive warehouse practice_first.db is not there in hive warehouse also.
When I run show database I'm able to see practice_first database in list of databases.
Any suggestion where Hive created the practice_first.db.
Thanks in Advance.

Just tested to make sure it should appear without any tables being added and it does. Try using describe on the database after you create to verify the path:
hive> create database test1 COMMENT 'testing' LOCATION '/user/testuser/test1.db';
OK
Time taken: 0.63 seconds
hive> describe database test1;
OK
test1 testing hdfs://nameserviceHA/user/testuser/test1.db testuser USER
Time taken: 0.183 seconds, Fetched: 1 row(s)
hive> exit;
$ hadoop fs -ls /user/testuser/ | grep test
drwxr-xr-x - testuser testgrp 0 2015-02-12 15:43 /user/testuser/test1.db
Doing the same with the default location will automatically name the dir with a .db extension. Also it looks like the group in this case will be hive instead of the user's group:
hive> drop database test1;
OK
Time taken: 0.379 seconds
hive> create database test1 COMMENT 'testing'; OK
Time taken: 0.319 seconds
hive> describe database test1;
OK
test1 testing hdfs://nameserviceHA/user/hive/warehouse/test1.db testuser USER
Time taken: 0.263 seconds, Fetched: 1 row(s)
hive> exit;
$ hadoop fs -ls /user/hive/warehouse | grep test1
drwxrwxrwt - testuser hive 0 2015-02-12 15:53 /user/hive/warehouse/test1.db

It will be created at the path specified by LOCATION
Check HIVE-1537 and this

You will be able to see practice_first.db directory if
1) you create the database in default location.
2) you explicitly provide the name of the directory as practice_first.db in LOCATION in CREATE DATABASE statement.

Related

Getting NULL values after loading data into Hive tables from an online dataset

I am trying to load a data from an online dataset into my hive table using hue interface but I am getting NULL values.
Here's my dataset:
https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv
Here's my code:
CREATE TABLE IF NOT EXISTS AISLES (aisles_id INT, aisles STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
Here's how I loaded the data:
LOAD DATA LOCAL INPATH '/home/hadoop/aisles.csv' INTO TABLE aisles;
My Workaround, but no go:
FIELDS TERMINATED BY ','
FIELDS TERMINATED BY '\t'
FIELDS TERMINATED BY ''
FIELDS TERMINATED BY ' '
Also tried removing LINES TERMINATED BY '\n'
This is how I downloaded the data:
[hadoop#ip-172-31-76-58 ~]$ wget -O aisles.csv "https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv"
--2020-10-14 23:50:06-- https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv
Resolving www.kaggle.com (www.kaggle.com)... 35.244.233.98
Connecting to www.kaggle.com (www.kaggle.com)|35.244.233.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘aisles.csv’
I checked the location of the table I created this is what it says;
hdfs://ip-172-31-76-58.ec2.internal:8020/user/hive/warehouse/aisles
I tried browsing the directory and see where the file was saved:
[hadoop#ip-172-31-76-58 ~]$ hdfs dfs -ls /user/hive/warehouse
Found 1 items
drwxrwxrwt - arjiesaenz hadoop 0 2020-10-15 00:57 /user/hive/warehouse/aisles
So, I tried to change my load script like this;
LOAD DATA INPATH '/user/hive/warehouse/aisles.csv' INTO TABLE aisles;
But I got an error:
Error while compiling statement: FAILED: SemanticException line 6:61 Invalid path ''/user/hive/warehouse/aisles.csv'': No files matching path hdfs://ip-172-31-76-58.ec2.internal:8020/user/hive/warehouse/aisles.csv
Hopefully someone can help me pinpoint the problem with my code.
Thanks.
I tried the same on my hadoop cluster. The code worked without any issues.
Here's my execution snippet:
hive> CREATE TABLE IF NOT EXISTS AISLES (aisles_id INT, aisles STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> tblproperties("skip.header.line.count"="1");
OK
Time taken: 0.034 seconds
hive> load data inpath '/user/hirwuser1448/aisles.csv' into table AISLES;
Loading data to table revisit.aisles
Table revisit.aisles stats: [numFiles=1, totalSize=2603]
OK
Time taken: 0.183 seconds
hive> select * from AISLES limit 10;
OK
1 prepared soups salads
2 specialty cheeses
3 energy granola bars
4 instant foods
5 marinades meat preparation
6 other
7 packaged meat
8 bakery desserts
9 pasta sauce
10 kitchen supplies
Time taken: 0.038 seconds, Fetched: 10 row(s)
I think you need to cross check if your dataset aisles.csv is at the hdfs location and not stored at local directory.
The problem is with your load cmd.
LOAD DATA INPATH '/user/hive/warehouse/aisles.csv' INTO TABLE aisles;
I see you tried browsing the dir to see the saved file. Do you see aisles.csv under that dir? If the file's there, then you're giving wrong path in your load cmd else file isn't there at all.
I found a workaround by downloading the dataset and uploaded it into the Amazon S3 bucket and used the S3 path in the LOAD command.

Why does querying an external hive table require write access to the hdfs directory?

I've hit an interesting permissions problem when setting up an external table to view some Avro files in Hive.
The Avro files are in this directory :
drwxr-xr-x - myserver hdfs 0 2017-01-03 16:29 /server/data/avrofiles/
The server can write to this file, but regular users cannot.
As the database admin, I create an external table in Hive referencing this directory:
hive> create external table test_table (data string) stored as avro location '/server/data/avrofiles';
Now as a regular user I try to query the table:
hive> select * from test_table limit 10;
FAILED: HiveException java.security.AccessControlException: Permission denied: user=regular.joe, access=WRITE, inode="/server/data/avrofiles":myserver:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
Weird, I'm only trying to read the contents of the file using hive, I'm not trying to write to it.
Oddly, I don't get the same problem when I partition the table like this:
As database_admin:
hive> create external table test_table_partitioned (data string) partitioned by (value string) stored as avro;
OK
Time taken: 0.104 seconds
hive> alter table test_table_partitioned add if not exists partition (value='myvalue') location '/server/data/avrofiles';
OK
As a regular user:
hive> select * from test_table_partitioned where value = 'some_value' limit 10;
OK
Can anyone explain this?
One interesting thing I noticed is that the Location value for the two tables are different and have different permissions:
hive> describe formatted test_table;
Location: hdfs://server.companyname.com:8020/server/data/avrofiles
$ hadoop fs -ls /apps/hive/warehouse/my-database/
drwxr-xr-x - myserver hdfs 0 2017-01-03 16:29 /server/data/avrofiles/
user cannot write
hive> describe formatted test_table_partitioned;
Location: hdfs://server.companyname.com:8020/apps/hive/warehouse/my-database.db/test_table_partitioned
$ hadoop fs -ls /apps/hive/warehouse/my-database.db/
drwxrwxrwx - database_admin hadoop 0 2017-01-04 14:04 /apps/hive/warehouse/my-database.db/test_table_partitioned
anyone can do anything :)

Can External Tables in Hive Intelligently Identify Partitions?

I need to run this whenever I need to mount partition. Rather than me doing it manually is there a way to auto detect partition in external hive tables
ALTER TABLE TableName ADD IF NOT EXISTS PARTITION()location 'locationpath';
Recover Partitions (MSCK REPAIR TABLE)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
MSCK REPAIR TABLE table_name;
partitions will be add automatically
Using dynamic partition, the directory does not need to be created manually. But dynamic partition mode needs to be set to nonstrict, by default it is strict
CREATE External TABLE profile (
userId int
)
PARTITIONED BY (city String)
location '/user/test/profile';
set hive.exec.dynamic.partition.mode=nonstrict;
hive> insert into profile partition(city)
select * from nonpartition;
hive> select * from profile;
OK
1 Chicago
1 Chicago
2 Orlando
and in HDFS
[cloudera#quickstart ~]$ hdfs dfs -ls /user/test/profile
Found 2 items
drwxr-xr-x - cloudera supergroup 0 2016-08-26
22:40 /user/test/profile/city=Chicago
drwxr-xr-x - cloudera supergroup 0 2016-08-26
22:40 /user/test/profile/city=Orlando

Apache hive MSCK REPAIR TABLE new partition not added

I am new for Apache Hive. While working on external table partition, if I add new partition directly to HDFS, the new partition is not added after running MSCK REPAIR table. Below are the codes I tried,
-- creating external table
hive> create external table factory(name string, empid int, age int) partitioned by(region string)
> row format delimited fields terminated by ',';
--Detailed Table Information
Location: hdfs://localhost.localdomain:8020/user/hive/warehouse/factory
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
transient_lastDdlTime 1438579844
-- creating directory in HDFS to load data for table factory
[cloudera#localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory1'
[cloudera#localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
-- Table data
cat factory1.txt
emp1,500,40
emp2,501,45
emp3,502,50
cat factory2.txt
EMP10,200,25
EMP11,201,27
EMP12,202,30
-- copying from local to HDFS
[cloudera#localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory1.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory1'
[cloudera#localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory2.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
-- Altering table to update in the metastore
hive> alter table factory add partition(region='southregion') location '/user/hive/testing/testing1/factory2';
hive> alter table factory add partition(region='northregion') location '/user/hive/testing/testing1/factory1';
hive> select * from factory;
OK
emp1 500 40 northregion
emp2 501 45 northregion
emp3 502 50 northregion
EMP10 200 25 southregion
EMP11 201 27 southregion
EMP12 202 30 southregion
Now I created new file factory3.txt to add as new partition for the table factory
cat factory3.txt
user1,100,25
user2,101,27
user3,102,30
-- creating the path and copying table data
[cloudera#localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
[cloudera#localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory3.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory3'
now I executed the below query to update the metastore for the new partition added
MSCK REPAIR TABLE factory;
Now the table is not giving the new partition content of factory3 file. Can I know where I am doing mistake while adding partition for table factory?
whereas, if I run the alter command then it is showing the new partition data.
hive> alter table factory add partition(region='eastregion') location '/user/hive/testing/testing1/factory3';
Can I know why the MSCK REPAIR TABLE command is not working?
For the MSCK to work, naming convention /partition_name=partition_value/ should be used. For example in the root directory of table;
# hadoop fs -ls /user/hive/root_of_table/*
/user/hive/root_of_table/day=20200101/data1.parq
/user/hive/root_of_table/day=20200101/data2.parq
/user/hive/root_of_table/day=20200102/data3.parq
/user/hive/root_of_table/day=20200102/data4.parq
When you run msck repair table <tablename> partitions of day; 20200101 and 20200102 will be added automatically.
You have to put data in directory named 'region=eastregio' in table location directory:
$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/warehouse/factory/region=eastregio'
$ hadoop fs -copyFromLocal '/home/cloudera/factory3.txt' 'hdfs://localhost.localdomain:8020/user/hive/warehouse/factory/region=eastregio'

Checking the table existence and loading the data into Hbase and HIve table

I have data in HDFS. And I wanted to load that data into hbase and hive table.
I have written a bash shell script in which I have written a pig script to load the data form HDFS to HBASE and also written hive script to load the data from HDFS to HIVE table which are working perfectly fine.Here my HDFS data files are with the same structure and I'm loading all the data files into single hbase and hive table.
Now my query is suppose if I receive some more data files in HDFS directory and if I run the shell script again it will create hbase and hive table again with the same name and tells table already exists. How can I write a hive and hbase query so that 1st it will check for the table existence, if table does not exists it create the table for the 1st time and load the data from HDFS to HBASE & Hive table. If the table is already exists then it will just insert the data into an existing hbase and hive table. It should not overwrite the data alreday exists in the tables.
How this can be done ?
Below is my script file: myScript.sh
echo "create 'goodtable','gt'" | hbase shell
pig -f a.pig -param input=/user/user/d/
hive -f h.hql
Where a.pig :
G = LOAD '$input' USING PigStorage(',') as (c1:chararray, c2:chararray,c3:chararray,c4:chararray,c5:chararray);
STORE G INTO 'hbase://goodtable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('gt:name gt:state gt:phone_no gt:gender');
h.hql:
create external table hive_table(
id int,
name string,
state string,
phone_no int,
gender string) row format delimited fields terminated by ',' stored as textfile;
LOAD DATA INPATH '/user/user/d/' INTO TABLE hive_table;
I just wanted to add an example for HBase as Hive was already covered before:
if [[ $(echo "exists 'goodtable'" | hbase shell | grep 'not exist') ]];
then
echo "create 'goodtable','gt'" | hbase shell;
fi
For HIVE, you can add the command IF NOT EXISTS in the CREATE TABLE statement. See the documentation
I don't have much experience on Hbase, but I believe you can use EXISTS table_name command to check whether the table exists and then create the table is it doesn't exist. See here
#visakh is correct - you can see if table exists in HBase by entering the HBase shell, and typing : exists '<tablename>
In order to do this without entering the HBase shell interactively, you can create a simple ruby script such as the following:
exists 'mytable'
exit
Let's say you save this to a file called tabletest.rb. You can then execute this script by calling hbase shell tabletest.rb. This will create the following output, which you can then parse from your shell script:
Table tableisthere does exist
0 row(s) in 0.9830 seconds
OR
Table tableisNOTthere does not exist
0 row(s) in 0.9830 seconds
Adding more details for 'all in one' script:
Alternatively, you can create a more advanced script in ruby that checks for table existence and then will create it if needed - this is done calling the HBaseAdmin java api from within the ruby script.
conf = HBaseConfiguration.new
hbaseAdmin = HBaseAdmin.new(conf)
if !hbaseAdmin.tableExists('mytable')
hbaseAdmin.createTable('mytable',...)
end

Resources