After Static Partitioning output is not as expected in hive

After Static Partitioning output is not as expected in hive - hadoop

I am working with Static Partitioning
data for processing is as follows
Id Name Salary Dept Doj
1,Murtaza,360000,Sales,2010
2,Soumya,478968,Admin,2011
3,Sneha,45789, Dev,2012
4,Asif ,145687, Qa,2012
5,Shreyashi,36598,Qa,2011
6,Adil,25987,Dev,2010
7,Yashwant,23982,Admin,2011
8,Mohsin,569875,2012
9,Anil,56798,Sales,2010
10,Balaji,56489,Sales,2012
11,Utsav,563895,Qa,2010
12,Anuj,546987,Dev,2010
Hql For creating Partitionng table and loading data into it is as follows
create external table if not exists murtaza.PartSalaryReport (ID int,Name
string,Salary string,Dept string)
partitioned by (Doj string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile
location '/user/cts573151/externaltables';
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2010);
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2011);
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2012);
Select * from murtaza.PartSalaryReport;`
Now Proble is that in my hdfs location where external table is located i should get data directory wise so upto that its ok
`
[cts573151#aster2 ~]$ hadoop dfs -ls /user/cts573151/externaltables`
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 4 items
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2011
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2012
But when i look into data inside
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010
it shows data of all 2010,2011 and 2012 , though it should show only 2010 data
[cts573151#aster2 ~]$ hadoop dfs -ls /user/cts573151/externaltables/doj=2010
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 1 items
-rwxr-xr-x 3 cts573151 supergroup 270 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010/partition.txt
[cts573151#aster2 ~]$ hadoop dfs -cat /user/cts573151/externaltables/doj=2010/partition.txt
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
1,Murtaza,360000,Sales,2010
2,Soumya,478968,Admin,2011
3,Sneha,45789,Dev,2012
4,Asif,145687,Qa,2012
5,Shreyashi,36598,Qa,2011
6,Adil,25987,Dev,2010
7,Yashwant,23982,Qa,2011
9,Anil,56798,Sales,2010
10,Balaji,56489,Sales,2012
11,Utsav,53895,Qa,2010
12,Anuj,54987,Dev,2010
[cts573151#aster2 ~]$
Where its wrong ???

Since you are creating external table in hive, so you have to follow the below sets of commands:
create external table if not exists murtaza.PartSalaryReport (
ID int, Name string, Salary string, Dept string)
partitioned by (Doj string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile
location '/user/cts573151/externaltables';
alter table murtaza.PartSalaryReport add partition (Doj=2010);
hdfs dfs -put /home/cts573151/partition1.txt /user/cts573151/externaltables/Doj=2010/
alter table murtaza.PartSalaryReport add partition (Doj=2011);
hdfs dfs -put /home/cts573151/partition2.txt /user/cts573151/externaltables/Doj=2011/
alter table murtaza.PartSalaryReport add partition (Doj=2012);
hdfs dfs -put /home/cts573151/partition3.txt /user/cts573151/externaltables/Doj=2012/
These commands work for me, Hoping it helps you!!!

Related

Why does querying an external hive table require write access to the hdfs directory?

I've hit an interesting permissions problem when setting up an external table to view some Avro files in Hive.
The Avro files are in this directory :
drwxr-xr-x - myserver hdfs 0 2017-01-03 16:29 /server/data/avrofiles/
The server can write to this file, but regular users cannot.
As the database admin, I create an external table in Hive referencing this directory:
hive> create external table test_table (data string) stored as avro location '/server/data/avrofiles';
Now as a regular user I try to query the table:
hive> select * from test_table limit 10;
FAILED: HiveException java.security.AccessControlException: Permission denied: user=regular.joe, access=WRITE, inode="/server/data/avrofiles":myserver:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
Weird, I'm only trying to read the contents of the file using hive, I'm not trying to write to it.
Oddly, I don't get the same problem when I partition the table like this:
As database_admin:
hive> create external table test_table_partitioned (data string) partitioned by (value string) stored as avro;
OK
Time taken: 0.104 seconds
hive> alter table test_table_partitioned add if not exists partition (value='myvalue') location '/server/data/avrofiles';
OK
As a regular user:
hive> select * from test_table_partitioned where value = 'some_value' limit 10;
OK
Can anyone explain this?
One interesting thing I noticed is that the Location value for the two tables are different and have different permissions:
hive> describe formatted test_table;
Location: hdfs://server.companyname.com:8020/server/data/avrofiles
$ hadoop fs -ls /apps/hive/warehouse/my-database/
drwxr-xr-x - myserver hdfs 0 2017-01-03 16:29 /server/data/avrofiles/
user cannot write
hive> describe formatted test_table_partitioned;
Location: hdfs://server.companyname.com:8020/apps/hive/warehouse/my-database.db/test_table_partitioned
$ hadoop fs -ls /apps/hive/warehouse/my-database.db/
drwxrwxrwx - database_admin hadoop 0 2017-01-04 14:04 /apps/hive/warehouse/my-database.db/test_table_partitioned
anyone can do anything :)

Can External Tables in Hive Intelligently Identify Partitions?

I need to run this whenever I need to mount partition. Rather than me doing it manually is there a way to auto detect partition in external hive tables
ALTER TABLE TableName ADD IF NOT EXISTS PARTITION()location 'locationpath';

Recover Partitions (MSCK REPAIR TABLE)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
MSCK REPAIR TABLE table_name;
partitions will be add automatically

Using dynamic partition, the directory does not need to be created manually. But dynamic partition mode needs to be set to nonstrict, by default it is strict
CREATE External TABLE profile (
userId int
)
PARTITIONED BY (city String)
location '/user/test/profile';
set hive.exec.dynamic.partition.mode=nonstrict;
hive> insert into profile partition(city)
select * from nonpartition;
hive> select * from profile;
OK
1 Chicago
1 Chicago
2 Orlando
and in HDFS
[cloudera#quickstart ~]$ hdfs dfs -ls /user/test/profile
Found 2 items
drwxr-xr-x - cloudera supergroup 0 2016-08-26
22:40 /user/test/profile/city=Chicago
drwxr-xr-x - cloudera supergroup 0 2016-08-26
22:40 /user/test/profile/city=Orlando

Apache hive MSCK REPAIR TABLE new partition not added

I am new for Apache Hive. While working on external table partition, if I add new partition directly to HDFS, the new partition is not added after running MSCK REPAIR table. Below are the codes I tried,
-- creating external table
hive> create external table factory(name string, empid int, age int) partitioned by(region string)
> row format delimited fields terminated by ',';
--Detailed Table Information
Location: hdfs://localhost.localdomain:8020/user/hive/warehouse/factory
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
transient_lastDdlTime 1438579844
-- creating directory in HDFS to load data for table factory
[cloudera#localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory1'
[cloudera#localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
-- Table data
cat factory1.txt
emp1,500,40
emp2,501,45
emp3,502,50
cat factory2.txt
EMP10,200,25
EMP11,201,27
EMP12,202,30
-- copying from local to HDFS
[cloudera#localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory1.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory1'
[cloudera#localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory2.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
-- Altering table to update in the metastore
hive> alter table factory add partition(region='southregion') location '/user/hive/testing/testing1/factory2';
hive> alter table factory add partition(region='northregion') location '/user/hive/testing/testing1/factory1';
hive> select * from factory;
OK
emp1 500 40 northregion
emp2 501 45 northregion
emp3 502 50 northregion
EMP10 200 25 southregion
EMP11 201 27 southregion
EMP12 202 30 southregion
Now I created new file factory3.txt to add as new partition for the table factory
cat factory3.txt
user1,100,25
user2,101,27
user3,102,30
-- creating the path and copying table data
[cloudera#localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
[cloudera#localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory3.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory3'
now I executed the below query to update the metastore for the new partition added
MSCK REPAIR TABLE factory;
Now the table is not giving the new partition content of factory3 file. Can I know where I am doing mistake while adding partition for table factory?
whereas, if I run the alter command then it is showing the new partition data.
hive> alter table factory add partition(region='eastregion') location '/user/hive/testing/testing1/factory3';
Can I know why the MSCK REPAIR TABLE command is not working?

For the MSCK to work, naming convention /partition_name=partition_value/ should be used. For example in the root directory of table;
# hadoop fs -ls /user/hive/root_of_table/*
/user/hive/root_of_table/day=20200101/data1.parq
/user/hive/root_of_table/day=20200101/data2.parq
/user/hive/root_of_table/day=20200102/data3.parq
/user/hive/root_of_table/day=20200102/data4.parq
When you run msck repair table <tablename> partitions of day; 20200101 and 20200102 will be added automatically.

You have to put data in directory named 'region=eastregio' in table location directory:
$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/warehouse/factory/region=eastregio'
$ hadoop fs -copyFromLocal '/home/cloudera/factory3.txt' 'hdfs://localhost.localdomain:8020/user/hive/warehouse/factory/region=eastregio'

Zero-length file in S3 folder possibly prevents accessing that folder with Hive?

I cannot access a folder on AWS S3 with Hive, presumably, a zero-length file in that directory is the reason. AWS management console's folder is a zero-byte object with key that ends with a slash, i.e. "folder_name/". I think that Hive or Hadoop may have a bug in how they define a folder scheme on S3.
Here is what I have done.
CREATE EXTERNAL TABLE is_data_original (user_id STRING, action_name STRING, timestamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://bucketname/logs/';
SELECT * FROM is_data_original LIMIT 10;
Failed with exception java.io.IOException:java.lang.NullPointerException
username#client:~$ hadoop fs -ls s3n://bucketname/logs/
Found 4 items
-rwxrwxrwx 1 0 2015-01-22 20:30 /logs/data
-rwxrwxrwx 1 8947 2015-02-27 18:57 /logs/data_2015-02-13.csv
-rwxrwxrwx 1 7912 2015-02-27 18:57 /logs/data_2015-02-14.csv
-rwxrwxrwx 1 16786 2015-02-27 18:57 /logs/data_2015-02-15.csv
hadoop fs -mkdir s3n://bucketname/copylogs/
hadoop fs -cp s3n://bucketname/logs/*.csv s3n://bucketname/copylogs/
username#client:~$ hadoop fs -ls s3n://bucketname/copylogs/
Found 3 items
-rwxrwxrwx 1 8947 2015-02-28 05:09 /copylogs/data_2015-02-13.csv
-rwxrwxrwx 1 7912 2015-02-28 05:09 /copylogs/data_2015-02-14.csv
-rwxrwxrwx 1 16786 2015-02-28 05:09 /copylogs/data_2015-02-15.csv
CREATE EXTERNAL TABLE is_data_copy (user_id STRING, action_name STRING, timestamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://bucketname/copylogs/';
SELECT * FROM is_data_copy LIMIT 10;
The latter, after copying, works fine.
Below two commands both work:
hadoop fs -cat s3n://bucketname/logs/data_2015-02-15.csv
hadoop fs -cat s3n://bucketname/copylogs/data_2015-02-15.csv
Versions: Hive 0.11.0 and Hadoop 1.0.3.
Is this some kind of bug? Is it related to AWS S3? Any ideas? I need to be able to read the original location, because this is where that data keeps flowing.
I have no control on the processes that created the directory and placed log files in there, so I cannot check anything on that end.
I carried an experiment: created a key/folder on S3 and placed a file in there in two different ways: using AWS Management Console and using hadoop fs.
I can see a zero-byte file in the folder in case I used AWS Console and I am getting a null-pointer exception assessing it with Hive. With hadoop fs I don't have such a problem. I assume, that zero-byte file supposed to be deleted but it was not in case of AWS Console. I am sure, that in my case, s3 folder is not created from AWS console, but possibly Ruby or Javascript.

Seems like a Hive bug. Hive 0.12.0 does not have that problem.

Copy Table from Hive to HDFS

I would like to copy HIVE table from HIVE to HDFS. Please suggest the steps. Later I would like to use this HFDS file for Mahout Machine Learning.
I have created a HIVE table using data stored in the HDFS. Then I trasfromed the few variables in that data set and created a new table from that.
Now I would like to dump the HIVE table from HIVE to HDFS. So that it can be read by Mahout.
When I type this
hadoop fs -ls -R /user/hive/
I can able to see the list of table I have created ,
drwxr-xr-x - hdfs supergroup 0 2014-04-25 17:00 /user/hive/warehouse/telecom.db/telecom_tr
-rw-r--r-- 1 hdfs supergroup 5199062 2014-04-25 17:00 /user/hive/warehouse/telecom.db/telecom_tr/000000_0
I tried to copy the file from Hive to HDFS,
hadoop fs -cp /user/hive/warehouse/telecom.db/telecom_tr/* /user/hdfs/tele_copy
Here I was excepting tele_copy should be a csv file, stored in hdfs.
But when I do hadoop fs -tail /user/hdfs/tele_copy I get the below result.
7.980.00.00.0-9.0-30.00.00.670.00.00.00.06.00.06.670.00.670.00.042.02.02.06.04.0198.032.030.00.03.00.01.01.00.00.00.01.00.01.01.00.00.00.01.00.00.00.00.00.00.06.00.040.09.990.01.01
32.64296.7544.990.016.00.0-6.75-27.844.672.3343.334.671.3331.4725.05.3386.6754.07.00.00.044.01.01.02.02.0498.038.00.00.07.01.00.00.00.01.00.00.01.00.00.00.00.00.01.01.01.00.01.00.00.03.00.010.029.991.01.01
30.52140.030.00.250.00.0-42.0-0.520.671.339.00.00.034.6210.677.3340.09.332.00.00.040.02.02.01.01.01214.056.050.01.05.00.00.00.00.00.00.01.00.01.01.00.00.01.01.00.00.01.00.00.00.06.00.001.00.00.01.01
60.68360.2549.990.991.250.038.75-10.692.331.6715.670.00.0134.576.00.0102.6729.674.00.00.3340.02.01.08.03.069.028.046.00.05.00.01.00.00.00.00.00.01.01.01.00.00.00.01.00.00.01.00.00.00.02.00.020.0129.990.01.01
Which is not comma separated.
Also received the same result I received after running this command.
INSERT OVERWRITE DIRECTORY '/user/hdfs/data/telecom' SELECT * FROM telecom_tr;
When I do a -ls
drwxr-xr-x - hdfs supergroup 0 2014-04-29 17:34 /user/hdfs/data/telecom
-rw-r--r-- 1 hdfs supergroup 5199062 2014-04-29 17:34 /user/hdfs/data/telecom/000000_0
When I do a cat the result is not a CSV

What you're really asking is to have Hive store the file as a CSV file. Try using ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' see Row Format, Storage Format, and SerDe.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio