hive understanding table creation - hadoop

I am taking a mooc.
it told us to upload a few files from our PC to hdfs using below commands
azure storage blob upload local_path container data/logs/2008-01.txt.gz
I did the same.
later on when I typed below command in PUTTY secure shell I was able to see that file
hdfs dfs -ls /data/logs
Found 6 items
-rwxrwxrwx 1 331941 2016-03-03 15:56 /data/logs/2008-01.txt.gz
-rwxrwxrwx 1 331941 2016-03-03 15:58 /data/logs/2008-02.txt.gz
-rwxrwxrwx 1 331941 2016-03-03 15:58 /data/logs/2008-03.txt.gz
-rwxrwxrwx 1 331941 2016-03-03 15:58 /data/logs/2008-04.txt.gz
-rwxrwxrwx 1 331941 2016-03-03 15:58 /data/logs/2008-05.txt.gz
-rwxrwxrwx 1 331941 2016-03-03 15:58 /data/logs/2008-06.txt.gz
then we started a hive terminal and first created a table and then inserted data into that table using
load data inpath '/data/logs' into TABLE rawlog;
Then we created an external table using below command
CREATE EXTERNAL TABLE cleanlog
(log_date DATE,
log_time STRING,
c_ip STRING,
cs_username STRING,
s_ip STRING,
s_port STRING,
cs_method STRING,
cs_uri_stem STRING,
cs_uri_query STRING,
sc_status STRING,
sc_bytes INT,
cs_bytes INT,
time_taken INT,
cs_user_agent STRING,
cs_referrer STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/data/cleanlog';
we inserted data into the table using
INSERT INTO TABLE cleanlog
SELECT *
FROM rawlog
WHERE SUBSTR(log_date, 1, 1) <> '#';
I exited out of hive and typed in below command
hdfs dfs -ls /data/logs
I dont see anything in that folder, why? where did uploaded log
files go?
Where is the rawlog table? does it exist in the same folder? Why dont i see it?
Why do i see file 00000_0 in my cleanlog folder? is it the new
table? If i type command
hdfs dfs -ls /data/cleanlog
The output that i get is
Found 1 items
-rwxr-xr-x 1 sshuser supergroup 71323206 2016-03-03 16:11 /data/cleanlog/000000_0
################----------------------------------update 1
What would happen if load one more data file at /data/logs/ and
then run select * from rawlog? would it automatically pull data
from the new file?

If you don't want to lose data in source folder, use external table. Have a look at this SE question:
Difference between `load data inpath ` and `location` in hive?
I dont see anything in that folder, why? where did uploaded log files go?
They have been removed as data is loaded in table and you have used load data in path instead of external table
Where is the rawlog table? does it exist in the same folder? Why dont i see it?
Table definition does not exists in the folder where data resides. In your create table statement, you have already quoted the location of table data to be stored as /data/cleanlog
Have a look at below queries on where does hive stores files in hdfs.
Where does Hive store files in HDFS?
I have created a table in hive, I would like to know which directory my table is created in?
Why do i see file 00000_0 in my cleanlog folder? is it the new table?
It's not new table. Execute this command in hive shell.
describe formatted <table_name>;
EDIT: Regarding incremental updates to table,follow the steps as per this article and this question : Delta/Incremental Load in Hive

You used the LOAD command, which MOVED the files from their original location to the folder for the rawlog table (which by default will be /hive/warehouse/rawlog).

Related

how to load load multiple files into table in hive?

There is a directory which contains multiple files yet to be analyzed, for example, file1, file2, file3.
I want to
load data inpath 'path/to/*' overwrite into table demo
instead of
load data inpath 'path/to/file1' overwrite into table demo
load data inpath 'path/to/file2' overwrite into table demo
load data inpath 'path/to/file3' overwrite into table demo.
However, it just doesn't work. Are there any easier ways to implement this?
1.
load data inpath is an HDFS metadata operation.
The only thing it does is moving files from their current location to the table location.
And again, "moving" (unlike "copying") is a metadata operation and not data operation.
2.
If the OVERWRITE keyword is used then the contents of the target table
(or partition) will be deleted and replaced by the files referred to
by filepath; otherwise the files referred by filepath will be added to
the table.
Language Manual DML-Loading files into tables
3.
load data inpath 'path/to/file1' into table demo;
load data inpath 'path/to/file2' into table demo;
load data inpath 'path/to/file3' into table demo;
or
load data inpath 'path/to/file?' into table demo;
or
dfs -mv path/to/file? ...{path to demo}.../demo
or (from bash)
hdfs dfs -mv path/to/file? ...{path to demo}.../demo
Generating a hive table with the path as the LOCATION parameter will automatically read all the files in said location.
for example:
CREATE [EXTERNAL] TABLE db.tbl(
column1 string,
column2 int ...)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY (delimiter)
LINES TERMINATED BY '\n'
LOCATION '/path/to/' <- DO NOT POINT TO A SPECIFIC FILE, POINT TO THE DIRECTORY
Hive will will automatically parse all data within the folder and will "force feed" it to the table statement you created.
as long as all files in that path are in the same format you are good to go.
1) Directory contains three files
-rw-r--r-- 1 hadoop supergroup 125 2017-05-15 17:53 /hallfolder/hall.csv
-rw-r--r-- 1 hadoop supergroup 125 2017-05-15 17:53 /hallfolder/hall1.csv
-rw-r--r-- 1 hadoop supergroup 125 2017-05-15 17:54 /hallfolder/hall2.csv
2) Enable this command
SET mapred.input.dir.recursive=true;
3) hive>
load data inpath '/hallfolder/*' into table alltable;
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
filepath can be:
a relative path, such as project/data1 an absolute path, such as /user/hive/project/data1 a full URI with scheme and (optionally) an authority, such as hdfs://namenode:9000/user/hive/project/data1
The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
filepath can refer to a file (in which case Hive will move the file into the table) or it can be a directory (in which case Hive will move all the files within that directory into the table). In either case, filepath addresses a set of files.

Why does querying an external hive table require write access to the hdfs directory?

I've hit an interesting permissions problem when setting up an external table to view some Avro files in Hive.
The Avro files are in this directory :
drwxr-xr-x - myserver hdfs 0 2017-01-03 16:29 /server/data/avrofiles/
The server can write to this file, but regular users cannot.
As the database admin, I create an external table in Hive referencing this directory:
hive> create external table test_table (data string) stored as avro location '/server/data/avrofiles';
Now as a regular user I try to query the table:
hive> select * from test_table limit 10;
FAILED: HiveException java.security.AccessControlException: Permission denied: user=regular.joe, access=WRITE, inode="/server/data/avrofiles":myserver:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
Weird, I'm only trying to read the contents of the file using hive, I'm not trying to write to it.
Oddly, I don't get the same problem when I partition the table like this:
As database_admin:
hive> create external table test_table_partitioned (data string) partitioned by (value string) stored as avro;
OK
Time taken: 0.104 seconds
hive> alter table test_table_partitioned add if not exists partition (value='myvalue') location '/server/data/avrofiles';
OK
As a regular user:
hive> select * from test_table_partitioned where value = 'some_value' limit 10;
OK
Can anyone explain this?
One interesting thing I noticed is that the Location value for the two tables are different and have different permissions:
hive> describe formatted test_table;
Location: hdfs://server.companyname.com:8020/server/data/avrofiles
$ hadoop fs -ls /apps/hive/warehouse/my-database/
drwxr-xr-x - myserver hdfs 0 2017-01-03 16:29 /server/data/avrofiles/
user cannot write
hive> describe formatted test_table_partitioned;
Location: hdfs://server.companyname.com:8020/apps/hive/warehouse/my-database.db/test_table_partitioned
$ hadoop fs -ls /apps/hive/warehouse/my-database.db/
drwxrwxrwx - database_admin hadoop 0 2017-01-04 14:04 /apps/hive/warehouse/my-database.db/test_table_partitioned
anyone can do anything :)

After Static Partitioning output is not as expected in hive

I am working with Static Partitioning
data for processing is as follows
Id Name Salary Dept Doj
1,Murtaza,360000,Sales,2010
2,Soumya,478968,Admin,2011
3,Sneha,45789, Dev,2012
4,Asif ,145687, Qa,2012
5,Shreyashi,36598,Qa,2011
6,Adil,25987,Dev,2010
7,Yashwant,23982,Admin,2011
8,Mohsin,569875,2012
9,Anil,56798,Sales,2010
10,Balaji,56489,Sales,2012
11,Utsav,563895,Qa,2010
12,Anuj,546987,Dev,2010
Hql For creating Partitionng table and loading data into it is as follows
create external table if not exists murtaza.PartSalaryReport (ID int,Name
string,Salary string,Dept string)
partitioned by (Doj string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile
location '/user/cts573151/externaltables';
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2010);
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2011);
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2012);
Select * from murtaza.PartSalaryReport;`
Now Proble is that in my hdfs location where external table is located i should get data directory wise so upto that its ok
`
[cts573151#aster2 ~]$ hadoop dfs -ls /user/cts573151/externaltables`
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 4 items
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2011
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2012
But when i look into data inside
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010
it shows data of all 2010,2011 and 2012 , though it should show only 2010 data
[cts573151#aster2 ~]$ hadoop dfs -ls /user/cts573151/externaltables/doj=2010
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 1 items
-rwxr-xr-x 3 cts573151 supergroup 270 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010/partition.txt
[cts573151#aster2 ~]$ hadoop dfs -cat /user/cts573151/externaltables/doj=2010/partition.txt
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
1,Murtaza,360000,Sales,2010
2,Soumya,478968,Admin,2011
3,Sneha,45789,Dev,2012
4,Asif,145687,Qa,2012
5,Shreyashi,36598,Qa,2011
6,Adil,25987,Dev,2010
7,Yashwant,23982,Qa,2011
9,Anil,56798,Sales,2010
10,Balaji,56489,Sales,2012
11,Utsav,53895,Qa,2010
12,Anuj,54987,Dev,2010
[cts573151#aster2 ~]$
Where its wrong ???
Since you are creating external table in hive, so you have to follow the below sets of commands:
create external table if not exists murtaza.PartSalaryReport (
ID int, Name string, Salary string, Dept string)
partitioned by (Doj string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile
location '/user/cts573151/externaltables';
alter table murtaza.PartSalaryReport add partition (Doj=2010);
hdfs dfs -put /home/cts573151/partition1.txt /user/cts573151/externaltables/Doj=2010/
alter table murtaza.PartSalaryReport add partition (Doj=2011);
hdfs dfs -put /home/cts573151/partition2.txt /user/cts573151/externaltables/Doj=2011/
alter table murtaza.PartSalaryReport add partition (Doj=2012);
hdfs dfs -put /home/cts573151/partition3.txt /user/cts573151/externaltables/Doj=2012/
These commands work for me, Hoping it helps you!!!

Zero-length file in S3 folder possibly prevents accessing that folder with Hive?

I cannot access a folder on AWS S3 with Hive, presumably, a zero-length file in that directory is the reason. AWS management console's folder is a zero-byte object with key that ends with a slash, i.e. "folder_name/". I think that Hive or Hadoop may have a bug in how they define a folder scheme on S3.
Here is what I have done.
CREATE EXTERNAL TABLE is_data_original (user_id STRING, action_name STRING, timestamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://bucketname/logs/';
SELECT * FROM is_data_original LIMIT 10;
Failed with exception java.io.IOException:java.lang.NullPointerException
username#client:~$ hadoop fs -ls s3n://bucketname/logs/
Found 4 items
-rwxrwxrwx 1 0 2015-01-22 20:30 /logs/data
-rwxrwxrwx 1 8947 2015-02-27 18:57 /logs/data_2015-02-13.csv
-rwxrwxrwx 1 7912 2015-02-27 18:57 /logs/data_2015-02-14.csv
-rwxrwxrwx 1 16786 2015-02-27 18:57 /logs/data_2015-02-15.csv
hadoop fs -mkdir s3n://bucketname/copylogs/
hadoop fs -cp s3n://bucketname/logs/*.csv s3n://bucketname/copylogs/
username#client:~$ hadoop fs -ls s3n://bucketname/copylogs/
Found 3 items
-rwxrwxrwx 1 8947 2015-02-28 05:09 /copylogs/data_2015-02-13.csv
-rwxrwxrwx 1 7912 2015-02-28 05:09 /copylogs/data_2015-02-14.csv
-rwxrwxrwx 1 16786 2015-02-28 05:09 /copylogs/data_2015-02-15.csv
CREATE EXTERNAL TABLE is_data_copy (user_id STRING, action_name STRING, timestamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://bucketname/copylogs/';
SELECT * FROM is_data_copy LIMIT 10;
The latter, after copying, works fine.
Below two commands both work:
hadoop fs -cat s3n://bucketname/logs/data_2015-02-15.csv
hadoop fs -cat s3n://bucketname/copylogs/data_2015-02-15.csv
Versions: Hive 0.11.0 and Hadoop 1.0.3.
Is this some kind of bug? Is it related to AWS S3? Any ideas? I need to be able to read the original location, because this is where that data keeps flowing.
I have no control on the processes that created the directory and placed log files in there, so I cannot check anything on that end.
I carried an experiment: created a key/folder on S3 and placed a file in there in two different ways: using AWS Management Console and using hadoop fs.
I can see a zero-byte file in the folder in case I used AWS Console and I am getting a null-pointer exception assessing it with Hive. With hadoop fs I don't have such a problem. I assume, that zero-byte file supposed to be deleted but it was not in case of AWS Console. I am sure, that in my case, s3 folder is not created from AWS console, but possibly Ruby or Javascript.
Seems like a Hive bug. Hive 0.12.0 does not have that problem.

Copy Table from Hive to HDFS

I would like to copy HIVE table from HIVE to HDFS. Please suggest the steps. Later I would like to use this HFDS file for Mahout Machine Learning.
I have created a HIVE table using data stored in the HDFS. Then I trasfromed the few variables in that data set and created a new table from that.
Now I would like to dump the HIVE table from HIVE to HDFS. So that it can be read by Mahout.
When I type this
hadoop fs -ls -R /user/hive/
I can able to see the list of table I have created ,
drwxr-xr-x - hdfs supergroup 0 2014-04-25 17:00 /user/hive/warehouse/telecom.db/telecom_tr
-rw-r--r-- 1 hdfs supergroup 5199062 2014-04-25 17:00 /user/hive/warehouse/telecom.db/telecom_tr/000000_0
I tried to copy the file from Hive to HDFS,
hadoop fs -cp /user/hive/warehouse/telecom.db/telecom_tr/* /user/hdfs/tele_copy
Here I was excepting tele_copy should be a csv file, stored in hdfs.
But when I do hadoop fs -tail /user/hdfs/tele_copy I get the below result.
7.980.00.00.0-9.0-30.00.00.670.00.00.00.06.00.06.670.00.670.00.042.02.02.06.04.0198.032.030.00.03.00.01.01.00.00.00.01.00.01.01.00.00.00.01.00.00.00.00.00.00.06.00.040.09.990.01.01
32.64296.7544.990.016.00.0-6.75-27.844.672.3343.334.671.3331.4725.05.3386.6754.07.00.00.044.01.01.02.02.0498.038.00.00.07.01.00.00.00.01.00.00.01.00.00.00.00.00.01.01.01.00.01.00.00.03.00.010.029.991.01.01
30.52140.030.00.250.00.0-42.0-0.520.671.339.00.00.034.6210.677.3340.09.332.00.00.040.02.02.01.01.01214.056.050.01.05.00.00.00.00.00.00.01.00.01.01.00.00.01.01.00.00.01.00.00.00.06.00.001.00.00.01.01
60.68360.2549.990.991.250.038.75-10.692.331.6715.670.00.0134.576.00.0102.6729.674.00.00.3340.02.01.08.03.069.028.046.00.05.00.01.00.00.00.00.00.01.01.01.00.00.00.01.00.00.01.00.00.00.02.00.020.0129.990.01.01
Which is not comma separated.
Also received the same result I received after running this command.
INSERT OVERWRITE DIRECTORY '/user/hdfs/data/telecom' SELECT * FROM telecom_tr;
When I do a -ls
drwxr-xr-x - hdfs supergroup 0 2014-04-29 17:34 /user/hdfs/data/telecom
-rw-r--r-- 1 hdfs supergroup 5199062 2014-04-29 17:34 /user/hdfs/data/telecom/000000_0
When I do a cat the result is not a CSV
What you're really asking is to have Hive store the file as a CSV file. Try using ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' see Row Format, Storage Format, and SerDe.

Resources