Copy Table from Hive to HDFS - hadoop

I would like to copy HIVE table from HIVE to HDFS. Please suggest the steps. Later I would like to use this HFDS file for Mahout Machine Learning.
I have created a HIVE table using data stored in the HDFS. Then I trasfromed the few variables in that data set and created a new table from that.
Now I would like to dump the HIVE table from HIVE to HDFS. So that it can be read by Mahout.
When I type this
hadoop fs -ls -R /user/hive/
I can able to see the list of table I have created ,
drwxr-xr-x - hdfs supergroup 0 2014-04-25 17:00 /user/hive/warehouse/telecom.db/telecom_tr
-rw-r--r-- 1 hdfs supergroup 5199062 2014-04-25 17:00 /user/hive/warehouse/telecom.db/telecom_tr/000000_0
I tried to copy the file from Hive to HDFS,
hadoop fs -cp /user/hive/warehouse/telecom.db/telecom_tr/* /user/hdfs/tele_copy
Here I was excepting tele_copy should be a csv file, stored in hdfs.
But when I do hadoop fs -tail /user/hdfs/tele_copy I get the below result.
7.980.00.00.0-9.0-30.00.00.670.00.00.00.06.00.06.670.00.670.00.042.02.02.06.04.0198.032.030.00.03.00.01.01.00.00.00.01.00.01.01.00.00.00.01.00.00.00.00.00.00.06.00.040.09.990.01.01
32.64296.7544.990.016.00.0-6.75-27.844.672.3343.334.671.3331.4725.05.3386.6754.07.00.00.044.01.01.02.02.0498.038.00.00.07.01.00.00.00.01.00.00.01.00.00.00.00.00.01.01.01.00.01.00.00.03.00.010.029.991.01.01
30.52140.030.00.250.00.0-42.0-0.520.671.339.00.00.034.6210.677.3340.09.332.00.00.040.02.02.01.01.01214.056.050.01.05.00.00.00.00.00.00.01.00.01.01.00.00.01.01.00.00.01.00.00.00.06.00.001.00.00.01.01
60.68360.2549.990.991.250.038.75-10.692.331.6715.670.00.0134.576.00.0102.6729.674.00.00.3340.02.01.08.03.069.028.046.00.05.00.01.00.00.00.00.00.01.01.01.00.00.00.01.00.00.01.00.00.00.02.00.020.0129.990.01.01
Which is not comma separated.
Also received the same result I received after running this command.
INSERT OVERWRITE DIRECTORY '/user/hdfs/data/telecom' SELECT * FROM telecom_tr;
When I do a -ls
drwxr-xr-x - hdfs supergroup 0 2014-04-29 17:34 /user/hdfs/data/telecom
-rw-r--r-- 1 hdfs supergroup 5199062 2014-04-29 17:34 /user/hdfs/data/telecom/000000_0
When I do a cat the result is not a CSV

What you're really asking is to have Hive store the file as a CSV file. Try using ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' see Row Format, Storage Format, and SerDe.

Related

After Static Partitioning output is not as expected in hive

I am working with Static Partitioning
data for processing is as follows
Id Name Salary Dept Doj
1,Murtaza,360000,Sales,2010
2,Soumya,478968,Admin,2011
3,Sneha,45789, Dev,2012
4,Asif ,145687, Qa,2012
5,Shreyashi,36598,Qa,2011
6,Adil,25987,Dev,2010
7,Yashwant,23982,Admin,2011
8,Mohsin,569875,2012
9,Anil,56798,Sales,2010
10,Balaji,56489,Sales,2012
11,Utsav,563895,Qa,2010
12,Anuj,546987,Dev,2010
Hql For creating Partitionng table and loading data into it is as follows
create external table if not exists murtaza.PartSalaryReport (ID int,Name
string,Salary string,Dept string)
partitioned by (Doj string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile
location '/user/cts573151/externaltables';
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2010);
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2011);
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2012);
Select * from murtaza.PartSalaryReport;`
Now Proble is that in my hdfs location where external table is located i should get data directory wise so upto that its ok
`
[cts573151#aster2 ~]$ hadoop dfs -ls /user/cts573151/externaltables`
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 4 items
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2011
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2012
But when i look into data inside
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010
it shows data of all 2010,2011 and 2012 , though it should show only 2010 data
[cts573151#aster2 ~]$ hadoop dfs -ls /user/cts573151/externaltables/doj=2010
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 1 items
-rwxr-xr-x 3 cts573151 supergroup 270 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010/partition.txt
[cts573151#aster2 ~]$ hadoop dfs -cat /user/cts573151/externaltables/doj=2010/partition.txt
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
1,Murtaza,360000,Sales,2010
2,Soumya,478968,Admin,2011
3,Sneha,45789,Dev,2012
4,Asif,145687,Qa,2012
5,Shreyashi,36598,Qa,2011
6,Adil,25987,Dev,2010
7,Yashwant,23982,Qa,2011
9,Anil,56798,Sales,2010
10,Balaji,56489,Sales,2012
11,Utsav,53895,Qa,2010
12,Anuj,54987,Dev,2010
[cts573151#aster2 ~]$
Where its wrong ???
Since you are creating external table in hive, so you have to follow the below sets of commands:
create external table if not exists murtaza.PartSalaryReport (
ID int, Name string, Salary string, Dept string)
partitioned by (Doj string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile
location '/user/cts573151/externaltables';
alter table murtaza.PartSalaryReport add partition (Doj=2010);
hdfs dfs -put /home/cts573151/partition1.txt /user/cts573151/externaltables/Doj=2010/
alter table murtaza.PartSalaryReport add partition (Doj=2011);
hdfs dfs -put /home/cts573151/partition2.txt /user/cts573151/externaltables/Doj=2011/
alter table murtaza.PartSalaryReport add partition (Doj=2012);
hdfs dfs -put /home/cts573151/partition3.txt /user/cts573151/externaltables/Doj=2012/
These commands work for me, Hoping it helps you!!!

hive understanding table creation

I am taking a mooc.
it told us to upload a few files from our PC to hdfs using below commands
azure storage blob upload local_path container data/logs/2008-01.txt.gz
I did the same.
later on when I typed below command in PUTTY secure shell I was able to see that file
hdfs dfs -ls /data/logs
Found 6 items
-rwxrwxrwx 1 331941 2016-03-03 15:56 /data/logs/2008-01.txt.gz
-rwxrwxrwx 1 331941 2016-03-03 15:58 /data/logs/2008-02.txt.gz
-rwxrwxrwx 1 331941 2016-03-03 15:58 /data/logs/2008-03.txt.gz
-rwxrwxrwx 1 331941 2016-03-03 15:58 /data/logs/2008-04.txt.gz
-rwxrwxrwx 1 331941 2016-03-03 15:58 /data/logs/2008-05.txt.gz
-rwxrwxrwx 1 331941 2016-03-03 15:58 /data/logs/2008-06.txt.gz
then we started a hive terminal and first created a table and then inserted data into that table using
load data inpath '/data/logs' into TABLE rawlog;
Then we created an external table using below command
CREATE EXTERNAL TABLE cleanlog
(log_date DATE,
log_time STRING,
c_ip STRING,
cs_username STRING,
s_ip STRING,
s_port STRING,
cs_method STRING,
cs_uri_stem STRING,
cs_uri_query STRING,
sc_status STRING,
sc_bytes INT,
cs_bytes INT,
time_taken INT,
cs_user_agent STRING,
cs_referrer STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/data/cleanlog';
we inserted data into the table using
INSERT INTO TABLE cleanlog
SELECT *
FROM rawlog
WHERE SUBSTR(log_date, 1, 1) <> '#';
I exited out of hive and typed in below command
hdfs dfs -ls /data/logs
I dont see anything in that folder, why? where did uploaded log
files go?
Where is the rawlog table? does it exist in the same folder? Why dont i see it?
Why do i see file 00000_0 in my cleanlog folder? is it the new
table? If i type command
hdfs dfs -ls /data/cleanlog
The output that i get is
Found 1 items
-rwxr-xr-x 1 sshuser supergroup 71323206 2016-03-03 16:11 /data/cleanlog/000000_0
################----------------------------------update 1
What would happen if load one more data file at /data/logs/ and
then run select * from rawlog? would it automatically pull data
from the new file?
If you don't want to lose data in source folder, use external table. Have a look at this SE question:
Difference between `load data inpath ` and `location` in hive?
I dont see anything in that folder, why? where did uploaded log files go?
They have been removed as data is loaded in table and you have used load data in path instead of external table
Where is the rawlog table? does it exist in the same folder? Why dont i see it?
Table definition does not exists in the folder where data resides. In your create table statement, you have already quoted the location of table data to be stored as /data/cleanlog
Have a look at below queries on where does hive stores files in hdfs.
Where does Hive store files in HDFS?
I have created a table in hive, I would like to know which directory my table is created in?
Why do i see file 00000_0 in my cleanlog folder? is it the new table?
It's not new table. Execute this command in hive shell.
describe formatted <table_name>;
EDIT: Regarding incremental updates to table,follow the steps as per this article and this question : Delta/Incremental Load in Hive
You used the LOAD command, which MOVED the files from their original location to the folder for the rawlog table (which by default will be /hive/warehouse/rawlog).

Zero-length file in S3 folder possibly prevents accessing that folder with Hive?

I cannot access a folder on AWS S3 with Hive, presumably, a zero-length file in that directory is the reason. AWS management console's folder is a zero-byte object with key that ends with a slash, i.e. "folder_name/". I think that Hive or Hadoop may have a bug in how they define a folder scheme on S3.
Here is what I have done.
CREATE EXTERNAL TABLE is_data_original (user_id STRING, action_name STRING, timestamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://bucketname/logs/';
SELECT * FROM is_data_original LIMIT 10;
Failed with exception java.io.IOException:java.lang.NullPointerException
username#client:~$ hadoop fs -ls s3n://bucketname/logs/
Found 4 items
-rwxrwxrwx 1 0 2015-01-22 20:30 /logs/data
-rwxrwxrwx 1 8947 2015-02-27 18:57 /logs/data_2015-02-13.csv
-rwxrwxrwx 1 7912 2015-02-27 18:57 /logs/data_2015-02-14.csv
-rwxrwxrwx 1 16786 2015-02-27 18:57 /logs/data_2015-02-15.csv
hadoop fs -mkdir s3n://bucketname/copylogs/
hadoop fs -cp s3n://bucketname/logs/*.csv s3n://bucketname/copylogs/
username#client:~$ hadoop fs -ls s3n://bucketname/copylogs/
Found 3 items
-rwxrwxrwx 1 8947 2015-02-28 05:09 /copylogs/data_2015-02-13.csv
-rwxrwxrwx 1 7912 2015-02-28 05:09 /copylogs/data_2015-02-14.csv
-rwxrwxrwx 1 16786 2015-02-28 05:09 /copylogs/data_2015-02-15.csv
CREATE EXTERNAL TABLE is_data_copy (user_id STRING, action_name STRING, timestamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://bucketname/copylogs/';
SELECT * FROM is_data_copy LIMIT 10;
The latter, after copying, works fine.
Below two commands both work:
hadoop fs -cat s3n://bucketname/logs/data_2015-02-15.csv
hadoop fs -cat s3n://bucketname/copylogs/data_2015-02-15.csv
Versions: Hive 0.11.0 and Hadoop 1.0.3.
Is this some kind of bug? Is it related to AWS S3? Any ideas? I need to be able to read the original location, because this is where that data keeps flowing.
I have no control on the processes that created the directory and placed log files in there, so I cannot check anything on that end.
I carried an experiment: created a key/folder on S3 and placed a file in there in two different ways: using AWS Management Console and using hadoop fs.
I can see a zero-byte file in the folder in case I used AWS Console and I am getting a null-pointer exception assessing it with Hive. With hadoop fs I don't have such a problem. I assume, that zero-byte file supposed to be deleted but it was not in case of AWS Console. I am sure, that in my case, s3 folder is not created from AWS console, but possibly Ruby or Javascript.
Seems like a Hive bug. Hive 0.12.0 does not have that problem.

Is there an apache pig equivalent of "SHOW TABLES"?

I have a Hadoop data store I'm accessing in Pig and not a lot of documentation on it, plus I'm new to Pig, so I am looking for the Pig equivalent of "SHOW TABLES". When I have a connection to a MySQL db I can do this and get a general sense of what data is in there; I have found several tutorials but nothing on point. If not, is there some other way to orient myself to a Hadoop data store I know nothing about?
ETA: This would be when running Pig in interactive mode, rather than loading a script. Probably obvious, but I thought I should mention it.
The closest thing I can see to 'show tables' is the 'history' command, which effectively lists all aliases created.
grunt> history
1 a = LOAD 'iris.csv' USING PigStorage (',') AS
(sl:double,sw:double,pl:double,pw:double,spec:int);
2 b = FILTER a BY spec==1;
3 c = GROUP b BY pw;
4 d = FOREACH c GENERATE COUNT(b);
Pig doesn't have a concept of tables. It can read any file that is on your HDFS filesystem and stores the parsed result in a relation.
Note that you can also run HDFS filesystem commands from the grunt shell
It's probably best you familiarise yourself with HDFS first and make sure you can comfortably navigate the filesystem first so you can find what data you want to process with Pig.
We had also came across similar situation and applied all solutions of stackoverflow but none had solved my issue . Now solution of these problem is that , you should use store command of pig and also provide dedicated folder for it .
Now the set up which we prefer is ,
grunt> fs -mkdir /user/hduser/AllPigTableStructures/
grunt> fs -chmod 777 /user/hduser/AllPigTableStructures/
Now we will store all table informations into these folder named "AllPigTableStructures".
Then you should use "store" function as below code,
grunt> store extract_details into '/user/hduser/AllPigTableStructures/SchemaTwit' using PigStorage('\t', '-schema');
the last line of these code should be
/*2017-09-18 02:13:56,566 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
*/
Now you should see a folder with named SchemaTwit like these,
grunt> fs -ls /user/hduser/AllPigTableStructures
Found 12 items
drwxr-xr-x - hduser supergroup 0 2017-09-18 02:13 /user/hduser/AllPigTableStructures/SchemaTwit
and at last if you will see content of SchemaTwit directory then it will contain your schema of your table and all details about your table below is command for it and part-m-xxx kind of file will contains your data part.
grunt> fs -ls /user/hduser/AllPigTableStructures/SchemaTwit
Found 4 items
-rw-r--r-- 2 hduser supergroup 8 2017-09-18 02:26 /user/hduser/AllPigTableStructures/SchemaTwit/.pig_header
-rw-r--r-- 2 hduser supergroup 239 2017-09-18 02:26 /user/hduser/AllPigTableStructures/SchemaTwit/.pig_schema
-rw-r--r-- 2 hduser supergroup 0 2017-09-18 02:26 /user/hduser/AllPigTableStructures/SchemaTwit/_SUCCESS
-rw-r--r-- 2 hduser supergroup 140 2017-09-18 02:26 /user/hduser/AllPigTableStructures/SchemaTwit/part-m-00000
Now you can use below cat command on schema file to see schema of your table of part-m-xxx for browsing your data part
grunt> fs -cat /user/hduser/AllPigTableStructures/SchemaTwit/.pig_schema
{"fields":[{"name":"id","type":50,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"text","type":50,"description":"autogenerated from Pig Field Schema","schema":null}],"version":0,"sortKeys":[],"sortKeyOrders":[]}
Now for loading your table with schema these command help,
WithSchema = LOAD '/user/hduser/AllPigTableStructures/SchemaTwit';
PS: We are running our pig into mapreduce mode .
Looks like you have mistaken Pig. As #seedhead has specified, you handle files with Pig. Folks quite often mistake it as a a database(like Hbase) or a warehouse(like Hive), which it is not. And, as far as visualizing the data is concerned, you could list the files and directories through Pig shell. And if you need to see how many records(or lines) a particular files has, you could do something like this :
Records = LOAD '/path_of_the_file';
Records_Group= GROUP Records ALL;
Records_Count = FOREACH Records_Group GENERATE COUNT(Records);

Issue with sqoop export with hive table partitioned by timestamp

I'm unable to sqoop export a hive table that's partitioned by timestamp.
I have a hive table that's partitioned by timestamp. The hdfs path it creates contains spaces which I think is causing issues with sqoop.
fs -ls
2013-01-28 16:31 /user/hive/warehouse/my_table/day=2013-01-28 00%3A00%3A00
The error on from sqoop export:
13/01/28 17:18:23 ERROR security.UserGroupInformation: PriviledgedActionException as:brandon (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /user/hive/warehouse/my_table/day=2012-10-29 00%3A00%3A00
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1239)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1192)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1165)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1147)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:383)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:170)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44064)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
If you do
fs -ls /user/hive/warehouse/my_table/day=2013-01-28 00%3A00%3A00
ls: /user/hive/warehouse/my_table/day=2013-01-28': No such file or directory
ls:00%3A00%3A00': No such file or directory
It works if you add quotes:
brandon#prod-namenode-new:~$ fs -ls /user/hive/warehouse/my_table/day="2013-01-28 00%3A00%3A00"
Found 114 items
-rw-r--r-- 2 brandon supergroup 4845 2013-01-28 16:30 /user/hive/warehouse/my_table/day=2013-01-28%2000%253A00%253A00/000000_0
...
You can try as "/user/hive/warehouse/my_table/day=2013-01-28*".
So what you can do is:
Select all the data from hive and write it to a directory in HDFS
(using INSERT OVERWRITE DIRECTORY '..path..' select a.column_1, a.column_n FROM table a) ,
and in the sqoop command specify the directory location using --export-dir ..dir..
Hope this will help.
Filenames with colon(:) are not supported as HDFS path as mention in these jira .But will work by converting it into Hex.But when sqoop is trying to read that path again it is converting it to colon(:) hence it cant able to find that path.I suggest to remove time part from your directory name and try again.Hope this answer your question.

Resources