Issue with sqoop export with hive table partitioned by timestamp - hadoop

I'm unable to sqoop export a hive table that's partitioned by timestamp.
I have a hive table that's partitioned by timestamp. The hdfs path it creates contains spaces which I think is causing issues with sqoop.
fs -ls
2013-01-28 16:31 /user/hive/warehouse/my_table/day=2013-01-28 00%3A00%3A00
The error on from sqoop export:
13/01/28 17:18:23 ERROR security.UserGroupInformation: PriviledgedActionException as:brandon (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /user/hive/warehouse/my_table/day=2012-10-29 00%3A00%3A00
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1239)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1192)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1165)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1147)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:383)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:170)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44064)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
If you do
fs -ls /user/hive/warehouse/my_table/day=2013-01-28 00%3A00%3A00
ls: /user/hive/warehouse/my_table/day=2013-01-28': No such file or directory
ls:00%3A00%3A00': No such file or directory
It works if you add quotes:
brandon#prod-namenode-new:~$ fs -ls /user/hive/warehouse/my_table/day="2013-01-28 00%3A00%3A00"
Found 114 items
-rw-r--r-- 2 brandon supergroup 4845 2013-01-28 16:30 /user/hive/warehouse/my_table/day=2013-01-28%2000%253A00%253A00/000000_0
...

You can try as "/user/hive/warehouse/my_table/day=2013-01-28*".

So what you can do is:
Select all the data from hive and write it to a directory in HDFS
(using INSERT OVERWRITE DIRECTORY '..path..' select a.column_1, a.column_n FROM table a) ,
and in the sqoop command specify the directory location using --export-dir ..dir..
Hope this will help.

Filenames with colon(:) are not supported as HDFS path as mention in these jira .But will work by converting it into Hex.But when sqoop is trying to read that path again it is converting it to colon(:) hence it cant able to find that path.I suggest to remove time part from your directory name and try again.Hope this answer your question.

Related

Why I'm getting "Permission denied " error HADOOP? And Why I'm unbale to import .csv file?

I have this table called 'emp' in hbase.
hbase(main):006:0> create 'emp', 'personel data'
0 row(s) in 1.3110 seconds
I've inserted 2 rows in it via put command.
hbase(main):020:0> scan 'emp'
ROW COLUMN+CELL
1 column=personel data:name, age, gender, timestamp=1641624361341, value=Pratik, 24, Male
2 column=personel data:name, age, gender, timestamp=1641624514176, value=Emma, 21, Female
2 row(s) in 0.0320 seconds
But, now I want to add data from the csv file.
[cloudera#quickstart ~]$ cat sample_kpi.csv
sam,24,m
emma,21,f
richard,23,m
susane,22,f
You can see my current working directory is: /home/cloudera and in that I've sample_kpi.csv. Hence, path=/home/cloudera/sample_kpi.csv
[cloudera#quickstart ~]$ pwd
/home/cloudera
[cloudera#quickstart ~]$ ls
avro data1 Documents emp.java external_jars kerberos Music part_dir rakeshdata sample_kpi.txt Videos
cloudera-manager Desktop Downloads enterprise-deployment.json external-unified kpifolder parcels Pictures rakeshdata1 sparkjars_exec workspace
cm_api.py devices.json eclipse express-deployment.json input.txt lib parquet_write Public sample_kpi.csv Templates zeyo_tab.java
So, I've written this command for importing data and ran it a few times but it shows error.
hbase(main):015:0> $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY, personal_data:name, personal_data:age, personal_data:gender emp /home/cloudera/sample_kpi.csv
SyntaxError: (hbase):15: syntax error, unexpected null
$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY, personal_data:name, personal_data:age, personal_data:gender emp /home/cloudera/sample_kpi.csv
^
hbase(main):016:0> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY, personal_data:name, personal_data:age, personal_data:gender emp /home/cloudera/sample_kpi.csv
SyntaxError: (hbase):16: syntax error, unexpected tSYMBEG
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY, personal_data:name, personal_data:age, personal_data:gender emp /home/cloudera/sample_kpi.csv
^
hbase(main):017:0> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY, personal_data:name, personal_data:age, personal_data:gender emp /home/cloudera/sample_kpi.csv
SyntaxError: (hbase):17: syntax error, unexpected tSYMBEG
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY, personal_data:name, personal_data:age, personal_data:gender emp /home/cloudera/sample_kpi.csv
And also, I thought this must be occuring coz I haven't imported the sample_kpi.csv file to hbase. So tried importing it but I got this "Permission denied" error:
[cloudera#quickstart ~]$ hadoop dfs -put /home/cloudera/sample_kpi.csv /hbase
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
put: Permission denied: user=cloudera, access=WRITE, inode="/hbase":hbase:supergroup:drwxr-xr-x
[cloudera#quickstart ~]$ hdfs dfs -put /home/cloudera/sample_kpi.csv /hbase
put: Permission denied: user=cloudera, access=WRITE, inode="/hbase":hbase:supergroup:drwxr-xr-x
And sometimes it says: No such file or directory
[cloudera#quickstart ~]$ hdfs dfs -put /home/cloudera/sample_kpi.csv /etc/hbase
put: `/etc/hbase': No such file or directory
[cloudera#quickstart ~]$ hadoop dfs -put /home/cloudera/sample_kpi.csv /hbase
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Can somebody tell me what exactly the error is?
And "WHY" is it happening?!
Thank you so much!
Your error tells exactly the problem. You're currently the Cloudera user. The folder is owned by hbase user and doesn't allow other users to write - inode="/hbase":hbase:supergroup:drwxr-xr-x
The permissions are for HDFS, not your local filesystem.
Run sudo su - hbase first before you can put data into the /hbase HDFS path, and no, hdfs:///etc doesn't exist at all, by default
Beyond that, I don't think putting files directly into the Hbase data path is the proper way to actually store them. Your syntax errors are because you're putting unquoted, spaced items in the value for -Dimporttsv.columns, so the command is interpreting them as separate arguments. Also, the HDFS path for your user files to import would be /user/cloudera, not /home/cloudera

Impala : Error loading data using load inpath : AccessControlException: Permission denied by sticky bit: user=impala

All,
I am new and trying few hands on use cases.
I have a file in hdfs and would want to load into impala table.
-- File location on hdfs : hdfs://xxx/user/hive/warehouse/impala_test
-- Table : CREATE TABLE impala_test_table
(File_Format STRING ,Rank TINYINT, Splitable_ind STRING )
Row format delimited
Fields terminated by '\,'
STORED AS textfile;
-- Load syntax in impala-shell : Load data inpath 'hdfs://xxx/user/hive/warehouse/impala_test' into table impala_test_table;
P.S : I am able to load it successfully with hive shell.
ERROR: AccessControlException: Permission denied by sticky bit: user=impala, path="/user/hive/warehouse/impala_test":uabc:hive:-rwxrwxrwx, parent="/user/hive/warehouse":hive:hive:drwxrwxrwt at ......
All permissions(777) are granted on the file impala_test.
Any suggestions ?
Thanks.
I know it is too late to answer this question, but maybe it would help others searching in future.
refer to HDFS Permissions Guide
The Sticky bit can be set on directories, preventing anyone except the superuser, directory owner or file owner from deleting or moving the files within the directory. Setting the sticky bit for a file has no effect.
so to the best of my knowledge, you should sign in as hdfs super user and remove sticky bit by hdfs dfs -chmod 0755 /dir_with_sticky_bit or hdfs dfs -chmod -t /dir_with_sticky_bit
hope this asnwer helps anybody

After Static Partitioning output is not as expected in hive

I am working with Static Partitioning
data for processing is as follows
Id Name Salary Dept Doj
1,Murtaza,360000,Sales,2010
2,Soumya,478968,Admin,2011
3,Sneha,45789, Dev,2012
4,Asif ,145687, Qa,2012
5,Shreyashi,36598,Qa,2011
6,Adil,25987,Dev,2010
7,Yashwant,23982,Admin,2011
8,Mohsin,569875,2012
9,Anil,56798,Sales,2010
10,Balaji,56489,Sales,2012
11,Utsav,563895,Qa,2010
12,Anuj,546987,Dev,2010
Hql For creating Partitionng table and loading data into it is as follows
create external table if not exists murtaza.PartSalaryReport (ID int,Name
string,Salary string,Dept string)
partitioned by (Doj string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile
location '/user/cts573151/externaltables';
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2010);
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2011);
LOAD DATA LOCAL INPATH '/home/cts573151/partition.txt'
overwrite into table murtaza.PartSalaryReport partition (Doj=2012);
Select * from murtaza.PartSalaryReport;`
Now Proble is that in my hdfs location where external table is located i should get data directory wise so upto that its ok
`
[cts573151#aster2 ~]$ hadoop dfs -ls /user/cts573151/externaltables`
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 4 items
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2011
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2012
But when i look into data inside
drwxr-xr-x - cts573151 supergroup 0 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010
it shows data of all 2010,2011 and 2012 , though it should show only 2010 data
[cts573151#aster2 ~]$ hadoop dfs -ls /user/cts573151/externaltables/doj=2010
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Found 1 items
-rwxr-xr-x 3 cts573151 supergroup 270 2016-12-12 13:06 /user/cts573151/externaltables/doj=2010/partition.txt
[cts573151#aster2 ~]$ hadoop dfs -cat /user/cts573151/externaltables/doj=2010/partition.txt
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
1,Murtaza,360000,Sales,2010
2,Soumya,478968,Admin,2011
3,Sneha,45789,Dev,2012
4,Asif,145687,Qa,2012
5,Shreyashi,36598,Qa,2011
6,Adil,25987,Dev,2010
7,Yashwant,23982,Qa,2011
9,Anil,56798,Sales,2010
10,Balaji,56489,Sales,2012
11,Utsav,53895,Qa,2010
12,Anuj,54987,Dev,2010
[cts573151#aster2 ~]$
Where its wrong ???
Since you are creating external table in hive, so you have to follow the below sets of commands:
create external table if not exists murtaza.PartSalaryReport (
ID int, Name string, Salary string, Dept string)
partitioned by (Doj string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile
location '/user/cts573151/externaltables';
alter table murtaza.PartSalaryReport add partition (Doj=2010);
hdfs dfs -put /home/cts573151/partition1.txt /user/cts573151/externaltables/Doj=2010/
alter table murtaza.PartSalaryReport add partition (Doj=2011);
hdfs dfs -put /home/cts573151/partition2.txt /user/cts573151/externaltables/Doj=2011/
alter table murtaza.PartSalaryReport add partition (Doj=2012);
hdfs dfs -put /home/cts573151/partition3.txt /user/cts573151/externaltables/Doj=2012/
These commands work for me, Hoping it helps you!!!

Zero-length file in S3 folder possibly prevents accessing that folder with Hive?

I cannot access a folder on AWS S3 with Hive, presumably, a zero-length file in that directory is the reason. AWS management console's folder is a zero-byte object with key that ends with a slash, i.e. "folder_name/". I think that Hive or Hadoop may have a bug in how they define a folder scheme on S3.
Here is what I have done.
CREATE EXTERNAL TABLE is_data_original (user_id STRING, action_name STRING, timestamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://bucketname/logs/';
SELECT * FROM is_data_original LIMIT 10;
Failed with exception java.io.IOException:java.lang.NullPointerException
username#client:~$ hadoop fs -ls s3n://bucketname/logs/
Found 4 items
-rwxrwxrwx 1 0 2015-01-22 20:30 /logs/data
-rwxrwxrwx 1 8947 2015-02-27 18:57 /logs/data_2015-02-13.csv
-rwxrwxrwx 1 7912 2015-02-27 18:57 /logs/data_2015-02-14.csv
-rwxrwxrwx 1 16786 2015-02-27 18:57 /logs/data_2015-02-15.csv
hadoop fs -mkdir s3n://bucketname/copylogs/
hadoop fs -cp s3n://bucketname/logs/*.csv s3n://bucketname/copylogs/
username#client:~$ hadoop fs -ls s3n://bucketname/copylogs/
Found 3 items
-rwxrwxrwx 1 8947 2015-02-28 05:09 /copylogs/data_2015-02-13.csv
-rwxrwxrwx 1 7912 2015-02-28 05:09 /copylogs/data_2015-02-14.csv
-rwxrwxrwx 1 16786 2015-02-28 05:09 /copylogs/data_2015-02-15.csv
CREATE EXTERNAL TABLE is_data_copy (user_id STRING, action_name STRING, timestamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://bucketname/copylogs/';
SELECT * FROM is_data_copy LIMIT 10;
The latter, after copying, works fine.
Below two commands both work:
hadoop fs -cat s3n://bucketname/logs/data_2015-02-15.csv
hadoop fs -cat s3n://bucketname/copylogs/data_2015-02-15.csv
Versions: Hive 0.11.0 and Hadoop 1.0.3.
Is this some kind of bug? Is it related to AWS S3? Any ideas? I need to be able to read the original location, because this is where that data keeps flowing.
I have no control on the processes that created the directory and placed log files in there, so I cannot check anything on that end.
I carried an experiment: created a key/folder on S3 and placed a file in there in two different ways: using AWS Management Console and using hadoop fs.
I can see a zero-byte file in the folder in case I used AWS Console and I am getting a null-pointer exception assessing it with Hive. With hadoop fs I don't have such a problem. I assume, that zero-byte file supposed to be deleted but it was not in case of AWS Console. I am sure, that in my case, s3 folder is not created from AWS console, but possibly Ruby or Javascript.
Seems like a Hive bug. Hive 0.12.0 does not have that problem.

Copy Table from Hive to HDFS

I would like to copy HIVE table from HIVE to HDFS. Please suggest the steps. Later I would like to use this HFDS file for Mahout Machine Learning.
I have created a HIVE table using data stored in the HDFS. Then I trasfromed the few variables in that data set and created a new table from that.
Now I would like to dump the HIVE table from HIVE to HDFS. So that it can be read by Mahout.
When I type this
hadoop fs -ls -R /user/hive/
I can able to see the list of table I have created ,
drwxr-xr-x - hdfs supergroup 0 2014-04-25 17:00 /user/hive/warehouse/telecom.db/telecom_tr
-rw-r--r-- 1 hdfs supergroup 5199062 2014-04-25 17:00 /user/hive/warehouse/telecom.db/telecom_tr/000000_0
I tried to copy the file from Hive to HDFS,
hadoop fs -cp /user/hive/warehouse/telecom.db/telecom_tr/* /user/hdfs/tele_copy
Here I was excepting tele_copy should be a csv file, stored in hdfs.
But when I do hadoop fs -tail /user/hdfs/tele_copy I get the below result.
7.980.00.00.0-9.0-30.00.00.670.00.00.00.06.00.06.670.00.670.00.042.02.02.06.04.0198.032.030.00.03.00.01.01.00.00.00.01.00.01.01.00.00.00.01.00.00.00.00.00.00.06.00.040.09.990.01.01
32.64296.7544.990.016.00.0-6.75-27.844.672.3343.334.671.3331.4725.05.3386.6754.07.00.00.044.01.01.02.02.0498.038.00.00.07.01.00.00.00.01.00.00.01.00.00.00.00.00.01.01.01.00.01.00.00.03.00.010.029.991.01.01
30.52140.030.00.250.00.0-42.0-0.520.671.339.00.00.034.6210.677.3340.09.332.00.00.040.02.02.01.01.01214.056.050.01.05.00.00.00.00.00.00.01.00.01.01.00.00.01.01.00.00.01.00.00.00.06.00.001.00.00.01.01
60.68360.2549.990.991.250.038.75-10.692.331.6715.670.00.0134.576.00.0102.6729.674.00.00.3340.02.01.08.03.069.028.046.00.05.00.01.00.00.00.00.00.01.01.01.00.00.00.01.00.00.01.00.00.00.02.00.020.0129.990.01.01
Which is not comma separated.
Also received the same result I received after running this command.
INSERT OVERWRITE DIRECTORY '/user/hdfs/data/telecom' SELECT * FROM telecom_tr;
When I do a -ls
drwxr-xr-x - hdfs supergroup 0 2014-04-29 17:34 /user/hdfs/data/telecom
-rw-r--r-- 1 hdfs supergroup 5199062 2014-04-29 17:34 /user/hdfs/data/telecom/000000_0
When I do a cat the result is not a CSV
What you're really asking is to have Hive store the file as a CSV file. Try using ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' see Row Format, Storage Format, and SerDe.

Resources