Can't import/load data to hive, why? - hadoop

I'm trying to import data (simple file with two columns, int and string), table looks:
hive> describe test;
id int
name string
and when I try to import:
hive> load data inpath '/user/test.txt' overwrite into table test;
Loading data to table default.test
rmr: org.apache.hadoop.security.AccessControlException: Permission denied: user=hadoop, access=ALL, inode="/user/hive/warehouse/test":hive:hadoop:drwxrwxr-x
Failed with exception org.apache.hadoop.security.AccessControlException: Permission denied: user=hadoop, access=WRITE, inode="/user/hive/warehouse/test":hive:hadoop:drwxrwxr-x
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
Looks like user hadoop has all permissions, but still can't load data, however I was able to create table.
What's wrong?

Hive uses Metastore for it's metadata. All table definitions are created in it, but actual data stored in hdfs. Currently hive permissions and hdfs permissions are completely different things. They are unrelated. You have several workarounds:
Disable permissions at all (for hdfs hdfs)
Use Storage Based https://cwiki.apache.org/confluence/display/Hive/HCatalog+Authorization (in this case you will not be able to create tables, if you don't own database directory on hdfs)
Submit all jobs under hive user ( sudo -u hive hive )
Create database:
create database hadoop;
and create needed directory in hdfs with correct permissions
hdfs dfs -mkdir /user/hive/warehouse/hadoop.db;
hdfs dfs -chown hadoop:hive /user/hive/warehouse/hadoop.db
hdfs dfs -chmod g+w /user/hive/warehouse/hadoop.db
Of course, you should enable hive.metastore.client.setugi=true and hive.metastore.server.setugi=true. These parameters instruct hive execute jobs under current shell user (looks like these parameters are already enabled, because hive can't create directory).

This issue is because of syntax.
The format given for generating a table should be similar to the input file format.

Yes this is a permission error on the destination directory in the HDFS. An approach that worked for me:
identify the destination dir in the HDFS, hive > describe extended [problem table name]; under location parameter, if you don't know where that is, then
change the permissions on that directory:
hadoop fs -chmod [-R] nnn /problem/table/directory
May have to run as superuser depending on your setup. Use the -R option to apply new permissions to everything within the directory. Choose nnn to whatever is appropriate for your system.

Related

How to create new user in hadoop

I am new to hadoop. I have done apache hadoop multinode installation and the user name is hadoop.
I am using total 3 nodes: 1 namenode and 2 datanodes
I have to create new user for data isolation. I have found few links on google, but those are not working and I am unable to access the hdfs.
**[user1#datanode1~]# hdfs dfs -ls -R /
bash: hdfs: command not found...**
Can someone help me with the steps to create the new user which can access hdfs for data isolation. And on which node I should create the new user.
Thanks
Hadoop doesn't have users like Linux does. Users are generally managed by external LDAP/Kerberos systems. By default, there is not even security features, all user-names are based on the HADOOP_USER_NAME environment variable, and can be overriden by export command. Also, by default, the user used is the current username, for example, your command user1#datanode1 # hdfs dfs -ls would actually run hdfs dfs -ls /user/user1, and return an error if that folder doesn't first exist.
However, your actual error is saying that your OS PATH variable does not include $HADOOP_HOME/bin, for example. Edit your .bashrc to fix this.
You'd create an HDFS folder for "user" username with
hdfs dfs -mkdir /user/username
hdfs dfs -chown username /user/username
hdfs dfs -chmod -R 770 /user/username
And you also should run useradd command on the namenode machine to make sure it knows about a user named "username"

How determine Hive database size?

How determine Hive's database size from Bash or from Hive CLI?
hdfs and hadoop commands are also avaliable in Bash.
A database in hive is a metadata storage - meaning it holds information about tables and has a default location. Tables in a database can also be stored anywhere in hdfs if location is specified when creating a table.
You can see all tables in a database using show tables command in Hive CLI.
Then, for each table, you can find its location in hdfs using describe formatted <table name> (again in Hive CLI).
Last, for each table you can find its size using hdfs dfs -du -s -h /table/location/
I don't think there's a single command to measure the sum of sizes of all tables of a database. However, it should be fairly easy to write a script that automates the above steps. Hive can also be invoked from bash CLI using: hive -e '<hive command>'
Show Hive databases on HDFS
sudo hadoop fs -ls /apps/hive/warehouse
Show Hive database size
sudo hadoop fs -du -s -h /apps/hive/warehouse/{db_name}
if you want the size of your complete database run this on your "warehouse"
hdfs dfs -du -h /apps/hive/warehouse
this gives you the size of each DB in your warehouse
if you want the size of tables in a specific DB run:
hdfs dfs -du -h /apps/hive/warehouse/<db_name>
run a "grep warehouse" on hive-site.xml to find your warehouse path

Unable to create HDFS admin super user

I am trying to create HDFS Admin super user. I referred below for another super user creation.
Creating HDFS Admin user
I followed exact steps but after running
hdfs dfsadmin -report
report: Access denied for user abc. Superuser privilege is required.
Any pointer here? how should I debug this?
Instead use this command it works:
sudo -u hdfs hdfs dfsadmin -report
It worked for me
Create a local user ,
Add user to hdfs group or setup privileges to local user using Apache Ranger Web UI
Assuming you aren't using kerberos you need to create a local Linux user on each Hadoop node. If you are using Kerberos/AD/LDAP then create a user there, setup Kerberos which takes a lot more effort.
Run this on each node to add a use as root/sudo;
useradd abc
passwd abc
usermod-aG hdfs abc
-instead of hdfs about it might be superuser.
su - hdfs
hadoop fs -mkdir /user/abc
hadoop fs -chown abc:abc
exit
su - abc
hadoop fs -ls /

Hive Permission denied with insert overwrite directory

I got permission denied failure from hdfs while running the command below:
hive -e "insert overwrite directory '/user/hadoop/a/b/c/d/e/f' select * from table_name limit 10;"
The error message is:
Permission denied: user=hadoop, access=WRITE, inode="/user/hadoop/a/b":hdfs:hive:drwxrwxr-x
But when I run : hadoop fs -ls /user/hadoop/a, I get:
drwxrwxrwx - hadoop supergroup 0 2014-04-08 00:56 /user/hadoop/a/b
It seems I have opened full permission on the folder b, why did I still get permission denied?
PS: I have set hive.insert.into.multilevel.dirs=true in hive config file.
I had the same problem and I have solved it simply by using the fully qualified HDFS path. Like this.
hive -e "insert overwrite directory 'hdfs://<cluster>/user/hadoop/a/b/c/d/e/f' select * from table_name limit 10;"
See here a mention of this issue.
However, I do not know the root cause but it's not related to permissions.
Open a new terminal then try this:
1.) Change user to root:
su
2.) Change user to hdfs:
su hdfs
3.) Then run this command:
hadoop fs -chown -R hadoop /user/hadoop/a
Now you can try the command you were running.
Hope it helps...!!!
The issue is not actually with the directory permissions. Hive should have access to the path, what I mean by that is not on the files level.
Below are the steps on how you can grant access to the hdfs path and to the database to a user/group. Comments on each command starts with #
#Login as hive superuser to perform the below steps
create role <role_name_x>;
#For granting to database
grant all on database to role <role_name_x>;
#For granting to HDFS path
grant all on URI '/hdfs/path' to role <role_name_x>;
#Granting the role to the user you will use to run the hive job
grant role <role_name_x> to group <your_user_name>;
#After you perform the below steps you can validate with the below commands
#grant role should show the URI or database access when you run the grant role check on the role name as below
show grant role <role_name_x>;
#Now to validate if the user has access to the role
show role grant group <your_user_name>;
Here is one of my answer to the similar question through impala. More on hive permissions
Other suggestion based on other answers and comments here, If you want to see the permissions on some hdfs path or file hdfs dfs -ls is not your friend to know more about the permissions and its old school approach. you can use hdfs dfs -getfacl /hdfs/path will give you the complete details, result looks something like below.
hdfs dfs -getfacl /tmp/
# file: /tmp
# owner: hdfs
# group: supergroup
# flags: --t
user::rwx
group::rwx
other::rwx

Same hadoop setup to different user

I have installed and setup a single node instance of hadoop using my username. I want to setup the same hadoop setup to a different user. How can I do this?
In hadoop we run different tasks and store data in HDFS.
If several users are doing tasks using the same user account, it will be difficult to trace the jobs and track the tasks/defects done by each user.
Also the other issue is with the security.
If all are given the same user account, all users will have the same privilege and all can access everyone’s data, can modify it, can perform execution, can delete it also.
This is a very serious issue.
For this we need to create multiple user accounts.
Benefits of Creating multiple users
1) The directories/files of other users cannot be modified by a user.
2) Other users cannot add new files to a user’s directory.
3) Other users cannot perform any tasks (mapreduce etc) on a user’s files.
In short data is safe and is accessible only to the assigned user and the superuser.
Steps for setting up multiple User accounts
For adding new user capable of performing hadoop operations, do the following steps.
Step 1
Creating a New User
For Ubuntu
sudo adduser --ingroup <groupname> <username>
For RedHat variants
useradd -g <groupname> <username>
passwd
Then enter the user details and password.
Step 2
we need to change the permission of a directory in HDFS where hadoop stores its temporary data.
Open the core-site.xml file
Find the value of hadoop.tmp.dir.
In my core-site.xml, it is /app/hadoop/tmp. In the proceeding steps, I will be using /app/hadoop/tmp as my directory for storing hadoop data ( ie value of hadoop.tmp.dir).
Then from the superuser account do the following step.
hadoop fs –chmod -R 1777 /app/hadoop/tmp/mapred/staging
Step 3
The next step is to give write permission to our user group on hadoop.tmp.dir (here /app/hadoop/tmp. Open core-site.xml to get the path for hadoop.tmp.dir). This should be done only in the machine(node) where the new user is added.
chmod 777 /app/hadoop/tmp
Step 4
The next step is to create a directory structure in HDFS for the new user.
For that from the superuser, create a directory structure.
Eg: hadoop fs –mkdir /user/username/
Step 5
With this we will not be able to run mapreduce programs, because the ownership of the newly created directory structure is with superuser. So change the ownership of newly created directory in HDFS to the new user.
hadoop fs –chown –R username:groupname <directory to access in HDFS>
Eg: hadoop fs –chown –R username:groupname /user/username/
Step 6
login as the new user and perform hadoop jobs..
su – username
I had similar file permission issue and it did not get fixed by executing hadoop fs –chmod -R 1777 /app/hadoop/tmp/mapred/staging.
Instead, it got fixed by executing the following Unix command $ sudo chmod -R 1777 /app/hadoop/tmp/mapred

Resources