I enabled the permission management in my hadoop cluster, but I'm facing a problem sending jobs with pig. This is the scenario:
1 - I have hadoop/hadoop user
2 - I have myuserapp/myuserapp user that runs PIG script.
3 - We setup the path /myapp to be owned by myuserapp
4 - We set pig.temp.dir to /myapp/pig/tmp
But when we pig try to run the jobs we got the following error:
job_201303221059_0009 all_actions,filtered,raw_data DISTINCT Message: Job failed! Error - Job initialization failed: org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=realtime, access=EXECUTE, inode="system":hadoop:supergroup:rwx------
Hadoop jobtracker requires this permission to statup it's server.
My hadoop policy looks like:
<property>
<name>security.client.datanode.protocol.acl</name>
<value>hadoop,myuserapp supergroup,myuserapp</value>
</property>
<property>
<name>security.inter.tracker.protocol.acl</name>
<value>hadoop,myuserapp supergroup,myuserapp</value>
</property>
<property>
<name>security.job.submission.protocol.acl</name>
<value>hadoop,myuserapp supergroup,myuserapp</value>
<property>
My hdfs-site.xml:
<property>
<name>dfs.permissions</name>
<value>true</value>
</property>
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>755</value>
</property>
<property>
<name>dfs.web.ugi</name>
<value>hadoop,supergroup</value>
</property>
My core site:
...
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
...
And finally my mapred-site.xml
...
<property>
<name>mapred.local.dir</name>
<value>/tmp/mapred</value>
</property>
<property>
<name>mapreduce.jobtracker.jobhistory.location</name>
<value>/opt/logs/hadoop/history</value>
</property>
Is there a missing configuration? How can I deal with multiples users running jobs in a restrict HDFS cluster?
Your problem is probably the staging directory. Try adding this property to mapred-site.xml:
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
Then make sure that the submitting user (eg. 'realtime') has a home directory (eg. '/user/realtime') and that they own it.
The fair scheduler is designed to run map reduce jobs as the user and it creates separeted pools for users/groups but have shared resources. At first look, there are some issues with this scheduler related to permissions on certain directories not allowing other users to execute/write in places that are necessary for the job to run.
So, one solution is to use Capacity scheduler:
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.CapacityTaskScheduler</value>
</property>
Capacity Scheduler, use a number of named queues, where each queue has a configurable number of map and reduce slots. And one good thing about capacity is the ability of placing a limit on percent of running tasks per user, so that users share a cluster with a quota.
Related
How can I access yarn job log via web ui?
I can view the job log via yarn manager web site. But every time yarn restart, the application list of yarn manager is empty. the picture is before restart
I can access application log via CLI command, even I restart yarn.
$HADOOP_HOME/bin/yarn logs -applicationId application_1499949542308_0020
The jobhistory server web ui is empty all the time
My log settings in yarn-site.xml and mapred-site.xml
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/home/hadoop/hadoop/nodemanager-logs</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/app-logs</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir-suffix</name>
<value>logs</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://hdp03.hp.sp.prd.bmsre.com:19888/jobhistory/logs</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hdp03.hp.sp.prd.bmsre.com:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hdp03.hp.sp.prd.bmsre.com:19888</value>
</property>
On Data Node, you can check the folder:
${HADOOP_HOME}/logs/userlogs
You just need to go to the folder which having the same name as the application id.
Yes, you can access Yarn retired jobs from Web UI.
access this url http://<jobtracker>:50070 to get the retired jobs.
With respect to your question you have restarted the yarn which means, a new log thread wakes up and does an upload of the logs to the configured location.
But in your question, does ' /app-logs' /app-logs path exists in your file-system. Please check.
There is a retention period, for how long the logs must be stored in that path and it is defined by the property name called yarn.log-aggregation.retain-seconds parameter.
To my understanding, the Job Tracker UI by default available at http://<jobtracker>:50070, exposes information on all currently running as well as retired MapReduce jobs and YARN has a JobHistory REST service that exposes details on finished applications.
I'm very confusing by the proxyuser setting in HDFS and Hive. I have the doAs option enabled in hive-site.xml
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
And proxyuser in core-site.xml
<property>
<name>hadoop.proxyuser.hdfs.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hdfs.groups</name>
<value>*</value>
</property>
But this will cause:
2017-03-29 16:24:59,022 INFO org.apache.hadoop.ipc.Server: Connection from 172.16.0.239:60920 for protocol org.apache.hadoop.hdfs.protocol.ClientProtocol is unauthorized for user hive (auth:PROXY) via hive (auth:SIMPLE)
2017-03-29 16:24:59,023 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 9000: readAndProcess from client 172.16.0.239 threw exception [org.apache.hadoop.security.authorize.AuthorizationException: User: hive is not allowed to impersonate hive]
I didn't set proxyuser to "hive" like most example saying is because core-site.xml is shared by other services, I don't like every service access HDFS as hive, but I still gave it a try so that now the core-site.xml looks as:
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
I lunched beeline again, however, the login is fine this time, but when a command was running, yarn thrown exception:
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Permission denied: user=hive, access=WRITE, inode="/user/yarn/hive/.staging":hdfs:supergroup:drwxr-xr-x
proxyuser "hive" has been denied from the staging folder which is owned by "hdfs". I don't think give 777 to the staging folder is a good idea as it makes no sense to give HDFS protection but open the folder to everyone. So my question is what's the best solution to setup the permission between Hive, Hdfs and Yarn?
Hadoop permission is just a nightmare to me, please help.
Adding proxyuser entries in core-site.xml would allow the superuser named hive to connect from any host (as value is *) to impersonate a user belonging to any group (as value is *).
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
This can be made more restrictive by passing actual hostnames and group names (Refer Superusers). The access privileges the superuser hive has on the FS will be applicable for the impersonating users.
For a multi-user Hadoop environment, the best practice would be to create dedicated directories for every superuser and configure the associated service to store files in it. And create a group supergroup for all these superusers so that group level access privileges can be given for the files, if required.
Add this property in hdfs-site.xml to configure the supergroup
<property>
<name>dfs.permissions.superusergroup</name>
<value>supergroup</value>
<description>The name of the group of super-users.</description>
</property>
I'm currently running Apache Ignite Hadoop accelerator for MapReduce. The jobs run, but I am unable to see them in the JobHistoryServer. I wouldn't expect to see the jobs in Yarn's Resource Manager (and don't).
I'm running my MapReduce jobs like
hadoop --config path/to/config/ jar path/to/jar ....
In the mapred-site.xml, I've added
<property>
<name>mapreduce.framework.name</name>
<value>ignite</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>[your_host]:11211</value>
</property>
My mapreduce.jobhistory.* settings have not been changed.
In the core-site.xml I've added
<property>
<name>fs.default.name</name>
<value>igfs://igfs#/</value>
</property>
<property>
<name>fs.igfs.impl</name>
<value>org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.igfs.impl</name>
<value>org.apache.ignite.hadoop.fs.v2.IgniteHadoopFileSystem</value>
</property>
I've also added ignite-core-1.6.0.jar, ignite-hadoop-1.6.0.jar, and ignite-shmem-1.0.0.jar to the $HADOOP_HOME path. Similarly, I've exported HADOOP_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, and HADOOP_MAPRED_HOME.
Is this functionality not supported by Ignite or am I doing something wrong?
Also, is there a way to track the MapReduce job running on Ignite?
Currently Ignite does not integrate anyhow with Hadoop History server, issue https://issues.apache.org/jira/browse/IGNITE-3766 requests that.
I am trying to integrate my es 2.2.0 version with hadoop HDFS.In my envoirnment,I have 1 master node and 1 data node. On my master node my Es is installed.
But while integrating it with HDFS my resource manager applications jobs get stuck in Accepted state.
Somehow i found link to change my yarn-site.xml settings:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2200</value>
<description>Amount of physical memory, in MB, that can be allocated for containers.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
I have done this also but it is not giving me expected output.
Configuration:-
my core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.
</description> </property>
<property> <name>fs.default.name</name>
<value>
hdfs://localhost:54310
</value>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
</description>
</property>
my mapred-site.xml,
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description>
</property>
my hdfs-site.xml,
<property>
<name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description>
</property>
Please help me how can i change my RM job to running state.So that i can use my elasticsearch data on HDFS.
If the screenshot is correct - you have 0 nodemanager - thus the application can’t start running - you need to start at least 1 nodemanager, so that application master and later tasks can be started.
Using hadoop multinode setup (1 mater , 1 salve)
After starting up start-mapred.sh on master , i found below error in TT logs (Slave an)
org.apache.hadoop.mapred.TaskTracker: Failed to get system directory
can some one help me to know what can be done to avoid this error
I am using
Hadoop 1.2.0
jetty-6.1.26
java version "1.6.0_23"
mapred-site.xml file
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>1</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>
</configuration>
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/workspace</value>
</property>
</configuration>
It seems that you just added hadoop.tmp.dir and started the job. You need to restart the Hadoop daemons after adding any property to the configuration files. You have specified in your comment that you added this property at a later stage. This means that all the data and metadata along with other temporary files is still in the /tmp directory. Copy all those things from there into your /home/hduser/workspace directory, restart Hadoop and re run the job.
Do let me know the result. Thank you.
If, it is your windows PC and you are using cygwin to run Hadoop. Then task tracker will not work.