NiFi ListHDFS cannot find directory, FileNotFoundException - apache-nifi

Have pipeline in NiFi of the form listHDFS->moveHDFS, attempting to run the pipeline we see the error log
13:29:21 HSTDEBUG01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Returning CLUSTER State: StandardStateMap[version=43, values={emitted.timestamp=1525468790000, listing.timestamp=1525468790000}]
13:29:21 HSTDEBUG01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Found new-style state stored, latesting timestamp emitted = 1525468790000, latest listed = 1525468790000
13:29:21 HSTDEBUG01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Fetching listing for /hdfs/path/to/dir
13:29:21 HSTERROR01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Failed to perform listing of HDFS due to File /hdfs/path/to/dir does not exist: java.io.FileNotFoundException: File /hdfs/path/to/dir does not exist
Changing the listHDFS path to /tmp seems to run ok, thus making me think that the problem is with my permissions on the directory I'm trying to list. However, changing the NiFi user to a user that can access that directory (eg. hadoop fs -ls /hdfs/path/to/dir) by setting the bootstrap.properties value run.as=myuser and restarting (see https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#bootstrap_properties) still seems to produce the same problem for the directory. The literal dir. string being used that is not working is:
"/etl/ucera_internal/datagov_example/raw-ingest-tracking/version-1/ingest"
Does anyone know what is happening here? Thanks.
** Note: The hadoop cluster I am accessing does not have kerberos enabled (it is a secured MapR hadoop cluster).
Update: It appears that the mapr hadoop implementation is different enough that it requires special steps in order for NiFi to properly work on it (see https://community.mapr.com/thread/10484 and http://hariology.com/integrating-mapr-fs-and-apache-nifi/). May not get a chance to work on this problem for some time to see if still works (as certain requirements have changed), so am dumping the link here for others who may have this problem in the meantime.

Could you once make sure you have entered correct path and directory needs to be exists in HDFS.
It seems to be list hdfs processor not able to find the directory that you have configured in directory property and logs are not showing any permission denied issues.
If logs shows permission denied then you can change the nifi running user in bootstrap.conf and
Once you make change in nifi properties then NiFi needs to restart to apply the changes (or) change the permissions on the directory that NiFi can have access.

Related

Why does h2o require write access on hdfs root directory?

Seeing error message
Job setup failed : org.apache.hadoop.security.AccessControlException: Permission denied: user=airflow, access=WRITE, inode="/":hdfs:hdfs:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:399) at ...
when trying to connect to start the h2o cluster (h2o-3.28.0.1-hdp3.1). Ie it appears that it does not like that the root hdfs dir hdfs:/// does not have write permissions for my user (and giving write access to my user via ranger does appear to fix the problem), but this seems wrong.
From past experience, I've seen this for case where the launching user does not have write permissions the their own hdfs:///user/<username> folder, but seems odd to me that h2o wants the user to have write access over the entire top level hdfs dir. Is this normal? Can I change this?
Possibly related: Finding that after starting the cluster, can't manually kill in YARN ResourceManager UI or killing the PID, rather need to go to the h2o cluster url and use the admin tab to shutdown the cluster. Any ideas why this would happen?
Found the problem, can't find the docs / other-post-detailing-this right now, but basically, when running the hadoop jar h2odriver.jar ... command, there is an optional param called -output where you would normally put some hdfs location that h2o will write stuff (from what I can recall, this is some legacy directory that is not super important) to.
I had forgotten that this is an HDFS location and put some local temp folder's absolute path. The error was because h2o was trying to create that folder by creating the entire path in hdfs that lead to it, thus requiring being able to write from the hdfs root dir. The correct value would be something like /user/<username>.

"No such file or directory" in hadoop while executing WordCount program using jar command

I am new to Hadoop and am trying to execute the WordCount Problem.
Things I did so far -
Setting up the Hadoop Single Node cluster referring the below link.
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
Write the word count problem referring the below link
https://kishorer.in/2014/10/22/running-a-wordcount-mapreduce-example-in-hadoop-2-4-1-single-node-cluster-in-ubuntu-14-04-64-bit/
Problem is when I execute the last line to run the program -
hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output
Following is the error I get -
The directory seems to be present
The file is also present in the directory with contents
Finally, on a side note I also tried the following directory sturcture in the jar command.
No avail! :/
I would really appreciate if someone could guide me here!
Regards,
Paul Alwin
Your first image is using input from the local Hadoop installation directory, /usr
If you want to use that data on your local filesystem, you can specify file:///usr/...
Otherwise, if you're running pseudo distributed mode, HDFS has been setup, and /usr does not exist in HDFS unless you explicitly created it there.
Based on the stacktrace, I believe the error comes from the /app/hadoop/ staging directory path not existing, or the permissions for it are not allowing your current user to run commands against that path
Suggestion: Hortonworks and Cloudera offer pre-built VirtualBox images and lots of tutorial resources. Most companies will have Hadoop from one of those vendors, so it's better to get familiar with that rather than mess around with having to install Hadoop yourself from scratch, in my opinion

Streamsets Mapr FS origin/dest. KerberosPrincipal exception (using hadoop impersonation (in mapr 6.0))

I am trying to do a simple data move from a mapr fs origin to a mapr fs destination (this is not my use case, just doing this simple movement for testing purposes). When trying to validate this pipeline, the error message I see in the staging area is:
HADOOPFS_11 - Cannot connect to the filesystem. Check if the Hadoop FS location: 'maprfs:///mapr/mycluster.cluster.local' is valid or not: 'java.io.IOException: Provided Subject must contain a KerberosPrincipal
Tyring different variations of the hadoop fs URI field (eg. mfs:///mapr/mycluster.cluster.local, maprfs:///mycluster.cluster.local) does not seem to help. Looking at the logs after trying to validate, I see
2018-01-04 10:28:56,686 mfs2mfs/mapr2sqlserver850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Created source of type: com.streamsets.pipeline.stage.origin.maprfs.ClusterMapRFSSource#16978460 DClusterSourceOffsetCommitter *admin preview-pool-1-thread-3
2018-01-04 10:28:56,697 mfs2mfs/mapr2sqlserver850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Error connecting to FileSystem: java.io.IOException: Provided Subject must contain a KerberosPrincipal ClusterHdfsSource *admin preview-pool-1-thread-3
java.io.IOException: Provided Subject must contain a KerberosPrincipal
....
2018-01-04 10:20:39,159 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Authentication Config: ClusterHdfsSource *admin preview-pool-1-thread-3
2018-01-04 10:20:39,159 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 ERROR Issues: Issue[instance='MapRFS_01' service='null' group='HADOOP_FS' config='null' message='HADOOPFS_11 - Cannot connect to the filesystem. Check if the Hadoop FS location: 'maprfs:///mapr/mycluster.cluster.local' is valid or not: 'java.io.IOException: Provided Subject must contain a KerberosPrincipal''] ClusterHdfsSource *admin preview-pool-1-thread-3
2018-01-04 10:20:39,169 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Validation Error: Failed to configure or connect to the 'maprfs:///mapr/mycluster.cluster.local' Hadoop file system: java.io.IOException: Provided Subject must contain a KerberosPrincipal HdfsTargetConfigBean *admin 0 preview-pool-1-thread-3
java.io.IOException: Provided Subject must contain a KerberosPrincipal
....
However, to my knowledge, the system is not running Keberos, so this error message is a bit confusing for me. Uncommenting #export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}" in the sdc environment variable file for native mapr authentication did not seem to help the problem (even when reinstalling and commenting this line before running the streamsets mapr setup script).
Does anyone have any idea what is happening and how to fix it? Thanks.
This answer was provided on the mapr community forums and worked for me (using mapr v6.0). Note that the instruction here differ from those currently provided by the streamsets documentation. Throughout these instructions, I was logged in as user root.
After installing streamsets (and the mapr prerequisites) as per the documentation...
Change the owner of the the streamsets $SDC_DIST or $SDC_HOME location to the mapr user (or whatever other user you plan to use for the hadoop impersonation): $chown -R mapr:mapr $SDC_DIST (for me this was the /opt/streamsets-datacollector dir.). Do the same for $SDC_CONF (/etc/sdc for me) as well as /var/lib/sdc and var/log/sdc.
In $SDC_DIST/libexec/sdcd-env.sh, set the user and group name (near the top of the file) to mapr user "mapr" and enable mapr password login. The file should end up looking like:
# user that will run the data collector, it must exist in the system
#
export SDC_USER=mapr
# group of the user that will run the data collector, it must exist in the system
#
export SDC_GROUP=mapr
....
# Indicate that MapR Username/Password security is enabled
export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}
Edit the file /usr/lib/systemd/system/sdc.service to look like:
[Service]
User=mapr
Group=mapr
$cd into /etc/systemd/system/ and create a directory called sdc.service.d. Within that directory, create a file (with any name) and add the contents (without spaces):
Environment=SDC_JAVA_OPTS=-Dmaprlogin.passowrd.enabled=true
If you are using mapr's sasl ticket auth. system (or something similar), generate a ticket for the this user on the node that is running streamsets. In this case, with the $maprlogin password command.
Then finally, restart the sdc service: $systemctl deamon-reload then $systemctl retart sdc.
Run something like $ps -aux | grep sdc | grep maprlogin to check that the sdc process is ownned by mapr and that the -Dmaprlogin.passowrd.enabled=true parameter has been successfully set. Once this is done, should be able to validate/run maprFS to maprFS operations in streamsets pipeline builder in batch processing mode.
** NOTE: If using Hadoop Configuration Directory param. instead of Hadoop FS URI, remember to have the files in your $HADOOP_HOME/conf directory (eg.hadoop-site.xml, yarn-site.xml, etc.) (in the case of mapr, something like /opt/mapr/hadoop/hadoop-<version>/etc/hadoop/) either soft-linked or hard-copied to a directory $SDC_DIST/resource/<some hadoop config dir. you made need to create> (I just copy eberything in the directory) and add this path to the Hadoop Configuration Directory param. for your MaprFS (or HadoopFS). In the sdc web UI Hadoop Configuration Directory box, it would look like Hadoop Configuration Directory: <the directory within $SDC_DIST/resources/ that holds the hadoop files>.
** NOTE: If you are still logging errors of the form
2018-01-16 14:26:10,883
ingest2sa_demodata_batch/ingest2sademodatabatchadca8442-cb00-4a0e-929b-df2babe4fd41
ERROR Error in Slave Runner: ClusterRunner *admin
runner-pool-2-thread-29
com.streamsets.datacollector.runner.PipelineRuntimeException:
CONTAINER_0800 - Pipeline
'ingest2sademodatabatchadca8442-cb00-4a0e-929b-df2babe4fd41'
validation error : HADOOPFS_11 - Cannot connect to the filesystem.
Check if the Hadoop FS location: 'maprfs:///' is valid or not:
'java.io.IOException: Provided Subject must contain a
KerberosPrincipal'
you may also need to add -Dmaprlogin.password.enabled=true to the pipeline's /cluster/Worker Java Options tab for the origin and destination hadoop FS stages.
** The video linked to in the mapr community link also says to generate a mapr ticket for the sdc user (the default user that sdc process runs as when running as a service), but I did not do this and the solution still worked for me (so if anyone has any idea why it should be done regardless, please let me know in the comments).

Spark - java IOException :Failed to create local dir in /tmp/blockmgr*

I was trying to run a long running Spark Job. After few hours of execution, I get exception below :
Caused by: java.io.IOException: Failed to create local dir in /tmp/blockmgr-bb765fd4-361f-4ee4-a6ef-adc547d8d838/28
Tried to get around it by checking:
Permission issue in /tmp dir. The spark server is not running as root. but /tmp dir should be writable to all users.
/tmp Dir has enough space.
Assuming that you are working with several nodes, you'll need to check every node participate in the spark operation (master/driver + slaves/nodes/workers).
Please confirm that each worker/node have enough disk space (especially check /tmp folder), and right permissions.
Edit: The answer below did not eventually solve my case. It's because some subfolders spark (or some of its dependencies) was able to create, yet not all of them. The frequent necessity of creation of such paths would make any project unviable. Therefore I ran Spark (PySpark in my case) as an Administrator, which solved the case. So in the end it is probably a permission issue afterall.
Original answer:
I solved the same problem I had on my local Windows machine (not a cluster). Since there was no problem with permissions, I created the dir that Spark was failing to create, i.e. I created the following folder as a local user and did not need to change any permissions on that folder.
C:\Users\<username>\AppData\Local\Temp\blockmgr-97439a5f-45b0-4257-a773-2b7650d17142
After verifying all the permissions and user access.
I got the same issue when building the components in Talend studio and it resolved by providing the correct "/" in spark scratch directory (temp directory) in spark Configuration tab. This is required when building the jar in windows and running in Linux cluster.

Hadoop on Mesos fails with "Could not find or load main class org.apache.hadoop.mapred.MesosExecutor"

I have a Mesos cluster setup -- I have verified that the master can see the slaves -- but when I attempt to run a Hadoop job, all tasks wind up with a status of LOST. The same error is present in all the slave stderr logs:
Error: Could not find or load main class org.apache.hadoop.mapred.MesosExecutor
and that is the only line in the stderr logs.
Following the instructions on http://mesosphere.io/learn/run-hadoop-on-mesos/, I have put a modified Hadoop distribution on HDFS which each slave can access.
In the lib directory of the Hadoop distribution, I have added hadoop-mesos-0.0.4.jar and mesos-0.14.2.jar.
I have verified that each slave does in fact download this Hadoop distribution, and that hadoop-mesos-0.0.4.jar contains the class org.apache.hadoop.mapred.MesosExecutor, so I cannot figure out why the class cannot be found.
I am using Hadoop from CDH4.4.0 and mesos-0.15.0-rc4.
Does any one have any suggestions as to what might be the problem? I know I would always start with a CLASSPATH problem, but, in this case, the mesos-slave is downloading, unpacking, and attempting to run a Hadoop TaskTracker so I would imagine any CLASSPATH would be setup by the mesos-slave.
In the stdout of the slave logs, the environment is printed. There is a MESOS_HADOOP_HOME which is empty. Should this be set to something? If it is supposed to be set to the downloaded Hadoop distribution, I cannot set it in advance because the Hadoop distribution is downloaded to a new location every time.
In the event that is related (some permissions issue maybe), when attempting to browse slave logs via the master UI, I get the error Error browsing path: ....
The user running mesos-slave can browse to the correct directory when I do so manually.
I found the problem. bin/hadoop of the downloaded Hadoop distribution attempts to find its location by running which $0. However, that will find a current Hadoop installation if one exists (i.e. /usr/lib/hadoop), and will load the jars under that installation's lib directory instead of the downloaded one's lib directory.
I had to modify bin/hadoop of the downloaded distribution to find its own location with dirname $0 instead of which $0.

Resources