Getting error while running Sqoop in EC2 instance - hadoop

I installed Sqoop in my EC2 instance with the reference of http://kontext.tech/docs/DataAndBusinessIntelligence/p/configure-sqoop-in-a-edge-node-of-hadoop-cluster My hadoop cluster is also working well and good.
I got the error Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster and I solved it with the mentioned solution given in that particular link. But unfortunately I get another error while running Sqoop Import:
Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.hadoop.mapred.YarnChild.
Please suggest me how to over come this error.
This is how my sqoop-env.template.sh looks:
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# included in all the hadoop scripts with source command
# should not be executable directly
# also should not be passed any arguments, since we need original $*
# Set Hadoop-specific environment variables here.
#Set path to where bin/hadoop is available
#export HADOOP_COMMON_HOME=$HOME/hadoop-3.1.0
#Set path to where hadoop-*-core.jar is available
#export HADOOP_MAPRED_HOME=$HOME/hadoop-3.1.0
#set the path to where bin/hbase is available
#export HBASE_HOME=
#Set the path to where bin/hive is available
#export HIVE_HOME=
#Set the path for where zookeper config dir is
#export ZOOCFGDIR=`

Related

GCP dataproc cluster hadoop job to move data from gs bucket to s3 amazon bucket fails [CONSOLE]

First question on stack overflow, so please forgive me for any rookie mistakes.
I am currently working on moving a very large sum of data (700+ GiB) consisting of many small files of about 1-10MB each, from a folder in GCS bucket to a folder in s3.
Several attempts I made:
gsutil -m rsync -r gs://<path> s3://<path>
Results in a timeout due to large sums of data
gsutil -m cp -r gs://<path> s3://<path>
Takes way too long. Even with many parallel processes and/or threads it still transfers at about 3.4MiB/s on average. I have made sure to upgrade the VM instance in this attempt.
using rclone
Same performance issue as cp
Recently I have found another probable method of doing this. However I am not familiar with GCP so please bear with me, sorry.
This is the reference I found https://medium.com/swlh/transfer-data-from-gcs-to-s3-using-google-dataproc-with-airflow-aa49dc896dad
The method involves making a dataproc cluster through GCP console with the following configuration:
Name:
<dataproc-cluster-name>
Region:
asia-southeast1
Nodes configuration:
1 main 2 worker #2vCPU & #3.75GBMemory & #30GBPersistentDisk
properties:
core fs.s3.awsAccessKeyId <key>
core fs.s3.awsSecretAccessKey <secret>
core fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem
Then I submit the job through the console menu in GCP website:
At this moment, I start noticing issues, I cannot find hadoop-mapreduce/hadoop-distcp.jar anywhere. I can only find /usr/lib/hadoop/hadoop-distcp.jar by browsing root files through my main dataproc cluster VM instance
The job I submit:
Start time:
31 Mar 2021, 16:00:25
Elapsed time:
3 sec
Status:
Failed
Region
asia-southeast1
Cluster
<cluster-name>
Job type
Hadoop
Main class or JAR
file://usr/lib/hadoop/hadoop-distcp.jar
Arguments
-update
gs://*
s3://*
Returns an error
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2400: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_USER: invalid variable name
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2365: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_USER: invalid variable name
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2460: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_OPTS: invalid variable name
2021-03-31 09:00:28,549 ERROR tools.DistCp: Invalid arguments:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2638)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3342)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3374)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:126)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3425)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3393)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:486)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:240)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:441)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunJarShim.main(HadoopRunJarShim.java:12)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2542)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2636)
... 18 more
Invalid arguments: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and
append new data to them if possible
-async Should distcp execution be blocking
-atomic Commit all changes or none
-bandwidth <arg> Specify bandwidth per map in MB, accepts
bandwidth as a fraction.
-blocksperchunk <arg> If set to a positive value, fileswith more
blocks than this value will be split into
chunks of <blocksperchunk> blocks to be
transferred in parallel, and reassembled on
the destination. By default,
<blocksperchunk> is 0 and the files will be
transmitted in their entirety without
splitting. This switch is only applicable
when the source file system implements
getBlockLocations method and the target
file system implements concat method
-copybuffersize <arg> Size of the copy buffer to use. By default
<copybuffersize> is 8192B.
-delete Delete from target, files missing in
source. Delete is applicable only with
update or overwrite options
-diff <arg> Use snapshot diff report to identify the
difference between source and target
-direct Write files directly to the target
location, avoiding temporary file rename.
-f <arg> List of files that need to be copied
-filelimit <arg> (Deprecated!) Limit number of files copied
to <= n
-filters <arg> The path to a file containing a list of
strings for paths to be excluded from the
copy.
-i Ignore failures during copy
-log <arg> Folder on DFS where distcp execution logs
are saved
-m <arg> Max number of concurrent maps to use for
copy
-numListstatusThreads <arg> Number of threads to use for building file
listing (max 40).
-overwrite Choose to overwrite target files
unconditionally, even if they exist.
-p <arg> preserve status (rbugpcaxt)(replication,
block-size, user, group, permission,
checksum-type, ACL, XATTR, timestamps). If
-p is specified with no <arg>, then
preserves replication, block size, user,
group, permission, checksum type and
timestamps. raw.* xattrs are preserved when
both the source and destination paths are
in the /.reserved/raw hierarchy (HDFS
only). raw.* xattrpreservation is
independent of the -p flag. Refer to the
DistCp documentation for more details.
-rdiff <arg> Use target snapshot diff report to identify
changes made on target
-sizelimit <arg> (Deprecated!) Limit number of files copied
to <= n bytes
-skipcrccheck Whether to skip CRC checks between source
and target paths.
-strategy <arg> Copy strategy to use. Default is dividing
work based on file sizes
-tmp <arg> Intermediate work path to be used for
atomic commit
-update Update target, copying only missing files
or directories
-v Log additional info (path, size) in the
SKIP/COPY log
-xtrack <arg> Save information about missing source files
to the specified directory
How can I fix this problem? Several fixes I find online aren't very helpful. Either they were using hadoop cli or have different jar files as mine. For example this one right here: Move data from google cloud storage to S3 using dataproc hadoop cluster and airflow and https://github.com/CoorpAcademy/docker-pyspark/issues/13
Disclaimers: I do not use hadoop cli or airflow. I use console to do this, submitting job through the dataproc cluster main VM instance shell also returns the same error. If this is required, any detailed reference would be appreciated, thankyou very much!
Update:
Fixed misleading path replacement on gsutil part
The problem was due to s3FileSystem no longer supported by hadoop. So I have to downgrade to an image with hadoop 2.10 [FIXED]. The speed however, was also not satisfactory
I think the Dataproc solution is overkill in your case. Dataproc would make sense if you needed to daily or hourly copy something like a TB of data from GCS to S3. But it sounds like yours will just be a one-time copy that you can let run for hours or days. I'd suggest running gsutil on a Google Cloud (GCP) instance. I've tried an AWS EC2 instance for this and it is always markedly slower for this particular operation.
Create your source and destination buckets in the same region. For example, us-east4 (N. Virginia) for GCS and us-east-1 (N. Virginia) for S3. Then deploy your instance in the same GCP region.
gsutil -m cp -r gs://* s3://*
. . . is probably not going to work. It definitely does not work in Dataproc, which always errors if I don't have either an explicit file location or a bucket/folder that ends with /
Instead, first try to explicitly copy one file successfully. Then try a whole folder or bucket.
How many files are you trying to copy?

Spark History server setup

I am trying to setup Spark History config server in local. I am using using Windows and Pycharm for Pyspark programming. I am able to view Spark Web-UI at localhost:4040.
The things I have done are:
spark-defaults.conf: (Where I have added last three lines.)
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1
spark.eventLog.enabled true
spark.history.fs.logDirectory file:///D:///tmp///spark-events
Run the history server
C:\Users\hp\spark>bin\spark-class.cmd org.apache.spark.deploy.history.HistoryServer
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/08/09 08:58:04 INFO HistoryServer: Started daemon with process name: 13476#DESKTOP-B9KRC6O
20/08/09 08:58:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/08/09 08:58:23 INFO SecurityManager: Changing view acls to: hp
20/08/09 08:58:23 INFO SecurityManager: Changing modify acls to: hp
20/08/09 08:58:23 INFO SecurityManager: Changing view acls groups to:
20/08/09 08:58:23 INFO SecurityManager: Changing modify acls groups to:
20/08/09 08:58:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hp); groups with view permissions: Set(); users with modify permissions: Set(hp); groups with modify permissions: Set()
20/08/09 08:58:24 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
20/08/09 08:58:26 INFO Utils: Successfully started service on port 18080.
20/08/09 08:58:26 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://DESKTOP-B9KRC6O:18080
After I run my Pyspark program successfully, I am unable to see Job details at Spark-History-server web UI. Though the server is started. It looks like below:
The references I have already used:
Windows: Apache Spark History Server Config
How to run Spark History Server on Windows
The code I use is as follows:
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("madhu").setMaster("local")
sc = SparkContext(conf=conf)
spark = SparkSession(sc).builder.getOrCreate()
def readtable(dbname,table):
dbname = dbname
table=table
hostname = "localhost"
jdbcPort = 3306
username = "root"
password = "madhu"
jdbc_url = "jdbc:mysql://{0}:{1}/{2}?user={3}&password={4}".format(hostname,jdbcPort, dbname,username,password)
dataframe = spark.read.format('jdbc').options(driver = 'com.mysql.jdbc.Driver',url=jdbc_url, dbtable=table).load()
return dataframe
t1 = readtable("db","table1")
t2 = readtable("db2","table2")
print(t2.show())
spark.stop()
Please help me on how can I achieve the same.I will provide any data that will be required.
I have also tried with directory paths as:
spark.eventLog.enabled true
spark.history.fs.logDirectory file:///D:/tmp/spark-events
You must provide the correct master URL in the application and run the application with spark-submit.
You can find it in the Spark UI at localhost:4040
In the following example, the master URL is spark://XXXX:7077.
Your application should be:
conf = SparkConf().setAppName("madhu").setMaster("spark://XXXX:7077")

Error: /usr/local/Cellar/sqoop/1.4.6/../hadoop does not exist! Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation

I installed Hadoop on Mac using brew and then configured it. Then I installed Sqoop and when I try to run Sqoop I get the following error:
Error: /usr/local/Cellar/sqoop/1.4.6/../hadoop does not exist!
Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation.
My Hadoop is running fine, I have even set the path to HADOOP_COMMON_HOME in both ~/.bash_profile and sqoop-env.sh
Here is my sqoop env file:
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version
2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# included in all the hadoop scripts with source command
# should not be executable directly
# also should not be passed any arguments, since we need original $*
# Set Hadoop-specific environment variables here.
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME= /usr/local/Cellar/hadoop/3.0.0
export PATH=$PATH:$HADOOP_COMMON_HOME/bin
#Set path to where hadoop-*-core.jar is available
#export HADOOP_MAPRED_HOME=
#set the path to where bin/hbase is available
#export HBASE_HOME=
#Set the path to where bin/hive is available
#export HIVE_HOME=
#Set the path for where zookeper config dir is
#export ZOOCFGDIR=
I have done brew install hadoop previously to do some Spark things. I don't think I had to edit any configurations in Hadoop itself,
but brew install sqoop seemed to work out of the box for finding the Hadoop configuration. Not sure where you got your sqoop-env.sh from, but having just installed it, here is mine.
[jomoore#libexec] $ cat /usr/local/Cellar/sqoop/1.4.6/libexec/conf/sqoop-env.sh
export HADOOP_HOME="/usr/local"
export HBASE_HOME="/usr/local"
export HIVE_HOME="/usr/local"
export ZOOCFGDIR="/usr/local/etc/zookeeper"
And if we grep the location that it wants to find hadoop in... $HADOOP_HOME/bin
[jomoore#libexec] $ ls -l /usr/local/bin/ | grep hadoop
lrwxr-xr-x 1 jomoore admin 45 Mar 13 10:24 container-executor -> ../Cellar/hadoop/3.0.0/bin/container-executor
lrwxr-xr-x 1 jomoore admin 33 Mar 13 10:24 hadoop -> ../Cellar/hadoop/3.0.0/bin/hadoop
lrwxr-xr-x 1 jomoore admin 31 Mar 13 10:24 hdfs -> ../Cellar/hadoop/3.0.0/bin/hdfs
lrwxr-xr-x 1 jomoore admin 33 Mar 13 10:24 mapred -> ../Cellar/hadoop/3.0.0/bin/mapred
lrwxr-xr-x 1 jomoore admin 50 Mar 13 10:24 test-container-executor -> ../Cellar/hadoop/3.0.0/bin/test-container-executor
lrwxr-xr-x 1 jomoore admin 31 Mar 13 10:24 yarn -> ../Cellar/hadoop/3.0.0/bin/yarn
And since I don't have Accumulo, Hive, or Zookeeper installed, these other warnings are just noise.
[jomoore#libexec] $ sqoop
Warning: /usr/local/Cellar/sqoop/1.4.6/libexec/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/local/Cellar/sqoop/1.4.6/libexec/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /usr/local/Cellar/sqoop/1.4.6/libexec/bin/../../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
ERROR: Cannot execute /usr/local/libexec/hadoop-config.sh.
That error at the end is actually looking for the file here, which probably shouldn't be at libexec/libexec anyway within the Brew package
$ find /usr/local/Cellar -name hadoop-config.sh
/usr/local/Cellar/hadoop/3.0.0/libexec/libexec/hadoop-config.sh

Streamsets Mapr FS origin/dest. KerberosPrincipal exception (using hadoop impersonation (in mapr 6.0))

I am trying to do a simple data move from a mapr fs origin to a mapr fs destination (this is not my use case, just doing this simple movement for testing purposes). When trying to validate this pipeline, the error message I see in the staging area is:
HADOOPFS_11 - Cannot connect to the filesystem. Check if the Hadoop FS location: 'maprfs:///mapr/mycluster.cluster.local' is valid or not: 'java.io.IOException: Provided Subject must contain a KerberosPrincipal
Tyring different variations of the hadoop fs URI field (eg. mfs:///mapr/mycluster.cluster.local, maprfs:///mycluster.cluster.local) does not seem to help. Looking at the logs after trying to validate, I see
2018-01-04 10:28:56,686 mfs2mfs/mapr2sqlserver850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Created source of type: com.streamsets.pipeline.stage.origin.maprfs.ClusterMapRFSSource#16978460 DClusterSourceOffsetCommitter *admin preview-pool-1-thread-3
2018-01-04 10:28:56,697 mfs2mfs/mapr2sqlserver850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Error connecting to FileSystem: java.io.IOException: Provided Subject must contain a KerberosPrincipal ClusterHdfsSource *admin preview-pool-1-thread-3
java.io.IOException: Provided Subject must contain a KerberosPrincipal
....
2018-01-04 10:20:39,159 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Authentication Config: ClusterHdfsSource *admin preview-pool-1-thread-3
2018-01-04 10:20:39,159 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 ERROR Issues: Issue[instance='MapRFS_01' service='null' group='HADOOP_FS' config='null' message='HADOOPFS_11 - Cannot connect to the filesystem. Check if the Hadoop FS location: 'maprfs:///mapr/mycluster.cluster.local' is valid or not: 'java.io.IOException: Provided Subject must contain a KerberosPrincipal''] ClusterHdfsSource *admin preview-pool-1-thread-3
2018-01-04 10:20:39,169 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Validation Error: Failed to configure or connect to the 'maprfs:///mapr/mycluster.cluster.local' Hadoop file system: java.io.IOException: Provided Subject must contain a KerberosPrincipal HdfsTargetConfigBean *admin 0 preview-pool-1-thread-3
java.io.IOException: Provided Subject must contain a KerberosPrincipal
....
However, to my knowledge, the system is not running Keberos, so this error message is a bit confusing for me. Uncommenting #export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}" in the sdc environment variable file for native mapr authentication did not seem to help the problem (even when reinstalling and commenting this line before running the streamsets mapr setup script).
Does anyone have any idea what is happening and how to fix it? Thanks.
This answer was provided on the mapr community forums and worked for me (using mapr v6.0). Note that the instruction here differ from those currently provided by the streamsets documentation. Throughout these instructions, I was logged in as user root.
After installing streamsets (and the mapr prerequisites) as per the documentation...
Change the owner of the the streamsets $SDC_DIST or $SDC_HOME location to the mapr user (or whatever other user you plan to use for the hadoop impersonation): $chown -R mapr:mapr $SDC_DIST (for me this was the /opt/streamsets-datacollector dir.). Do the same for $SDC_CONF (/etc/sdc for me) as well as /var/lib/sdc and var/log/sdc.
In $SDC_DIST/libexec/sdcd-env.sh, set the user and group name (near the top of the file) to mapr user "mapr" and enable mapr password login. The file should end up looking like:
# user that will run the data collector, it must exist in the system
#
export SDC_USER=mapr
# group of the user that will run the data collector, it must exist in the system
#
export SDC_GROUP=mapr
....
# Indicate that MapR Username/Password security is enabled
export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}
Edit the file /usr/lib/systemd/system/sdc.service to look like:
[Service]
User=mapr
Group=mapr
$cd into /etc/systemd/system/ and create a directory called sdc.service.d. Within that directory, create a file (with any name) and add the contents (without spaces):
Environment=SDC_JAVA_OPTS=-Dmaprlogin.passowrd.enabled=true
If you are using mapr's sasl ticket auth. system (or something similar), generate a ticket for the this user on the node that is running streamsets. In this case, with the $maprlogin password command.
Then finally, restart the sdc service: $systemctl deamon-reload then $systemctl retart sdc.
Run something like $ps -aux | grep sdc | grep maprlogin to check that the sdc process is ownned by mapr and that the -Dmaprlogin.passowrd.enabled=true parameter has been successfully set. Once this is done, should be able to validate/run maprFS to maprFS operations in streamsets pipeline builder in batch processing mode.
** NOTE: If using Hadoop Configuration Directory param. instead of Hadoop FS URI, remember to have the files in your $HADOOP_HOME/conf directory (eg.hadoop-site.xml, yarn-site.xml, etc.) (in the case of mapr, something like /opt/mapr/hadoop/hadoop-<version>/etc/hadoop/) either soft-linked or hard-copied to a directory $SDC_DIST/resource/<some hadoop config dir. you made need to create> (I just copy eberything in the directory) and add this path to the Hadoop Configuration Directory param. for your MaprFS (or HadoopFS). In the sdc web UI Hadoop Configuration Directory box, it would look like Hadoop Configuration Directory: <the directory within $SDC_DIST/resources/ that holds the hadoop files>.
** NOTE: If you are still logging errors of the form
2018-01-16 14:26:10,883
ingest2sa_demodata_batch/ingest2sademodatabatchadca8442-cb00-4a0e-929b-df2babe4fd41
ERROR Error in Slave Runner: ClusterRunner *admin
runner-pool-2-thread-29
com.streamsets.datacollector.runner.PipelineRuntimeException:
CONTAINER_0800 - Pipeline
'ingest2sademodatabatchadca8442-cb00-4a0e-929b-df2babe4fd41'
validation error : HADOOPFS_11 - Cannot connect to the filesystem.
Check if the Hadoop FS location: 'maprfs:///' is valid or not:
'java.io.IOException: Provided Subject must contain a
KerberosPrincipal'
you may also need to add -Dmaprlogin.password.enabled=true to the pipeline's /cluster/Worker Java Options tab for the origin and destination hadoop FS stages.
** The video linked to in the mapr community link also says to generate a mapr ticket for the sdc user (the default user that sdc process runs as when running as a service), but I did not do this and the solution still worked for me (so if anyone has any idea why it should be done regardless, please let me know in the comments).

Error: Could not find or load main class org.apache.hadoop.hdfs.server.namenode.NameNode Tried all solution still error persists

I am following tutorial to install hadoop. It is explained with hadoop 1.x but I am using hadoop-2.6.0
I have successfully completed all the step just before executing following cmd.
bin/hadoop namenode -format
I am getting the following error when I execute the above command.
Error: Could not find or load main class org.apache.hadoop.hdfs.server.namenode.NameNode
My hadoop-env.sh file
The java implementation to use.
export JAVA_HOME="C:/Program Files/Java/jdk1.8.0_74"
# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol. Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}
export HADOOP_PREFIX="/home/582092/hadoop-2.6.0"
export HADOOP_HOME="/home/582092/hadoop-2.6.0"
export HADOOP_COMMON_HOME=$HADOOP_HOME
#export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_PREFIX/bin
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/hdfs/hadoop-hdfs-2.6.0.jar
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
core-site.xml
core-site.xml Image
hdfs-site.xml
dfs.data.dir
/home/582092/hadoop-dir/datadir
dfs.name.dir
/home/582092/hadoop-dir/namedir
Kindly help me in fixing this issue.
One cause behind this problem might be a user-defined HDFS_DIR environment variable. This is picked up by scripts such as the following lines in libexec/hadoop-functions.sh:
HDFS_DIR=${HDFS_DIR:-"share/hadoop/hdfs"}
...
if [[ -z "${HADOOP_HDFS_HOME}" ]] &&
[[ -d "${HADOOP_HOME}/${HDFS_DIR}" ]]; then
export HADOOP_HDFS_HOME="${HADOOP_HOME}"
fi
The solution is to avoid defining an environment variable HDFS_DIR.
The recommendations in the comments of question are correct – use the hadoop classpath command to identify whether hadoop-hdfs-*.jar files are present in the classpath or not. They were missing in my case.

Resources