Sqoop Import - Job stuck while connecting to Resource Manager - hadoop

I am trying to import data from sql server to Hive using Sqoop. When I execute the import command, sql server connection is successful but when it comes to connecting to resource manager, the job get stuck. Here is the log:
18/07/16 12:35:20 DEBUG mapreduce.JobBase: Adding to job classpath: file:/usr/lib/sqoop/lib/commons-jexl-2.1.1.jar
18/07/16 12:35:20 DEBUG mapreduce.JobBase: Adding to job classpath: file:/usr/lib/sqoop/lib/avro.jar
18/07/16 12:35:20 DEBUG mapreduce.JobBase: Adding to job classpath: file:/usr/lib/sqoop/lib/fastutil-6.3.jar
18/07/16 12:35:20 DEBUG mapreduce.JobBase: Adding to job classpath: file:/usr/lib/sqoop/lib/commons-codec-1.4.jar
18/07/16 12:35:21 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Here is the sqoop import command:
sqoop import --connect "jdbc:sqlserver://ip;databaseName=TEST" --driver com.microsoft.sqlserver.jdbc.SQLServerDriver --username user1 --password pass1 --hive-import --create-hive-table --hive-table "customer_data_march" --table "customer_data_march_parsed" --split-by Account_Branch_Converted -m 1 --verbose
When I check the resource manager process allocation, it shows the following:
[root#quickstart /]# jps
3527
1609 ThriftServer
850 JobHistoryServer
1135 ResourceManager
2611 HRegionServer
786 Bootstrap
506 NameNode
4225 Jps
3499
950 NodeManager
382 JournalNode
1983 RunJar
3469 Bootstrap
3124 Bootstrap
1745 RunJar
2459 Bootstrap
2493 HistoryServer
248 QuorumPeerMain
1446 HMaster
634 SecondaryNameNode
297 DataNode
Here are configurations in yarn-site.xml
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.dispatcher.exit-on-error</name>
<value>true</value>
</property>
<property>
<description>List of directories to store localized files in.</description>
<name>yarn.nodemanager.local-dirs</name>
<value>/var/lib/hadoop-yarn/cache/${user.name}/nm-local-dir</value>
</property>
<property>
<description>Where to store container logs.</description>
<name>yarn.nodemanager.log-dirs</name>
<value>/var/log/hadoop-yarn/containers</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/var/log/hadoop-yarn/apps</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
</value>
</property>
<!--
<property>
<name>yarn.resourcemanager.address</name>
<value>127.0.0.1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>127.0.0.1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>127.0.0.1:8031</value>
</property>
-->
</configuration>
After leaving my job for a while, it gives me following error:
18/07/16 14:02:19 INFO mapreduce.Job: Job job_1531745885301_0001 running in uber mode : false
18/07/16 14:02:20 INFO mapreduce.Job: map 0% reduce 0%
18/07/16 14:02:20 INFO mapreduce.Job: Job job_1531745885301_0001 failed with state FAILED due to: Application application_1531745885301_0001 failed 2 times due to ApplicationMaster for attempt appattempt_1531745885301_0001_000002 timed out. Failing the application.
18/07/16 14:02:20 INFO mapreduce.Job: Counters: 0
18/07/16 14:02:20 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
18/07/16 14:02:20 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 1,241.2774 seconds (0 bytes/sec)
18/07/16 14:02:20 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
18/07/16 14:02:20 INFO mapreduce.ImportJobBase: Retrieved 0 records.
18/07/16 14:02:20 ERROR tool.ImportTool: Error during import: Import job failed!
I am running the docker using following command:
docker run --hostname=quickstart.cloudera --privileged=true -t -p 8889:8888 -i 00a03c98e0d2 /u
sr/bin/docker-quickstart
Where am I going wrong?

Related

Facing issue while running MR Job on Yarn

I am trying run a MR job on Yarn. While running the MR job , the job gets stuck while cleaning up the staging dir. created by the process. below is the log snippet
2022-03-24 00:12:48 IST INFO [defaultEventExecutorGroup-2-1] org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at localhost/127.0.0.1:8032
2022-03-24 00:12:49 IST INFO [defaultEventExecutorGroup-2-1] org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 3
2022-03-24 00:12:49 IST INFO [defaultEventExecutorGroup-2-1] org.apache.hadoop.mapreduce.JobSubmitter - number of splits:3
2022-03-24 00:12:49 IST INFO [defaultEventExecutorGroup-2-1] org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1648055218115_0023
2022-03-24 00:12:49 IST INFO [defaultEventExecutorGroup-2-1] org.apache.hadoop.mapreduce.JobSubmitter - Cleaning up the staging area /user/pk/.staging/job_1648055218115_0023
Below is the dir structure on HDFS
drwx------ - pk supergroup 0 2022-03-24 00:12 /user/pk/.staging
mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Can someone suggest how we can resolve this issue ?

hadoop cluster is not running mapreduce jobs

I set up a small Hadoop cluster following these instructions but using Hadoop version 2.7.4. The cluster seems to work OK, but I cannot run mapreduce jobs. In particular, when trying the following
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar randomwriter outdenter code here
the job prints
17/11/27 16:35:21 INFO client.RMProxy: Connecting to ResourceManager at
ec2-yyy.eu-central-
1.compute.amazonaws.com/xxx:8032
Running 0 maps.
Job started: Mon Nov 27 16:35:22 UTC 2017
17/11/27 16:35:22 INFO client.RMProxy: Connecting to ResourceManager at
ec2-yyy.eu-central-
1.compute.amazonaws.com/xxx:8032
17/11/27 16:35:22 INFO mapreduce.JobSubmitter: number of splits:0
17/11/27 16:35:22 INFO mapreduce.JobSubmitter: Submitting tokens for
job: job_1511799491035_0006
17/11/27 16:35:22 INFO impl.YarnClientImpl: Submitted application
application_1511799491035_0006
17/11/27 16:35:22 INFO mapreduce.Job: The url to track the job:
http://ec2-yyy.eu-central-
1.compute.amazonaws.com:8088/proxy/application_1511799491035_0006/
17/11/27 16:35:22 INFO mapreduce.Job: Running job:
job_1511799491035_0006
and never gets past this state.
My yarn-site.xml looks as follows
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ec2-yyy.eu-central-1.compute.amazonaws.com</value>
</property>
</configuration>
My mapred-site.xml looks as follows
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>ec2-yyy.eu-central-1.compute.amazonaws.com:54311</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Do you have an idea how I could approach this issue?
Thanks
c14

Hadoop Wordcount example failing due to AM container

I've been trying to run the hadoop wordcount example for a while now, however I am facing some issues. I have hadoop 2.7.1 and running it on Windows. Below are the error details:
command:
yarn jar C:\hadoop-2.7.1\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.1.jar wordcount input output
Output:
INFO input.FileInputFormat: Total input paths to process : 1
INFO mapreduce.JobSubmitter: number of splits:1
INFO mapreduce.JobSubmitter: Submitting tokens for job: job_14
90853163147_0009
INFO impl.YarnClientImpl: Submitted application application_14
90853163147_0009
INFO mapreduce.Job: The url to track the job: http://*****
*****/proxy/application_1490853163147_0009/
INFO mapreduce.Job: Running job: job_1490853163147_0009
INFO mapreduce.Job: Job job_1490853163147_0009 running in uber
mode : false
INFO mapreduce.Job: map 0% reduce 0%
INFO mapreduce.Job: Job job_1490853163147_0009 failed with sta
te FAILED due to: Application application_1490853163147_0009 failed 2 times due
to AM Container for appattempt_1490853163147_0009_000002 exited with exitCode:
1639
For more detailed output, check application tracking page:http://********
:****/cluster/app/application_1490853163147_0009Then, click on links to logs of
each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1490853163147_0009_02_000001
Exit code: 1639
Exception message: Incorrect command line arguments.
Stack trace: ExitCodeException exitCode=1639: Incorrect command line arguments.
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.la
unchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
ontainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
ontainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
Shell output: Usage: task create [TASKNAME] [COMMAND_LINE] |
task isAlive [TASKNAME] |
task kill [TASKNAME]
task processList [TASKNAME]
Creates a new task jobobject with taskname
Checks if task jobobject is alive
Kills task jobobject
Prints to stdout a list of processes in the task
along with their resource usage. One process per line
and comma separated info per process
ProcessId,VirtualMemoryCommitted(bytes),
WorkingSetSize(bytes),CpuTime(Millisec,Kernel+User)
Container exited with a non-zero exit code 1639
Failing this attempt. Failing the application.
INFO mapreduce.Job: Counters: 0
Yarn-site.xml:
<configuration>
<property>
<name>yarn.application.classpath</name>
<value>
C:\hadoop-2.7.1\etc\hadoop,
C:\hadoop-2.7.1\share\hadoop\common\*,
C:\hadoop-2.7.1\share\hadoop\common\lib\*,
C:\hadoop-2.7.1\share\hadoop\hdfs\*,
C:\hadoop-2.7.1\share\hadoop\hdfs\lib\*,
C:\hadoop-2.7.1\share\hadoop\mapreduce\*,
C:\hadoop-2.7.1\share\hadoop\mapreduce\lib\*,
C:\hadoop-2.7.1\share\hadoop\yarn\*,
C:\hadoop-2.7.1\share\hadoop\yarn\lib\*
</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>98.5</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2200</value>
<description>Amount of physical memory, in MB, that can be allocated for containers.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/tmp/logs</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>259200</value>
</property>
<property>
<name>yarn.log-aggregation.retain-check-interval-seconds</name>
<value>3600</value>
</property>
</configuration>
mapred.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Any idea on what is going wrong?
exitCode: 1639 Looks like your are running hadoop on Windows .
https://github.com/OctopusDeploy/Issues/issues/1346
I faced exactly same problem. I was following a guide on how to install Hadoop 2.6.0 (http://www.ics.uci.edu/~shantas/Install_Hadoop-2.6.0_on_Windows10.pdf) while actually installing Hadoop 2.8.0.
As soon as I was done I ran
hadoop jar D:\hadoop-2.8.0\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.8.0.jar wordcount /foo/bar/LICENSE.txt /out1
And got (from yarn nodemanager's logs):
17/06/19 13:15:30 INFO monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1497902417767_0004_01_000001
17/06/19 13:15:30 INFO nodemanager.DefaultContainerExecutor: launchContainer: [D:\hadoop-2.8.0\bin\winutils.exe, task, create, -m, -1, -c, -1, container_1497902417767_0004_01_000001, cmd /c D:/hadoop/temp/nm-localdir/usercache/******/appcache/application_1497902417767_0004/container_1497902417767_0004_01_000001/default_container_executor.cmd]
17/06/19 13:15:30 WARN nodemanager.DefaultContainerExecutor: Exit code from container container_1497902417767_0004_01_000001 is : 1639
17/06/19 13:15:30 WARN nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1497902417767_0004_01_000001 and exit code: 1639
ExitCodeException exitCode=1639: Incorrect command line arguments.
TaskExit: error (1639): Invalid command line argument. Consult the Windows Installer SDK for detailed command line help.
Another symptom was (from yarn nodemanager's logs):
17/06/19 13:25:49 WARN util.SysInfoWindows: Expected split length of sysInfo to be 11. Got 7
The solution was to get compatible (with Hadoop 2.8.0) binaries: https://github.com/steveloughran/winutils/tree/master/hadoop-2.8.0-RC3/bin
Once I got a correct winutils.exe, my problem went away.

Unable to import data using scoop: exitCode=255

I am a noob in hadoop spark. I have setup a hadoop/spark cluster (1 namenode, 2 datanode). Now I am trying to import data from DB (mysql) using scoop in HDFS, but its failing always
16/07/27 16:50:04 INFO mapreduce.Job: Running job: job_1469629483256_0004
16/07/27 16:50:11 INFO mapreduce.Job: Job job_1469629483256_0004 running in uber mode : false
16/07/27 16:50:11 INFO mapreduce.Job: map 0% reduce 0%
16/07/27 16:50:13 INFO ipc.Client: Retrying connect to server: datanode1_hostname/172.31.58.123:59676. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
16/07/27 16:50:14 INFO ipc.Client: Retrying connect to server: datanode1_hostname/172.31.58.123:59676. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
16/07/27 16:50:15 INFO ipc.Client: Retrying connect to server: datanode1_hostname/172.31.58.123:59676. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
16/07/27 16:50:18 INFO mapreduce.Job: Job job_1469629483256_0004 failed with state FAILED due to: Application application_1469629483256_0004 failed 2 times due to AM Container for appattempt_1469629483256_0004_000002 exited with exitCode: 255
For more detailed output, check application tracking page:http://ip-172-31-55-182.ec2.internal:8088/cluster/app/application_1469629483256_0004Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1469629483256_0004_02_000001
Exit code: 255
Stack trace: ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 255
Failing this attempt. Failing the application.
16/07/27 16:50:18 INFO mapreduce.Job: Counters: 0
16/07/27 16:50:18 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
16/07/27 16:50:18 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 16.2369 seconds (0 bytes/sec)
16/07/27 16:50:18 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
16/07/27 16:50:18 INFO mapreduce.ImportJobBase: Retrieved 0 records.
16/07/27 16:50:18 ERROR tool.ImportTool: Error during import: Import job failed!
I am able to manually write in HDFS:
hdfs dfs -put <local file path> <hdfs path>
But when i run scoop import command
sqoop import --connect jdbc:mysql://<host>/<db_name> --username <USERNAME> --password <PASSWORD> --table <TABLE_NAME> --enclosed-by '\"' --fields-terminated-by , --escaped-by \\ -m 1 --target-dir <hdfs location>
Can any one please tell me what I am doing wrong
Here is the list of things that I have already tried
Shutting down cluster, formatting HDFS, then restarting cluster (didn't help)
Made sure that HDFS is not in SAFE MODE
all the nodes have this in their /etc/hosts
127.0.0.1 localhost
172.31.55.182 namenode_hostname
172.31.58.123 datanode1_hostname
172.31.58.122 datanode2_hostname
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
Configuration Files:
All Nodes: $HADOOP_CONF_DIR/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ip-172-31-55-182.ec2.internal:9000</value>
</property>
</configuration>
All Nodes: $HADOOP_CONF_DIR/yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ip-172-31-55-182.ec2.internal</value>
</property>
</configuration>
All Nodes: $HADOOP_CONF_DIR/mapred-site.xml:
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>ip-172-31-55-182.ec2.internal:54311</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
NameNode Specific Configurations
$HADOOP_CONF_DIR/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///mnt/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:50010</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:50075</value>
</property>
<property>
<name>dfs.datanode.https.address</name>
<value>0.0.0.0:50475</value>
</property>
<property>
<name>dfs.datanode.ipc.address</name>
<value>0.0.0.0:50020</value>
</property>
</configuration>
$HADOOP_CONF_DIR/masters:
ip-172-31-55-182.ec2.internal
$HADOOP_CONF_DIR/slaves:
ip-172-31-58-123.ec2.internal
ip-172-31-58-122.ec2.internal
DataNode Specific Configurations
$HADOOP_CONF_DIR/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///mnt/hadoop_data/hdfs/datanode</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:50010</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:50075</value>
</property>
<property>
<name>dfs.datanode.https.address</name>
<value>0.0.0.0:50475</value>
</property>
<property>
<name>dfs.datanode.ipc.address</name>
<value>0.0.0.0:50020</value>
</property>
</configuration>
From where u are trying to import the data. I mean from which machine you are trying to connect.check the master and slaves file in both namenode and datanode.
Try to ping the ip address from different server and check if it's showing as up.
Make these changes and restart your cluster, and try again:
Edit the part as mention in comment(#) below, and remove the comment
/etc/hosts file on client node:
127.0.0.1 localhost yourcomputername #get computername by "hostname -f" command and replace here
172.31.55.182 namenode_hostname ip-172-31-55-182.ec2.internal
172.31.58.123 datanode1_hostname ip-172-31-58-123.ec2.internal
172.31.58.122 datanode2_hostname ip-172-31-58-122.ec2.internal
/etc/hosts file on cluster nodes:
198.22.23.212 youcomputername #change to public ip of client node, change computername same as client node
172.31.55.182 namenode_hostname ip-172-31-55-182.ec2.internal
172.31.58.123 datanode1_hostname ip-172-31-58-123.ec2.internal
172.31.58.122 datanode2_hostname ip-172-31-58-122.ec2.internal
I am terminating this cluster and starting from scratch.

Need help in resolving following error on Hadoop 2.6

Command executed:
hadoop jar /usr/local/hadoop-2.6.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -input /user/hduser/samples/x.txt -output /user/hduser/samples/hadoop_output_data1 -mapper mapper2.py -reducer reducer2.py -file mapper2.py -file reducer2.py
15/10/05 17:37:04 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
packageJobJar: [mapper2.py, reducer2.py] [] /tmp/streamjob1958064018029021376.jar tmpDir=null
15/10/05 17:37:04 INFO client.RMProxy: Connecting to ResourceManager at Hadoop1/192.168.10.2:8050
15/10/05 17:37:04 INFO client.RMProxy: Connecting to ResourceManager at Hadoop1/192.168.10.2:8050
15/10/05 17:37:06 INFO mapred.FileInputFormat: Total input paths to process : 1
15/10/05 17:37:07 INFO mapreduce.JobSubmitter: number of splits:2
15/10/05 17:37:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1444080901412_0001
15/10/05 17:37:08 INFO impl.YarnClientImpl: Submitted application application_1444080901412_0001
15/10/05 17:37:08 INFO mapreduce.Job: The url to track the job: http://Hadoop1:8088/proxy/application_1444080901412_0001/
15/10/05 17:37:08 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hduser/mapred/local]
15/10/05 17:37:08 INFO streaming.StreamJob: Running job: job_1444080901412_0001
15/10/05 17:37:08 INFO streaming.StreamJob: Job running in-process (local Hadoop)
15/10/05 17:37:09 INFO streaming.StreamJob: map 0% reduce 0%
15/10/05 17:37:45 INFO streaming.StreamJob: map 100% reduce 100%
15/10/05 17:37:46 INFO streaming.StreamJob: Job running in-process (local Hadoop)
15/10/05 17:37:46 ERROR streaming.StreamJob: Job not successful. Error: Task failed task_1444080901412_0001_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
15/10/05 17:37:46 INFO streaming.StreamJob: killJob...
15/10/05 17:37:46 INFO impl.YarnClientImpl: Killed application application_1444080901412_0001
Streaming Command Failed!
mapper file :
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split(',')
if len(words) == 6:
key = (',').join([words[0], words[1], words[2], words[3],words[4]])
value = (',').join([words[5], "1"])
print '%s\t%s'%(key, value)
Sample Input File:
10.10.1.22,10.10.1.13,0,23772,6,9900
10.10.1.12,10.10.1.21,55570,0,6,9900
10.10.1.22,10.10.1.13,0,24028,6,9900
10.10.1.21,10.10.1.12,0,46864,6,9900
10.10.1.12,10.10.1.21,56594,0,6,9900
10.10.1.22,10.10.1.13,0,25308,6,9900
10.10.1.12,10.10.1.21,57618,0,6,9900
10.10.1.21,10.10.1.12,0,48144,6,9900
10.10.1.22,10.10.1.13,0,25564,6,9900
10.10.1.12,10.10.1.21,58642,0,6,9900
10.10.1.22,10.10.1.13,0,26844,6,9900
10.10.1.21,10.10.1.12,0,48400,6,9900
10.10.1.12,10.10.1.21,59410,0,6,9900
mapred-site.xml
<configuration>
<property>
<name>mapreduce.job.tracker</name>
<value>Hadoop1:54311</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>100</value>
</property>
</configuration>
hdfs-site.xml for namenode
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop-2.6.0/hadoop_data/hdfs/namenode</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>Hadoop1:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>Hadoop1:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>Hadoop1:8050</value>
</property>
</configuration>
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://Hadoop1:9000</value>
</property>
</configuration>

Resources