Nutch 1.7 with Hadoop 2.6.0 "Wrong FS" Error - hadoop

We have been trying to use Nutch 1.7 with Hadoop 2.6.0.
After installation, we we try to submit a job to Nutch, we receive the following error:
INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://master:9000/user/ubuntu/crawl/crawldb/436075385, expected: file:///
Job is submitted using the following command:
./crawl urls crawl_results 1
Also, we have checked fs.default.name setting in core-site.xml is having hdfs protocol:
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
It is happening when crawl command is sent to Nutch, after it reads the input URLs from file and attempts to insert the data into crawl db.
Any insights would be appreciated.
Thanks in advance.

Related

INFO Configuration deprecation session id is deprecated Instead use dfs metrics session-id

I am trying to set up hadoop 2.6.2. Almost everything has been setup.
My Ubuntu version: 15.10
My hadoop path is /usr/local/hadoop/hadoop-2.6.2
Java path is /usr/local/java/jdk1.8.0_65
I have mentioned java and hadoop path in /etc/profile
I have edited 4 files inside hadoop-2.6.2/etc/hadoop: core-site.xml, hadoop-env.sh, hdfs-site.xml and mapred-site.xml
But when I try to execute following command from hadoop site
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.2.jar grep input output 'dfs[a-z.]+'
Then it gives me following error
INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/11/25 07:57:09 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
java.net.ConnectException: Call From jass-VirtualBox/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
What can be the reason?
I had the same problem but on ubuntu 14.04 LTS.
I have solved it with following commands:
sbin/stop-dfs.sh
bin/hdfs namenode -format
sbin/start-dfs.sh
The first command will stop all daemons.
The second will format file system.
The third will start all daemons again.

Mapreduce job gzip compression failure

I have setup a new cluster (using HDP on Windows ) and I am encountering a new problem which I haven't seen before.
When I run a simple word count problem from hadoop-examples jar the MapreduceV2 job fails with below error
5/05/16 18:58:29 INFO mapreduce.Job: Task Id : attempt_1431802381254_0001_r_000000_0, Status : FAILED
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#15
Now,when I go to Application Master tracker and dig into logs I find that reducer is expecting a gzip file but the mapper output wasn’t
2015-05-16 18:45:20,864 WARN [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to shuffle output of attempt_1431791182314_0011_m_000000_0 from <url>:13562
java.io.IOException: not a gzip file
When I specifically drill into Map phase log,I see this
2015-05-16 18:45:09,532 WARN [main] org.apache.hadoop.io.compress.zlib.ZlibFactory: Failed to load/initialize native-zlib library
2015-05-16 18:45:09,532 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.gz]
2015-05-16 18:45:09,532 WARN [main] org.apache.hadoop.mapred.IFile: Could not obtain compressor from CodecPool
I have the following configurations in my core-site.xml
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
<description>A list of the compression codec classes that can be used for compression/decompression.</description>
</property>
and in mapred-site.xml
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
Now I realise this is pointing to error in native zlib dll loading,so I ran the job overriding options to run without compression and it does work.
I have downloaded the zlib.dll from zlib site and placed it in Hadoop/bin , C:\system32 and C:\SystemWOW64 folders and restarted the cluster services but still I have same error. Not sure why.I would appreciate any ideas to debug this further and resolve it
Hadoop 2.7.2
I ran into the same issue, when I built and ran hadoop 2.7.2 on windows 7. To resolve the issue you need to do the following:
1) On the Build Machine: set ZLIB_HOME to the zlib headers folder zlib_unzip_folder\zlib128-dll\include and build the distribution.
2) On the Run Machine make zlib1.dll zlib_unzip_folder\zlib128-dll\zlib1.dll available on the path.
I used zlib 1.2.8 and the download link can be found here: http://zlib.net/zlib128-dll.zip
Hadoop 2.4.1
This issue can also be reproduced on an older version of HADOOP by setting native lib as false and forcing map output to be compressed. For More detail you can see here: https://issues.apache.org/jira/browse/HADOOP-11334

Configuring HCatalog, WebHCat with Hive

I'm installing Hadoop, Hive to be integrated with WebHCat which will be used to run hive queries through it using Map-Reduce jobs of Hadoop.
I installed Hadoop 2.4.1 and Hive 0.13.0 (latest stable versions).
The request I'm sending using the web interface is:
POST: http://localhost:50111/templeton/v1/hive?user.name='hadoop'&statusdir='out'&execute='show tables'
And I got response as the following:
{
"id": "job_local229830426_0001"
}
But in the logs webhcat-console-error.log I find that exit value of this job is 1, which means some error occurred. Tracking this error I found it Missing argument for option: hiveconf
This is the webhcat-site.xml which contains the configurations of webhcat (known previously as templeton):
<configuration>
<property>
<name>templeton.port</name>
<value>50111</value>
<description>The HTTP port for the main server.</description>
</property>
<property>
<name>templeton.hive.path</name>
<value>/usr/local/hive/bin/hive</value>
<description>The path to the Hive executable.</description>
</property>
<property>
<name>templeton.hive.properties</name>
<value>hive.metastore.local=false,hive.metastore.uris=thrift://localhost:9933,hive.metastore.sasl.enabled=false</value>
<description>Properties to set when running hive.</description>
</property>
</configuration>
But the cmd query executed is weird as it have some additional hiveconf parameters with no values:
tool.TrivialExecService: Starting cmd: [/usr/local/hive/bin/hive, --service, cli, --hiveconf, --hiveconf, --hiveconf, hive.metastore.local=false, --hiveconf, hive.metastore.uris=thrift://localhost:9933, --hiveconf, hive.metastore.sasl.enabled=false, -e, show tables]
Any Idea?

hadoop 2.2.0 job -list throws NPE

I compiled hadoop 2.2.0 x64 and running it on a cluster. When I do hadoop job -list or hadoop job -list all, it throws a NPE like this:
14/01/28 17:18:39 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/01/28 17:18:39 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.mapreduce.tools.CLI.listJobs(CLI.java:504)
at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:312)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1237)
and on hadoop webapp like jobhistory ( I turn on the jobhistory server). it shows no job was running and no job finishing although I was running jobs.
Please help me to solve this problem.
I encountered this when trying to migrate to mapreduce over to YARN. Turns out I was missing the directive in mapred-site.xml instructing map reduce to use YARN:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

Getting OOZIE error E0900: Jobtracker [localhost:8021] not allowed, not in Oozies whitelist]

I am trying to run the Oozie examples on the CDH virtual machine. I have Cloudera Manager running and I execute the following command:
oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run
when I check the status I get the HadoopAccessorException.
I checked the oozie log and I see the following stack trace:
2013-07-22 14:25:56,179 WARN org.apache.oozie.command.wf.ActionStartXCommand:
USER[cloudera] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000001-130722142323751-oozie
oozi-W] ACTION[0000001-130722142323751-oozie-oozi-W#mr-node] Error starting action
[mr-node]. ErrorType [ERROR], ErrorCode [HadoopAccessorException], Message
[HadoopAccessorException: E0900: Jobtracker not allowed, not in
Oozies whitelist] org.apache.oozie.action.ActionExecutorException:
HadoopAccessorException: E0900: Jobtracker not allowed, not in Oozies
Whitelist
The oozie-site.xml and the oozie-default.xml have the oozie.service.HadoopAccessorService.jobTracker.whitelist and oozie.service.HadoopAccessorService.nameNode.whitelist set.
Any help would be appreciated.
Thanks.
Dave
I believe Cloudera Manager doesn't read your oozie-site.xml file and rather maintains its own config somewhere.
You should be able to go in the UI into Oozie Server Role, Processes, Configuration Files/Environment and click on Show and this is where you can define the whitelists for your Oozie server, as opposed to just doing it in the files.
Once this is changed, restart Oozie and you should be able to execute your command.
source
I know i am very late on this but someone looking for answers might find this helpful. I got similar error i went into the the location on Cloudera manager UI into Oozie Server Role, Processes, Configuration Files/Environment
And clicked on oozie-site.xml link and looked at the below property
<property>
<name>oozie.service.HadoopAccessorService.nameNode.whitelist</name>
<value>server1:8020,server2:8020,**<name>**</value>
</property>
<property>
<name>oozie.service.HadoopAccessorService.jobTracker.whitelist</name>
<value>server1:8032,server2:8032,**yarnRM**</value>
</property>
I used yarnRM as my value on the jobtracker in the workflow.xml file and it went past the error while running the workflow.

Resources