org.apache.nutch.crawl.Crawl missing in nutch 1.9 on hadoop 1.2.1 - hadoop

I have installed fully distributed Hadoop 1.2.1. I was trying to integrated nutch with steps below:
Download apache-nutch-1.9-src.zip
Add value http.agent.name into nutch-site.xml
Copy hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml,
masters, slaves into $NUTCH_HOME/conf
compile using ant runtime
create urls/seed.txt and put on hadoop dfs
edit $NUTCH_HOME/conf/regex-urlfilter.txt
Test crawl using command:
bin/hadoop -jar nutch-1.9.job org.apache.nutch.crawl.Crawl urls -dir urls -depth 1 -topN 5
and get this error:
Exception in thread "main" java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
I tried extract nutch-1.9.job and I didn't find out class Crawl in org/apache/nutch/crawl.
Do I need to config something?

Crawl.java removed at 1.8 version. You can use crawl shell script for all crawling.
Deprecated class o.a.n.crawl.Crawler is still in code base https://issues.apache.org/jira/browse/NUTCH-1621

Related

How can Apache Spark history-server refer to Amazon S3?

[version]
Apache Spark 2.2.0
Hadoop 2.7
I want to set up Apache Spark histroy server.
Spark events log located in Amazon S3.
I can save log file in S3, but cannot read from history server.
Apache Spark installed at /usr/local/spark
so, $SPARK_HOME is /usr/local/spark
$ cd /usr/local/spark/sbin
$ sh start-history-server.sh
I got following error
Exception in thread "main" java.lang.ClassNotFoundException: org.apache.hadoop.fs.s3a.S3AFileSystem
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
....
my spark-defaults.conf is below
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.history.provider org.apache.hadoop.fs.s3a.S3AFileSystem
spark.history.fs.logDirectory s3a://xxxxxxxxxxxxx
spark.eventLog.enabled true
spark.eventLog.dir s3a://xxxxxxxxxxxxxxx
I installed this 2 jar files in /usr/local/spark/jars/
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.3.jar
but error is same.
What is wrong?
Please add the following in spark-defaults.conf file and retry again.
spark.driver.extraClassPath :/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/* :
spark.executor.extraClassPath :/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:

Oozie Fails - Running MRv1 on CDH 5.4.2

I'm trying to run an oozie workflow the should execute MRv1 hadoop job.
Started with Cloudera QuickStart VM 5.4.2-0. Configured it to use MRv1 (appended at the bottom).
But the workflow fails:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.MapReduceMain], main() threw exception, org/apache/hadoop/yarn/exceptions/YarnException
java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/exceptions/YarnException
at org.apache.oozie.action.hadoop.MapReduceMain.run(MapReduceMain.java:58)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:46)
at org.apache.oozie.action.hadoop.MapReduceMain.main(MapReduceMain.java:38)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:228)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.exceptions.YarnException
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
The strange thing is that when I look at the classpath, it is loading the jar files from the right location; namely /usr/lib/hadoop-0.20-mapreduce
-----configurations----
Configured OOzie to use MapReduce instead of yarn:
Changed config for Oozie in CDH by changing: MapReduce Service from YARN to MapReduce
Changed config for Oozie in by setting oozie-site.xml valve value:
<property>
<name>oozie.service.HadoopAccessorService.jobTracker.whitelist</name>
<value>quickstart.cloudera:8021</value>
</property>
Changed tomcat config using alternatives:
sudo alternatives --config oozie-tomcat-deployment -- chose tomcat-conf.http.mr1
Configured Hadoop to use MapReduce:
sudo alternatives --config hadoop-conf -- chose conf.cloudera.mapreduce
I had forgotten to update the actions to use MR1:
$ sudo -u oozie hadoop fs -rmr /user/oozie/share
sudo oozie-setup sharelib create -fs <FS_URI> -locallib /usr/lib/oozie/oozie-sharelib-mr1

Error in launching Spark REPL

I got pre-built Spark 1.4.1 and I'm running HDP 2.6. when I try to run spark-shell it gives me an error message as follows.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:111)
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:111)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:111)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:97)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:107)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
What is the issue?
ClassNotFoundException occurs when class loader could not find the
required class in class path . So , basically you should check your
class path and add the class in the classpath.
Check whether hadoop-common-0.21.0.jar is added to your classpath.
Is it possible that your Hadoop home is not set, as in here?
Cannot find hadoop installation: $HADOOP_HOME must be set or hadoop must be in the path

HiveServer Class Not Found Exception

I'm trying to run hive from the command prompt it is working absolutely fine. But when I try running hiveserver using "hive --service hiveserver" command, I'm getting the following exception.
Starting Hive Thrift Server
Exception in thread "main" java.lang.ClassNotFoundException: org.apache.hadoop.hive.service.HiveServer
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at org.apache.hadoop.util.RunJar.run(RunJar.java:214)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
So I then tried with the command "hive --service hiveserver2"; still I'm not finding any solution.
Can anybody please suggest a solution for this problem.
May be another process (another hiveserver) already listening on port 10000.
can you check it by :
netstat -ntulp | grep ':10000' to see it and if found then kill the process.
Otherwise start the server on another port.
By the way which version you are using ?
This error occurred to me when it can't find hive-service-*.jar in hadoop classpath. Just copy the hive-service-*.jar to your hadoop lib folder or export classpath in hadoop-env.sh. I have mentioned how to add classpath below.
Add this line in hadoop-env.sh:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/local/hive/lib/hive-*.jar
I have mentioned the path for hive as /usr/local/hive since i have hive installed at that location. Change it to point to your hive installation.

hcatalog with mapreduce

I get the following error while executing a MapReduce program.
I have placed all jars in hadoop/lib directory and have also mentioned the jars in -libjars.
This is the cmd I am executing:
$HADOOP_HOME/bin/hadoop --config $HADOOP_HOME/conf jar /home/shash/distinct.jar HwordCount -libjars $LIB_JARS WordCount HWordCount2
java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.hcatalog.mapreduce.HCatOutputFormat at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996) at
org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:248) at org.apache.hadoop.mapred.Task.initialize(Task.java:501) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:306) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at
org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: java.lang.ClassNotFoundException: org.apache.hcatalog.mapreduce.HCatOutputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:423) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at
java.lang.ClassLoader.loadClass(ClassLoader.java:356) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:264) at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:943)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:994) ...
8 more
Make sure LIB_JARS is a comma-separated list (as opposed to colon-separated like CLASSPATH)
Applies To CDH 5.0.x CDH 5.1.x CDH 5.2.x CDH 5.3.x Sqoop
Cause Sqoop cannot pick up the HCatalog libraries because Cloudera
Manager does not set the HIVE_HOME environment. It needs to be set
manually.
This problem is tracked with below JIRA:
https://issues.apache.org/jira/browse/SQOOP-2145
The fix of this JIRA has been included in CDH since version 5.4.0.
Workaround: Applicable to CDH versions lower than 5.4.0.
Execute below commands in shell before calling Sqoop command or adding them to /etc/sqoop/conf/sqoop-env.sh (create one, if it does not already exists):
export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive (for parcel installation)
export HIVE_HOME=/usr/lib/hive (for package installation)

Resources