How to compile Nutch 2.3.1 with Hbase 1.2.6 - hadoop

I have to setup hadoop stack with Nutch 2.3.1. Supported version of Hbase for hadoop 2.7.4 is 1.2.6 that I have configured and tested successfully. But when I compile Nutch I got following and crawl a sample page I got this error.
/usr/local/nutch/runtime/local/bin/nutch inject urls/ -crawlId kics
InjectorJob: starting at 2017-09-21 14:20:10
InjectorJob: Injecting urlDir: urls
Exception in thread "main" java.lang.NoSuchFieldError: HBASE_CLIENT_PREFETCH_LIMIT
at org.apache.hadoop.hbase.client.HConnectionKey.<clinit>(HConnectionKey.java:43)
at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:267)
at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:194)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:115)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
Error running:
According to my search such as this and this, Hbase 1.x can be compiled for Nutch 2.3.1. But How to compile I have no idea. Can someone please guide (steps etc.)

Apache Gora 0.7 is the one supporting HBase 1.2.3(+): https://issues.apache.org/jira/browse/GORA-443
You can take a look at https://stackoverflow.com/a/39837926/582789 where I wrote how to modify Nutch 2.3.1 to work with Apache Gora 0.7. About the patch https://paste.apache.org/jjqz in that answer, use "0.7" where it shows "0.7-SNAPSHOT".
By the way, Apache Gora 0.8 was released yesterday :) Just changing 0.7 for 0.8 should work.
http://gora.apache.org/#20-september-2017-apache-gora-08-release

Related

Zookeeper : java.lang.ClassNotFoundException: org.apache.zookeeper.admin.ZooKeeperAdmin after updating spring boot

I am trying to update an springboot application which uses org.apache.zookeeper.zookeeper.
After updating the spring boot version. I am getting one of the two errors given below depending upon the version used.
Error 1 - (For new version provided below)
Caused by: org.apache.zookeeper.KeeperException$UnimplementedException: KeeperErrorCode = Unimplemented for /service/**/test/**/************
at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1836)
at org.apache.curator.framework.imps.CreateBuilderImpl$16.call(CreateBuilderImpl.java:1131)
at org.apache.curator.framework.imps.CreateBuilderImpl$16.call(CreateBuilderImpl.java:1113)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1110)
at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:593)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:583)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48)
at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.internalRegisterService(ServiceDiscoveryImpl.java:237)
at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.registerService(ServiceDiscoveryImpl.java:192)
at org.springframework.cloud.zookeeper.serviceregistry.ZookeeperServiceRegistry.register(ZookeeperServiceRegistry.java:71)
... 63 more
or
Error 2 - (For some other versions of zookeeper and curator provided in thread 1 provided below)
Caused by: java.lang.ClassNotFoundException: org.apache.zookeeper.admin.ZooKeeperAdmin
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 109 more
Old versions: (Working good)
Java - 8
SpringBoot - 2.3.3.RELEASE
Zookeeper - 3.4.12
Curator - 4.0.1
New version: (Spring managed versions)
Java - 8
SpringBoot - 2.7.4
Zookeeper - 3.6.0
Curator - 5.1.0
Many threads mentions that the issue is because of incompatible zookeeper and curator versions.
There are some threads already available regarding the issue
Zookeeper : java.lang.ClassNotFoundException: org.apache.zookeeper.admin.ZooKeeperAdminI tried every solution provided in the this thread and also some other combinations but none seems to work. I tried to use the old version and updating the rest. This too didn't work.
Apache Curator Unimplemented Errors When Trying to Create zNodesI am not accessing the curator directly as provided in this thread and I believe the zookeeper internally uses curator.
Is there any other dependency I need to upgrade? or Do I need to upgrade the java?
Please mention if you need some more info.

Is there an official way to support both Spark 1.6.2 and 2.0.0 on Hadoop yarn 2.7.2 cluster?

I have a cluster running Hadoop yarn 2.7.2 with dynamic allocation enabled for Spark 1.6.2.
Is there an official way to support both Spark 1.6.2 and 2.0.0? Because when I tried to submit an application from Spark 2.0.0 client, exception happened in driver like below:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.network.util.JavaUtils.byteStringAs(Ljava/lang/String;Lorg/apache/spark/network/util/ByteUnit;)J
at org.apache.spark.internal.config.ConfigHelpers$.byteFromString(ConfigBuilder.scala:63)
at org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:197)
at org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:197)
at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefaultString(ConfigBuilder.scala:131)
at org.apache.spark.internal.config.package$.<init>(package.scala:41)
at org.apache.spark.internal.config.package$.<clinit>(package.scala)
at org.apache.spark.deploy.yarn.ApplicationMaster.<init>(ApplicationMaster.scala:69)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:785)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:71)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:70)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:70)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:784)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
This feature is supported by Hortonwork's HDP distribution, I have a cluster running HDP 2.5, which supports Hadoop 2.7.3, Spark 1.6.2 and 2.0.0 on Centos7.
I have not experienced any problem while using either Spark and Spark2 jobs.
How did you install and configure both Spark versions? You can give a try to HDP sandbox and use as inspiration how is Spark & Spark2 configured for your own cluster.

Getting java.lang.NoSuchFieldError: INT_8 error while running spark job through oozie

I am Getting java.lang.NoSuchFieldError: INT_8 error when I am trying to execute a spark job using OOzie on Cloudera 5.5.1 version.
Any help on this will be appreciated.
Please find the error stackstrace below.
16/01/28 11:21:17 WARN TaskSetManager: Lost task 0.2 in stage 20.0 (TID 40, Zlab-physrv1): java.lang.NoSuchFieldError: INT_8
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:327)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convertField$1.apply(CatalystSchemaConverter.scala:517)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convertField$1.apply(CatalystSchemaConverter.scala:516)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:516)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:521)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
at org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
at org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:277)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetRelation.scala:94)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
As per My idea normally we used to get this error when ever there is some difference on the jars you have used to generate the code and the jars you have used currently.
Note: When I am trying to submit the Same one using spark-submit command it's running fine.
Regards
Nisith
Finally Able to debug and fix the issue. The Issue was with the installation as one of the data nodes are having older version of parquet Jars(5.2 cdh distribution). After replacing the jars with the current version jars it was working fine.

InjectorJob: java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation (gora solr, nutch 2)

I need some help with apache nutch 2.3.1.
With hbase 0.94 everything works ok but when i setup for solrStore I get this erors.
InjectorJob: java.lang.UnsupportedOperationException: Not implemented by theDistributedFileSystem FileSystem implementation
org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:214) at
org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2365) at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375) at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392) at
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) at
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) at
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167) at
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:352) at
org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:372) at
org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:212) at
org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252) at
org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at
org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
Here are some of the steps I did
solr: have solr 4.8.1 with webpage schema (no errors)
ivy.xml: uncommented dependency name="gora-solr" rev="0.5"
nutch-site.xml: org.apache.gora.solr.store.SolrStore
gora.properties
gora.datastore.default=org.apache.gora.solr.store.SolrStore
gora.solrstore.solr.url=http://localhost:8983/solr
gora.solrstore.solr.config=solrconfig.xml
gora.solrstore.solr.schema=gora-solr-webpage-schema.xml
gora.solrstore.solr.batchSize=100
gora.solrstore.solr.solrjserver=http
gora.solrstore.solr.commitWithin=1000
gora.solrstore.solr.resultsSize=100
ant runtime
Did someone managed to use gora solr 0.5 with apache nutch 2.3?

Which version of pig should use for hbase 0.98.8

I have hadoop 2.5.1 installed
Hive version is 0.13.1
Pig version is 0.13.0
Habse version is 0.98.8
If I want to load files from HDFS into habase using pig then will my pig version work fine?
For now I am facing issue as follows:
2014-12-24 16:11:24,783 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org.apache.hadoop.hbase.util.Bytes.equals([BLjava/nio/ByteBuffer;)Z
Make sure you build pig with the following options:
-Dhbaseversion=95 -Dhadoopversion=23 -Dprotobuf-java.version=2.5.0

Resources