InjectorJob: java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation (gora solr, nutch 2) - hadoop

I need some help with apache nutch 2.3.1.
With hbase 0.94 everything works ok but when i setup for solrStore I get this erors.
InjectorJob: java.lang.UnsupportedOperationException: Not implemented by theDistributedFileSystem FileSystem implementation
org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:214) at
org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2365) at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375) at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392) at
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) at
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) at
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167) at
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:352) at
org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:372) at
org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:212) at
org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252) at
org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at
org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
Here are some of the steps I did
solr: have solr 4.8.1 with webpage schema (no errors)
ivy.xml: uncommented dependency name="gora-solr" rev="0.5"
nutch-site.xml: org.apache.gora.solr.store.SolrStore
gora.properties
gora.datastore.default=org.apache.gora.solr.store.SolrStore
gora.solrstore.solr.url=http://localhost:8983/solr
gora.solrstore.solr.config=solrconfig.xml
gora.solrstore.solr.schema=gora-solr-webpage-schema.xml
gora.solrstore.solr.batchSize=100
gora.solrstore.solr.solrjserver=http
gora.solrstore.solr.commitWithin=1000
gora.solrstore.solr.resultsSize=100
ant runtime
Did someone managed to use gora solr 0.5 with apache nutch 2.3?

Related

SAP Vora Thrift Server Error: Instantiating dialect 'sapsql' failed

I have deployed a cloudera CDH 5.13.1 Cluster with SAP Vora 1.4 Patch 4.
When I started the Vora thrift server everything looks fine, but as soon as I start SAP Vora tools and login following error shows up:
17/12/20 11:26:52 ERROR thriftserver.SparkExecuteStatementOperation: Error executing query, currentState RUNNING,
org.apache.spark.sql.catalyst.errors.package$DialectException: Instantiating dialect 'sapsql' failed.
Reverting to default dialect 'sapsql'
at org.apache.spark.sql.SQLContext.getSQLDialect(SQLContext.scala:225)
at org.apache.spark.sql.hive.HiveContext.getSQLDialect(HiveContext.scala:577)
at org.apache.spark.sql.hive.SapHiveContext$$anonfun$1.apply(SapHiveContext.scala:54)
at org.apache.spark.sql.hive.SapHiveContext$$anonfun$1.apply(SapHiveContext.scala:54)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
at org.apache.spark.sql.hive.SapHiveContext$$anonfun$2.apply(SapHiveContext.scala:58)
at org.apache.spark.sql.hive.SapHiveContext$$anonfun$2.apply(SapHiveContext.scala:58)
at org.apache.spark.sql.execution.datasources.DDLParser.parse(DDLParser.scala:43)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:231)
at org.apache.spark.sql.hive.HiveContext.parseSql(HiveContext.scala:334)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:829)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.extension.SapSQLDialect
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:177)
at org.apache.spark.sql.SQLContext.getSQLDialect(SQLContext.scala:215)
... 54 more
In the installation guide it says I need to assign the vora user authorization for the Hive Metastore.
Since this is only a test setup authorization is disabled in Hive and the vora user can create and drop tables in the default database and has write access to Hive's warehouse location.
How can I solve it?
This issue is caused by an incompatability with CDH 5.13 and Vora 1.4 patch 4. The issue is currently being investigated by SAP.
Is it an option for you to move to a newer Vora version? Current version is Vora 2.1. Since version 2.0 Vora is deployed in a Kubernetes cluster instead of the Hadoop cluster. This could help to overcome this CDH dependency issue.

How to compile Nutch 2.3.1 with Hbase 1.2.6

I have to setup hadoop stack with Nutch 2.3.1. Supported version of Hbase for hadoop 2.7.4 is 1.2.6 that I have configured and tested successfully. But when I compile Nutch I got following and crawl a sample page I got this error.
/usr/local/nutch/runtime/local/bin/nutch inject urls/ -crawlId kics
InjectorJob: starting at 2017-09-21 14:20:10
InjectorJob: Injecting urlDir: urls
Exception in thread "main" java.lang.NoSuchFieldError: HBASE_CLIENT_PREFETCH_LIMIT
at org.apache.hadoop.hbase.client.HConnectionKey.<clinit>(HConnectionKey.java:43)
at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:267)
at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:194)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:115)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
Error running:
According to my search such as this and this, Hbase 1.x can be compiled for Nutch 2.3.1. But How to compile I have no idea. Can someone please guide (steps etc.)
Apache Gora 0.7 is the one supporting HBase 1.2.3(+): https://issues.apache.org/jira/browse/GORA-443
You can take a look at https://stackoverflow.com/a/39837926/582789 where I wrote how to modify Nutch 2.3.1 to work with Apache Gora 0.7. About the patch https://paste.apache.org/jjqz in that answer, use "0.7" where it shows "0.7-SNAPSHOT".
By the way, Apache Gora 0.8 was released yesterday :) Just changing 0.7 for 0.8 should work.
http://gora.apache.org/#20-september-2017-apache-gora-08-release

Is there an official way to support both Spark 1.6.2 and 2.0.0 on Hadoop yarn 2.7.2 cluster?

I have a cluster running Hadoop yarn 2.7.2 with dynamic allocation enabled for Spark 1.6.2.
Is there an official way to support both Spark 1.6.2 and 2.0.0? Because when I tried to submit an application from Spark 2.0.0 client, exception happened in driver like below:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.network.util.JavaUtils.byteStringAs(Ljava/lang/String;Lorg/apache/spark/network/util/ByteUnit;)J
at org.apache.spark.internal.config.ConfigHelpers$.byteFromString(ConfigBuilder.scala:63)
at org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:197)
at org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:197)
at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefaultString(ConfigBuilder.scala:131)
at org.apache.spark.internal.config.package$.<init>(package.scala:41)
at org.apache.spark.internal.config.package$.<clinit>(package.scala)
at org.apache.spark.deploy.yarn.ApplicationMaster.<init>(ApplicationMaster.scala:69)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:785)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:71)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:70)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:70)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:784)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
This feature is supported by Hortonwork's HDP distribution, I have a cluster running HDP 2.5, which supports Hadoop 2.7.3, Spark 1.6.2 and 2.0.0 on Centos7.
I have not experienced any problem while using either Spark and Spark2 jobs.
How did you install and configure both Spark versions? You can give a try to HDP sandbox and use as inspiration how is Spark & Spark2 configured for your own cluster.

Nutch 2.3.1 on OSX does not connect to MongoDB

I configured a local Nutch 2.3.1 instance on MacOS 10.11.5 (El Capitan) running in Eclipse as described here: https://wiki.apache.org/nutch/RunNutchInEclipse
As data store to use I configured MongoDB 2.6.12 which is also running on my local MacOS machine. I took the Gora config from here: http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
ivy.xml
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />
gora.properties
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
# I tried several server settings like localhost, 127.0.0.1, 127.0.0.1:27017, ...
gora.mongodb.db=nutch
I did not change gora-mongodb-mapping.xml.
nutch-site.xml
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
If I run the inject command, hadoop.log shows this confusing result:
2016-07-12 23:23:16,818 INFO crawl.InjectorJob - InjectorJob: starting at 2016-07-12 23:23:16
2016-07-12 23:23:16,819 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: /Users/myaccount/Documents/Nutch/urls
2016-07-12 23:23:17,054 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-07-12 23:23:17,416 ERROR store.MongoStore -
2016-07-12 23:23:17,417 ERROR store.MongoStore - [Ljava.lang.StackTraceElement;#4b5189ac
2016-07-12 23:23:17,418 ERROR store.MongoStore - Error while initializing MongoDB store: java.lang.NullPointerException
2016-07-12 23:23:17,419 ERROR crawl.InjectorJob - InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.io.IOException: java.lang.NullPointerException
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:267)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:290)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:299)
Caused by: java.lang.RuntimeException: java.io.IOException: java.lang.NullPointerException
at org.apache.gora.mongodb.store.MongoStore.initialize(MongoStore.java:131)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 7 more
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.gora.mongodb.store.MongoMappingBuilder.fromFile(MongoMappingBuilder.java:123)
at org.apache.gora.mongodb.store.MongoStore.initialize(MongoStore.java:118)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.gora.mongodb.store.MongoMapping.newDocumentField(MongoMapping.java:109)
at org.apache.gora.mongodb.store.MongoMapping.addClassField(MongoMapping.java:169)
at org.apache.gora.mongodb.store.MongoMappingBuilder.loadPersistentClass(MongoMappingBuilder.java:169)
at org.apache.gora.mongodb.store.MongoMappingBuilder.fromFile(MongoMappingBuilder.java:112)
... 10 more
After two days I've run out of ideas.
Within the log file I can't identify any valuable hint. The MongoDB logs don't show any connection attempts (not to mention an active connection). Using mongo I'm able to connect to the database and requesting http://localhost:27017 shows the expected message ("It looks like you are trying to access MongoDB over HTTP on the native driver port.") and corresponding log file entries. If I switch the data store to Cassandra, injecting works as expected, so Nutch itself also seems to work.
Does anybody know what I'm missing or understand what the hadoop.log is trying to tell me?
Any help would be appreciated! Thx.
Update: I also tried to use this configuration on an Ubuntu 14.04 server - works as expected. So I suppose my issue is related to the connection between Nutch & MongoDB running on a Mac. (If somebody wants to know: I'm trying to get the configuration working on my Mac because I want to do some local development with no need of a server connection.)

Getting java.lang.NoSuchFieldError: INT_8 error while running spark job through oozie

I am Getting java.lang.NoSuchFieldError: INT_8 error when I am trying to execute a spark job using OOzie on Cloudera 5.5.1 version.
Any help on this will be appreciated.
Please find the error stackstrace below.
16/01/28 11:21:17 WARN TaskSetManager: Lost task 0.2 in stage 20.0 (TID 40, Zlab-physrv1): java.lang.NoSuchFieldError: INT_8
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:327)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convertField$1.apply(CatalystSchemaConverter.scala:517)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convertField$1.apply(CatalystSchemaConverter.scala:516)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:516)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:521)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
at org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
at org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:277)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetRelation.scala:94)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
As per My idea normally we used to get this error when ever there is some difference on the jars you have used to generate the code and the jars you have used currently.
Note: When I am trying to submit the Same one using spark-submit command it's running fine.
Regards
Nisith
Finally Able to debug and fix the issue. The Issue was with the installation as one of the data nodes are having older version of parquet Jars(5.2 cdh distribution). After replacing the jars with the current version jars it was working fine.

Resources