Nutch 2.3.1 on OSX does not connect to MongoDB - macos

I configured a local Nutch 2.3.1 instance on MacOS 10.11.5 (El Capitan) running in Eclipse as described here: https://wiki.apache.org/nutch/RunNutchInEclipse
As data store to use I configured MongoDB 2.6.12 which is also running on my local MacOS machine. I took the Gora config from here: http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
ivy.xml
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />
gora.properties
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
# I tried several server settings like localhost, 127.0.0.1, 127.0.0.1:27017, ...
gora.mongodb.db=nutch
I did not change gora-mongodb-mapping.xml.
nutch-site.xml
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
If I run the inject command, hadoop.log shows this confusing result:
2016-07-12 23:23:16,818 INFO crawl.InjectorJob - InjectorJob: starting at 2016-07-12 23:23:16
2016-07-12 23:23:16,819 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: /Users/myaccount/Documents/Nutch/urls
2016-07-12 23:23:17,054 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-07-12 23:23:17,416 ERROR store.MongoStore -
2016-07-12 23:23:17,417 ERROR store.MongoStore - [Ljava.lang.StackTraceElement;#4b5189ac
2016-07-12 23:23:17,418 ERROR store.MongoStore - Error while initializing MongoDB store: java.lang.NullPointerException
2016-07-12 23:23:17,419 ERROR crawl.InjectorJob - InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.io.IOException: java.lang.NullPointerException
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:267)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:290)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:299)
Caused by: java.lang.RuntimeException: java.io.IOException: java.lang.NullPointerException
at org.apache.gora.mongodb.store.MongoStore.initialize(MongoStore.java:131)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 7 more
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.gora.mongodb.store.MongoMappingBuilder.fromFile(MongoMappingBuilder.java:123)
at org.apache.gora.mongodb.store.MongoStore.initialize(MongoStore.java:118)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.gora.mongodb.store.MongoMapping.newDocumentField(MongoMapping.java:109)
at org.apache.gora.mongodb.store.MongoMapping.addClassField(MongoMapping.java:169)
at org.apache.gora.mongodb.store.MongoMappingBuilder.loadPersistentClass(MongoMappingBuilder.java:169)
at org.apache.gora.mongodb.store.MongoMappingBuilder.fromFile(MongoMappingBuilder.java:112)
... 10 more
After two days I've run out of ideas.
Within the log file I can't identify any valuable hint. The MongoDB logs don't show any connection attempts (not to mention an active connection). Using mongo I'm able to connect to the database and requesting http://localhost:27017 shows the expected message ("It looks like you are trying to access MongoDB over HTTP on the native driver port.") and corresponding log file entries. If I switch the data store to Cassandra, injecting works as expected, so Nutch itself also seems to work.
Does anybody know what I'm missing or understand what the hadoop.log is trying to tell me?
Any help would be appreciated! Thx.
Update: I also tried to use this configuration on an Ubuntu 14.04 server - works as expected. So I suppose my issue is related to the connection between Nutch & MongoDB running on a Mac. (If somebody wants to know: I'm trying to get the configuration working on my Mac because I want to do some local development with no need of a server connection.)

Related

Apache Nifi Web Server keeps failing to start with Decryption exception

I have a setup in which NiFi Web Server suddenly started failing to start when upgrading from 1.15.3 to 1.16.1 version. The following exception keeps occurring on the Apache NiFi Cluster:
2022-05-11 22:53:40,570 WARN [main] org.apache.nifi.web.server.JettyServer Failed to start web server... shutting down.
org.apache.nifi.encrypt.EncryptionException: Decryption Failed with Algorithm [PBEWITHMD5AND256BITAES-CBC-OPENSSL]
at org.apache.nifi.encrypt.CipherPropertyEncryptor.decrypt(CipherPropertyEncryptor.java:78)
at org.apache.nifi.fingerprint.FingerprintFactory.decrypt(FingerprintFactory.java:931)
at org.apache.nifi.fingerprint.FingerprintFactory.getLoggableRepresentationOfSensitiveValue(FingerprintFactory.java:561)
at org.apache.nifi.fingerprint.FingerprintFactory.addParameter(FingerprintFactory.java:330)
at org.apache.nifi.fingerprint.FingerprintFactory.addParameterContext(FingerprintFactory.java:302)
at org.apache.nifi.fingerprint.FingerprintFactory.addFlowControllerFingerprint(FingerprintFactory.java:210)
at org.apache.nifi.fingerprint.FingerprintFactory.createFingerprint(FingerprintFactory.java:153)
at org.apache.nifi.fingerprint.FingerprintFactory.createFingerprint(FingerprintFactory.java:127)
at org.apache.nifi.controller.inheritance.FlowFingerprintCheck.checkInheritability(FlowFingerprintCheck.java:45)
at org.apache.nifi.controller.XmlFlowSynchronizer.sync(XmlFlowSynchronizer.java:200)
at org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:43)
at org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1524)
at org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104)
at org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:815)
at org.apache.nifi.controller.StandardFlowService.load(StandardFlowService.java:457)
at org.apache.nifi.web.server.JettyServer.start(JettyServer.java:1086)
at org.apache.nifi.NiFi.<init>(NiFi.java:170)
at org.apache.nifi.NiFi.<init>(NiFi.java:82)
at org.apache.nifi.NiFi.main(NiFi.java:330)
Caused by: javax.crypto.BadPaddingException: pad block corrupted
at org.bouncycastle.jcajce.provider.symmetric.util.BaseBlockCipher$BufferedGenericBlockCipher.doFinal(Unknown Source)
at org.bouncycastle.jcajce.provider.symmetric.util.BaseBlockCipher.engineDoFinal(Unknown Source)
at javax.crypto.Cipher.doFinal(Cipher.java:2168)
at org.apache.nifi.encrypt.CipherPropertyEncryptor.decrypt(CipherPropertyEncryptor.java:74)
... 18 common frames omitted
relevant nifi.properties:
nifi.sensitive.props.key=<hidden>
nifi.sensitive.props.key.protected=
nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
nifi.sensitive.props.additional.keys=
I have already tried to tear it all down and re-install 1.15.3 with not any other changes, but the same issue still persists. Can someone please share any ideas if there are any on how to fix this?

NiFi doesn't start on MacOS Big Sur

I have installed NiFi using Homebrew following the instructions on this page.
Once I go start NiFi using
nifi start
I get the following:
Java home: /usr/local/opt/openjdk#11/libexec/openjdk.jdk/Contents/Home
NiFi home: /usr/local/Cellar/nifi/1.15.0/libexec
Bootstrap Config File: /usr/local/Cellar/nifi/1.15.0/libexec/conf/bootstrap.conf
Error: Could not find or load main class org.apache.nifi.bootstrap.RunNiFi
Caused by: java.lang.ClassNotFoundException: org.apache.nifi.bootstrap.RunNiFi
I also see this error in the nifi-app.log
2021-12-08 13:06:37,463 ERROR [Write-Ahead Local State Provider Maintenance] o.a.n.c.s.p.l.WriteAheadLocalStateProvider Failed to checkpoint Write-Ahead Log used to stor$
java.io.FileNotFoundException: ./state/local/partition-0/1.journal (No such file or directory)
at java.base/java.io.FileOutputStream.open0(Native Method)
at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237)
at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187)
at org.wali.MinimalLockingWriteAheadLog$Partition.rollover(MinimalLockingWriteAheadLog.java:788)
at org.wali.MinimalLockingWriteAheadLog.checkpoint(MinimalLockingWriteAheadLog.java:534)
at org.apache.nifi.controller.state.providers.local.WriteAheadLocalStateProvider$CheckpointTask.run(WriteAheadLocalStateProvider.java:286)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Any ideas?
I got the same error. Mine was due to the 8443 being occupied. I think, you can either find what's using the 8443 or you can try to change the port nifi.web.https.port=8443 in ./bin/conf/nifi.properties to something else. I hope it helps if you are still facing the issues.

NiFi fails to launch due to java.lang.IllegalArgumentException

I have been trying to launch NiFi, but everytime I do so I get the following error:
2019-03-06 18:53:46,935 ERROR [main] org.apache.nifi.NiFi Failure to
launch NiFi due to java.lang.IllegalArgumentException:
java.security.NoSuchAlgorithmException: md5 MessageDigest not
available java.lang.IllegalArgumentException:
java.security.NoSuchAlgorithmException: md5 MessageDigest not
available
at org.apache.nifi.nar.NarUnpacker.calculateMd5sum(NarUnpacker.java:419)
at org.apache.nifi.nar.NarUnpacker.unpackNar(NarUnpacker.java:228)
at org.apache.nifi.nar.NarUnpacker.unpackNars(NarUnpacker.java:123)
at org.apache.nifi.NiFi.(NiFi.java:128)
at org.apache.nifi.NiFi.(NiFi.java:71)
at org.apache.nifi.NiFi.main(NiFi.java:296) Caused by: java.security.NoSuchAlgorithmException: md5 MessageDigest not
available
at sun.security.jca.GetInstance.getInstance(GetInstance.java:159)
at java.security.Security.getImpl(Security.java:695)
at java.security.MessageDigest.getInstance(MessageDigest.java:167)
at org.apache.nifi.nar.NarUnpacker.calculateMd5sum(NarUnpacker.java:407)
... 5 common frames omitted 2019-03-06 18:53:46,939 INFO [Thread-1] org.apache.nifi.NiFi Initiating shutdown of Jetty web
server... 2019-03-06 18:53:46,940 INFO [Thread-1] org.apache.nifi.NiFi
Jetty web server shutdown completed (nicely or otherwise).
I understand this is coming from "calculateMd5sum " function that calculates md5 sum of a specified file. However, I have made no changes to any of Nars neither have I added any custom nars. The same instance did launch before.
I have also tried to start afresh by extracting the setup again, however I face the same error. I fail to understand why the issue is coming up all of a sudden. Please help!
I got it.
My java home pointed to "C:\Program Files\Java\jdk1.8.0_65"
changed the path to "C:\Program Files (x86)\Java\jre1.8.0_121"
It works fine now.
Thanks #BryanBende

Apache Nutch 2.3: won't inject urls (hangs) & hadoop log shows warning

I'm stuck trying to set up Nutch 2.3 with Elasticsearch 5.4. The problem is in Nutch as I cannot get it to inject my urls. The hadoop log shows the following warning:
Console:
aurora apache-nutch-2.3.1 # runtime/local/bin/nutch inject urls/seed.txt
InjectorJob: starting at 2017-06-14 17:08:28
InjectorJob: Injecting urlDir: urls/seed.txt
** it hangs here**
and the
Hadoop log:
aurora apache-nutch-2.3.1 # cat runtime/local/logs/hadoop.log
2017-06-14 17:08:28,339 INFO crawl.InjectorJob - InjectorJob: starting at 2017-06-14 17:08:28
2017-06-14 17:08:28,340 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls/seed.txt
2017-06-14 17:08:28,992 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
I've tried setting my Hadoop environment variables following this thread (Hadoop "Unable to load native-hadoop library for your platform" warning) but I'm still getting the same error.
Any ideas?
Don't worry about warning. And I believe you are running on a Linux distribution
Nutch2.3 is not compatible with ES 5.x. I had written a custom IndexWriter which invokes Logstash at given port which in turn invokes Elastic Search. You may try this approach or something around it.

Why do the Spark examples fail to spark-submit on EC2 with spark-ec2 scripts?

I downloaded spark-1.5.2 and I setup a cluster on ec2 using the spark-ec2 doc here.
After that I went to examples/ and run mvn package and packaged the examples in a jar.
In the end I run the submit with:
bin/spark-submit --class org.apache.spark.examples.JavaTC --master spark://url_here.eu-west-1.compute.amazonaws.com:7077 --deploy-mode cluster /home/aki/Projects/spark-1.5.2/examples/target/spark-examples_2.10-1.5.2.jar
Instead of it running, I get the error:
WARN RestSubmissionClient: Unable to connect to server spark://url_here.eu-west-1.compute.amazonaws.com:7077.
Warning: Master endpoint spark://url_here.eu-west-1.compute.amazonaws.com:7077 was not a REST server. Falling back to legacy submission gateway instead.
15/12/22 17:36:07 WARN Utils: Your hostname, aki-linux resolves to a loopback address: 127.0.1.1; using 192.168.10.63 instead (on interface wlp4s0)
15/12/22 17:36:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/12/22 17:36:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:98)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:116)
at org.apache.spark.deploy.Client$$anonfun$7.apply(Client.scala:233)
at org.apache.spark.deploy.Client$$anonfun$7.apply(Client.scala:233)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.deploy.Client$.main(Client.scala:233)
at org.apache.spark.deploy.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241)
... 21 more
Are you sure the URL to master contains "url-here"?
spark://url_here.eu-west-1.compute.amazonaws.com:7077
Or maybe you are trying to obfuscate it for this post.
If you can you connect the Spark UI at
http://url_here.eu-west-1.compute.amazonaws.com:4040 or depending on your spark version http://url_here.eu-west-1.compute.amazonaws.com:8080, make sure you are using the URL variable seen on the Spark UI for your spark://...:7070 command line argument

Resources