Apache Nutch 2.3: won't inject urls (hangs) & hadoop log shows warning - hadoop

I'm stuck trying to set up Nutch 2.3 with Elasticsearch 5.4. The problem is in Nutch as I cannot get it to inject my urls. The hadoop log shows the following warning:
Console:
aurora apache-nutch-2.3.1 # runtime/local/bin/nutch inject urls/seed.txt
InjectorJob: starting at 2017-06-14 17:08:28
InjectorJob: Injecting urlDir: urls/seed.txt
** it hangs here**
and the
Hadoop log:
aurora apache-nutch-2.3.1 # cat runtime/local/logs/hadoop.log
2017-06-14 17:08:28,339 INFO crawl.InjectorJob - InjectorJob: starting at 2017-06-14 17:08:28
2017-06-14 17:08:28,340 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls/seed.txt
2017-06-14 17:08:28,992 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
I've tried setting my Hadoop environment variables following this thread (Hadoop "Unable to load native-hadoop library for your platform" warning) but I'm still getting the same error.
Any ideas?

Don't worry about warning. And I believe you are running on a Linux distribution
Nutch2.3 is not compatible with ES 5.x. I had written a custom IndexWriter which invokes Logstash at given port which in turn invokes Elastic Search. You may try this approach or something around it.

Related

Pycharm error while running Pyspark code on windows

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/05/21 12:55:53 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
It's just a warning you can still run the pyspark code. From pycharm you have to use spark-submit to execute your code. Use pyspark when you want to use REPL.

HBase Shell works but Unable to connect to Web UI

Following the 2. Quick Start - Standalone HBase instructions I installed HBase standalone on my Macbook. The shell works - I can create tables, drop tables, etc.
However, I can't seem to get the Web UI working. I've tried a variety of configurations for the hbase-site.xml file. I don't get any get one HBase errors that I don't think is related, but I get "Unable to connect" from the browser.
Has anyone else had this issue and solved it?
Errors:
Thu Nov 1 06:26:42 PDT 2018 Starting master on USERS-MacBook-Pro.local
.
.
.
2018-11-01 06:26:49,477 INFO [main] server.NIOServerCnxnFactory: binding to port 0.0.0.0/0.0.0.0:2181
2018-11-01 06:26:49,621 ERROR [main] server.ZooKeeperServer: ZKShutdownHandler is not registered, so ZooKeeper server won't take any action on ERROR or SHUTDOWN server state changes
2018-11-01 06:26:49,632 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:56148
.
.
.
2018-11-01 06:26:50,248 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
.
.
.

Nutch 2.3.1 on OSX does not connect to MongoDB

I configured a local Nutch 2.3.1 instance on MacOS 10.11.5 (El Capitan) running in Eclipse as described here: https://wiki.apache.org/nutch/RunNutchInEclipse
As data store to use I configured MongoDB 2.6.12 which is also running on my local MacOS machine. I took the Gora config from here: http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
ivy.xml
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />
gora.properties
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
# I tried several server settings like localhost, 127.0.0.1, 127.0.0.1:27017, ...
gora.mongodb.db=nutch
I did not change gora-mongodb-mapping.xml.
nutch-site.xml
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
If I run the inject command, hadoop.log shows this confusing result:
2016-07-12 23:23:16,818 INFO crawl.InjectorJob - InjectorJob: starting at 2016-07-12 23:23:16
2016-07-12 23:23:16,819 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: /Users/myaccount/Documents/Nutch/urls
2016-07-12 23:23:17,054 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-07-12 23:23:17,416 ERROR store.MongoStore -
2016-07-12 23:23:17,417 ERROR store.MongoStore - [Ljava.lang.StackTraceElement;#4b5189ac
2016-07-12 23:23:17,418 ERROR store.MongoStore - Error while initializing MongoDB store: java.lang.NullPointerException
2016-07-12 23:23:17,419 ERROR crawl.InjectorJob - InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.io.IOException: java.lang.NullPointerException
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:267)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:290)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:299)
Caused by: java.lang.RuntimeException: java.io.IOException: java.lang.NullPointerException
at org.apache.gora.mongodb.store.MongoStore.initialize(MongoStore.java:131)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 7 more
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.gora.mongodb.store.MongoMappingBuilder.fromFile(MongoMappingBuilder.java:123)
at org.apache.gora.mongodb.store.MongoStore.initialize(MongoStore.java:118)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.gora.mongodb.store.MongoMapping.newDocumentField(MongoMapping.java:109)
at org.apache.gora.mongodb.store.MongoMapping.addClassField(MongoMapping.java:169)
at org.apache.gora.mongodb.store.MongoMappingBuilder.loadPersistentClass(MongoMappingBuilder.java:169)
at org.apache.gora.mongodb.store.MongoMappingBuilder.fromFile(MongoMappingBuilder.java:112)
... 10 more
After two days I've run out of ideas.
Within the log file I can't identify any valuable hint. The MongoDB logs don't show any connection attempts (not to mention an active connection). Using mongo I'm able to connect to the database and requesting http://localhost:27017 shows the expected message ("It looks like you are trying to access MongoDB over HTTP on the native driver port.") and corresponding log file entries. If I switch the data store to Cassandra, injecting works as expected, so Nutch itself also seems to work.
Does anybody know what I'm missing or understand what the hadoop.log is trying to tell me?
Any help would be appreciated! Thx.
Update: I also tried to use this configuration on an Ubuntu 14.04 server - works as expected. So I suppose my issue is related to the connection between Nutch & MongoDB running on a Mac. (If somebody wants to know: I'm trying to get the configuration working on my Mac because I want to do some local development with no need of a server connection.)

Spark shell throwing exception after trying to integrate s3 / hadoop

I'm working on a windows machine trying to set up a spark teststack - the aim is to read/write file to an s3 bucket.
I'm running 1.6.1. When I run spark-shell I now receive an error:
16/03/22 15:19:48 INFO metastore.HiveMetaStore: 0: get_functions: db=default pat=*
16/03/22 15:19:48 INFO HiveMetaStore.audit: ugi=Administrator ip=unknown-ip-addr cmd=get_functions: db=default pat=*
16/03/22 15:19:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:204)
doing some reading lead me to believe that I need to add the aws jars as an argument - the jars are included in the hadoop structure.
I then run C:\Spark\hadoop\share\hadoop\tools\lib>spark-shell --jars aws-java-sdk-1.7.4.jar, hadoop-aws-2.7.1.jar
thinking that I'm now including the jars and so it must be ok...how foolish of me - I get the exact same error.
I then tried to include just the hadoop-aws jar and all kinds of exceptions were thrown including not being able to instantiate hive, s3a couldn't be instantiated, awscredentials wasn't happy and so on.
I'm at a bit of a loss, if anyone can shed some light on what I might be doing wrong I'll happily buy them a pint :)
EDIT:
I've since updates the core-site.xml file, by removing the fs.defaultFS property witha value os s3n://mybucketname, spark will now load.
In it's stead i have the hdfs://0.0.0.0:19000 which is working fine.
Soi I guess my question changes from 'gaaaaah to 'gaaaaah, how does one include s3 correctly as a filesystem'

Hortonworks Data node install: Exception in secureMain

Am trying to install Hortonworks Hadoop single node cluster. I am able to start namenode and secondary namenode, but datanode failed with the following error. How do I solve this issue?
2014-04-04 18:22:49,975 FATAL datanode.DataNode (DataNode.java:secureMain(1841)) - Exception in secureMain
java.lang.RuntimeException: Although a UNIX domain socket path is configured as /var/lib/hadoop-hdfs/dn_socket, we cannot start a localDataXceiverServer because libhadoop cannot be loaded."
See Native Libraries Guide. Make sure libhadoop.so is available in $HADOOP_HOME\bin. Look into the logs for this message:
INFO util.NativeCodeLoader - Loaded the native-hadoop library
If instead you find
INFO util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
then it means the libhadoop.so is not available, and you'll have to investigate why. Alternatively you can turn off HDFS shortcircuit if you wish, or enable the legacy short-circuit instead using dfs.client.use.legacy.blockreader.local, to remove the libhadoop dependency. But I reckon would be better to find out what's the problem with your library.
Make sure you read and understand the articles linked before asking further questions.

Resources