Elaticsearch Sink Kafka Connector fails with ConnectionClosedException - elasticsearch

We are running a Confluent ElasticsearchSinkConnector on a dedicated K8S Kafka connect cluster, all seems to be working well and records appears on our Elasticsearch cluster.
Once in a while we are getting an unrecoverable error, which fails that task(s) and require a manual restart of the connector(s).
There are not much details regarding the error:
Caused by: org.apache.kafka.connect.errors.ConnectException: Bulk request failed
due to
Caused by: org.apache.http.ConnectionClosedException: Connection is closed
We are running with the following configurations:
Class: io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
Config:
batch.size: 1000
behavior.on.malformed.documents: warn
behavior.on.null.values: delete
connection.compression: true
connection.password: my-password
connection.timeout.ms: 30000
connection.url: https://es-http.com:9200
connection.username: elastic
errors.log.enable: true
errors.log.include.messages: true
errors.tolerance: all
key.converter: org.apache.kafka.connect.storage.StringConverter
read.timeout.ms: 30000
retry.backoff.ms: 60000
schema.ignore: true
Topics: my-topic
Transforms: ExtractField
transforms.ExtractField.field: metadata
transforms.ExtractField.type: org.apache.kafka.connect.transforms.ExtractField$Value
value.converter: org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable: false
Tasks Max: 10
We are running a 3 nodes Elasticsearch cluster from this image: docker.elastic.co/elasticsearch/elasticsearch:7.8.0, not sure if it is relevant.
There is no extra logs on neither the Elasticsearch cluster nor on the Kafka connect cluster.
Any suggestions?

Related

Expected hostname at index 7 for neo4j bolt (3.5.21)

We run neo4j (3.5.21) in an EC2 instance. Today, after I restarted the server, noticed this error:
Expected hostname at index 7: bolt://:7687". Starting Neo4j failed: Component 'org.neo4j.server.AbstractNeoServer$ServerComponentsLifecycleAdapter#75401424' was successfully initialized, but failed to start
Service start logs:
Active database: graph.db
Directories in use:
home: /var/lib/neo4j
config: /etc/neo4j
logs: /var/log/neo4j
plugins: /var/lib/neo4j/plugins
import: /var/lib/neo4j/import
data: /var/lib/neo4j/data
certificates: /var/lib/neo4j/certificates
run: /var/run/neo4j
Starting Neo4j.
WARNING: Max 1024 open files allowed, minimum of 40000 recommended. See the Neo4j manual.
Started neo4j (pid 22577). It is available at http://0.0.0.0:7474/
There may be a short delay until the server is ready.
See /var/log/neo4j/neo4j.log for current status.
This is what I see in neo4j.log:
2022-12-03 20:29:49.886+0000 INFO Bolt enabled on 0.0.0.0:7687.
2022-12-03 20:29:51.968+0000 INFO Started.
2022-12-03 20:29:52.121+0000 INFO Stopping...
2022-12-03 20:29:52.231+0000 INFO Stopped.
2022-12-03 20:29:52.233+0000 ERROR Failed to start Neo4j: Starting Neo4j failed: Component 'org.neo4j.server.AbstractNeoServer$ServerComponentsLifecycleAdapter#75401424' was successfully initialized, but failed to start. Please see the attached cause exception "Expected hostname at index 7: bolt://:7687". Starting Neo4j failed: Component 'org.neo4j.server.AbstractNeoServer$ServerComponentsLifecycleAdapter#75401424' was successfully initialized, but failed to start. Please see the attached cause exception "Expected hostname at index 7: bolt://:7687".
org.neo4j.server.ServerStartupException: Starting Neo4j failed: Component 'org.neo4j.server.AbstractNeoServer$ServerComponentsLifecycleAdapter#75401424' was successfully initialized, but failed to start. Please see the attached cause exception "Expected hostname at index 7: bolt://:7687".
at org.neo4j.server.exception.ServerStartupErrors.translateToServerStartupError(ServerStartupErrors.java:45)
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:187)
at org.neo4j.server.ServerBootstrapper.start(ServerBootstrapper.java:124)
at org.neo4j.server.ServerBootstrapper.start(ServerBootstrapper.java:91)
at org.neo4j.server.CommunityEntryPoint.main(CommunityEntryPoint.java:32)
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.server.AbstractNeoServer$ServerComponentsLifecycleAdapter#75401424' was successfully initialized, but failed to start. Please see the attached cause exception "Expected hostname at index 7: bolt://:7687".
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:473)
at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:111)
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:180)
... 3 more
Caused by: org.neo4j.graphdb.config.InvalidSettingException: Unable to construct bolt discoverable URI using '' as hostname: Expected hostname at index 7: bolt://:7687
at org.neo4j.server.rest.discovery.DiscoverableURIs$Builder.add(DiscoverableURIs.java:133)
at org.neo4j.server.rest.discovery.DiscoverableURIs$Builder.lambda$addBoltConnectorFromConfig$1(DiscoverableURIs.java:155)
at java.util.Optional.ifPresent(Optional.java:159)
at org.neo4j.server.rest.discovery.DiscoverableURIs$Builder.addBoltConnectorFromConfig(DiscoverableURIs.java:145)
at org.neo4j.server.rest.discovery.CommunityDiscoverableURIs.communityDiscoverableURIs(CommunityDiscoverableURIs.java:38)
at org.neo4j.server.CommunityNeoServer.lambda$createDBMSModule$0(CommunityNeoServer.java:99)
at org.neo4j.server.modules.DBMSModule.start(DBMSModule.java:59)
at org.neo4j.server.AbstractNeoServer.startModules(AbstractNeoServer.java:249)
at org.neo4j.server.AbstractNeoServer.access$700(AbstractNeoServer.java:102)
at org.neo4j.server.AbstractNeoServer$ServerComponentsLifecycleAdapter.start(AbstractNeoServer.java:541)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:452)
... 5 more
Caused by: java.net.URISyntaxException: Expected hostname at index 7: bolt://:7687
at java.net.URI$Parser.fail(URI.java:2847)
at java.net.URI$Parser.failExpecting(URI.java:2853)
at java.net.URI$Parser.parseHostname(URI.java:3389)
at java.net.URI$Parser.parseServer(URI.java:3235)
at java.net.URI$Parser.parseAuthority(URI.java:3154)
at java.net.URI$Parser.parseHierarchical(URI.java:3096)
at java.net.URI$Parser.parse(URI.java:3052)
at java.net.URI.<init>(URI.java:673)
at org.neo4j.server.rest.discovery.DiscoverableURIs$Builder.add(DiscoverableURIs.java:128)
... 15 more
2022-12-03 20:29:52.243+0000 INFO Neo4j Server shutdown initiated by request
EC2: t3.large
OS: Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1092-aws x86_64)
I have already tried restarting the server, restarting the service multiple times without any success. We have not changed anything on the networking (vpc, subnet, security groups, network interface, etc)
Curious if there's a config I am missing. Any help will be much appreciated.

Setup multiple kafka connect sinks

I am working on streaming the data from postgreSQL to HDFS. I had setup confluent environment on HDP 2.6 sandbox. My jdbc source configs for postgreSQL are
name=jdbc_1
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:postgresql://host:port/db?currentSchema=schema&user=user&password=password
mode=timestamp
timestamp.column.name=col1
validate.non.null=false
topic.prefix=psql-
All other properties for connection are also fine and i am running it by
./bin/connect-standalone ./etc/kafka/connect-standalone.properties ./etc/kafka-connect-jdbc/source.properties
Its working fine and creating topics based on the number of tables in the database as
psql-table1
psql-table2
Now i want to run HDFS sinks on all the topics to create separate dir for every table in the postgreSQL database.
But when i run HDFS sink with command
./bin/connect-standalone ./etc/kafka/connect-standalone.properties ./etc/kafka-connect-hdfs/hdfs-postGres.properties
by running the source i am getting error
ERROR Stopping after connector error (org.apache.kafka.connect.cli.ConnectStandalone:113)
org.apache.kafka.connect.errors.ConnectException: Unable to start REST server
at org.apache.kafka.connect.runtime.rest.RestServer.start(RestServer.java:214)
at org.apache.kafka.connect.runtime.Connect.start(Connect.java:53)
at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:95)
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:331)
at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:299)
at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:235)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.server.Server.doStart(Server.java:398)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.apache.kafka.connect.runtime.rest.RestServer.start(RestServer.java:212)
... 2 more
and if i stop the source connection and start the sink it works fine.
Anyone can help me that how i can setup multiple sink connectors.
Kafka Connect starts a rest server on port 8083.
If you run more that one standalone connector on a single machine, you need to change it with the rest.port property
Or you can run connect-distributed, then POST your source and sink configurations individually as JSON payloads running on a single Connect server, then you wouldn't have this Address already in use issue.

Unable to connect Hive with Zookeeper Service Discovery mode via JDBC

I am creating a jdbc connection to hive using javax.sql.DataSource and passing zookeeper service discovery (obtained from Ambari) string to Hive .
Zookeeper Hive URL : jdbc:hive2://localhost:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;transportMode=http;httpPath=cliservice
If I make direct jdbc connection with HiveServer host and port then connection work properly but it fails with zookeeper string.
After that I tested zookeeper string with beeline and I worked fine.
Below is exception when connection is made.
Caused by: java.sql.SQLException: Could not open client transport for any of the Server URI's in ZooKeeper: Unable to read HiveServer2 uri from ZooKeeper
at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:205)
at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:163)
at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
at org.apache.tomcat.jdbc.pool.PooledConnection.connectUsingDriver(PooledConnection.java:307)
at org.apache.tomcat.jdbc.pool.PooledConnection.connect(PooledConnection.java:200)
at org.apache.tomcat.jdbc.pool.ConnectionPool.createConnection(ConnectionPool.java:710)
at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:644)
at org.apache.tomcat.jdbc.pool.ConnectionPool.init(ConnectionPool.java:466)
at org.apache.tomcat.jdbc.pool.ConnectionPool.<init>(ConnectionPool.java:143)
at org.apache.tomcat.jdbc.pool.DataSourceProxy.pCreatePool(DataSourceProxy.java:115)
at org.apache.tomcat.jdbc.pool.DataSourceProxy.createPool(DataSourceProxy.java:102)
at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:126)
at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:85)
at com.thinkbiganalytics.kerberos.KerberosUtil.getConnectionWithOrWithoutKerberos(KerberosUtil.java:60)
at com.thinkbiganalytics.hive.service.RefreshableDataSource.getConnectionForValidation(RefreshableDataSource.java:113)
at com.thinkbiganalytics.hive.service.RefreshableDataSource.testAndRefreshIfInvalid(RefreshableDataSource.java:133)
at com.thinkbiganalytics.hive.service.RefreshableDataSource.getConnection(RefreshableDataSource.java:145)
at com.thinkbiganalytics.kerberos.KerberosUtil.getConnectionWithOrWithoutKerberos(KerberosUtil.java:60)
at com.thinkbiganalytics.schema.DBSchemaParser.listCatalogs(DBSchemaParser.java:80)
... 118 more
Caused by: org.apache.hive.jdbc.ZooKeeperHiveClientException: Unable to read HiveServer2 uri from ZooKeeper
at org.apache.hive.jdbc.ZooKeeperHiveClientHelper.getNextServerUriFromZooKeeper(ZooKeeperHiveClientHelper.java:86)
at org.apache.hive.jdbc.Utils.updateConnParamsFromZooKeeper(Utils.java:506)
at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203)
... 136 more
Caused by: org.apache.hive.jdbc.ZooKeeperHiveClientException: Tried all existing HiveServer2 uris from ZooKeeper.
at org.apache.hive.jdbc.ZooKeeperHiveClientHelper.getNextServerUriFromZooKeeper(ZooKeeperHiveClientHelper.java:73)
... 138 more
Did anyone encounter this ?
After spending my 2 days , i figured out problem . I have hive 0.14 dependency in my code where this problem is occurring. To fix i updated below two hive maven dependencies..
Hive Services - https://mvnrepository.com/artifact/org.apache.hive/hive-service/1.2.1000.2.4.2.10-1
Hive JDBC - https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc/1.2.1000.2.4.2.10-1

Kafka | Unable to publish data to broker - ClosedChannelException

I am trying to run simple kafka producer consumer example on HDP but facing below exception.
[2016-03-03 18:26:38,683] WARN Fetching topic metadata with correlation id 0 for topics [Set(page_visits)] from broker [BrokerEndPoint(0,sandbox.hortonworks.com,9092)] failed (kafka.client.ClientUtils$)
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:120)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:75)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:74)
at kafka.producer.SyncProducer.send(SyncProducer.scala:115)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:59)
at kafka.producer.BrokerPartitionInfo.updateInfo(BrokerPartitionInfo.scala:82)
at kafka.producer.async.DefaultEventHandler$$anonfun$handle$1.apply$mcV$sp(DefaultEventHandler.scala:68)
at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:89)
at kafka.utils.Logging$class.swallowError(Logging.scala:106)
at kafka.utils.CoreUtils$.swallowError(CoreUtils.scala:51)
at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:68)
at kafka.producer.async.ProducerSendThread.tryToHandle(ProducerSendThread.scala:105)
at kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:88)
at kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:68)
at scala.collection.immutable.Stream.foreach(Stream.scala:547)
at kafka.producer.async.ProducerSendThread.processEvents(ProducerSendThread.scala:67)
at kafka.producer.async.ProducerSendThread.run(ProducerSendThread.scala:45)
[2016-03-03 18:26:38,688] ERROR fetching topic metadata for topics [Set(page_visits)] from broker [ArrayBuffer(BrokerEndPoint(0,sandbox.hortonworks.com,9092))] failed (kafka.utils.CoreUtils$)
kafka.common.KafkaException: fetching topic metadata for topics [Set(page_visits)] from broker [ArrayBuffer(BrokerEndPoint(0,sandbox.hortonworks.com,9092))] failed
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:73)
at kafka.producer.BrokerPartitionInfo.updateInfo(BrokerPartitionInfo.scala:82)
at kafka.producer.async.DefaultEventHandler$$anonfun$handle$1.apply$mcV$sp(DefaultEventHandler.scala:68)
at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:89)
at kafka.utils.Logging$class.swallowError(Logging.scala:106)
at kafka.utils.CoreUtils$.swallowError(CoreUtils.scala:51)
at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:68)
at kafka.producer.async.ProducerSendThread.tryToHandle(ProducerSendThread.scala:105)
at kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:88)
at kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:68)
at scala.collection.immutable.Stream.foreach(Stream.scala:547)
at kafka.producer.async.ProducerSendThread.processEvents(ProducerSendThread.scala:67)
at kafka.producer.async.ProducerSendThread.run(ProducerSendThread.scala:45)
Caused by: java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:120)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:75)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:74)
at kafka.producer.SyncProducer.send(SyncProducer.scala:115)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:59)
... 12 more
[2016-03-03 18:26:38,693] WARN Fetching topic metadata with correlation id 1 for topics [Set(page_visits)] from broker [BrokerEndPoint(0,sandbox.hortonworks.com,9092)] failed (kafka.client.ClientUtils$)
java.nio.channels.ClosedChannelException
Here is command that I am using for producer.
./kafka-console-producer.sh --broker-list sandbox.hortonworks.com:9092 --topic page_visits
After doing bit of googling , I found that I need to add advertised.host.name property in server.properties file .
Here is my server.properties file.
# Generated by Apache Ambari. Thu Mar 3 18:12:50 2016
advertised.host.name=sandbox.hortonworks.com
auto.create.topics.enable=true
auto.leader.rebalance.enable=true
broker.id=0
compression.type=producer
controlled.shutdown.enable=true
controlled.shutdown.max.retries=3
controlled.shutdown.retry.backoff.ms=5000
controller.message.queue.size=10
controller.socket.timeout.ms=30000
default.replication.factor=1
delete.topic.enable=false
fetch.purgatory.purge.interval.requests=10000
host.name=sandbox.hortonworks.com
kafka.ganglia.metrics.group=kafka
kafka.ganglia.metrics.host=localhost
kafka.ganglia.metrics.port=8671
kafka.ganglia.metrics.reporter.enabled=true
kafka.metrics.reporters=org.apache.hadoop.metrics2.sink.kafka.KafkaTimelineMetricsReporter
kafka.timeline.metrics.host=sandbox.hortonworks.com
kafka.timeline.metrics.maxRowCacheSize=10000
kafka.timeline.metrics.port=6188
kafka.timeline.metrics.reporter.enabled=true
kafka.timeline.metrics.reporter.sendInterval=5900
leader.imbalance.check.interval.seconds=300
leader.imbalance.per.broker.percentage=10
listeners=PLAINTEXT://sandbox.hortonworks.com:6667
log.cleanup.interval.mins=10
log.dirs=/kafka-logs
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
log.retention.bytes=-1
log.retention.hours=168
log.roll.hours=168
log.segment.bytes=1073741824
message.max.bytes=1000000
min.insync.replicas=1
num.io.threads=8
num.network.threads=3
num.partitions=1
num.recovery.threads.per.data.dir=1
num.replica.fetchers=1
offset.metadata.max.bytes=4096
offsets.commit.required.acks=-1
offsets.commit.timeout.ms=5000
offsets.load.buffer.size=5242880
offsets.retention.check.interval.ms=600000
offsets.retention.minutes=86400000
offsets.topic.compression.codec=0
offsets.topic.num.partitions=50
offsets.topic.replication.factor=3
offsets.topic.segment.bytes=104857600
producer.purgatory.purge.interval.requests=10000
queued.max.requests=500
replica.fetch.max.bytes=1048576
replica.fetch.min.bytes=1
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.lag.max.messages=4000
replica.lag.time.max.ms=10000
replica.socket.receive.buffer.bytes=65536
replica.socket.timeout.ms=30000
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
zookeeper.connect=sandbox.hortonworks.com:2181
zookeeper.connection.timeout.ms=15000
zookeeper.session.timeout.ms=30000
zookeeper.sync.time.ms=2000
After adding property i am getting same exception.
Any suggestion.
I had similar problem. First I have checked listeners property for Kafka broker in the Ambari
Also possible to check with:
[root#sandbox bin]# cat /usr/hdp/current/kafka-broker/conf/server.properties | grep listeners
listeners=PLAINTEXT://sandbox.hortonworks.com:6667
Ambari replaces localhost with hostname as you can see and the port is same - 6667.
Then I checked that broker really listens on that port:
[root#sandbox bin]# netstat -tulpn | grep 6667
tcp 0 0 10.0.2.15:6667 0.0.0.0:* LISTEN 11137/java
Next step was to launch producer:
./kafka-console-producer.sh --broker-list 10.0.2.15:6667 --topic test
At last I have launched consumer:
./kafka-console-consumer.sh --zookeeper 10.0.2.15:2181 --topic test --from-beginning
After typing few words with hitting Enter on producer side, consumer received messages.
As per the log it seems the kafka server(broker) is not running. The broker server should run first.
Producers and consumers are client programs that will interact with the broker servers and zookeeper also.
Before running the producer or consumer please check whether broker and zookeeper are running successfully or not.
Run the server
./kafka-server-start.sh ../config/server.properties
check the logs for any errors, if no errors then start producing the messages to the server.
Check the zookeeper service also.
modified the file /usr/hdp/current/kafka-broker/config/server.properties with the following 2 lines
advertised.host.name=sandbox.hortonworks.com
listeners=PLAINTEXT://sandbox.hortonworks.com:6667,PLAINTEXT://0.0.0.0:6667
run the following execution commands
./kafka-console-producer.sh --broker-list sandbox.hortonworks.com:6667 --topic tst2
./kafka-console-consumer.sh --zookeeper localhost:2181 --topic tst2 --from-beginning
with this its working fine

Nutch 2.3.1 on cassandra couldn't start

I'm trying to run nutch 2.3.1 with cassandra. Followed steps on http://wiki.apache.org/nutch/Nutch2Cassandra . Finally, when I try to start nutch with command:
bin/crawl urls/ test http://localhost:8983/solr/ 2
I got the following exception:
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: java.lang.RuntimeException: job failed: name=[test]generate: 1454483370-31180, jobid=job_local1380148534_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:227)
at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:256)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:322)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:330)
Error running:
/home/user/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 - crawlId webmd -batchId 1454483370-31180
Failed with exit value 255.
When I check logs/hadoop.log, here's the error message:
2016-02-03 15:18:14,741 ERROR connection.HConnectionManager - Could not start connection pool for host localhost(127.0.0.1):9160
...
2016-02-03 15:18:15,185 ERROR store.CassandraStore - All host pools marked down. Retry burden pushed out to client.
me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.
at me.prettyprint.cassandra.connection.HConnectionManager.getClientFromLBPolicy(HConnectionManager.java:390)
But my cassandra server is up:
runtime/local$ netstat -l |grep 9160
tcp 0 0 172.16.230.130:9160 *:* LISTEN
Anyone can help on this issue? Thanks.
The address of Cassandra is not localhost, it's 172.16.230.130. That is the reason, Nutch cannot connect to the Cassandra store.
Hope this helps,
Le Quoc Do

Resources