spring embedded kafka, LEADER_NOT_AVAILABLE, rebalancing - spring-boot

I was using spring emebbded kafka to test a kafka listener.
When I started the test,Randomlly I will get a lot of kafka error logs with informationin including below:
[2022-07-03T03:24:48.452Z] 2022-07-03 05:24:48,225 WARN 
[EventListener-1-C-1] - [org.apache.kafka.clients.NetworkClient] -
Error while fetching metadata with correlation id 106 :
{xxtopic=LEADER_NOT_AVAILABLE} ... [2022-07-03T03:24:48.452Z]
2022-07-03 05:24:48,231 INFO  [data-plane-kafka-request-handler-2] -
[state.change.logger] - [Broker id=0] Finished LeaderAndIsr request in
16291ms correlationId 11 from controller 0 for 51 partitions

> [2022-07-03T03:24:48.452Z] 2022-07-03 05:24:48,234 INFO 
[data-plane-kafka-request-handler-5] - [state.change.logger] - [Broker
id=0] Add 51 partitions and deleted 0 partitions from metadata cache
in response to UpdateMetadata request sent by controller 0 epoch 1
with correlation id 12 ... [2022-07-03T03:21:29.735Z] 2022-07-03
05:21:29,659 INFO  [EventListener-0-C-1] -
[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] - []
Request joining group due to: group is already rebalancing  
[2022-07-03T03:21:29.735Z] 2022-07-03 05:21:29,659 INFO 
[EventListener-0-C-1] -
[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] -
[Revoke previously assigned partitions   [2022-07-03T03:21:29.735Z]
2022-07-03 05:21:29,659 INFO  [EventListener-0-C-1] -
[org.springframework.kafka.listener.KafkaMessageListenerContainer] - :
partitions revoked: []

> [2022-07-03T03:21:29.735Z] 2022-07-03 05:21:29,659 INFO 
[EventListener-0-C-1] -
[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] - ]
(Re-)joining group  ... [2022-07-03T03:24:16.535Z] 2022-07-03
05:24:16,145 WARN  [EventListener-0-C-1] -
[org.apache.kafka.clients.NetworkClient] - [] Error while fetching
metadata with correlation id 10 :
{xxxtopic=UNKNOWN_TOPIC_OR_PARTITION}   [2022-07-03T03:24:18.160Z]
2022-07-03 05:24:17,327 WARN  [EventListener-0-C-1] -
[org.apache.kafka.clients.NetworkClient] - [] Error while fetching
metadata with correlation id 26 : {xxx=LEADER_NOT_AVAILABLE}  
[2022-07-03T03:25:12.365Z] 2022-07-03 05:25:12,259 INFO 
[EventListener-0-C-1] -
[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] - []
Request joining group due to: need to re-join with the given member-id

> [2022-07-03T03:25:12.365Z] 2022-07-03 05:25:12,260 INFO 
[EventListener-0-C-1] -
[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] -[]
(Re-)joining group  ... [2022-07-03T03:25:33.381Z] 2022-07-03
05:25:32,825 INFO  [EventListener-0-C-1] -
[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] -]
Notifying assignor about the new Assignment(partitions=[xxx.topic-0])

> [2022-07-03T03:25:33.381Z] 2022-07-03 05:25:32,825 INFO 
[EventListener-0-C-1] -
[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] -
Adding newly assigned partitions: ... [2022-07-03T03:25:45.674Z]
        ... 1 common frames omitted
> [2022-07-03T03:25:45.674Z] 2022-07-03 05:25:45,413 INFO 
[EventListener-0-C-1] -
[org.apache.kafka.clients.consumer.KafkaConsumer] - ] Seeking to
offset 1 for partition  ...
This issue is not occured every time, but when it occured, my test will fail, the spring kafka test I am using is:
<dependency>
<groupId>org.springframework.kafka</groupId>
<artifactId>spring-kafka-test</artifactId>
<scope>test</scope>
</dependency>
Is this a bug of the emebbed kafka?

Related

Kafka consumer does not fetch new records when using topic pattern and large messages

I hope someone of you can help me.
I'm using spring boot 2.3.4 with spring kafka 2.5.6. I recently had to reset an offset and saw some strange behavior. We consumed the messages, but after every X (variating) messages we had a timeout of 10 seconds before the consumption continued.
This is my configuration:
spring:
kafka:
bootstrap-servers: localhost:9092
consumer:
enable-auto-commit: false
auto-offset-reset: earliest
heartbeat-interval: 1000
max-poll-records: 50
group-id: kafka-fetch-demo
fetch-max-wait: 10000
listener:
type: single
concurrency: 1
poll-timeout: 1000
no-poll-threshold: 2
monitor-interval: 10
ack-mode: manual
producer:
acks: all
batch-size: 0
retries: 0
This is an examle listener code:
#KafkaListener(id = LISTENER_ID, idIsGroup = false, topicPattern = "#{demoProperties.getTopicPattern()}")
public void onEvent(Acknowledgment acknowledgment, ConsumerRecord<byte[], String> record) {
log.info("Received record on topic {}, partition {} and offset {}",
record.topic(),
record.partition(),
record.offset());
acknowledgment.acknowledge();
}
Analysis
I figured out that the 10 second timeout came from the fetch.max.wait.ms property. However I'm not able to figure out why this property applies.
As far as I understand the fetch-max-wait property only determines the maximum time the broker waits before providing the consumer with new records even if the fetch.min.bytes is not exceeded. (Which in my case is set to the default 1 and should always be fullfilled)
Furthermore I analyzed that this problem only applies when using topic patterns and "larger" messages.
Reproduction
I uploaded an demo application on Github to reproduce the issue: https://github.com/kraennix/kafka-fetch-demo.
How I did reproduce it:
I put a thousand messages with 17,1 KB per message on a kafka topic.
I start my consuming application that listens per topic pattern to this topic. Then you can see this stopping behaviour.
Note: If I do the same with "small" messages (89 Bytes) it works as expected.
Logs
In the logs you can see the successful commit, but then the it says Skipping fetch
2021-01-16 15:04:40.773 DEBUG 19244 --- [_LISTENER-0-C-1] essageListenerContainer$ListenerConsumer : Commit list: {publish.LargeTopic.2.test-0=OffsetAndMetadata{offset=488, leaderEpoch=null, metadata=''}}
2021-01-16 15:04:40.773 DEBUG 19244 --- [_LISTENER-0-C-1] essageListenerContainer$ListenerConsumer : Committing: {publish.LargeTopic.2.test-0=OffsetAndMetadata{offset=488, leaderEpoch=null, metadata=''}}
2021-01-16 15:04:40.773 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Sending OffsetCommit request with {publish.LargeTopic.2.test-0=OffsetAndMetadata{offset=488, leaderEpoch=null, metadata=''}} to coordinator localhost:9092 (id: 2147483647 rack: null)
2021-01-16 15:04:40.773 DEBUG 19244 --- [_LISTENER-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Using older server API v7 to send OFFSET_COMMIT {group_id=kafka-fetch-demo,generation_id=4,member_id=consumer-kafka-fetch-demo-1-cf8e747f-531d-457a-aca8-18960c518ef9,group_instance_id=null,topics=[{name=publish.LargeTopic.2.test,partitions=[{partition_index=0,committed_offset=488,committed_leader_epoch=-1,committed_metadata=}]}]} with correlation id 62 to node 2147483647
2021-01-16 15:04:40.778 TRACE 19244 --- [_LISTENER-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Completed receive from node 2147483647 for OFFSET_COMMIT with correlation id 62, received {throttle_time_ms=0,topics=[{name=publish.LargeTopic.2.test,partitions=[{partition_index=0,error_code=0}]}]}
2021-01-16 15:04:40.779 DEBUG 19244 --- [_LISTENER-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Committed offset 488 for partition publish.LargeTopic.2.test-0
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.1.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.2.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.1.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
2021-01-16 15:04:40.779 TRACE 19244 --- [_LISTENER-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-kafka-fetch-demo-1, groupId=kafka-fetch-demo] Skipping fetch for partition publish.LargeTopic.2.test-0 because previous request to localhost:9092 (id: 0 rack: null) has not been processed
When there is a change in the Size of the message, you might need to change below 2 Props
heartbeat-interval: 1000
max-poll-records: 50
Your heart beat interval is 1sec and Max poll wait is 10secs. If the size of the message is high and you are processing the consumed messages in the same thread, then Heartbeat check will fail by the time the next Pull triggered. Make sure to process messages by an Executor using Callable.
Increase the Heart Beat Interval to 5 to 10 secs and Reduce Max Poll records to 15 when the messages size is high. Hope, this can help

Spring Kafka Always rebalance after 5 min even i pause consumer

there is a Time-consuming operation (about 10 min) ,but kafka aways rebalance after 5 min,even i pause the consumer. the consumer method :
#KafkaListener(topics = {TopicAppoint.EXECUTE_SCHOOL_DATA_STATICS_TASK})
public void receiveMessage(#Payload String payload, Consumer<String, String> consumer) {
Set<TopicPartition> assignment = consumer.assignment();
consumer.pause(assignment);
if (StringUtils.isNotEmpty(payload)) {
SchoolStatisticsTaskDTO staticsTaskDTO = JSONObject.parseObject(payload, SchoolStatisticsTaskDTO.class);
Optional<SchoolStatisticsTaskDO> taskOptional = schoolStatisticsTaskRepository.findById(staticsTaskDTO.getTrackId());
taskOptional.ifPresent(schoolStaticsTaskDO -> {
// handler
});
}
consumer.resume(assignment);
}
this is my config:
kafka:
bootstrap-servers: 192.168.0.230:9092
producer:
key-serializer: org.apache.kafka.common.serialization.StringSerializer
value-serializer: org.apache.kafka.common.serialization.StringSerializer
retries: 3
properties:
max.request.size: 12582912
consumer:
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
group-id: dc-fitness-data-consumer-group
properties:
max.partition.fetch.bytes: 12582912
#enable-auto-commit: false
listener:
ack-mode: record
concurrency: 6
LOGS
13:09:20.219 [org.springframework.kafka.KafkaListenerEndpointContainer#2-0-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-25, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.219 [org.springframework.kafka.KafkaListenerEndpointContainer#2-0-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-25, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.es.records.incremental.update.task-4, ft.es.records.incremental.update.task-5, ft.es.records.incremental.update.task-2, ft.es.records.incremental.update.task-3, ft.es.records.incremental.update.task-0, ft.es.records.incremental.update.task-1]
13:09:20.219 [org.springframework.kafka.KafkaListenerEndpointContainer#2-0-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.es.records.incremental.update.task-4, ft.es.records.incremental.update.task-5, ft.es.records.incremental.update.task-2, ft.es.records.incremental.update.task-3, ft.es.records.incremental.update.task-0, ft.es.records.incremental.update.task-1]
13:09:20.219 [org.springframework.kafka.KafkaListenerEndpointContainer#2-0-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-25, groupId=dc-fitness-data-consumer-group] (Re-)joining group
13:09:20.220 [org.springframework.kafka.KafkaListenerEndpointContainer#9-2-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-21, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#9-2-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-21, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.student.batch.upload.task-17, ft.student.batch.upload.task-14, ft.student.batch.upload.task-13, ft.student.batch.upload.task-16, ft.student.batch.upload.task-15, ft.student.batch.upload.task-12]
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#9-2-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.student.batch.upload.task-17, ft.student.batch.upload.task-14, ft.student.batch.upload.task-13, ft.student.batch.upload.task-16, ft.student.batch.upload.task-15, ft.student.batch.upload.task-12]
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#9-2-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-21, groupId=dc-fitness-data-consumer-group] (Re-)joining group
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#6-3-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-4, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#8-5-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-18, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#8-5-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-18, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.record.batch.upload.task-32, ft.record.batch.upload.task-31, ft.record.batch.upload.task-34, ft.record.batch.upload.task-33, ft.record.batch.upload.task-30, ft.record.batch.upload.task-35]
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#8-5-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.record.batch.upload.task-32, ft.record.batch.upload.task-31, ft.record.batch.upload.task-34, ft.record.batch.upload.task-33, ft.record.batch.upload.task-30, ft.record.batch.upload.task-35]
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#8-5-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-18, groupId=dc-fitness-data-consumer-group] (Re-)joining group
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#6-3-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-4, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.class.batch.upload.task-18, ft.class.batch.upload.task-19, ft.class.batch.upload.task-20, ft.class.batch.upload.task-21, ft.class.batch.upload.task-22, ft.class.batch.upload.task-23]
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#6-4-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-5, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-53, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.221 [org.springframework.kafka.KafkaListenerEndpointContainer#0-3-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-52, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#6-4-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-5, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.class.batch.upload.task-26, ft.class.batch.upload.task-27, ft.class.batch.upload.task-28, ft.class.batch.upload.task-29, ft.class.batch.upload.task-24, ft.class.batch.upload.task-25]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#6-3-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.class.batch.upload.task-18, ft.class.batch.upload.task-19, ft.class.batch.upload.task-20, ft.class.batch.upload.task-21, ft.class.batch.upload.task-22, ft.class.batch.upload.task-23]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-53, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.es.class.incremental.update.task-26, ft.es.class.incremental.update.task-27, ft.es.class.incremental.update.task-28, ft.es.class.incremental.update.task-29, ft.es.class.incremental.update.task-24, ft.es.class.incremental.update.task-25]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#6-4-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.class.batch.upload.task-26, ft.class.batch.upload.task-27, ft.class.batch.upload.task-28, ft.class.batch.upload.task-29, ft.class.batch.upload.task-24, ft.class.batch.upload.task-25]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.es.class.incremental.update.task-26, ft.es.class.incremental.update.task-27, ft.es.class.incremental.update.task-28, ft.es.class.incremental.update.task-29, ft.es.class.incremental.update.task-24, ft.es.class.incremental.update.task-25]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#6-4-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-5, groupId=dc-fitness-data-consumer-group] (Re-)joining group
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#6-3-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-4, groupId=dc-fitness-data-consumer-group] (Re-)joining group
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-53, groupId=dc-fitness-data-consumer-group] (Re-)joining group
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#0-3-C-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-52, groupId=dc-fitness-data-consumer-group] Revoking previously assigned partitions [ft.es.class.incremental.update.task-22, ft.es.class.incremental.update.task-23, ft.es.class.incremental.update.task-18, ft.es.class.incremental.update.task-19, ft.es.class.incremental.update.task-20, ft.es.class.incremental.update.task-21]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#0-3-C-1] INFO o.s.k.l.KafkaMessageListenerContainer - partitions revoked: [ft.es.class.incremental.update.task-22, ft.es.class.incremental.update.task-23, ft.es.class.incremental.update.task-18, ft.es.class.incremental.update.task-19, ft.es.class.incremental.update.task-20, ft.es.class.incremental.update.task-21]
13:09:20.223 [org.springframework.kafka.KafkaListenerEndpointContainer#5-4-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-47, groupId=dc-fitness-data-consumer-group] Attempt to heartbeat failed since group is rebalancing
13:09:20.225 [org.springframework.kafka.KafkaListenerEndpointContainer#0-3-C-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-52, groupId=dc-fitness-data-consumer-group] (Re-)joining group
Lets look at the documentation for a while, begin with pause() method
public void pause(Collection partitions) here
Suspend fetching from the requested partitions. Future calls to poll(long) will not return any records from these partitions until they have been resumed using resume(Collection). Note that this method does not affect partition subscription. In particular, it does not cause a group rebalance when automatic assignment is used.
So from the above description pause() method will suspend the partition from fetching the messages but it will not pause the consumer thread, which means consumer thread will do the subsequent poll() request to paused partitions but will not fetch any records
In your application : partitions are paused but consumer thread is busy in executing Time consuming operations for more that 10 minutes without doing any poll request.
Detecting Consumer Failures : so when poll stops heartbeat will not sent to the cluster
After subscribing to a set of topics, the consumer will automatically join the group when poll(long) is invoked. The poll API is designed to ensure consumer liveness. As long as you continue to call poll, the consumer will stay in the group and continue to receive messages from the partitions it was assigned. Underneath the covers, the poll API sends periodic heartbeats to the server; when you stop calling poll (perhaps because an exception was thrown), then no heartbeats will be sent. If a period of the configured session timeout elapses before the server has received a heartbeat, then the consumer will be kicked out of the group and its partitions will be reassigned.
Solution : increase the timeouts for below properties since default time is 5 minutes you are seeing rebalancing for every 5 minutes (personally will not support increasing timeouts) here
The new Java Consumer now supports heartbeating from a background thread. There is a new configuration max.poll.interval.ms which controls the maximum time between poll invocations before the consumer will proactively leave the group (5 minutes by default). The value of the configuration request.timeout.ms must always be larger than max.poll.interval.ms because this is the maximum time that a JoinGroup request can block on the server while the consumer is rebalancing, so we have changed its default value to just above 5 minutes. Finally, the default value of session.timeout.ms has been adjusted down to 10 seconds, and the default value of max.poll.records has been changed to 500.
Solution 2: if you need to process huge data with less amount of time increase more partition and consume each partition with each thread (which means more concurrency), less data that can be processed in 5 minutes

Flume: kafka channel and hdfs sink get unable to deliver event error

I want to try this new Flafka flow: only use kafka channel transfer data to hdfs sink. I tried it from kafka channel and logger sink which is easier to monitor. My configuration file is:
# Name the components on this agent
a1.sinks = sink1
a1.channels = channel1
a1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.channel1.brokerList = localhost:9093,localhost:9094
a1.channels.channel1.topic = par4
a1.channels.channel1.zookeeperConnect = localhost:2181
a1.channels.channel1.parseAsFlumeEvent = false
a1.channels.cnannel1.kafka.consumer.timeout.ms = 1000000
a1.sinks.sink1.channel = channel1
a1.sinks.sink1.type = logger
I set up zookeeper and two brokers locally using above port number, and I have a producer client keep push messages to kafka.
I got following messages:
2015-07-02 20:22:37,619 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2015-07-02 20:22:37,623 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)] Reloading configuration file:conf/example.conf
2015-07-02 20:22:37,629 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1017)] Processing:sink1
2015-07-02 20:22:37,629 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1017)] Processing:sink1
2015-07-02 20:22:37,629 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:931)] Added sinks: sink1 Agent: a1
2015-07-02 20:22:37,633 (conf-file-poller-0) [WARN - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateSources(FlumeConfiguration.java:508)] Agent configuration for 'a1' has no sources.
2015-07-02 20:22:37,635 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:141)] Post-validation flume configuration contains configuration for agents: [a1]
2015-07-02 20:22:37,635 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.loadChannels(AbstractConfigurationProvider.java:145)] Creating channels
2015-07-02 20:22:37,639 (conf-file-poller-0) [INFO - org.apache.flume.channel.DefaultChannelFactory.create(DefaultChannelFactory.java:42)] Creating instance of channel channel1 type org.apache.flume.channel.kafka.KafkaChannel
2015-07-02 20:22:37,650 (conf-file-poller-0) [INFO - org.apache.flume.channel.kafka.KafkaChannel.configure(KafkaChannel.java:168)] Group ID was not specified. Using flume as the group id.
2015-07-02 20:22:37,658 (conf-file-poller-0) [INFO - org.apache.flume.channel.kafka.KafkaChannel.configure(KafkaChannel.java:188)] {metadata.broker.list=localhost:9093,localhost:9094, request.required.acks=-1, group.id=flume, zookeeper.connect=localhost:2181, consumer.timeout.ms=100, auto.commit.enable=false}
2015-07-02 20:22:37,665 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.loadChannels(AbstractConfigurationProvider.java:200)] Created channel channel1
2015-07-02 20:22:37,666 (conf-file-poller-0) [INFO - org.apache.flume.sink.DefaultSinkFactory.create(DefaultSinkFactory.java:42)] Creating instance of sink: sink1, type: logger
2015-07-02 20:22:37,669 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:114)] Channel channel1 connected to [sink1]
2015-07-02 20:22:37,674 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:138)] Starting new configuration:{ sourceRunners:{} sinkRunners:{sink1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#3362ba9e counterGroup:{ name:null counters:{} } }} channels:{channel1=org.apache.flume.channel.kafka.KafkaChannel{name: channel1}} }
2015-07-02 20:22:37,675 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:145)] Starting Channel channel1
2015-07-02 20:22:37,677 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.kafka.KafkaChannel.start(KafkaChannel.java:96)] Starting Kafka Channel: channel1
2015-07-02 20:22:37,885 (lifecycleSupervisor-1-0) [INFO - kafka.utils.Logging$class.info(Logging.scala:68)] Verifying properties
2015-07-02 20:22:37,903 (lifecycleSupervisor-1-0) [WARN - kafka.utils.Logging$class.warn(Logging.scala:83)] Property auto.commit.enable is not valid
2015-07-02 20:22:37,903 (lifecycleSupervisor-1-0) [WARN - kafka.utils.Logging$class.warn(Logging.scala:83)] Property consumer.timeout.ms is not valid
2015-07-02 20:22:37,903 (lifecycleSupervisor-1-0) [WARN - kafka.utils.Logging$class.warn(Logging.scala:83)] Property group.id is not valid
2015-07-02 20:22:37,904 (lifecycleSupervisor-1-0) [INFO - kafka.utils.Logging$class.info(Logging.scala:68)] Property metadata.broker.list is overridden to localhost:9093,localhost:9094
2015-07-02 20:22:37,904 (lifecycleSupervisor-1-0) [INFO - kafka.utils.Logging$class.info(Logging.scala:68)] Property request.required.acks is overridden to -1
2015-07-02 20:22:37,904 (lifecycleSupervisor-1-0) [WARN - kafka.utils.Logging$class.warn(Logging.scala:83)] Property zookeeper.connect is not valid
2015-07-02 20:22:37,929 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.channel.kafka.KafkaChannel.start(KafkaChannel.java:99)] Topic = par4
2015-07-02 20:22:37,929 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: CHANNEL, name: channel1: Successfully registered new MBean.
2015-07-02 20:22:37,930 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: CHANNEL, name: channel1 started
2015-07-02 20:22:37,930 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink sink1
2015-07-02 20:22:37,939 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - kafka.utils.Logging$class.info(Logging.scala:68)] Verifying properties
2015-07-02 20:22:37,939 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - kafka.utils.Logging$class.info(Logging.scala:68)] Property auto.commit.enable is overridden to false
2015-07-02 20:22:37,939 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - kafka.utils.Logging$class.info(Logging.scala:68)] Property consumer.timeout.ms is overridden to 100
2015-07-02 20:22:37,939 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - kafka.utils.Logging$class.info(Logging.scala:68)] Property group.id is overridden to flume
2015-07-02 20:22:37,939 (SinkRunner-PollingRunner-DefaultSinkProcessor) [WARN - kafka.utils.Logging$class.warn(Logging.scala:83)] Property metadata.broker.list is not valid
2015-07-02 20:22:37,940 (SinkRunner-PollingRunner-DefaultSinkProcessor) [WARN - kafka.utils.Logging$class.warn(Logging.scala:83)] Property request.required.acks is not valid
2015-07-02 20:22:37,942 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - kafka.utils.Logging$class.info(Logging.scala:68)] Property zookeeper.connect is overridden to localhost:2181
2015-07-02 20:22:37,951 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - kafka.utils.Logging$class.info(Logging.scala:68)] [flume_MACC02PHH5LG3QC-1435893757951-c4c69fb7], Connecting to zookeeper instance at localhost:2181
2015-07-02 20:22:37,952 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)] Unable to deliver event. Exception follows.
java.lang.IllegalStateException: close() called when transaction is OPEN - you must either commit or rollback first
at com.google.common.base.Preconditions.checkState(Preconditions.java:172)
at org.apache.flume.channel.BasicTransactionSemantics.close(BasicTransactionSemantics.java:179)
at org.apache.flume.sink.LoggerSink.process(LoggerSink.java:105)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
^C2015-07-02 20:22:39,497 (agent-shutdown-hook) [INFO - org.apache.flume.lifecycle.LifecycleSupervisor.stop(LifecycleSupervisor.java:79)] Stopping lifecycle supervisor 12
2015-07-02 20:22:39,499 (agent-shutdown-hook) [INFO - kafka.utils.Logging$class.info(Logging.scala:68)] Shutting down producer
2015-07-02 20:22:39,499 (agent-shutdown-hook) [INFO - kafka.utils.Logging$class.info(Logging.scala:68)] Closing all sync producers
2015-07-02 20:22:39,501 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:150)] Component type: CHANNEL, name: channel1 stopped
2015-07-02 20:22:39,501 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:156)] Shutdown Metric for type: CHANNEL, name: channel1. channel.start.time == 1435893757930
2015-07-02 20:22:39,501 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:162)] Shutdown Metric for type: CHANNEL, name: channel1. channel.stop.time == 1435893759501
2015-07-02 20:22:39,501 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:178)] Shutdown Metric for type: CHANNEL, name: channel1. channel.capacity == 0
2015-07-02 20:22:39,502 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:178)] Shutdown Metric for type: CHANNEL, name: channel1. channel.current.size == 0
2015-07-02 20:22:39,502 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:178)] Shutdown Metric for type: CHANNEL, name: channel1. channel.event.put.attempt == 0
2015-07-02 20:22:39,504 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:178)] Shutdown Metric for type: CHANNEL, name: channel1. channel.event.put.success == 0
2015-07-02 20:22:39,504 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:178)] Shutdown Metric for type: CHANNEL, name: channel1. channel.event.take.attempt == 0
2015-07-02 20:22:39,504 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:178)] Shutdown Metric for type: CHANNEL, name: channel1. channel.event.take.success == 0
2015-07-02 20:22:39,504 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:178)] Shutdown Metric for type: CHANNEL, name: channel1. channel.kafka.commit.time == 0
2015-07-02 20:22:39,504 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:178)] Shutdown Metric for type: CHANNEL, name: channel1. channel.kafka.event.get.time == 0
2015-07-02 20:22:39,504 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:178)] Shutdown Metric for type: CHANNEL, name: channel1. channel.kafka.event.send.time == 0
2015-07-02 20:22:39,504 (agent-shutdown-hook) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:178)] Shutdown Metric for type: CHANNEL, name: channel1. channel.rollback.count == 0
2015-07-02 20:22:39,505 (agent-shutdown-hook) [INFO - org.apache.flume.channel.kafka.KafkaChannel.stop(KafkaChannel.java:123)] Kafka channel channel1 stopped. Metrics: CHANNEL:channel1{channel.event.put.attempt=0, channel.event.put.success=0, channel.kafka.event.get.time=0, channel.current.size=0, channel.event.take.attempt=0, channel.event.take.success=0, channel.kafka.event.send.time=0, channel.capacity=0, channel.kafka.commit.time=0, channel.rollback.count=0}
2015-07-02 20:22:39,505 (agent-shutdown-hook) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.stop(PollingPropertiesFileConfigurationProvider.java:83)] Configuration provider stopping
I don't understand why I have this unable to deliver event error. (I also tried to set up HDFS sink which gives me the same error.)
I also don't understand why I didn't successfully set consumer.timeout.ms
Looking for help, thanks!
Based on the answer from the community, this question can be solved by following two JIRA topic.
https://issues.apache.org/jira/browse/FLUME-2734
https://issues.apache.org/jira/browse/FLUME-2735

Graylog2 - Startup fail. Address already in use

I am trying to install graylog2. I have installed open-jdk7. I have also installed elasticsearch and mongodb using apt on ubuntu 14.04.
I am new to both graylog and elasticsearch. I just want to try a trail installation and try these out. And I also did search similar questions and tried their suggestions. But none of them worked for my case.
I have followed the installation instructions on graylog.org. But when I try to start the graylog2 server I get the following error.
2015-02-12 03:19:36,216 INFO : org.graylog2.periodical.Periodicals - Starting [org.graylog2.periodical.IndexerClusterCheckerThread] periodical in [0s], polling every [30s].
2015-02-12 03:19:36,222 INFO : org.graylog2.periodical.Periodicals - Starting [org.graylog2.periodical.GarbageCollectionWarningThread] periodical, running forever.
2015-02-12 03:19:36,225 INFO : org.graylog2.periodical.IndexerClusterCheckerThread - Indexer not fully initialized yet. Skipping periodic cluster check.
2015-02-12 03:19:36,229 INFO : org.graylog2.periodical.Periodicals - Starting [org.graylog2.periodical.ThroughputCounterManagerThread] periodical in [0s], polling every [1s].
2015-02-12 03:19:36,280 INFO : org.graylog2.periodical.Periodicals - Starting [org.graylog2.periodical.DeadLetterThread] periodical, running forever.
2015-02-12 03:19:36,295 INFO : org.graylog2.periodical.Periodicals - Starting [org.graylog2.periodical.ClusterHealthCheckThread] periodical in [0s], polling every [20s].
2015-02-12 03:19:36,299 INFO : org.graylog2.periodical.Periodicals - Starting [org.graylog2.periodical.InputCacheWorkerThread] periodical, running forever.
2015-02-12 03:19:36,334 DEBUG: org.graylog2.periodical.ClusterHealthCheckThread - No input running in cluster!
2015-02-12 03:19:36,368 DEBUG: org.graylog2.caches.DiskJournalCache - Committing output-cache (entries 0)
2015-02-12 03:19:36,383 DEBUG: org.graylog2.caches.DiskJournalCache - Committing input-cache (entries 0)
2015-02-12 03:19:36,885 ERROR: com.google.common.util.concurrent.ServiceManager - Service IndexerSetupService [FAILED] has failed in the STARTING state.
org.elasticsearch.transport.BindTransportException: Failed to bind to [9300]
at org.elasticsearch.transport.netty.NettyTransport.doStart(NettyTransport.java:396)
at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85)
at org.elasticsearch.transport.TransportService.doStart(TransportService.java:90)
at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85)
at org.elasticsearch.node.internal.InternalNode.start(InternalNode.java:242)
at org.graylog2.initializers.IndexerSetupService.startUp(IndexerSetupService.java:101)
at com.google.common.util.concurrent.AbstractIdleService$2$1.run(AbstractIdleService.java:54)
at com.google.common.util.concurrent.Callables$3.run(Callables.java:95)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.common.netty.channel.ChannelException: Failed to bind to: /127.0.0.1:9300
at org.elasticsearch.common.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at org.elasticsearch.transport.netty.NettyTransport$3.onPortNumber(NettyTransport.java:387)
at org.elasticsearch.common.transport.PortsRange.iterate(PortsRange.java:58)
at org.elasticsearch.transport.netty.NettyTransport.doStart(NettyTransport.java:383)
... 8 more
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.elasticsearch.common.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:372)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:296)
at org.elasticsearch.common.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Elastic search is showing the following status
{
"cluster_name" : "graylog2",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}
The following are the changes I made to elasticsearch.yml
cluster.name: graylog2
network.bind_host: 127.0.0.1
network.host: 127.0.0.1
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["127.0.0.1", MYSYS IP]
and graylog2.conf
is_master = true
password_secret = changed
root_password_sha2 = changed
elasticsearch_max_docs_per_index = 20000000
elasticsearch_shards = 1
elasticsearch_replicas = 0
elasticsearch_cluster_name = graylog2
elasticsearch_discovery_zen_ping_multicast_enabled = false
elasticsearch_discovery_zen_ping_unicast_hosts = IP_ARR:9300
mongodb_useauth = false
I tried killing the process on the port 9300 and tried starting graylog again. But I got the following error
2015-02-12 04:01:24,976 INFO : org.elasticsearch.transport - [graylog2-server] bound_address {inet[/127.0.0.1:9300]}, publish_address {inet[/127.0.0.1:9300]}
2015-02-12 04:01:25,227 INFO : org.elasticsearch.discovery - [graylog2-server] graylog2/LGkZJDz1SoeENKj6Rr0e8w
2015-02-12 04:01:25,252 DEBUG: org.elasticsearch.cluster.service - [graylog2-server] processing [update local node]: execute
2015-02-12 04:01:25,253 DEBUG: org.elasticsearch.cluster.service - [graylog2-server] cluster state updated, version [0], source [update local node]
2015-02-12 04:01:25,259 DEBUG: org.elasticsearch.cluster.service - [graylog2-server] set local cluster state to version 0
2015-02-12 04:01:25,259 DEBUG: org.elasticsearch.cluster.service - [graylog2-server] processing [update local node]: done applying updated cluster_state (version: 0)
2015-02-12 04:01:25,325 WARN : org.elasticsearch.transport.netty - [graylog2-server] exception caught on transport layer [[id: 0x82f30fa7]], closing connection
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:644)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:574)
.......
2015-02-12 04:01:28,536 DEBUG: org.elasticsearch.action.admin.cluster.health - [graylog2-server] no known master node, scheduling a retry
2015-02-12 04:01:28,564 DEBUG: org.elasticsearch.transport.netty - [graylog2-server] disconnected from [[graylog2-server][LGkZJDz1SoeENKj6Rr0e8w][ubuntu-greylog-9945][inet[/127.0.0.1:9300]]{client=true, data=false, master=false}]
2015-02-12 04:01:28,573 DEBUG: org.elasticsearch.discovery.zen - [graylog2-server] filtered ping responses: (filter_client[true], filter_data[false]) {none}
2015-02-12 04:01:28,590 WARN : org.elasticsearch.transport.netty - [graylog2-server] exception caught on transport layer [[id: 0xe27feaff]], closing connection
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:644)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70)
Can you please point out to what I am doing wrong here and what I am missing??
if ES and greylog2 running on same server, try (del/comment) in elasticsearch.conf
#transport.tcp.port: 9300
and (add/uncomment) in greylog.conf
elasticsearch_transport_tcp_port = 9350

Setup a graylog2 server with elasticsearch in a vagrant machine

I'm trying to Install graylog2 server on my local dev machine and encountering problems with elasticsearch setup.
My elasticsearch is installed as a service on a vagrant machine running on my dev machine. so My elasticsearch isn't installed in 127.0.0.1 but in 192.168.50.4 (the ip of the vagrant machine) I have ports 9200 forwarded from the vagrant machine but graylog2 server seems to fail to find it and stops running with a :
ERROR: Could not successfully connect to ElasticSearch. Check that
your cluster state is not RED and that ElasticSearch is running
properly.
Adding port 9300 forwarded from the vagrant machine changed the error to:
Caused by: org.elasticsearch.common.netty.channel.ChannelException:
Failed to bind to: 0.0.0.0/0.0.0.0:9350
I tried this settings in graylog conf file:
elasticsearch_network_host =192.168.50.4
but that only changes the error to an exception failing to bind to
Caused by: org.elasticsearch.common.netty.channel.ChannelException:
Failed to bind to: /192.168.50.4:9350 at
org.elasticsearch.common.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
But didn't help.
I'll be glad for any direction what am I doing wrong (either with elastic search configuration or the vagrant or graylog2)
Thanks!
Update following advice by the answer below I changed the following config:
elasticsearch_discovery_zen_ping_multicast_enabled = false
elasticsearch_discovery_zen_ping_unicast_hosts = 192.168.50.4:9300
I now get this error:
2014-06-16 23:04:34,946 WARN : org.elasticsearch.transport.netty - [graylog2-server] Message not fully read (response) for [6] handler org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$4#67bd250a, error [true], resetting
2014-06-16 23:04:36,451 WARN : org.elasticsearch.discovery.zen.ping.unicast - [graylog2-server] failed to send ping to [[#zen_unicast_1#][inet[/192.168.50.4:9300]]]
org.elasticsearch.transport.RemoteTransportException: Failed to deserialize exception response from stream
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize exception response from stream
at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:169)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:123)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.InvalidClassException: failed to read class descriptor
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1603)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
looks that graylog2 still fails to connect to elastic search in a correct way
Details (update): graylog2-server-0.20.2, elasticsearch 1.1.0 (I think) - I can replace if that's the problem. java OpenJDK 64-Bit java version "1.7.0_55"
More Updates (thanks #sheena) When downgrading the elasticsearch version to 0.90.10 we got some progress but still not working:
Here is the current log:
2014-06-17 13:27:16,394 INFO : org.graylog2.Main - Graylog2 0.20.2 starting up. (JRE: Oracle Corporation 1.7.0_55 on Linux 3.13.0-29-generic)
2014-06-17 13:27:16,475 INFO : org.graylog2.plugin.system.NodeId - Node ID: e7245f12-2e8b-4803-9e88-7529169b5a91
2014-06-17 13:27:16,670 INFO : org.graylog2.buffers.ProcessBuffer - Initialized ProcessBuffer with ring size <1024> and wait strategy <BlockingWaitStrategy>.
2014-06-17 13:27:16,692 INFO : org.graylog2.buffers.OutputBuffer - Initialized OutputBuffer with ring size <1024> and wait strategy <BlockingWaitStrategy>.
2014-06-17 13:27:16,964 DEBUG: com.ning.http.client.providers.netty.NettyAsyncHttpProvider - Number of application's worker threads is 8
2014-06-17 13:27:17,272 INFO : org.elasticsearch.node - [graylog2-server] version[0.90.10], pid[24419], build[0a5781f/2014-01-10T10:18:37Z]
2014-06-17 13:27:17,273 INFO : org.elasticsearch.node - [graylog2-server] initializing ...
2014-06-17 13:27:17,273 DEBUG: org.elasticsearch.node - [graylog2-server] using home [/home/alon/Downloads/graylog2-server-0.20.2], config [/home/alon/Downloads/graylog2-server-0.20.2/config], data [[/home/alon/Downloads/graylog2-server-0.20.2/data]], logs [/home/alon/Downloads/graylog2-server-0.20.2/logs], work [/home/alon/Downloads/graylog2-server-0.20.2/work], plugins [/home/alon/Downloads/graylog2-server-0.20.2/plugins]
2014-06-17 13:27:17,281 INFO : org.elasticsearch.plugins - [graylog2-server] loaded [], sites []
2014-06-17 13:27:17,320 DEBUG: org.elasticsearch.common.compress.lzf - using [UnsafeChunkDecoder] decoder
2014-06-17 13:27:18,655 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [generic], type [cached], keep_alive [30s]
2014-06-17 13:27:18,740 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [index], type [fixed], size [4], queue_size [200]
2014-06-17 13:27:18,744 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [bulk], type [fixed], size [4], queue_size [50]
2014-06-17 13:27:18,745 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [get], type [fixed], size [4], queue_size [1k]
2014-06-17 13:27:18,745 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [search], type [fixed], size [12], queue_size [1k]
2014-06-17 13:27:18,745 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [suggest], type [fixed], size [4], queue_size [1k]
2014-06-17 13:27:18,745 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [percolate], type [fixed], size [4], queue_size [1k]
2014-06-17 13:27:18,746 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [management], type [scaling], min [1], size [5], keep_alive [5m]
2014-06-17 13:27:18,747 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [flush], type [scaling], min [1], size [2], keep_alive [5m]
2014-06-17 13:27:18,747 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [merge], type [scaling], min [1], size [2], keep_alive [5m]
2014-06-17 13:27:18,747 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [refresh], type [scaling], min [1], size [2], keep_alive [5m]
2014-06-17 13:27:18,748 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [warmer], type [scaling], min [1], size [2], keep_alive [5m]
2014-06-17 13:27:18,748 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [snapshot], type [scaling], min [1], size [2], keep_alive [5m]
2014-06-17 13:27:18,748 DEBUG: org.elasticsearch.threadpool - [graylog2-server] creating thread_pool [optimize], type [fixed], size [1], queue_size [null]
2014-06-17 13:27:18,768 DEBUG: org.elasticsearch.transport.netty - [graylog2-server] using worker_count[8], port[9350], bind_host[null], publish_host[null], compress[false], connect_timeout[30s], connections_per_node[2/3/6/1/1], receive_predictor[512kb->512kb]
2014-06-17 13:27:18,784 DEBUG: org.elasticsearch.discovery.zen.ping.unicast - [graylog2-server] using initial hosts [192.168.50.4:9300], with concurrent_connects [10]
2014-06-17 13:27:18,787 DEBUG: org.elasticsearch.discovery.zen - [graylog2-server] using ping.timeout [3s], master_election.filter_client [true], master_election.filter_data [false]
2014-06-17 13:27:18,788 DEBUG: org.elasticsearch.discovery.zen.elect - [graylog2-server] using minimum_master_nodes [-1]
2014-06-17 13:27:18,790 DEBUG: org.elasticsearch.discovery.zen.fd - [graylog2-server] [master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
2014-06-17 13:27:18,801 DEBUG: org.elasticsearch.discovery.zen.fd - [graylog2-server] [node ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
2014-06-17 13:27:18,845 DEBUG: org.elasticsearch.monitor.jvm - [graylog2-server] enabled [true], last_gc_enabled [false], interval [1s], gc_threshold [{old=GcThreshold{name='old', warnThreshold=10000, infoThreshold=5000, debugThreshold=2000}, default=GcThreshold{name='default', warnThreshold=10000, infoThreshold=5000, debugThreshold=2000}, young=GcThreshold{name='young', warnThreshold=1000, infoThreshold=700, debugThreshold=400}}]
2014-06-17 13:27:18,846 DEBUG: org.elasticsearch.monitor.os - [graylog2-server] Using probe [org.elasticsearch.monitor.os.JmxOsProbe#7b01e044] with refresh_interval [1s]
2014-06-17 13:27:18,849 DEBUG: org.elasticsearch.monitor.process - [graylog2-server] Using probe [org.elasticsearch.monitor.process.JmxProcessProbe#3103c203] with refresh_interval [1s]
2014-06-17 13:27:18,854 DEBUG: org.elasticsearch.monitor.jvm - [graylog2-server] Using refresh_interval [1s]
2014-06-17 13:27:18,854 DEBUG: org.elasticsearch.monitor.network - [graylog2-server] Using probe [org.elasticsearch.monitor.network.JmxNetworkProbe#1cc7580f] with refresh_interval [5s]
2014-06-17 13:27:18,857 DEBUG: org.elasticsearch.monitor.network - [graylog2-server] net_info
host [stox-alonisser]
vboxnet0 display_name [vboxnet0]
address [/fe80:0:0:0:800:27ff:fe00:0%4] [/192.168.50.1]
mtu [1500] multicast [true] ptp [false] loopback [false] up [true] virtual [false]
wlan0 display_name [wlan0]
address [/fe80:0:0:0:e8b:fdff:fe62:dc9d%3] [/192.168.20.107]
mtu [1500] multicast [true] ptp [false] loopback [false] up [true] virtual [false]
lo display_name [lo]
address [/0:0:0:0:0:0:0:1%1] [/127.0.0.1]
mtu [65536] multicast [false] ptp [false] loopback [true] up [true] virtual [false]
2014-06-17 13:27:18,858 DEBUG: org.elasticsearch.monitor.fs - [graylog2-server] Using probe [org.elasticsearch.monitor.fs.JmxFsProbe#2c8807d7] with refresh_interval [1s]
2014-06-17 13:27:19,196 DEBUG: org.elasticsearch.indices.store - [graylog2-server] using indices.store.throttle.type [MERGE], with index.store.throttle.max_bytes_per_sec [20mb]
2014-06-17 13:27:19,204 DEBUG: org.elasticsearch.cache.memory - [graylog2-server] using bytebuffer cache with small_buffer_size [1kb], large_buffer_size [1mb], small_cache_size [10mb], large_cache_size [500mb], direct [true]
2014-06-17 13:27:19,220 DEBUG: org.elasticsearch.script - [graylog2-server] using script cache with max_size [500], expire [null]
2014-06-17 13:27:19,234 DEBUG: org.elasticsearch.cluster.routing.allocation.decider - [graylog2-server] using node_concurrent_recoveries [2], node_initial_primaries_recoveries [4]
2014-06-17 13:27:19,235 DEBUG: org.elasticsearch.cluster.routing.allocation.decider - [graylog2-server] using [cluster.routing.allocation.allow_rebalance] with [indices_all_active]
2014-06-17 13:27:19,236 DEBUG: org.elasticsearch.cluster.routing.allocation.decider - [graylog2-server] using [cluster_concurrent_rebalance] with [2]
2014-06-17 13:27:19,243 DEBUG: org.elasticsearch.gateway.local - [graylog2-server] using initial_shards [quorum], list_timeout [30s]
2014-06-17 13:27:19,424 DEBUG: org.elasticsearch.indices.recovery - [graylog2-server] using max_bytes_per_sec[20mb], concurrent_streams [3], file_chunk_size [512kb], translog_size [512kb], translog_ops [1000], and compress [true]
2014-06-17 13:27:19,486 DEBUG: org.elasticsearch.indices.memory - [graylog2-server] using index_buffer_size [265.4mb], with min_shard_index_buffer_size [4mb], max_shard_index_buffer_size [512mb], shard_inactive_time [30m]
2014-06-17 13:27:19,487 DEBUG: org.elasticsearch.indices.cache.filter - [graylog2-server] using [node] weighted filter cache with size [20%], actual_size [530.8mb], expire [null], clean_interval [1m]
2014-06-17 13:27:19,489 DEBUG: org.elasticsearch.indices.fielddata.cache - [graylog2-server] using size [-1] [-1b], expire [null]
2014-06-17 13:27:19,507 DEBUG: org.elasticsearch.gateway.local.state.meta - [graylog2-server] using gateway.local.auto_import_dangled [YES], with gateway.local.dangling_timeout [2h]
2014-06-17 13:27:19,511 DEBUG: org.elasticsearch.bulk.udp - [graylog2-server] using enabled [false], host [null], port [9700-9800], bulk_actions [1000], bulk_size [5mb], flush_interval [5s], concurrent_requests [4]
2014-06-17 13:27:19,514 DEBUG: org.elasticsearch.cluster.routing.allocation.decider - [graylog2-server] using node_concurrent_recoveries [2], node_initial_primaries_recoveries [4]
2014-06-17 13:27:19,514 DEBUG: org.elasticsearch.cluster.routing.allocation.decider - [graylog2-server] using [cluster.routing.allocation.allow_rebalance] with [indices_all_active]
2014-06-17 13:27:19,515 DEBUG: org.elasticsearch.cluster.routing.allocation.decider - [graylog2-server] using [cluster_concurrent_rebalance] with [2]
2014-06-17 13:27:19,516 DEBUG: org.elasticsearch.cluster.routing.allocation.decider - [graylog2-server] using node_concurrent_recoveries [2], node_initial_primaries_recoveries [4]
2014-06-17 13:27:19,516 DEBUG: org.elasticsearch.cluster.routing.allocation.decider - [graylog2-server] using [cluster.routing.allocation.allow_rebalance] with [indices_all_active]
2014-06-17 13:27:19,516 DEBUG: org.elasticsearch.cluster.routing.allocation.decider - [graylog2-server] using [cluster_concurrent_rebalance] with [2]
2014-06-17 13:27:19,528 INFO : org.elasticsearch.node - [graylog2-server] initialized
2014-06-17 13:27:19,529 INFO : org.elasticsearch.node - [graylog2-server] starting ...
2014-06-17 13:27:19,552 DEBUG: org.elasticsearch.netty.channel.socket.nio.SelectorUtil - Using select timeout of 500
2014-06-17 13:27:19,552 DEBUG: org.elasticsearch.netty.channel.socket.nio.SelectorUtil - Epoll-bug workaround enabled = false
2014-06-17 13:27:19,618 DEBUG: org.elasticsearch.transport.netty - [graylog2-server] Bound to address [/0:0:0:0:0:0:0:0:9350]
2014-06-17 13:27:19,622 INFO : org.elasticsearch.transport - [graylog2-server] bound_address {inet[/0:0:0:0:0:0:0:0:9350]}, publish_address {inet[/192.168.20.107:9350]}
2014-06-17 13:27:19,658 DEBUG: org.elasticsearch.transport.netty - [graylog2-server] connected to node [[#zen_unicast_1#][inet[/192.168.50.4:9300]]]
2014-06-17 13:27:22,628 WARN : org.elasticsearch.discovery - [graylog2-server] waited for 3s and no initial state was set by the discovery
2014-06-17 13:27:22,628 INFO : org.elasticsearch.discovery - [graylog2-server] graylog2/vWsYLp5JQoOJMva0FZgRsA
2014-06-17 13:27:22,629 DEBUG: org.elasticsearch.gateway - [graylog2-server] can't wait on start for (possibly) reading state from gateway, will do it asynchronously
2014-06-17 13:27:22,629 INFO : org.elasticsearch.node - [graylog2-server] started
2014-06-17 13:27:22,642 DEBUG: org.elasticsearch.transport.netty - [graylog2-server] disconnected from [[#zen_unicast_1#][inet[/192.168.50.4:9300]]]
2014-06-17 13:27:22,644 DEBUG: org.elasticsearch.discovery.zen - [graylog2-server] filtered ping responses: (filter_client[true], filter_data[false])
--> target [[Crimson Daffodil][vPHcWzoCQteDG19hofaayA][inet[/10.0.2.15:9300]]], master [[Crimson Daffodil][vPHcWzoCQteDG19hofaayA][inet[/10.0.2.15:9300]]]
2014-06-17 13:27:27,634 ERROR: org.graylog2.Main -
elasticsearch_network_host is not what you think. It is about the elasticsearch /client/ within graylog, and not the elasticsearch server you want to connect with. So graylog is trying to listen on 192.168.50.4 which isn't a valid IP address on the graylog system (your dev machine).
You most likely want to set these variables in graylog2 config:
elasticsearch_discovery_zen_ping_multicast_enabled = false
elasticsearch_discovery_zen_ping_unicast_hosts = 192.168.50.4:9300
Here is where I got stuck, but that was because I had elasticsearch 1.0 installed when I needed 0.90. I'll now more once my puppet/vagrant stack finishes re-provisioning. =)
EDIT: Mine is working now.

Resources