i am using Spring Boot 2.2 together with a Kafka cluster (bitnami helm chart).
And get some pretty strange behaviour.
Having some spring boot app with several consumers on several topics.
Calling kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe -group my-app gives:
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
my-app event.topic-a 0 2365079 2365090 11 consumer-4-0c9a5616-3e96-413b-b770-b813c3d38a28 /10.244.3.47 consumer-4
my-app event.topic-a 1 2365080 2365091 11 consumer-4-0c9a5616-3e96-413b-b770-b813c3d38a28 /10.244.3.47 consumer-4
my-app batch.topic-a 0 278363 278363 0 consumer-3-14cb199e-646f-46ad-8ee2-98f37107fa37 /10.244.3.47 consumer-3
my-app batch.topic-a 1 278362 278362 0 consumer-3-14cb199e-646f-46ad-8ee2-98f37107fa37 /10.244.3.47 consumer-3
my-app batch.topic-b 0 1434 1434 0 consumer-5-a2f940c8-75e6-43d2-8d79-77d03e1ad640 /10.244.3.47 consumer-5
my-app event.topic-b 0 2530 2530 0 consumer-6-45a32d6d-eac9-4abe-b14f-47173338e62c /10.244.3.47 consumer-6
my-app batch.topic-c 0 1779 1779 0 consumer-1-d935a29f-ad3c-4292-9ace-5efdfff864d6 /10.244.3.47 consumer-1
my-app event.topic-c 0 12308 13502 1194 - - -
Calling it again gives
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
my-app event.topic-a 0 2365230 2365245 15 consumer-4-0c9a5616-3e96-413b-b770-b813c3d38a28 /10.244.3.47 consumer-4
my-app event.topic-a 1 2365231 2365246 15 consumer-4-0c9a5616-3e96-413b-b770-b813c3d38a28 /10.244.3.47 consumer-4
my-app batch.topic-a 0 278363 278363 0 consumer-3-14cb199e-646f-46ad-8ee2-98f37107fa37 /10.244.3.47 consumer-3
my-app batch.topic-a 1 278362 278362 0 consumer-3-14cb199e-646f-46ad-8ee2-98f37107fa37 /10.244.3.47 consumer-3
my-app batch.topic-b 0 1434 1434 0 consumer-5-a2f940c8-75e6-43d2-8d79-77d03e1ad640 /10.244.3.47 consumer-5
my-app event.topic-b 0 2530 2530 0 consumer-6-45a32d6d-eac9-4abe-b14f-47173338e62c /10.244.3.47 consumer-6
my-app batch.topic-c 0 1779 1779 0 consumer-1-d935a29f-ad3c-4292-9ace-5efdfff864d6 /10.244.3.47 consumer-1
my-app event.topic-c 0 12308 13505 1197 consumer-2-d52e2b96-f08c-4247-b827-4464a305cb20 /10.244.3.47 consumer-2
As you could see the the consumer for event.topic-c is now there but laggs 1197 entries.
The app itself reads from the topic, but always the same events (looks like the amount of the lag) but the offset is not changed.
I get no errors or log entries, eighter on kafka or on spring boot.
All i have is for that specific topic the same events are processed again and again ..... all the other topics on the app are working correctly.
Here is the client config:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = latest
bootstrap.servers = [kafka:9092]
check.crcs = true
client.dns.lookup = default
client.id =
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = sap-integration
group.instance.id = null
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
session.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
Any idea.. i am a litte bit lost ..
Edit:
Spring config is pretty standard:
configProps[ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG] = bootstrapAddress
configProps[ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG] = StringDeserializer::class.java
configProps[ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG] = MyJsonDeserializer::class.java
configProps[JsonDeserializer.TRUSTED_PACKAGES] = "*"
Here are some example from the logs:
2019-11-01 18:39:46.268 DEBUG 1 --- [ntainer#0-0-C-1] .a.RecordMessagingMessageListenerAdapter : Processing [GenericMessage [payload=..., headers={kafka_offset=37603361, kafka_consumer=org.apache.kafka.clients.consumer.KafkaConsumer#6ca11277, kafka_timestampType=CREATE_TIME, kafka_receivedMessageKey=null, kafka_receivedPartitionId=0, kafka_receivedTopic=topic-c, kafka_receivedTimestamp=1572633584589, kafka_groupId=my-app}]]
2019-11-01 18:39:46.268 DEBUG 1 --- [ntainer#0-0-C-1] .a.RecordMessagingMessageListenerAdapter : Processing [GenericMessage [payload=..., headers={kafka_offset=37603362, kafka_consumer=org.apache.kafka.clients.consumer.KafkaConsumer#6ca11277, kafka_timestampType=CREATE_TIME, kafka_receivedMessageKey=null, kafka_receivedPartitionId=0, kafka_receivedTopic=topic-c, kafka_receivedTimestamp=1572633584635, kafka_groupId=my-app}]]
2019-11-01 18:39:46.268 DEBUG 1 --- [ntainer#0-0-C-1] essageListenerContainer$ListenerConsumer : Commit list: {topic-c-0=OffsetAndMetadata{offset=37603363, leaderEpoch=null, metadata=''}}
2019-11-01 18:39:46.268 DEBUG 1 --- [ntainer#0-0-C-1] essageListenerContainer$ListenerConsumer : Committing: {topic-c-0=OffsetAndMetadata{offset=37603363, leaderEpoch=null, metadata=''}}
....
2019-11-01 18:39:51.475 DEBUG 1 --- [ntainer#0-0-C-1] essageListenerContainer$ListenerConsumer : Received: 0 records
2019-11-01 18:39:51.475 DEBUG 1 --- [ntainer#0-0-C-1] essageListenerContainer$ListenerConsumer : Commit list: {}
while consumer is laging
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
my-app topic-c 0 37603363 37720873 117510 consumer-3-2b8499c0-7304-4906-97f8-9c0f6088c469 /10.244.3.64 consumer-3
No error, no warning .. just no more messages ....
Thx
You need to look for logs like this...
2019-11-01 16:33:31.825 INFO 35182 --- [ kgh1231-0-C-1] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=consumer-2, groupId=kgh1231] (Re-)joining group
...
2019-11-01 16:33:31.872 INFO 35182 --- [ kgh1231-0-C-1] o.s.k.l.KafkaMessageListenerContainer : partitions assigned: [kgh1231-0, kgh1231-2, kgh1231-1, kgh1231-4, kgh1231-3]
...
2019-11-01 16:33:31.897 DEBUG 35182 --- [ kgh1231-0-C-1] essageListenerContainer$ListenerConsumer : Received: 10 records
...
2019-11-01 16:33:31.902 DEBUG 35182 --- [ kgh1231-0-C-1] .a.RecordMessagingMessageListenerAdapter : Processing [GenericMessage [payload=foo1, headers={kafka_offset=80, kafka_consumer=org.apache.kafka.clients.consumer.KafkaConsumer#3d00c543, kafka_timestampType=CREATE_TIME, kafka_receivedMessageKey=null, kafka_receivedPartitionId=0, kafka_receivedTopic=kgh1231, kafka_receivedTimestamp=1572640411869}]]
...
2019-11-01 16:33:31.906 DEBUG 35182 --- [ kgh1231-0-C-1] .a.RecordMessagingMessageListenerAdapter : Processing [GenericMessage [payload=foo5, headers={kafka_offset=61, kafka_consumer=org.apache.kafka.clients.consumer.KafkaConsumer#3d00c543, kafka_timestampType=CREATE_TIME, kafka_receivedMessageKey=null, kafka_receivedPartitionId=3, kafka_receivedTopic=kgh1231, kafka_receivedTimestamp=1572640411870}]]
2019-11-01 16:33:31.907 DEBUG 35182 --- [ kgh1231-0-C-1] essageListenerContainer$ListenerConsumer : Commit list: {kgh1231-0=OffsetAndMetadata{offset=82, metadata=''}, kgh1231-2=OffsetAndMetadata{offset=62, metadata=''}, kgh1231-1=OffsetAndMetadata{offset=62, metadata=''}, kgh1231-4=OffsetAndMetadata{offset=62, metadata=''}, kgh1231-3=OffsetAndMetadata{offset=62, metadata=''}}
2019-11-01 16:33:31.908 DEBUG 35182 --- [ kgh1231-0-C-1] essageListenerContainer$ListenerConsumer : Committing: {kgh1231-0=OffsetAndMetadata{offset=82, metadata=''}, kgh1231-2=OffsetAndMetadata{offset=62, metadata=''}, kgh1231-1=OffsetAndMetadata{offset=62, metadata=''}, kgh1231-4=OffsetAndMetadata{offset=62, metadata=''}, kgh1231-3=OffsetAndMetadata{offset=62, metadata=''}}
If you don't see anything like that then your consumer is not configured properly.
If you can't figure it out, post your log someplace like PasteBin.
Mysteriously fixed this issue by doing all of that:
Upgrade Kafka to newer Version
Upgrade Spring Boot to newer Version
Improve performance of the Software
Switch to Batch processing
Add Health Check and combine it with liveness probe
It runns now for more than a week without having that error again.
Related
I'm writing a java spring boot app that connects to MQ, read messages (placed by another app) and returns a response to them.
I'm trying to set the correlation id to the message ID we've just received.
public void sendMessage(String destination, Item item, Message originalMessage) throws IOException {
jmsTemplate.send(destinationName, session -> {
Message message = createBytesMessage(item, session);
message.setJMSCorrelationID(originalMessage.getJMSMessageID());
log.info("Message: " + message);
return message;
});
}
The original message ID is:
ID:414d5120514d44454c30315020202020f8af4d6202558722
The reply message has a correlation ID as expected:
ID:414d5120514d44454c30315020202020f8af4d6202558722
However, when I investigate MQ I can see the correlation ID is actually:
343134643531323035313464343434353463333033313530
Curiously, if I take my expected value for the correlation and covert that to a hexdecimal, I get the following:
343134643531323035313464343434353463333033313530323032303230323066386166346436323032383338343232
// The first 48 characters match what is in MQ
Here is the debug information from the java application:
This is the original message:
JMSMessage class: jms_bytes
JMSType: null
JMSDeliveryMode: 1
JMSDeliveryDelay: 0
JMSDeliveryTime: 0
JMSExpiration: 0
JMSPriority: 0
JMSMessageID: ID:414d5120514d44454c30315020202020f8af4d6202558722
JMSTimestamp: 1649452421210
JMSCorrelationID: null
JMSDestination: null
JMSRedelivered: false
JMSXAppID: DataFlowEngine
JMSXDeliveryCount: 1
JMSXUserID: mqm
JMS_IBM_Character_Set: UTF-8
JMS_IBM_Encoding: 546
JMS_IBM_Format:
JMS_IBM_MsgType: 1
JMS_IBM_PutApplType: 6
JMS_IBM_PutDate: 20220408
JMS_IBM_PutTime: 21134121
This is the debug for the out bound message:
2022-04-08 22:13:42.020 DEBUG 31180 --- [enerContainer-1] o.springframework.jms.core.JmsTemplate : Sending created message:
JMSMessage class: jms_bytes
JMSType: null
JMSDeliveryMode: 2
JMSDeliveryDelay: 0
JMSDeliveryTime: 0
JMSExpiration: 0
JMSPriority: 4
JMSMessageID: null
JMSTimestamp: 0
JMSCorrelationID: ID:414d5120514d44454c30315020202020f8af4d6202558722
JMSDestination: null
JMSReplyTo: null
JMSRedelivered: false
7b22747261636b696e674964223a312c22737461747573223a2244454c495645524544222c226465
6c697665727944617465223a22323032322d30312d3031222c2274797065223a225345434f4e445f
434c415353227d
I'm using a sample JMeter test using 1 ThreadGroup with 5VUs for 60 seconds:
Jmeter Test Plan
I am running JMeter in non-gui, distributed mode. I have one master coordinating the test and one slave running the JMX test plan. I am using the "influxdbMetricsSender" BackendListener to store test data in InfluxDB so that my Grafana dashboard can display real time results to the user.
I am using the Summariser to debug the Master.
I have configured the JMX (via Java code) to use a ResultCollector with a Summariser:
...
HashTree clonedTree = JMeter.convertSubTree(tree, true);
String summariserName = JMeterUtils.getPropDefault("summariser.name", "");
Summariser summariser = new Summariser(summariserName);
ResultCollector resultCollector = new ResultCollector(summariser);
resultCollector.setFilename("<path_to_logfile");
clonedTree.add(clonedTree.getArray()[0], resultCollector);
...
I have also configured a BackendListener (via Java code):
public static void activateInfluxDBListener(HashTree testPlanTree, String scheme) {
SearchByClass<org.apache.jmeter.testelement.TestPlan> ts = new SearchByClass<>(
org.apache.jmeter.testelement.TestPlan.class);
testPlanTree.traverse(ts);
Collection<org.apache.jmeter.testelement.TestPlan> objs = ts.getSearchResults();
if (objs.size() != 1) {
logger.error("Testplans {} found in jmx are more than expeected size of 1", objs.size());
throw new TestPlanJmxMalformedException("TestPlan entries found in jmx must be exactly one");
}
Iterator<org.apache.jmeter.testelement.TestPlan> testPlanIter = objs.iterator();
org.apache.jmeter.testelement.TestPlan theTestPlan = testPlanIter.next();
logger.info("Test Plan {}", theTestPlan);
BackendListener beListener = new BackendListener();
beListener.setProperty(TestElement.TEST_CLASS, BackendListener.class.getName());
beListener.setProperty(TestElement.GUI_CLASS, BackendListenerGui.class.getName());
beListener.setName("InfluxDB Backend Listener");
beListener.setEnabled(true);
Arguments arguments = new Arguments();
arguments.setProperty(TestElement.GUI_CLASS, ArgumentsPanel.class.getName());
arguments.setProperty(TestElement.TEST_CLASS, Arguments.class.getName());
arguments.setEnabled(true);
arguments.addArgument("influxdbMetricsSender",
"org.apache.jmeter.visualizers.backend.influxdb.HttpMetricsSender");
arguments.addArgument("influxdbUrl", scheme + "influx-svc:8086/write?db=jmeterdb", "=");
arguments.addArgument("application", "TestApp", "=");
arguments.addArgument("measurement", "jmeter", "=");
arguments.addArgument("summaryOnly", "false", "=");
arguments.addArgument("samplersRegex", ".*", "=");
arguments.addArgument("percentiles", "90;95;99", "=");
arguments.addArgument("testTitle", "Test name", "=");
arguments.addArgument("eventTags", "", "=");
beListener.setClassname(
org.apache.jmeter.visualizers.backend.influxdb.InfluxdbBackendListenerClient.class.getCanonicalName());
beListener.setArguments(arguments);
testPlanTree.add(theTestPlan, beListener);
logger.info("Registered influx backend listener with url {}",
scheme + "influx-svc:8086/write?db=jmeterdb");
}
When I check the Master logs I get SummariserRunningSample data but not any of the thread counts (Active, Started, and Finished all report 0) so RMI connection between the slave and master is working correctly:
2022-03-03 16:05:30.438 INFO 14 --- [3)-10.240.1.210] org.apache.jmeter.reporters.Summariser : summary + 1002 in 00:00:24 = 42.0/s Avg: 27 Min:
1 Max: 2499 Err: 0 (0.00%) Active: 0 Started: 0 Finished: 0
summary + 1002 in 00:00:24 = 42.0/s Avg: 27 Min: 1 Max: 2499 Err: 0 (0.00%) Active: 0 Started: 0 Finished: 0
2022-03-03 16:06:00.237 INFO 14 --- [3)-10.240.1.210] org.apache.jmeter.reporters.Summariser : summary + 2300 in 00:00:30 = 77.2/s Avg: 7 Min:
1 Max: 352 Err: 0 (0.00%) Active: 0 Started: 0 Finished: 0
summary + 2300 in 00:00:30 = 77.2/s Avg: 7 Min: 1 Max: 352 Err: 0 (0.00%) Active: 0 Started: 0 Finished: 0
2022-03-03 16:06:00.238 INFO 14 --- [3)-10.240.1.210] org.apache.jmeter.reporters.Summariser : summary = 3302 in 00:00:54 = 61.5/s Avg: 13 Min:
1 Max: 2499 Err: 0 (0.00%)
summary = 3302 in 00:00:54 = 61.5/s Avg: 13 Min: 1 Max: 2499 Err: 0 (0.00%)
2022-03-03 16:06:08.003 INFO 14 --- [ Thread-2] o.a.j.v.backend.BackendListener : Worker ended
2022-03-03 16:06:08.004 INFO 14 --- [1)-10.240.1.210] .a.j.v.b.i.InfluxdbBackendListenerClient : Sending last metrics
2022-03-03 16:06:08.005 INFO 14 --- [1)-10.240.1.210] o.a.j.v.b.influxdb.HttpMetricsSender : Destroying
2022-03-03 16:06:08.091 INFO 14 --- [3)-10.240.1.210] org.apache.jmeter.reporters.Summariser : summary + 741 in 00:00:08 = 94.4/s Avg: 5 Min:
1 Max: 188 Err: 0 (0.00%) Active: 0 Started: 0 Finished: 0
summary + 741 in 00:00:08 = 94.4/s Avg: 5 Min: 1 Max: 188 Err: 0 (0.00%) Active: 0 Started: 0 Finished: 0
2022-03-03 16:06:08.091 INFO 14 --- [3)-10.240.1.210] org.apache.jmeter.reporters.Summariser : summary = 4043 in 00:01:02 = 65.7/s Avg: 12 Min:
1 Max: 2499 Err: 0 (0.00%)
summary = 4043 in 00:01:02 = 65.7/s Avg: 12 Min: 1 Max: 2499 Err: 0 (0.00%)
The Grafana dashboard (using the InfluxDB data) shows Active Thread count as 0 too:
Grafana Dashboard
However when I generate a Jmeter report from the result JTL file it shows the correct Active Thread Count data (5VUs constant until test ends):
Jmeter HTML report - Active Thread Over Time
Is there something I'm missing?
I was able to figure it out.
There was a bug report for this exact issue from 2012: https://bz.apache.org/bugzilla/show_bug.cgi?id=54152
The solution is to add a "dummy" test element to your JMX as a workaround:
var remoteListener = new RemoteThreadsListenerTestElement();
clonedTree.add(clonedTree.getArray()[0], remoteListener);
XML element:
<org.apache.jmeter.threads.RemoteThreadsListenerTestElement/>
The Jmeter client class org.apache.jmeter.engine.ConvertListeners will find this TestElement and replace it with a RemoteThreadsListener that will properly update the count of threads to the JMeterContextService that both Summariser and InfluxdbBackendListenerClient will check periodically.
After this change the Summariser and my Grafana Dashboards are showing the correct number of Jmeter VUs
I'm training dialoGPT on my own dataset, following this tutorial.
When I follow exactly the tutorial with the provided dataset I have no issues. I changed the example dataset. The only difference between the example and my code is that my dataset is 256397 lines long compared to the tutorial’s 1906 lines.
I am not sure if the error is pertaining to my column labels in my dataset or if its an issue in one of the text values on a particular row, or the size of my data.
06/12/2020 09:23:08 - WARNING - __main__ - Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
06/12/2020 09:23:10 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/config.json from cache at cached/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
06/12/2020 09:23:10 - INFO - transformers.configuration_utils - Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"vocab_size": 50257
}
06/12/2020 09:23:11 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/config.json from cache at cached/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
06/12/2020 09:23:11 - INFO - transformers.configuration_utils - Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"vocab_size": 50257
}
06/12/2020 09:23:11 - INFO - transformers.tokenization_utils - Model name 'microsoft/DialoGPT-small' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming 'microsoft/DialoGPT-small' is a path, a model identifier, or url to a directory containing tokenizer files.
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/vocab.json from cache at cached/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/merges.txt from cache at cached/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/added_tokens.json from cache at None
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/special_tokens_map.json from cache at None
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/tokenizer_config.json from cache at None
06/12/2020 09:23:19 - INFO - filelock - Lock 140392381680496 acquired on cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483.lock
06/12/2020 09:23:19 - INFO - transformers.file_utils - https://cdn.huggingface.co/microsoft/DialoGPT-small/pytorch_model.bin not found in cache or force_download set to True, downloading to /content/drive/My Drive/Colab Notebooks/cached/tmpj1dveq14
Downloading: 100%
351M/351M [00:34<00:00, 10.2MB/s]
06/12/2020 09:23:32 - INFO - transformers.file_utils - storing https://cdn.huggingface.co/microsoft/DialoGPT-small/pytorch_model.bin in cache at cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483
06/12/2020 09:23:32 - INFO - transformers.file_utils - creating metadata file for cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483
06/12/2020 09:23:33 - INFO - filelock - Lock 140392381680496 released on cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483.lock
06/12/2020 09:23:33 - INFO - transformers.modeling_utils - loading weights file https://cdn.huggingface.co/microsoft/DialoGPT-small/pytorch_model.bin from cache at cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483
06/12/2020 09:23:39 - INFO - transformers.modeling_utils - Weights of GPT2LMHeadModel not initialized from pretrained model: ['transformer.h.0.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.11.attn.masked_bias']
06/12/2020 09:23:54 - INFO - __main__ - Training/evaluation parameters <__main__.Args object at 0x7fafa60a00f0>
06/12/2020 09:23:54 - INFO - __main__ - Creating features from dataset file at cached
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-523c0d2a27d3> in <module>()
----> 1 main(trn_df, val_df)
7 frames
<ipython-input-11-d6dfa312b1f5> in main(df_trn, df_val)
59 # Training
60 if args.do_train:
---> 61 train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)
62
63 global_step, tr_loss = train(args, train_dataset, model, tokenizer)
<ipython-input-9-3c4f1599e14e> in load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate)
40
41 def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
---> 42 return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn)
43
44 def set_seed(args):
<ipython-input-9-3c4f1599e14e> in __init__(self, tokenizer, args, df, block_size)
24 self.examples = []
25 for _, row in df.iterrows():
---> 26 conv = construct_conv(row, tokenizer)
27 self.examples.append(conv)
28
<ipython-input-9-3c4f1599e14e> in construct_conv(row, tokenizer, eos)
1 def construct_conv(row, tokenizer, eos = True):
2 flatten = lambda l: [item for sublist in l for item in sublist]
----> 3 conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
4 conv = flatten(conv)
5 return conv
<ipython-input-9-3c4f1599e14e> in <listcomp>(.0)
1 def construct_conv(row, tokenizer, eos = True):
2 flatten = lambda l: [item for sublist in l for item in sublist]
----> 3 conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
4 conv = flatten(conv)
5 return conv
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in encode(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, return_tensors, **kwargs)
1432 pad_to_max_length=pad_to_max_length,
1433 return_tensors=return_tensors,
-> 1434 **kwargs,
1435 )
1436
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in encode_plus(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, is_pretokenized, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, **kwargs)
1574 )
1575
-> 1576 first_ids = get_input_ids(text)
1577 second_ids = get_input_ids(text_pair) if text_pair is not None else None
1578
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in get_input_ids(text)
1554 else:
1555 raise ValueError(
-> 1556 "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
1557 )
1558
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
I am running hive on mapreduce some of the mapper are running for to long ~ 8 hrs (mostly last few number of mappers).
I can see lot of [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 59
org.apache.hadoop.mapred.MapTask: Spilling map output in the logs.
Need your help to tune this ?
Please find below the sample query I am running.
sample query
CREATE TABLE schema.test_t AS
SELECT
demo,
col1,
col2 as col2,
col3 as col3,
col4,
col5,
col6,
col7,
SUM(col8) AS col8,
COUNT(1) AS col9,
count(distinct col10) as col10,
col11,
col12
FROM
schema.srce_t
WHERE col13 IN ('a','b')
GROUP BY
col1,col2,col3,col4,col5,col6,col7,col11,col12
GROUPING SETS ((col1,col2,col3,col4,col5,col6,col7,col11,col12),
(col1,col11,col2,col3,col5,col6,col12,col7),
(col1,col11,col2,col3,col6,col12,col7),
(col1,col11,col2,col3,col4,col6,col12,col7),
(col1,col11,col2,col4,col5,col6,col12,col7),
(col1,col11,col2,col4,col6,col12,col7),
(col1,col11,col2,col5,col6,col12,col7),
(col1,col11,col4,col5,col6,col12,col7),
(col1,col11,col3,col4,col5,col6,col12,col7),
(col1,col11,col3,col5,col6,col12,col7),
(col1,col11,col3,col4,col6,col12,col7),
(col1,col11,col4,col6,col12,col7),
(col1,col11,col3,col6,col12,col7),
(col1,col11,col5,col6,col12,col7),
(col1,col11,col2, col6,col12,col7),
(col1,col11,col6, col12,col7));
Hive properties.
SET mapreduce.reduce.memory.mb=10240;
SET mapreduce.reduce.java.opts=-Xmx9216m;
SET mapreduce.map.memory.mb=10240;
SET mapreduce.map.java.opts=-Xmx9216m;
SET mapreduce.task.io.sort.mb=1536
Logs:
2019-05-15 05:34:32,600 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 714424619; bufvoid = 1073741824
2019-05-15 05:34:32,600 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452(1073741808); kvend = 232293228(929172912); length = 36142225/67108864
2019-05-15 05:34:32,600 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 750592747 kvi 187648180(750592720)
2019-05-15 05:34:41,305 INFO [main] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: RS[4]: records written - 10000000
2019-05-15 05:35:01,944 INFO [SpillThread] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
2019-05-15 05:35:07,479 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 0
2019-05-15 05:35:07,480 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 750592747 kv 187648180(750592720) kvi 178606160(714424640)
2019-05-15 05:35:34,178 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[13]: records read - 1000000
2019-05-15 05:35:58,140 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2019-05-15 05:35:58,140 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 750592747; bufend = 390854476; bufvoid = 1073741791
2019-05-15 05:35:58,140 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 187648180(750592720); kvend = 151400696(605602784); length = 36247485/67108864
2019-05-15 05:35:58,141 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 427407372 kvi 106851836(427407344)
2019-05-15 05:36:31,831 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 1
2019-05-15 05:36:31,833 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 427407372 kv 106851836(427407344) kvi 97806648(391226592)
2019-05-15 05:37:19,180 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
Сheck current values of these parameters and reduce figures until you have more mappers in parallel:
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set mapreduce.input.fileinputformat.split.minsize=16000; -- 16 KB
set mapreduce.input.fileinputformat.split.maxsize=128000000; -- 128Mb
--files bigger than max size will be splitted.
--files smaller than min size will be processed on the same mapper combined
If your files are not in splittable format, like gzip. this will not help.
Play with these settings to get more smaller mappers.
Also these settings may help to improve performance of the query
set hive.optimize.distinct.rewrite=true;
set hive.map.aggr=true;
--if files are ORC, check PPD:
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true;
I used OpenTSDB over HBase (pseudo-distributed Hadoop on virtual box) to send data at very high load (~ 50,000 records / s). The system worked properly for a while but it went down suddenly. I terminated OpenTSDB and HBase. Unfortunately, I could never bring them up again. Every time I tried to run HBase and OpenTSDB, they showed error logs. Here I list the logs:
regionserver:
2015-07-01 18:15:30,752 INFO [sync.3] wal.FSHLog: Slow sync cost: 112 ms, current pipeline: [192.168.56.101:50010]
2015-07-01 18:15:41,277 INFO [regionserver/node1.vmcluster/192.168.56.101:16201.logRoller] wal.FSHLog: Rolled WAL /hbase/WALs/node1.vmcluster,16201,1435738612093/node1.vmcluster%2C16201%2C1435738612093.default.1435742101122 with entries=3841, filesize=123.61 MB; new WAL /hbase/WALs/node1.vmcluster,16201,1435738612093/node1.vmcluster%2C16201%2C1435738612093.default.1435742141109
2015-07-01 18:15:41,278 INFO [regionserver/node1.vmcluster/192.168.56.101:16201.logRoller] wal.FSHLog: Archiving hdfs://node1.vmcluster:9000/hbase/WALs/node1.vmcluster,16201,1435738612093/node1.vmcluster%2C16201%2C1435738612093.default.1435742061805 to hdfs://node1.vmcluster:9000/hbase/oldWALs/node1.vmcluster%2C16201%2C1435738612093.default.1435742061805
2015-07-01 18:15:42,249 INFO [MemStoreFlusher.0] regionserver.HRegion: Started memstore flush for tsdb,,1435740133573.1a692e2668a2b4a71aaf2805f9b00a72., current region memstore size 132.20 MB
2015-07-01 18:15:42,381 INFO [MemStoreFlusher.1] regionserver.HRegion: Started memstore flush for tsdb,,1435740133573.1a692e2668a2b4a71aaf2805f9b00a72., current region memstore size 133.09 MB
2015-07-01 18:15:42,382 WARN [MemStoreFlusher.1] regionserver.DefaultMemStore: Snapshot called again without clearing previous. Doing nothing. Another ongoing flush or did we fail last attempt?
2015-07-01 18:15:42,391 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: ABORTING region server node1.vmcluster,16201,1435738612093: Replay of WAL required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,,1435740133573.1a692e2668a2b4a71aaf2805f9b00a72.
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2001)
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1772)
at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1704)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:445)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:407)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:69)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:225)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NegativeArraySizeException
at org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:494)
at org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:932)
at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:121)
at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:71)
at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:879)
at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2128)
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1955)
... 7 more
2015-07-01 18:15:42,398 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
Master:
this is the same to one on regionserver:
2015-07-01 18:15:42,596 ERROR [B.defaultRpcServer.handler=15,queue=0,port=16020] master.MasterRpcServices: Region server node1.vmcluster,16201,1435738612093 reported a fatal error:
ABORTING region server node1.vmcluster,16201,1435738612093: Replay of WAL required. Forcing server shutdown
Cause:
org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,,1435740133573.1a692e2668a2b4a71aaf2805f9b00a72.
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2001)
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1772)
at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1704)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:445)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:407)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:69)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:225)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NegativeArraySizeException
at org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:494)
at org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:932)
at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:121)
at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:71)
at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:879)
at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2128)
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1955)
... 7 more
and this:
2015-07-01 18:17:16,971 INFO [node1.vmcluster,16020,1435738420751.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 1 tasks={/hbase/splitWAL/WALs%2Fnode1.vmcluster%2C16201%2C1435738612093-splitting%2Fnode1.vmcluster%252C16201%252C1435738612093..meta.1435738616764.meta=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2015-07-01 18:17:21,976 INFO [node1.vmcluster,16020,1435738420751.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 1 tasks={/hbase/splitWAL/WALs%2Fnode1.vmcluster%2C16201%2C1435738612093-splitting%2Fnode1.vmcluster%252C16201%252C1435738612093..meta.1435738616764.meta=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2015-07-01 18:17:26,979 INFO [node1.vmcluster,16020,1435738420751.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 1 tasks={/hbase/splitWAL/WALs%2Fnode1.vmcluster%2C16201%2C1435738612093-splitting%2Fnode1.vmcluster%252C16201%252C1435738612093..meta.1435738616764.meta=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2015-07-01 18:17:31,983 INFO [node1.vmcluster,16020,1435738420751.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 1 tasks={/hbase/splitWAL/WALs%2Fnode1.vmcluster%2C16201%2C1435738612093-splitting%2Fnode1.vmcluster%252C16201%252C1435738612093..meta.1435738616764.meta=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2015-07-01 18:17:36,985 INFO [node1.vmcluster,16020,1435738420751.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 1 tasks={/hbase/splitWAL/WALs%2Fnode1.vmcluster%2C16201%2C1435738612093-splitting%2Fnode1.vmcluster%252C16201%252C1435738612093..meta.1435738616764.meta=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2015-07-01 18:17:41,992 INFO [node1.vmcluster,16020,1435738420751.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 1 tasks={/hbase/splitWAL/WALs%2Fnode1.vmcluster%2C16201%2C1435738612093-splitting%2Fnode1.vmcluster%252C16201%252C1435738612093..meta.1435738616764.meta=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2015-07-01 18:17:45,283 WARN [CatalogJanitor-node1:16020] master.CatalogJanitor: Failed scan of catalog table
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=351, exceptions:
Wed Jul 01 18:17:45 KST 2015, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68275: row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=node1.vmcluster,16201,1435738612093, seqNum=0
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:264)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:215)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:56)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:288)
at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:267)
at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:139)
at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:134)
at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:823)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:187)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89)
at org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:169)
at org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:121)
at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:222)
at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:103)
at org.apache.hadoop.hbase.Chore.run(Chore.java:80)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=68275: row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=node1.vmcluster,16201,1435738612093, seqNum=0
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:310)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:291)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupConnection(RpcClientImpl.java:403)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:709)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:880)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:849)
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1173)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:31889)
at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:344)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:188)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
... 6 more
2015-07-01 18:17:46,995 INFO [node1.vmcluster,16020,1435738420751.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 1 tasks={/hbase/splitWAL/WALs%2Fnode1.vmcluster%2C16201%2C1435738612093-splitting%2Fnode1.vmcluster%252C16201%252C1435738612093..meta.1435738616764.meta=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2015-07-01 18:17:52,002 INFO [node1.vmcluster,16020,1435738420751.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 1 tasks={/hbase/splitWAL/WALs%2Fnode1.vmcluster%2C16201%2C1435738612093-splitting%2Fnode1.vmcluster%252C16201%252C1435738612093..meta.1435738616764.meta=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2015-07-01 18:17:57,004 INFO [node1.vmcluster,16020,1435738420751.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 1 tasks={/hbase/splitWAL/WALs%2Fnode1.vmcluster%2C16201%2C1435738612093-splitting%2Fnode1.vmcluster%252C16201%252C1435738612093..meta.1435738616764.meta=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2015-07-01 18:18:02,006 INFO [node1.vmcluster,16020,1435738420751.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 1 tasks={/hbase/splitWAL/WALs%2Fnode1.vmcluster%2C16201%2C1435738612093-splitting%2Fnode1.vmcluster%252C16201%252C1435738612093..meta.1435738616764.meta=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2015-07-01 18:18:07,011 INFO [node1.vmcluster,16020,1435738420751.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 1 tasks={/hbase/splitWAL/WALs%2Fnode1.vmcluster%2C16201%2C1435738612093-splitting%2Fnode1.vmcluster%252C16201%252C1435738612093..meta.1435738616764.meta=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
After that, when I restarted hbase, regionserver showed this error:
2015-07-02 09:17:49,151 INFO [RS_OPEN_REGION-node1:16201-0] regionserver.HRegion: Replaying edits from hdfs://node1.vmcluster:9000/hbase/data/default/tsdb/1a692e2668a2b4a71aaf2805f9b00a72/recovered.edits/0000000000000169657
2015-07-02 09:17:49,343 INFO [RS_OPEN_REGION-node1:16201-0] regionserver.HRegion: Started memstore flush for tsdb,,1435740133573.1a692e2668a2b4a71aaf2805f9b00a72., current region memstore size 132.20 MB; wal is null, using passed sequenceid=169615
2015-07-02 09:17:49,428 ERROR [RS_OPEN_REGION-node1:16201-0] handler.OpenRegionHandler: Failed open of region=tsdb,,1435740133573.1a692e2668a2b4a71aaf2805f9b00a72., starting to roll back the global memstore size.
org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,,1435740133573.1a692e2668a2b4a71aaf2805f9b00a72.
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2001)
at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:3694)
at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:3499)
at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:889)
at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:769)
at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:742)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4921)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4887)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4858)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4814)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4765)
at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:356)
at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:126)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NegativeArraySizeException
at org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:490)
at org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:932)
at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:121)
at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:71)
at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:879)
at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2128)
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1955)
... 16 more
2015-07-02 09:17:49,430 INFO [RS_OPEN_REGION-node1:16201-0] coordination.ZkOpenRegionCoordination: Opening of region {ENCODED => 1a692e2668a2b4a71aaf2805f9b00a72, NAME => 'tsdb,,1435740133573.1a692e2668a2b4a71aaf2805f9b00a72.', STARTKEY => '', ENDKEY => '\x00\x15\x08M|kp\x00\x00\x01\x00\x00\x01'} failed, transitioning from OPENING to FAILED_OPEN in ZK, expecting version 1
2015-07-02 09:17:49,443 INFO [PriorityRpcServer.handler=9,queue=1,port=16201] regionserver.RSRpcServices: Open tsdb,,1435740133573.1a692e2668a2b4a71aaf2805f9b00a72.
2015-07-02 09:17:49,458 INFO [StoreOpener-1a692e2668a2b4a71aaf2805f9b00a72-1] hfile.CacheConfig: blockCache=LruBlockCache{blockCount=3, currentSize=533360, freeSize=509808592, maxSize=510341952, heapSize=533360, minSize=484824864, minFactor=0.95, multiSize=242412432, multiFactor=0.5, singleSize=121206216, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
2015-07-02 09:17:49,458 INFO [StoreOpener-1a692e2668a2b4a71aaf2805f9b00a72-1] compactions.CompactionConfiguration: size [134217728, 9223372036854775807); files [3, 10); ratio 1.200000; off-peak ratio 5.000000; throttle point 2684354560; major period 604800000, major jitter 0.500000, min locality to compact 0.000000
2015-07-02 09:17:49,519 INFO [RS_OPEN_REGION-node1:16201-2] regionserver.HRegion: Replaying edits from hdfs://node1.vmcluster:9000/hbase/data/default/tsdb/1a692e2668a2b4a71aaf2805f9b00a72/recovered.edits/0000000000000169567
Update
Today I did the test at very light traffic: 1000 records / sec for 2000 seconds, the problem came again.
According to HBASE-13329, a short type overflow has occurred with "short diffIdx = 0" in hbase-common/src/main/java/org/apache/hadoop/hbase/CellComparator.java. A patch has been available today (2015-07-06). In the patch, HBase developers change the declaration from "short diffIdx = 0" to "int diffIdx = 0".
I did tests with the patch, it worked properly.
It seems like you might have disk IO issue. HINT: "[sync.3] wal.FSHLog: Slow sync cost".
Refer to http://search-hadoop.com/m/YGbbn8F9AuR0s42