kafka connect - ExtractTopic transformation with hdfs sink connector throws NullPointerException - hadoop

I am using confluent hdfs sink connector 5.0.0 with kafka 2.0.0 and I need to use ExtractTopic transformation (https://docs.confluent.io/current/connect/transforms/extracttopic.html). My connector works fine but when I add this transformation I get NullPointerException, even on simple data sample with only 2 attributes.
ERROR Task hive-table-test-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:482)
java.lang.NullPointerException
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:352)
at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:109)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:464)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Here is configuration of connector:
name=hive-table-test
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=hive_table_test
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=${env.SCHEMA_REGISTRY_URL}
value.converter.schema.registry.url=${env.SCHEMA_REGISTRY_URL}
schema.compatibility=BACKWARD
# HDFS configuration
# Use store.url instead of hdfs.url (deprecated) in later versions. Property store.url does not work, yet
hdfs.url=${env.HDFS_URL}
hadoop.conf.dir=/etc/hadoop/conf
hadoop.home=/opt/cloudera/parcels/CDH/lib/hadoop
topics.dir=${env.HDFS_TOPICS_DIR}
# Connector configuration
format.class=io.confluent.connect.hdfs.avro.AvroFormat
flush.size=100
rotate.interval.ms=60000
# Hive integration
hive.integration=true
hive.metastore.uris=${env.HIVE_METASTORE_URIS}
hive.conf.dir=/etc/hive/conf
hive.home=/opt/cloudera/parcels/CDH/lib/hive
hive.database=kafka_connect
# Transformations
transforms=InsertMetadata, ExtractTopic
transforms.InsertMetadata.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.InsertMetadata.partition.field=partition
transforms.InsertMetadata.offset.field=offset
transforms.ExtractTopic.type=io.confluent.connect.transforms.ExtractTopic$Value
transforms.ExtractTopic.field=name
transforms.ExtractTopic.skip.missing.or.null=true
I am using schema registry, data is in avro format and I am sure the given attribute name is not null. Any suggestions? What I need is basically to extract content of given field and use it as a topic name.
EDIT:
It happens even on simple json like this in avro format:
{
"attr": "tmp",
"name": "topic1"
}

Short answer is because, you change the name of the topic in your Transformation.
Hdfs Connector for each topic partition has separate TopicPartitionWriter. When SinkTask, that is responsible for processing messages is created in open(...) method for each partition TopicPartitionWriter is created.
When it processed SinkRecords, based on topic name and partition number it looks up for TopicPartitionWriter and try to append record to its buffer. In your case it couldn't find any write for message. The topic name was changed by Transformation and for that pair (topic, partition) any TopicPartitionWriter was not created.
SinkRecords, that are passed to HdfsSinkTask::put(Collection<SinkRecord> records), have partitions and topic already set, so you don't have to apply any Transformations.
I think io.confluent.connect.transforms.ExtractTopic should be rather used for SourceConnector.

Related

Restart kafka connect sink and source connectors to read from beginning

I have searched quite a lot on this but there doesn't seems to be a good guide around this.
From what I have searched there are a few things to consider:
Resetting Sink Connector internal topics (status, config and offset).
Source Connector offsets implementation is implementation specific.
Question: Is there even a need to reset these topics?
Deleting the consumer group.
Restarting the connector with a different name (this is also an option) but it doesn't seems to be the right thing to do.
Resetting consumer group to --reset-offsets to --to-earliest
Using the REST API (Does the it provides the functionality to reset and read from beginning)
What would be the best way to restart both a sink and a source connector to read from beginning?
Source Connector:
Standalone mode: remove offset file (/tmp/connect.offsets) or change connector name.
Distributed mode: change name of the connector.
Sink Connector (both modes) one of the following methods:
Change name.
Reset offset for the Consumer group. Name of the group is same as Connector name.
To reset offset you have to first delete connector, reset offset (./bin/kafka-consumer-groups.sh --bootstrap-server :9092 --group connectorName --reset-offsets --to-earliest --execute --topic topicName), add same configuration one more time
You can check following question: Reset the JDBC Kafka Connector to start pulling rows from the beginning of time?
Source connector Distributed mode - has another option which is producing a new message to the offset topic.
For example I use jdbc source connector:
When looking on the offset topic I see the following:
./kafka-console-consumer.sh --zookeeper localhost:2181/kafka11-staging --topic kc-staging--offsets --from-beginning --property print.key=true
["referrer-family-jdbc-source",{"query":"query"}] {"incrementing":100}
Now in order to reset this I just produce another message with incrementing:0
For example: how to produce from shell with key from here
./kafka-console-producer.sh \
--broker-list `hostname`:9092 \
--topic kc-staging--offsets \
--property "parse.key=true" \
--property "key.separator=|"
["referrer-family-jdbc-source",{"query":"query"}]|{"incrementing":0}
Please note that you need to do the following:
Delete the connector.
Produce a message with the relevant offset as I described above.
Create the connector again.
a bit late but found another way. Just set the offset.storage.file.name in standalone mode to dev/null:
#worker.properties
offset.storage.file.filename=/dev/null
#cmdline
connect-standalone /data/config/worker.properties /data/config/connector.properties

confluent: Hdfs sink to avro format, but while reading the avro file in hive my time is 5:30 hours ahead of "timezone": "Asia/Kolkata"

My HDFS-Sink connection:
{
"name":"hdfs-sink1",
"config":{
"connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector",
"tasks.max":"3",
"topics":"mysql-prod-registrations-",
"hadoop.conf.dir":"/usr/hdp/current/hadoop-client/conf",
"hadoop.home":"/usr/hdp/current/hadoop-client",
"hdfs.url":"hdfs://HACluster:8020",
"topics.dir":"/topics",
"logs.dir":"/logs",
"flush.size":"100",
"rotate.interval.ms":"60000",
"format.class":"io.confluent.connect.hdfs.avro.AvroFormat",
"value.converter.schemas.enable": "true",
"partitioner.class":"io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner",
"partition.duration.ms":"1800000",
"path.format":"'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/",
"locale":"kor",
"timezone":"Asia/Kolkata"
}
}
But while reading in hive i am 5:30 hours ahead of timezone":"Asia/Kolkata". How to get the timestamp value in indian time zone?
Connection was running fine for two days but iam getting error as below now:
ERROR WorkerSinkTask{id=hdfs-sink1-2} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:177)
java.lang.NullPointerException
at io.confluent.connect.hdfs.HdfsSinkTask.open(HdfsSinkTask.java:133)
at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:612)
at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1100(WorkerSinkTask.java:69)
at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:672)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:283)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:422)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:352)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:337)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:343)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1218)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1181)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:444)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:317)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2018-12-14 12:22:39,670] ERROR WorkerSinkTask{id=hdfs-sink1-2} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:178)
Asia/Kolkata is +05:30 after before UTC, so that makes some sense... And the timezone config only applies to the path.format values, not internal values of your Kafka records.
I'm not sure which tool you are using to query with, but it might be a problem there because I have had some tools assume the data is only written in UTC time, then the tool will "shift" and "display" a formatted local timestamp... Therefore, I would suggest making the HDFS sink connector actually write in UTC time, then let your SQL tools and operating system handle the actual TZ conversions themselves.

Spring DefaultTuple not serializable

I'm using spring-xd 1.3.0 release to process a tuple message.
After using some taps to enrich a message, I've made my own aggregator to re-assemblate the resulting messages.
Now I would like to use a postgreSql message store, to have persistence in case of crashing node.
So I roughly copy-pasted the original xml configuration file of the original spring-xd aggregator.
Then I built and deployed the following stream:
stream create aggregate --definition "queue:scoring > scoring-aggregator --store=jdbc --username=${spring.datasource.username} --password=${spring.datasource.password} --driverClassName=${spring.datasource.driverClassName} --url=${spring.datasource.url} --initializeDatabase=false > queue:endAggr"
but when I send my usual tuple message to this stream, that got correctly processed by an in-memory store, I get:
xd_container_2 | Caused by: org.springframework.core.serializer.support.SerializationFailedException: Failed to serialize object using DefaultSerializer; nested exception is java.io.NotSerializableException: org.springframework.xd.tuple.DefaultTuple
xd_container_2 | at org.springframework.core.serializer.support.SerializingConverter.convert(SerializingConverter.java:68) ~[spring-core-4.2.2.RELEASE.jar:4.2.2.RELEASE]
xd_container_2 | at org.springframework.integration.jdbc.JdbcMessageStore.addMessage(JdbcMessageStore.java:345) ~[spring-integration-jdbc-4.2.4.RELEASE.jar:na]
xd_container_2 | at org.springframework.integration.jdbc.JdbcMessageStore.addMessageToGroup(JdbcMessageStore.java:386) ~[spring-integration-jdbc-4.2.4.RELEASE.jar:na]
xd_container_2 | at org.springframework.integration.aggregator.AbstractCorrelatingMessageHandler.store(AbstractCorrelatingMessageHandler.java:604) ~[spring-integration-core-4.2.2.RELEASE.jar:na]
xd_container_2 | at org.springframework.integration.aggregator.AbstractCorrelatingMessageHandler.handleMessageInternal(AbstractCorrelatingMessageHandler.java:400) ~[spring-integration-core-4.2.2.RELEASE.jar:na]
xd_container_2 | at org.springframework.integration.handler.AbstractMessageHandler.handleMessage(AbstractMessageHandler.java:127) ~[spring-integration-core-4.2.2.RELEASE.jar:na]
and now... well I'm stuck and dont have any idea how to proceed.
Any hint appreciated.
The tuple is not Serializable - I am not sure why - but XD uses Kryo for serialization internally (by default); you could add kryo transformers before/after the aggregator as a work around.
EDIT
Another option is to convert the Tuple to json, using --inputType=application/json on the aggregator.
See type conversion.
The aggregator output would be a collection of JSON strings, getting them back to Tuple would depend on what you are doing downstream of the aggregator.

Spark Streaming and ElasticSearch - Could not write all entries

I'm currently writing a Scala application made of a Producer and a Consumer. The Producers get some data from and external source and writes em inside Kafka. The Consumer reads from Kafka and writes to Elasticsearch.
The consumer is based on Spark Streaming and every 5 seconds fetches new messages from Kafka and writes them to ElasticSearch. The problem is I'm not able to write to ES because I get a lot of errors like the one below :
ERROR] [2015-04-24 11:21:14,734] [org.apache.spark.TaskContextImpl]:
Error in TaskCompletionListener
org.elasticsearch.hadoop.EsHadoopException: Could not write all
entries [3/26560] (maybe ES was overloaded?). Bailing out... at
org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:225)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:236)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:125)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:33)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.apache.spark.TaskContextImpl$$anon$2.onTaskCompletion(TaskContextImpl.scala:57)
~[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[na:na] at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[na:na] at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.scheduler.Task.run(Task.scala:58)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
[spark-core_2.10-1.2.1.jar:1.2.1] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_65] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
Consider that the producer is writing 6 messages every 15 seconds so I really don't understand how this "overload" can possibly happen (I even cleaned the topic and flushed all old messages, I thought it was related to an offset issue). The task executed by Spark Streaming every 5 seconds can be summarized by the following code :
val result = KafkaUtils.createStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Map("wasp.raw" -> 1), StorageLevel.MEMORY_ONLY_SER_2)
val convertedResult = result.map(k => (k._1 ,AvroToJsonUtil.avroToJson(k._2)))
//TO-DO : Remove resource (yahoo/yahoo) hardcoded parameter
log.info(s"*** EXECUTING SPARK STREAMING TASK + ${java.lang.System.currentTimeMillis()}***")
convertedResult.foreachRDD(rdd => {
rdd.map(data => data._2).saveToEs("yahoo/yahoo", Map("es.input.json" -> "true"))
})
If I try to print the messages instead of sending to ES, everything is fine and I actually see only 6 messages. Why can't I write to ES?
For the sake of completeness, I'm using this library to write to ES : elasticsearch-spark_2.10 with the latest beta version.
I found, after many retries, a way to write to ElasticSearch without getting any error. Basically passing the parameter "es.batch.size.entries" -> "1" to the saveToES method solved the problem. I don't understand why using the default or any other batch size leads to the aforementioned error considering that I would expect an error message if I'm trying to write more stuff than the allowed max batch size, not less.
Moreover I've noticed that actually I was writing to ES but not all my messages, I was losing between 1 and 3 messages per batch.
When I pushed dataframe to ES on Spark, I had the same error message. Even with "es.batch.size.entries" -> "1" configuration,I had the same error.
Once I increased thread pool in ES, I could figure out this issue.
for example,
Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 600
threadpool.bulk.queue_size: 30000
Like it was already mentioned here, this is a document write conflict.
Your convertedResult data stream contains multiple records with the same id. When written to elastic as part of the same batch produces the error above.
Possible solutions:
Generate unique id for each record. Depending on your use case it can be done in a few different ways. As example, one common solution is to create a new field by combining the id and lastModifiedDate fields and use that field as id when writing to elastic.
Perform de-duplication of records based on id - select only one record with particular id and discard other duplicates. Depending on your use case, this could be the most current record (based on time stamp field), most complete (most of the fields contain data), etc.
The #1 solution will store all records that you receive in the stream.
The #2 solution will store only the unique records for a specific id based on your de-duplication logic. This result would be the same as setting "es.batch.size.entries" -> "1", except you will not limit the performance by writing one record at a time.
One of the possibility is the cluster/shard status being RED. Please address this issue which may be due to unassigned replicas. Once status turned GREEN the API call succeeded just fine.
This is a document write conflict.
For example:
Multiple documents specify the same _id for Elasticsearch to use.
These documents are located in different partitions.
Spark writes multiple partitions to ES simultaneously.
Result is Elasticsearch receiving multiple updates for a single Document at once - from multiple sources / through multiple nodes / containing different data
"I was losing between 1 and 3 messages per batch."
Fluctuating number of failures when batch size > 1
Success if batch write size "1"
Just adding another potential reason for this error, hopefully it helps someone.
If your Elasticsearch index has child documents then:
if you are using a custom routing field (not _id), then according to
the documentation the uniqueness of the documents is not guaranteed.
This might cause issues while updating from spark.
If you are using the standard _id, the uniqueness will be preserved, however you need to make sure the following options are provided while writing from Spark to Elasticsearch:
es.mapping.join
es.mapping.routing

In Storm Spout, Naming the Consumer Group

I am currently using:
https://github.com/wurstmeister/storm-kafka-0.8-plus/commits/master
which has been moved to:
https://github.com/apache/storm/tree/master/external/storm-kafka
I want to specify the Kafka Consumer Group Name. By looking at the storm-kafka code, I followed the setting, id, to find that is is never used when dealing with a consumer configuration, but is used in creating the zookeeper path at which offset information is stored. Here in this link is an example of why I would want to do this: https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/
Am I correct in saying that the Consumer Group Name cannot be set using the https://github.com/apache/storm/tree/master/external/storm-kafka code?
So far, storm-kafka integration is implemented using SimpleConsumer API of kafka and the format it stores consumer offset in zookeeper is implemented in their own way(JSON format).
If you write spout config like below,
SpoutConfig spoutConfig = new SpoutConfig(zkBrokerHosts,
"topic name",
"/kafka/consumers(just an example, path to store consumer offset)",
"yourTopic");
It will write consumer offset in subdirectories of /kafka/consumers/yourTopic.
Note that by default storm-kafka uses same zookeeper that your Storm uses.

Resources