Kafka HDFS Sink Connector error.[Top level type must be STRUCT..] - hadoop

I'm testing the 2.7 version of Kafka with Kafka connect,
and i'm facing with problem that i don't understand.
I started distributed connector first with configuration like below.
bootstrap.servers=..:9092,...:9092, ...
group.id=kafka-connect-test
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
... some internal topic configuration
plugin.path=<plugin path>
This connector serviced with 8083 port.
And I want to write ORC format data with snappy codec at HDFS.
so i made new Kafka HDFS connector with REST API with json data like below.
and i don't use schema-registry.
curl -X POST <connector url:8083> \
-H Accept: application/json \
-H Content-Type: application/json \
-d
{
"name": "hdfs-sinkconnect-test",
"config": {
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"store.url": "hdfs:~",
"hadoop.conf.dir": "<my hadoop.conf dir>",
"hadoop.home": "<hadoop home dir>",
"tasks.max": "5",
"key.deserializer": "org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer": "org.apache.kafka.common.serialization.ByteArrayDeserializer",
"format.class": "io.confluent.connect.hdfs.orc.OrcFormat",
"flush.size": 1000,
"avro.codec": "snappy",
"topics": "<topic name>",
"topics.dir": "/tmp/connect-logs",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"locale": "ko_KR",
"timezone": "Asia/Seoul",
"partition.duration.ms": "3600000",
"path.format": "'hour'=YYYYMMddHH/"
}
}
Then i have error message like this.
# connectDistributed.out
[2021-06-28 17:14:11,596] ERROR Exception on topic partition <topic name>-<partition number>: (io.confluent.connect.hdfs.TopicPartitionWriter:409)
org.apache.kafka.connect.errors.ConnectException: Top level type must be STRUCT but was bytes
at io.confluent.connect.hdfs.orc.OrcRecordWriterProvider$1.write(OrcRecordWriterProvider.java:98)
at io.confluent.connect.hdfs.TopicPartitionWriter.writeRecord(TopicPartitionWriter.java:742)
at io.confluent.connect.hdfs.TopicPartitionWriter.write(TopicPartitionWriter.java:385)
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:333)
at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:126)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:586)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:329)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:232)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:185)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I think this error message have something to do with Schema Information.
Is Schema Registry essential for Kafka Connector?
Any ideas or solutions to solve this error messsage? Thanks.

Writing ORC files requires a Struct type.
The options provided by Confluent include plain JSON, JSONSchema, Avro, or Protobuf. The only option that doesn't require the Registry is the plain JsonConverter
Note that key.deserializer and value.deserializer are not valid Connect properties. You need to refer to your key.converter and value.converter properties instead
If you're not willing to modify the converter, you can attempt to use a HoistField transformer to create a Struct, and this will create an ORC file with a schema of only one field

Related

kafka connect elastic sink Could not connect to Elasticsearch. General SSLEngine problem

I'm trying to deploy confluent Kafka connect to elasticsearch. My elastic stack is deployed on kubernetes, has HTTP encryption, and authentication. I'm forwarding elastic from kubernetes to localhost.
Caused by: org.apache.kafka.connect.runtime.rest.errors.BadRequestException: Connector configuration
is invalid and contains the following 3 error(s):
Could not connect to Elasticsearch. Error message: General SSLEngine problem
Could not authenticate the user. Check the 'connection.username' and 'connection.password'. Error
message: General SSLEngine problem
Could not authenticate the user. Check the 'connection.username' and 'connection.password'. Error
message: General SSLEngine problem
I'm sure that the username and password are right. Elastic properties file looks like
name=elasticsearch-sink
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=1
topics=pwp-alerts
key.ignore=true
connection.url=https://localhost:9200
type.name=kafka-connect
errors.tolerance = all
behavior.on.malformed.documents=warn
schema.ignore = true
connection.username ="elastic"
connection.password ="my_password"
Does anyone know what can cause the problem?
I guess the failure issued by unsuccessful connection to your elastic engine it may happens by many things for example wrong port or your listener type it may be advertised listener instead of simple consumer, I recommend to use Logstash and add the Kafka input configuration in your Logstash configuration, You can simply modify your Kafka consumer and bootstrap server and many properties in input and your elastic index, Port and authorization in output easily.
Your Logstash configuration file with Kafka input may look like as below
input {
kafka{
group_id => "Your group consumer group id"
topics => ["Your topic name"]
bootstrap_servers => "Your consumer port, Default port is 9092"
codec => json
}
}
filter {
}
output {
file {
path => "Some path"
}
elasticsearch {
hosts => ["localhost:9200"]
document_type => "_doc"
index => "Your index name"
user => username
password => password
}
stdout { codec => rubydebug
}
}
You can remove the file in output if you don't want to store your data additionally beside your Logstash pipeline.
Find out more about Logstash Kafka input properties in Here

Is there a way to debug transactions in RSK network, which could be the best way?

We are running an RSK node, some smart contract transactions show internal errors, but the message related to the failed require condition doesn't appear in those error messages...
We only see "internal error" and are unable to see which specific error occurred.
If your contract emits messages in the reversions, then you can find them out by using debug_traceTransaction.
NOTE: The debug RPC module is enabled by default in RSK config, but this is disabled on public nodes.
Furthermore, the RSK public nodes do not expose this feature, and you must run your own node in order to do so.
The following assumes that you have a local node running with RPC exposed on port 4444.
First, you need to enable debug module in your config file:
modules = [
...
{
"name": "debug",
"version": "1.0",
"enabled": "true",
},
...
]
Then, you can execute the RPC method passing the
transaction ID as a parameter, like in this example:
curl \
-X POST \
-H "Content-Type:application/json" \
--data '{"jsonrpc":"2.0","method":"debug_traceTransaction","params":["0xa9ae08f01437e32973649cc13f6db44e3ef370cbcd38a6ed69806bd6ea385e49"],"id":1}' \
http://localhost:4444
You will get the following response
(truncated for brevity):
{
...
"result": "08c379a00000000000000000000000000000000000000000000000000000000000000020000000000000000000000000000000000000000000000000000000000000001e536166654d6174683a207375627472616374696f6e206f766572666c6f770000",
"error": "",
"reverted": true,
...
}
Finally, convert result from hexadecimal to ASCII,
to obtain a readable message:
Ãy SafeMath: subtraction overflow

bulk data update from MSSQL vis kafka connector to elasticsearch with nested types is failing

MSSQL value : column name=prop --> value= 100 and column name=role --> value= [{"role":"actor"},{"role":"director"}]
NOTE: the column:role is saved in json format.
read from kafka topic :
{
"schema":{
"type":"struct",
"fields":[
{
"type":"int32",
"optional":false,
"field":"prop"
},
{
"type":"string",
"optional":true,
"field":"roles"
}
],
"optional":false
},
"payload":{ "prop":100, "roles":"[{"role":"actor"},{"role":"director"}]"}
failing with the reason :
Error was [{"type":"mapper_parsing_exception","reason":"object mapping for [roles] tried to parse field [roles] as object, but found a concrete value"}
Reason for failing is that the connector is not able to create schema as array for roles
The above input message is created by confluent JdbcSourceConnector and the sink connector used is confluent ElasticsearchSinkConnector
Configuration details :
sink config:
name=prop-test
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
connection.url=<elasticseach url>
tasks.max=1
topics=test_prop
type.name=prop
#transforms=InsertKey, ExtractId
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=prop
transforms.ExtractId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractId.field=prop
Source config:
name=test_prop_source
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:sqlserver://*.*.*.*:1433;instance=databaseName=test;
connection.user=*****
connection.password=*****
query=EXEC <store proc>
mode=bulk
batch.max.rows=2000000
topic.prefix=test_prop
transforms=createKey,extractInt
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.createKey.fields=prop
transforms.extractInt.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.extractInt.field=prop
connect-standalone.properties :
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
need to understand how explicitly i can make the schema as ARRAY for roles and not a string.
The fact that the jdbc Source connector will always sees columnname as fieldname and columnvalue as value. This conversion from string to array is not possible with the existing jdbc source connector support, it has to be a custom transformation or custom plugin to enable it.
The best option available for getting the data transferred from the MSSQL and getting it inserted into the Elastic Search is by using logstash. It has rich filter plugins which can enable the data from MSSQL to flow in any desired format and to any desired JSON output environment ( logstash/kafka topic )
Flow : MSSQL --> logstash --> Kakfa Topic --> Kafka Elastic sink connector --> Elastic Search

how to configure an HdfsSinkConnector with Kafka Connect?

I'm trying to setup an HdfsSinkConnector. This is my worker.properties config:
bootstrap.servers=kafkacluster01.corp:9092
group.id=nycd-og-kafkacluster
config.storage.topic=hive_conn_conf
offset.storage.topic=hive_conn_offs
status.storage.topic=hive_conn_stat
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://my-schemaregistry.co:8081
schema.registry.url=http://my-schemaregistry.co:8081
hive.integration=true
hive.metastore.uris=dev-hive-metastore
schema.compatibility=BACKWARD
value.converter.schemas.enable=true
logs.dir = /logs
topics.dir = /topics
plugin.path=/usr/share/java
and this is the post requst that i'm calling to setup the connector
curl -X POST localhost:9092/connectors -H "Content-Type: application/json" -d '{
"name":"hdfs-hive_sink_con_dom16",
"config":{
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"topics": "dom_topic",
"hdfs.url": "hdfs://hadoop-sql-dev:10000",
"flush.size": "3",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url":"http://my-schemaregistry.co:8081"
}
}'
The topic dom_topic already exists (is Avro) but I get the following error from my worker:
INFO Couldn't start HdfsSinkConnector: (io.confluent.connect.hdfs.HdfsSinkTask:72)
org.apache.kafka.connect.errors.ConnectException: java.io.IOException:
Failed on local exception: com.google.protobuf.InvalidProtocolBufferException:
Protocol message end-group tag did not match expected tag.;
Host Details : local host is: "319dc5d70884/172.17.0.2"; destination host is: "hadoop-sql-dev":10000;
at io.confluent.connect.hdfs.DataWriter.<init>(DataWriter.java:202)
at io.confluent.connect.hdfs.HdfsSinkTask.start(HdfsSinkTask.java:64)
at org.apache.kafka.connect.runtime.WorkerSinkTask.initializeAndStart(WorkerSinkTask.java:207)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:139)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
the hdfs.url I've gotten from hive: jdbc:hive2://hadoop-sql-dev:10000
If I change the port to, say, 9092 I get
INFO Retrying connect to server: hadoop-sql-dev/xxx.xx.x.xx:9092. Already tried 0 time(s); maxRetries=45 (org.apache.hadoop.ipc.Client:837)
I'm running this all on Docker, and my Dockerfile is very simple
#FROM coinsmith/cp-kafka-connect-hdfs
FROM confluentinc/cp-kafka-connect:5.3.1
COPY confluentinc-kafka-connect-hdfs-5.3.1 /usr/share/java/kafka-connect-hdfs
COPY worker.properties worker.properties
# start
ENTRYPOINT ["connect-distributed", "worker.properties"]
Any help would be appreciated.

Apache Kafka JDBC Connector - SerializationException: Unknown magic byte

We are trying to write back the values from a topic to a postgres database using the Confluent JDBC Sink Connector.
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
connection.password=xxx
tasks.max=1
topics=topic_name
auto.evolve=true
connection.user=confluent_rw
auto.create=true
connection.url=jdbc:postgresql://x.x.x.x:5432/Datawarehouse
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
We can read the value in the console using:
kafka-avro-console-consumer --bootstrap-server localhost:9092 --topic topic_name
The schema exists and the value are correctly deserialize by kafka-avro-console-consumer because it gives no error but the connector gives those errors:
{
"name": "datawarehouse_sink",
"connector": {
"state": "RUNNING",
"worker_id": "x.x.x.x:8083"
},
"tasks": [
{
"id": 0,
"state": "FAILED",
"worker_id": "x.x.x.x:8083",
"trace": "org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:511)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:491)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:322)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:226)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:194)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: org.apache.kafka.connect.errors.DataException: f_machinestate_sink\n\tat io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:103)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$0(WorkerSinkTask.java:511)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)\n\t... 13 more\nCaused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1\nCaused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!\n"
}
],
"type": "sink"
}
The final error is :
org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
The schema is registered in the schema registry.
Does the problem sit with the configuration file of the connector ?
The error org.apache.kafka.common.errors.SerializationException: Unknown magic byte! means that a message on the topic was not valid Avro and could not be deserialised. There are several reasons this could be:
Some messages are Avro, but others are not. If this is the case you can use the error handling capabilities in Kafka Connect to ignore the invalid messages using config like this:
"errors.tolerance": "all",
"errors.log.enable":true,
"errors.log.include.messages":true
The value is Avro but the key isn't. If this is the case then use the appropriate key.converter.
More reading: https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained/
This means that deserializer have checked the first 5 bytes of the message and found something unexpected. More about message packing with serializer here , check 'wire format' section. Just a guess that zero byte in message !=0

Resources