How can I bulk queue kafka messages to clickhouse

How can I bulk queue kafka messages to clickhouse - clickhouse

I am trying to bulk an Streaming queue kafka to clickhouse following the steps of official web page https://clickhouse.yandex/docs/en/table_engines/kafka.html and there is no way to make that it's run ok.
I've checked kafka configuration and it's ok because I've created a feeder for this queue and I've added to clickhouse configuration the zookeeper's host and port.
For example, the sentence from eclipse is :
System.out.println(ck.connection.createStatement().execute("CREATE TABLE IF NOT EXISTS test.clickhouseMd5 ("st1 String," + "st2 String," + "st3 String) ENGINE = Kafka('node2:2181', 'TestTopic', 'testConsumerGroup', 'JSONEachRow')"));
The result of System.out.println() is always false and there isn't exceptions.
Any ideas?
Thanks,
kind regards.

could you try run your query via commandline clickhouse-client
on clickhouse node?
cliclhouse-client -c "CREATE TABLE IF NOT EXISTS test.clickhouseMd5 (st1 String,st2 String, st3 String) ENGINE = Kafka('node2:2181', 'TestTopic', 'testConsumerGroup', 'JSONEachRow')"

You use port 2181 which is a default port for zookeeper.
But according to documentation you mentioned (https://clickhouse.yandex/docs/en/table_engines/kafka.html) you should specify in the first argument a comma-separated list of brokers (localhost:9092).
Also note that it may not work with old Kafka version. For example with 0.9.0.1. With this version CREATE TABLE command returns OK. But in kafka logs I have error like 'ERROR Processor got uncaught exception. (kafka.network.Processor) java.lang.ArrayIndexOutOfBoundsException: 18'.
But with the latest Kafka 0.11.0.2 it works fine for me.

Related

send payara logs to graylog via syslog and set correct source

I have a graylog instance that's running a UDP-Syslog-Input on Port 1514.
It's working wonderfully well for all the system logs of the linux servers.
When I try to ingest payara logs though [1], the "source" of the message is set to "localhost" in graylog, while it's normally the hostname of the sending server.
This is suboptimal, because in the best case I want the application logs with correct source in graylog also.
I googled around and found:
https://github.com/payara/Payara/blob/payara-server-5.2021.5/nucleus/core/logging/src/main/java/com/sun/enterprise/server/logging/SyslogHandler.java#L122
It seems like the syslog "source" is hard-coded into payara (localhost).
Is there a way to accomplish sending payara-logs with the correct "source" set?
I have nothing to do with the application server itself, I just want to receive the logs with the correct source (the hostname of the sending server).
example log entry in /var/log/syslog for payara
Mar 10 10:00:20 localhost [ INFO glassfish ] Bootstrapping Monitoring Console Runtime
I suspect I want the "localhost" in above example set to fqdn of the host.
Any ideas?
Best regards
[1]
logging.properties:com.sun.enterprise.server.logging.SyslogHandler.useSystemLogging=true

Try enabling "store full message" in the syslog input settings.
That will add the full_message field to your log messages and will contain the header, in addition to what you see in the message field. Then you can see if the source IP is in the UDP packet. If so, collect those messages via a raw/plaintext UDP input and the source should show correctly.
You may have to parse the rest of the message via an extractor or pipeline rule, but at least you'll have the source....

Well,
this might not exactly be a good solution but I tweaked the rsyslog template for graylog.
I deploy the rsyslog-config via Puppet, so I can generate "$YOURHOSTNAME-PAYARA" dynamically using the facts.
This way, I at least have the correct source set.
$template GRAYLOGRFC5424,"<%PRI%>%PROTOCOL-VERSION% %TIMESTAMP:::date-rfc3339% YOURHOSTNAME-PAYARA %APP-NAME% %PROCID% %MSGID% %STRUCTURED-DATA% %msg%\n"
if $msg contains 'glassfish' then {
*.* #loghost.domain:1514;GRAYLOGRFC5424
& ~
} else {
*.* #loghost.domain:1514;RSYSLOG_SyslogProtocol23Format
}
The other thing we did is actually activating application logging through log4j and it's syslog appender:
<Syslog name="syslog_app" appName="DEMO" host="loghost" port="1514" protocol="UDP" format="RFC5424" facility="LOCAL0" enterpriseId="">
<LoggerFields>
<KeyValuePair key="thread" value="%t"/>
<KeyValuePair key="priority" value="%p"/>
<KeyValuePair key="category" value="%c"/>
<KeyValuePair key="exception" value="%ex"/>
</LoggerFields>
</Syslog>
This way, we can ingest the glassfish server logs and the independent application logs into graylog.
The "LoggerFields" in log4j.xml appear to be key-value pairs for the "StructuredDataElements" according to RFC5424.
https://logging.apache.org/log4j/2.x/manual/appenders.html
https://datatracker.ietf.org/doc/html/rfc5424

That's the problem with UDP Syslog. The sender gets to set the source in the header. There is no "best answer" to this question. When the information isn't present, it's hard for Graylog to pass it along.
It sounds like you may have found an answer that works for you. Go with it. Using log4j solves two problems and lets you define the source yourself.
For those who face a similar issue, a simpler way to solve the source problem might be to use a static field. If you send the payara syslog messages to their own input, you can create a static field that could substitute for the source to identify traffic from that source. Call it "app_name" or "app_source" or something and use that field for whatever sorting you need to do.
Alternatively, if you have just one source for application messages, you could use a pipeline to set the value of the source field to the IP or FQDN of the payara server. Then it displays like all the rest.

Kafka Connect JDBC Source running even when query fails

I'm running a JDBC source connector and try to monitor its status somehow via the exposed JMX metrics and a prometheus exporter. However the status of the connector and all its tasks are still in the running state when the query fails or db can't be reached.
In earlier versions it seems that no value for source-record-poll-total in the source-task-metrics was exported when the query failed, in the versions I use (connect-runtime-6.2.0-ccs, confluentinc-kafka-connect-jdbc-10.2.0, jmx_prometheus_javaagent-0.14.0) even when failing the metric is exported with value 0.0.
Any ideas how I could detect such a failing query or db-connection?

This is resolved in version 10.2.4 of the jdbc connector. Tasks now fail when a SQLNonTransientException occurs and this can be detected using the exported metrics. See https://github.com/confluentinc/kafka-connect-jdbc/pull/1096

Process messages with Null inside array in Kafka connect S3 connector

I'm using Kafka connect with 2 connectors:
debezium to pull data from Postgres to Kafka
S3 connector to save data from Kafka to S3
While running I got this error from the S3 connector
java.lang.NullPointerException: Array contains a null element at 0
I have found the related message that has the following as part of the message:
"some_key": [
"XCVB",
null
]
How can I process this message?
I have tried adding the following to the S3 connector config:
"behavior.on.null.values": "ignore",
"errors.tolerance": "all",
"errors.deadletterqueue.topic.name":"dlq_s3_sink"
to try and skip those messages and send them to DLQ, but it doesn't seems to be working and the task failed on this error. I also saw this in the log:
Set parquet.avro.write-old-list-structure=false to turn on support for arrays with null elements.
but not sure where should I add this? as part of the connector config?

Add parquet.avro.write-old-list-structure:false in the sink connector config.
Also use Version 10.1.0 or above.
Reference: https://docs.confluent.io/kafka-connectors/s3-sink/current/changelog.html#version-10-1-0:~:text=S3%20transient%20errors-,PR%2D485%20%2D%20CCMSG%2D1531%3A%20Support%20null%20items%20within%20arrays%20with%20Parquet%20writer,-PR%2D475%20%2D%20CCMSG

How to get the controller address/ID

I am using confluent Kafka Go for my project. When writing tests, because of the asynchronous nature of Kafka when creating a topic, I might be have errors (error code 3: UNKNOWN_TOPIC_OR_PARTITION) when create the topic then get back immediately.
As I understood, if I can query directly on the controller, I can always get the lastest meta data. So my question is: How can I get Kafka controller's IP or ID when using Confluent Kafka Go.

PyHive ignoring Hive config

I'm intermittently getting the error message
DAG did not succeed due to VERTEX_FAILURE.
when running Hive queries via PyHive. Hive is running on an EMR cluster where hive.vectorized.execution.enabled is set to false in the hive-site.xml file for this reason.
I can set the above property through the configuration on the Hive connection and my query has run successfully every time I've executed it, however I want to confirm that this has fixed the issue and that it is definitely the case that hive-site.xml is being ignored.
Can anyone confirm if this is the expected behavior, or alternatively is there any way to inspect the Hive configuration via PyHive as I've not been able to find any way of doing this?
Thanks!

PyHive is a thin client that connects to HiveServer2, just like a Java or C client (via JDBC or ODBC). It does not use any Hadoop configuration files on your local machine. The HS2 session starts with whatever properties are set server-side.
Same goes for ImPyla BTW.
So it's your responsibility to set custom session properties from your Python code, e.g. execute this statement...
SET hive.vectorized.execution.enabled =False
... before running your SELECT.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How can I bulk queue kafka messages to clickhouse - clickhouse

could you try run your query via commandline clickhouse-client on clickhouse node? cliclhouse-client -c "CREATE TABLE IF NOT EXISTS test.clickhouseMd5 (st1 String,st2 String, st3 String) ENGINE = Kafka('node2:2181', 'TestTopic', 'testConsumerGroup', 'JSONEachRow')"

Related

send payara logs to graylog via syslog and set correct source

Kafka Connect JDBC Source running even when query fails

Process messages with Null inside array in Kafka connect S3 connector

How to get the controller address/ID

PyHive ignoring Hive config

Categories

Resources