Clickhouse Kafka Engine Throwing Exception - clickhouse

I am trying to use Clickhouse Kafka Engine to ingest data. Data is in CSV format. During data ingestion, sometimes I am getting exception
2018.01.08 08:41:47.016826 [ 3499 ] <Debug> StorageKafka (consumer_queue): Started streaming to 1 attached views
2018.01.08 08:41:47.016906 [ 3499 ] <Trace> StorageKafka (consumer_queue): Creating formatted reader
2018.01.08 08:41:49.680816 [ 3499 ] <Error> void DB::StorageKafka::streamThread(): Code: 117, e.displayText() = DB::Exception: Expected end of line, e.what() = DB::Exception, Stack trace:
0. clickhouse-server(StackTrace::StackTrace()+0x16) [0x3221296]
1. clickhouse-server(DB::Exception::Exception(std::string const&, int)+0x1f) [0x144a02f]
2. clickhouse-server() [0x36e6ce1]
3. clickhouse-server(DB::CSVRowInputStream::read(DB::Block&)+0x1a0) [0x36e6f60]
4. clickhouse-server(DB::BlockInputStreamFromRowInputStream::readImpl()+0x64) [0x36e3454]
5. clickhouse-server(DB::IProfilingBlockInputStream::read()+0x16e) [0x2bcae0e]
6. clickhouse-server(DB::KafkaBlockInputStream::readImpl()+0x6c) [0x32f6e7c]
7. clickhouse-server(DB::IProfilingBlockInputStream::read()+0x16e) [0x2bcae0e]
8. clickhouse-server(DB::copyData(DB::IBlockInputStream&, DB::IBlockOutputStream&, std::atomic<bool>*)+0x55) [0x35b3e25]
9. clickhouse-server(DB::StorageKafka::streamToViews()+0x366) [0x32f54f6]
10. clickhouse-server(DB::StorageKafka::streamThread()+0x143) [0x32f58c3]
11. clickhouse-server() [0x40983df]
12. /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f4d115d06ba]
13. /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f4d10bf13dd]
Below is the table
CREATE TABLE test.consumer_queue (ID Int32, DAY Date) ENGINE = Kafka('broker-ip:port', 'clickhouse-kyt-test','clickhouse-kyt-test-group', '**CSV**')
CREATE TABLE test.consumer_request ( ID Int32, DAY Date) ENGINE = MergeTree PARTITION BY DAY ORDER BY (DAY, ID) SETTINGS index_granularity = 8192
CREATE MATERIALIZED VIEW test.consumer_view TO test.consumer_request (ID Int32, DAY Date) AS SELECT ID, DAY FROM test.consumer_queue
CSV Data
10034,"2018-01-05"
10035,"2018-01-05"
10036,"2018-01-05"
10037,"2018-01-05"
10038,"2018-01-05"
10039,"2018-01-05"
Clickhouse server version 1.1.54318.

It seems that ClickHouse read batch of messages from Kafka and then try to decode all these messages as a single CSV.
And messages in this single CSV should be separated with new line character.
So all messages should have new line character at the end.
I am not sure if it is a feature or a bug of ClickHouse.
You can try to send to kafka only one message and check if it appears correctly in ClickHouse.
If you send messages to Kafka with script kafka-console-producer.sh then this script (class ConsoleProducer.scala) reads lines from a file and sends each line to a Kafka topic without new line character, so such messages can not be processed correctly.
If you send messages with your own script/application then you can try to modify it and add new line character to the end of each messages. This should solve the problem.
Or you can use another format for Kafka Engine, for example JSONEachRow.

Agree with #mikhail 's answer, i guess, try kafka_row_delimiter = '\n' in SETTINGS KAFKA engine

Related

Error using Polybase to load Parquet file: class java.lang.Integer cannot be cast to class parquet.io.api.Binary

I have a snappy.parquet file with a schema like this:
{
"type": "struct",
"fields": [{
"name": "MyTinyInt",
"type": "byte",
"nullable": true,
"metadata": {}
}
...
]
}
Update: parquet-tools reveals this:
############ Column(MyTinyInt) ############
name: MyTinyInt
path: MyTinyInt
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=8, isSigned=true)
converted_type (legacy): INT_8
When I try and run a stored procedure in Azure Data Studio to load this into an external staging table with PolyBase I get the error:
11:16:21Started executing query at Line 113
Msg 106000, Level 16, State 1, Line 1
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: class java.lang.Integer cannot be cast to class parquet.io.api.Binary (java.lang.Integer is in module java.base of loader 'bootstrap'; parquet.io.api.Binary is in unnamed module of loader 'app')
The load into the external table works fine with only varchars
CREATE EXTERNAL TABLE [domain].[TempTable]
(
...
MyTinyInt tinyint NULL,
...
)
WITH
(
LOCATION = ''' + #Location + ''',
DATA_SOURCE = datalake,
FILE_FORMAT = parquet_snappy
)
The data will eventually be merged into a Data Warehouse Synapse table. In that table the column will have to be of type tinyint.
I have the same issue and good support plan in Azure, so I've got an answer from Microsoft:
there is a known bug in ADF for this particular scenario: The date
type in parquet should be mapped as data type date in Sql sever
however, ADF incorrectly converts this type to Datetime2 which causes
a conflict in PolyBase. I have confirmation for the core engineering
team that this will be rectified with a fix by the end of November and
will be published directly into the ADF product.
In the meantime, as a workaround:
Create the target table with data type DATE as opposed to DATETIME2
Configure the Copy Activity Sink settings to use Copy Command as opposed to PolyBase
but even Copy command don't work for me, so only one workaround is to use Bulk insert, but Bulk is extremely slow and on big datasets it's would be a problem

Confluent Kafka-connect-JDBC connector showing hexa decimal data in the kafka topic

I'm trying to copy the data from a table in the oracle db and trying to put that data in a kafka topic. I've used the following JDBC source connector for that :
name=JDBC-DB-source
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
connection.password = *******
connection.url = jdbc:oracle:thin:#1.1.1.1:1111/ABCD
connection.user = *****
table.types=TABLE
query= select * from (SELECT * FROM JENNY.WORKFLOW where ID = '565231')
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
mode=timestamp+incrementing
incrementing.column.name=ID
timestamp.column.name=MODIFIED
topic.prefix=workflow_data12
poll.interval.ms=6000
timestamp.delay.interval.ms=60000
transforms:createKey
transforms.createKey.type:org.apache.kafka.connect.transforms.ValueToKey
transforms.createKey.fields:ID
So far good. I'm able to get the data into my kafka topic. But the output looks like the following :
key - {"ID":"\u0001"}
value - {"ID":"\u0001","MODIFIED":1874644537368}
You can observer that my key "ID" is being printed as Hexadecimal format, despite I'm using Avro in my JDBC properties file.
(I'm using kafka-avro-console consumer to view the data on the command line)
(And the column "ID" is of type "NUMBER" in the oracle db.)
Could anyone help me to point out if I'm missing some property? to print the data properly in Avro format.
Thanks in advance!!
Add this property to your .properties file e.g before query:
numeric.mapping=best_fit
Detail Explanation can be found here

Cloudwatch to Elasticsearch parse/tokenize log event before push to ES

Appreciate your help in advance.
In my scenario - Cloudwatch multiline logs needs to be shipped to elasticsearch service.
ECS--awslog->Cloudwatch---using lambda--> ES Domain
(Basic flow though very open to change how data is shipped from CW to ES )
I was able to solve multi-line issue using multi_line_start_pattern BUT
The main issue I am experiencing now - is my logs have ODL format (following format)
[yyyy-mm-ddThh:mm:ss.SSS-Z][ProductName-Version][Log Level]
[Message ID][LoggerName][Key Value Pairs][[
Message]]
AND I will like to parse and tokenize log events before storing in ES (vs the complete log line ).
For example:
[2018-05-31T11:08:49.148-0400] [glassfish 4.1] [INFO] [] [] [tid: _ThreadID=43 _ThreadName=Thread-8] [timeMillis: 1527692929148] [levelValue: 800] [[
[] INFO : (DummyApplicationFunctionJPADAO) EntityManagerFactory located under resource lookup name [null], resource name=AuthorizationPU]]
Needs to be parsed and tokenize using format
timestamp 2018-05-31T11:08:49.148-0400
ProductName-Version glassfish 4.1
LogLevel INFO
MessageID
LoggerName
KeyValuePairs tid: _ThreadID=43 _ThreadName=Thread-8
Message [] INFO : (DummyApplicationFunctionJPADAO)
EntityManagerFactorylocated under resource lookup name
[null], resource name=AuthorizationPU
In above Key Value pairs repeat and are variable - for simplicity I can store all as one long string.
As far as what I gathered about Cloudwatch - It seems Subscription Filter Pattern reg ex support is very limited really not sure how to fit the above pattern. For lambda function that pushes the data to ES have not seen AWS doc or examples that support lambda as means to parse and push for ES.
Will appreciate if someone can please guide what/where will be best option to parse CW logs before it gets into ES => Subscription Filter -Pattern vs in lambda function or any other way.
Thank you .
From what I can see your best bet is what you're suggesting, a CloudWatch log triggered lambda that reformats the logged data into your ES prefered format and then posts it into ES.
You'll need to subscribe this lambda to your CloudWatch logs. You can do this on the lambda console, or the cloudwatch console (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html).
The lambda's event payload will be: { "awslogs": { "data": "encoded-logs" } }. Where encoded-logs is a Base64 encoding of a gzipped JSON.
For example, the sample event (https://docs.aws.amazon.com/lambda/latest/dg/eventsources.html#eventsources-cloudwatch-logs) can be decoded in node, for example, using:
const zlib = require('zlib');
const data = event.awslogs.data;
const gzipped = Buffer.from(data, 'base64');
const json = zlib.gunzipSync(gzipped);
const logs = JSON.parse(json);
console.log(logs);
/*
{ messageType: 'DATA_MESSAGE',
owner: '123456789123',
logGroup: 'testLogGroup',
logStream: 'testLogStream',
subscriptionFilters: [ 'testFilter' ],
logEvents:
[ { id: 'eventId1',
timestamp: 1440442987000,
message: '[ERROR] First test message' },
{ id: 'eventId2',
timestamp: 1440442987001,
message: '[ERROR] Second test message' } ] }
*/
From what you've outlined, you'll want to extract the logEvents array, and parse this into an array of strings. I'm happy to give some help on this too if you need it (but I'll need to know what language you're writing your lambda in- there are libraries for tokenizing ODL- so hopefully it's not too hard).
At this point you can then POST these new records directly into your AWS ES Domain. Somewhat crypitcally the S3-to-ES guide gives a good outline of how to do this in python: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-aws-integrations.html#es-aws-integrations-s3-lambda-es
You can find a full example for a lambda that does all this (by someone else) here: https://github.com/blueimp/aws-lambda/tree/master/cloudwatch-logs-to-elastic-cloud

Unable to load data into parquet file format?

I am trying to parse log data into parquet file format in hive , the separator used is "||-||".
The sample row is
"b8905bfc-dc34-463e-a6ac-879e50c2e630||-||syntrans1||-||CitBook"
After performing the data staging I am able to get the result
"b8905bfc-dc34-463e-a6ac-879e50c2e630 syntrans1 CitBook ".
While converting the data to parquet file format I got error :
`
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2185)
at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:137)
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:297)
... 24 more
This is what I have tried
create table log (a String ,b String ,c String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES (
"field.delim"="||-||",
"collection.delim"="-",
"mapkey.delim"="#"
);
create table log_par(
a String ,
b String ,
c String
) stored as PARQUET ;
insert into logspar select * from log_par ;
`
Aman kumar,
To resolve this issue, run the hive query after adding the following jar:
hive> add jar hive-contrib.jar;
To add the jar permanently, do the following:
1.On Hive Server host, create a /usr/hdp//hive/auxlib directory.
2.Copy /usr/hdp//hive/lib/hive-contrib-.jar to /usr/hdp//hive/auxlib.
3.Restart the HS2 server.
Please check further reference.
https://community.hortonworks.com/content/supportkb/150175/errororgapachehadoophivecontribserde2multidelimits.html.
https://community.hortonworks.com/questions/79075/loading-data-to-hive-via-pig-orgapachehadoophiveco.html
Let me know,if you face any issues

Spark : Avro RDD to csv

I am able to read arvo file into avroRDD and am trying to convert into csvRDD which contain all the values in comma separated. With the following code I am able to read specific field into csvRDD.
val csvRDD = avroRDD .map({case (u, _) => u.datum.get("empname")})
How can I read all the values into csvRDD instead of specifying field names. My result csvRDD should contain records as follows
(100,John,25,IN)
(101,Ricky,38,AUS)
(102,Chris,68,US)
Using Spark 1.2+ with the Spark-Avro integration library by Databricks, one can convert an avro rdd to a csv rdd as follows:
val sqlContext = new SQLContext(sc)
val episodes = sqlContext.avroFile("episodes.avro")
val csv = episodes.map(_.mkString(","))
Running csv.collect().foreach(println) using this sample avro file prints
The Eleventh Hour,3 April 2010,11
The Doctor's Wife,14 May 2011,11
Horror of Fang Rock,3 September 1977,4
An Unearthly Child,23 November 1963,1
The Mysterious Planet,6 September 1986,6
Rose,26 March 2005,9
...

Resources