Kafka connect - JDBC sink connectivity with KSQL - oracle
I am having an issue with JDBC sink connector to consume topic created by KSQL.
below options I have tried to make it work:
with key and without key
with schema registry and with schema manually created
with AVRO and with JSON
two types of Errors I am facing,
with scenario 3 error looks like below
[2023-02-07 07:20:27,821] ERROR WorkerSinkTask{id=oracle-sink-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:223)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:149)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:513)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:493)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:332)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:234)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:203)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:189)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:244)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.connect.errors.DataException: Converting byte[] to Kafka Connect data failed due to serialization error of topic MY_EMPLOYEE:
at io.confluent.connect.json.JsonSchemaConverter.toConnectData(JsonSchemaConverter.java:119)
at org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:88)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$3(WorkerSinkTask.java:513)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:173)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:207)
... 13 more
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing JSON message for id -1
at io.confluent.kafka.serializers.json.AbstractKafkaJsonSchemaDeserializer.deserialize(AbstractKafkaJsonSchemaDeserializer.java:180)
at io.confluent.kafka.serializers.json.AbstractKafkaJsonSchemaDeserializer.deserializeWithSchemaAndVersion(AbstractKafkaJsonSchemaDeserializer.java:235)
at io.confluent.connect.json.JsonSchemaConverter$Deserializer.deserialize(JsonSchemaConverter.java:165)
at io.confluent.connect.json.JsonSchemaConverter.toConnectData(JsonSchemaConverter.java:108)
... 17 more
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
at io.confluent.kafka.serializers.AbstractKafkaSchemaSerDe.getByteBuffer(AbstractKafkaSchemaSerDe.java:244)
at io.confluent.kafka.serializers.json.AbstractKafkaJsonSchemaDeserializer.deserialize(AbstractKafkaJsonSchemaDeserializer.java:115)
... 20 more
[2023-02-07 07:20:27,822] INFO Stopping task (io.confluent.connect.jdbc.sink.JdbcSinkTask)
with scenario 1 & 2 error says
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:223)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:149)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:516)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:493)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:332)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:234)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:203)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:189)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:244)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.connect.errors.DataException: JsonConverter with schemas.enable requires "schema" and "payload" fields and may not contain additional fields. If you are trying to deserialize plain JSON data, set schemas.enable=false in your converter configuration.
at org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:328)
at org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:88)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$4(WorkerSinkTask.java:516)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:173)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:207)
... 13 more
I followed few articles and videos similar to my issue but none of them worked
reference articles:
https://forum.confluent.io/t/ksqldb-and-the-kafka-connect-jdbc-sink/187
https://github.com/confluentinc/ksql/issues/3487
My configuration and topic given below, which I am trying to sink in Oracle database:
scenario without key:
{
“name”: “destination-connector-simple”,
“config”: {
“connector.class”: “io.confluent.connect.jdbc.JdbcSinkConnector”,
“topics”: “MY_STREAM1”,
“tasks.max”: “1”,
“connection.url”: “jdbc:oracle:thin:#oracle21:1521/orclpdb1”,
“connection.user”: “c__sinkuser”,
“connection.password”: “sinkpw”,
“table.name.format”: “kafka_customers”,
“auto.create”: “true”,
“key.ignore”:“true”,
“pk.mode”: “none”,
“value.converter.schemas.enable”: “false”,
“key.converter.schemas.enable”: “false”,
“key.converter”: “org.apache.kafka.connect.storage.StringConverter”
}
}
scenario with key:
{
“name”: “oracle-sink”,
“config”: {
“connector.class”: “io.confluent.connect.jdbc.JdbcSinkConnector”,
“tasks.max”: “1”,
“topics”: “MY_EMPLOYEE”,
“table.name.format”: “kafka_customers”,
“connection.url”: “jdbc:oracle:thin:#oracle21:1521/orclpdb1”,
“connection.user”: “c__sinkuser”,
“connection.password”: “sinkpw”,
“auto.create”:true,
“auto.evolve”:true,
“pk.fields”: “ID”,
“insert.mode”:“upsert”,
“delete.enabled”:true,
“delete.retention.ms”:100,
“pk.mode”: “record_key”,
“key.converter”: “io.confluent.connect.json.JsonSchemaConverter”,
“key.converter.schema.registry.url”: “h t tp :confused: / schema-registry :8081”,
“value.converter”: “io.confluent.connect.json.JsonSchemaConverter”,
“value.converter.schema.registry.url”: “htt p : / /schema-registry :8081”
}
}
topic to be consumed in sink (Any one of them would be okay):
without key
print ‘MY_STREAM1’ from beginning;
Key format: ¯_(ツ)_/¯ - no data processed
Value format: JSON or KAFKA_STRING
rowtime: 2023/02/05 18:16:16.553 Z, key: , value: {“L_EID”:“101”,“NAME”:“Dhruv”,“LNAME”:“S”,“L_ADD_ID”:“201”}, partition: 0
rowtime: 2023/02/05 18:16:16.554 Z, key: , value: {“L_EID”:“102”,“NAME”:“Dhruv1”,“LNAME”:“S1”,“L_ADD_ID”:“202”}, partition: 0
topic with key:
ksql> print ‘MY_EMPLOYEE’ from beginning;
Key format: JSON or KAFKA_STRING
Value format: JSON or KAFKA_STRING
rowtime: 2023/02/05 18:16:16.553 Z, key: 101, value: {“EID”:“101”,“NAME”:“Dhruv”,“LNAME”:“S”,“ADD_ID”:“201”}, partition: 0
rowtime: 2023/02/05 18:16:16.554 Z, key: 102, value: {“EID”:“102”,“NAME”:“Dhruv1”,“LNAME”:“S1”,“ADD_ID”:“202”}, partition: 0
Topic with schema (manually created)
ksql> print ‘E_SCHEMA’ from beginning;
Key format: ¯_(ツ)_/¯ - no data processed
Value format: JSON or KAFKA_STRING
rowtime: 2023/02/06 20:01:25.824 Z, key: , value: {“SCHEMA”:{“TYPE”:“struct”,“FIELDS”:[{“TYPE”:“int32”,“OPTIONAL”:false,“FIELD”:“L_EID”},{“TYPE”:“int32”,“OPTIONAL”:false,“FIELD”:“NAME”},{“TYPE”:“int32”,“OPTIONAL”:false,“FIELD”:“LAME”},{“TYPE”:“int32”,“OPTIONAL”:false,“FIELD”:“L_ADD_ID”}],“OPTIONAL”:false,“NAME”:“”},“PAYLOAD”:{“L_EID”:“201”,“NAME”:“Vishuddha”,“LNAME”:“Sh”,“L_ADD_ID”:“401”}}, partition: 0
Topic with Avro:
ksql> print ‘MY_STREAM_AVRO’ from beginning;
Key format: ¯_(ツ)_/¯ - no data processed
Value format: AVRO or KAFKA_STRING
rowtime: 2023/02/05 18:16:16.553 Z, key: , value: {“L_EID”: “101”, “NAME”: “Dhruv”, “LNAME”: “S”, “L_ADD_ID”: “201”}, partition: 0
rowtime: 2023/02/05 18:16:16.554 Z, key: , value: {“L_EID”: “102”, “NAME”: “Dhruv1”, “LNAME”: “S1”, “L_ADD_ID”: “202”}, partition: 0
rowtime: 2023/02/05 18:16:16.553 Z, key: , value: {“L_EID”: “101”, “NAME”: “Dhruv”, “LNAME”: “S”, “L_ADD_ID”: “201”}, partition: 0
rowtime: 2023/02/05 18:16:16.554 Z, key: , value: {“L_EID”: “102”, “NAME”: “Dhruv1”, “LNAME”: “S1”, “L_ADD_ID”: “202”}, partition: 0
could you please help me complete my POC in time.
With
Value format: JSON or KAFKA_STRING
Then you need
“value.converter.schemas.enable”: “true”,
And, as the error says, your JSON needs schema and payload fields but not uppercased. Refer - https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained/
In order to use JsonSchemaConverter, you would need to produce the data using the Schema Registry and the JSON Schema Serializer... Refer ksql docs on VALUE_FORMAT = JSON_SR
https://docs.ksqldb.io/en/latest/reference/serialization/#json
Similar for Avro, but with AvroConverter and VALUE_FORMAT = AVRO.
You can ignore the keys, but the same concept applies. Ideally, you just have plain String/Integer keys (no schemas), so only set key.converter to use the respective class for those.
In any case, JDBC sink will not accept plain JSON. You need a schema.
Related
Stream from MSK to RDS PostgreSQL with MSK Connector
I've been going in circles with this for a few days now. I'm sending data to Kafka using kafkajs. Each time I produce a message, I assign a UUID to the message.key value, and the the message.value is set to an event like this and then stringified: // the producer is written in typescript const event = { eventtype: "event1", eventversion: "1.0.1", sourceurl: "https://some-url.com/source" }; // stringified because the kafkajs producer only accepts `string` or `Buffer` const stringifiedEvent = JSON.stringify(event); I start my connect-standalone JDBC Sink Connector with the following configurations: # connect-standalone.properties name=local-jdbc-sink-connector connector.class=io.confluent.connect.jdbc.JdbcSinkConnector dialect.name=PostgreSqlDatabaseDialect connection.url=jdbc:postgresql://postgres:5432/eventservice connection.password=postgres connection.user=postgres auto.create=true auto.evolve=true topics=topic1 tasks.max=1 insert.mode=upsert pk.mode=record_key pk.fields=id # worker.properties offset.storage.file.filename=/tmp/connect.offsets offset.flush.interval.ms=10000 value.converter=org.apache.kafka.connect.json.JsonConverter value.converter.schemas.enable=false value.converter.schema.registry.url=http://schema-registry:8081 key.converter=org.apache.kafka.connect.storage.StringConverter key.converter.schemas.enable=false bootstrap.servers=localhost:9092 group.id=jdbc-sink-connector-worker worker.id=jdbc-sink-worker-1 offset.storage.topic=connect-offsets offset.storage.replication.factor=1 config.storage.topic=connect-configs config.storage.replication.factor=1 status.storage.topic=connect-status status.storage.replication.factor=1 When I start the connector with connect-standalone worker.properties connect-standalone.properties, it spins up and connects to PostgreSQL with no issue. However, when I produce an event, it fails with this error message: WorkerSinkTask{id=local-jdbc-sink-connector-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: Sink connector 'local-jdbc-sink- connector' is configured with 'delete.enabled=false' and 'pk.mode=record_key' and therefore requires records with a non-null Struct value and non-null Struct schema, but found record at (topic='topic1',partition=0,offset=0,timestamp=1676309784254) with a HashMap value and null value schema. (org.apache.kafka.connect.runtime.WorkerSinkTask:609) With this stack trace: org.apache.kafka.connect.errors.ConnectException: Sink connector 'local-jdbc-sink-connector' is configured with 'delete.enabled=false' and 'pk.mode=record_key' and therefore requires records with a non-null Struct value and non-null Struct schema, but found record at (topic='txningestion2',partition=0,offset=0,timestamp=1676309784254) with a HashMap value and null value schema. at io.confluent.connect.jdbc.sink.RecordValidator.lambda$requiresValue$2(RecordValidator.java:86) at io.confluent.connect.jdbc.sink.RecordValidator.lambda$and$1(RecordValidator.java:41) at io.confluent.connect.jdbc.sink.BufferedRecords.add(BufferedRecords.java:81) at io.confluent.connect.jdbc.sink.JdbcDbWriter.write(JdbcDbWriter.java:74) at io.confluent.connect.jdbc.sink.JdbcSinkTask.put(JdbcSinkTask.java:85) at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:581) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:333) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:234) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:203) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:189) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:244) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) I've been going back and forth trying to get it to read my messages, but I'm not sure what is going wrong. One solution just leads to another error, and the solution for the new error leads back to the previous error. What is the correct configuration? How do I resolve this?
NiFi exception while redaing avro to json
In NiFi first I'm converting JSON to Avro and then from Avro to JSON. But while converting from Avro to JSON I'm getting exception. while converting from Avro to JSON I'm getting the below exception: 2019-07-23 12:48:04,043 ERROR [Timer-Driven Process Thread-9] o.a.n.processors.avro.ConvertAvroToJSON ConvertAvroToJSON[id=1db0939d-016c-1000-caa3-80d0993c3468] ConvertAvroToJSON[id=1db0939d-016c-1000-caa3-80d0993c3468] failed to process session due to org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -40; Processor Administratively Yielded for 1 sec: org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -40 org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -40 at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336) at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263) at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201) at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:430) at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144) at org.apache.nifi.processors.avro.ConvertAvroToJSON$1.process(ConvertAvroToJSON.java:161) at org.apache.nifi.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2887) at org.apache.nifi.processors.avro.ConvertAvroToJSON.onTrigger(ConvertAvroToJSON.java:148) at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1162) at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:209) at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117) at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Below is the template file: https://community.hortonworks.com/storage/attachments/109978-avro-to-json-and-json-to-avro.xml The flow that I have drawn is: The input json is: { "name":"test", "company":{ "exp":"1.5" } } The converted avro data is: Objavro.schema {"type":"record","name":"MyClass","namespace":"com.acme.avro","fields":[{"name":"name","type":"string"},{"name":"company","type":{"type":"record","name":"company","fields":[{"name":"exp","type":"string"}]}}]}avro.codecdeflate�s™ÍRól&D³DV`•ÔÃ6ã(I-.a3Ô3�s™ÍRól&D³DV`•ÔÃ6
For avro data file schema will be embedded and you want to write all the fields in CSV format so we don't need to setup the registry. if you are writing only specific columns not all and for other formats JSON.. then schema registry is required - Shu NIFI Malformed Data.Length is negative More about internals
What is 152³305746
I am getting 152³305746 as some bad data in my input file. I am trying to filter it, but have been unsuccessful in telling hive on how to detect it and filter. I am not even sure what data type it is. I am expecting bigint values instead of what i see here. I have tried the below things and their various combinations which did not help skip the few bad records i have in my input: 1) CAST(mycol AS string) RLIKE "^[0-9]+$" 2) mycol < 2147483647 3) CREATE TEMPORARY MACRO isNumber(s string) CAST(s as BIGINT) IS NOT NULL; 4) isNumber(mycol) != false 5) SET mapreduce.map.skip.maxrecords = 100000000; None of the above approaches worked. Hive fails with the following error: Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NumberFormatException: For input string: "152³305746" at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:416) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879) at org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:126) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:149) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:489) ... 9 more Caused by: java.lang.NumberFormatException: For input string: "152³305746" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at org.openx.data.jsonserde.objectinspector.primitive.ParsePrimitiveUtils.parseLong(ParsePrimitiveUtils.java:49) at org.openx.data.jsonserde.objectinspector.primitive.JavaStringLongObjectInspector.get(JavaStringLongObjectInspector.java:46) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:400) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serializeStruct(LazyBinarySerDe.java:279) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serializeStruct(LazyBinarySerDe.java:239) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:201) at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.makeValueWritable(ReduceSinkOperator.java:565) at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:395) ... 17 more
IllegalDataException from DateUtil.java when saving spark streaming dataframe to phoenix
I am using kafka + spark streaming to stream messages and do analytics, then saving to phoenix. Some spark job fail several times per day with the following error message: org.apache.phoenix.schema.IllegalDataException: java.lang.IllegalArgumentException: Invalid format: "" at org.apache.phoenix.util.DateUtil$ISODateFormatParser.parseDateTime(DateUtil.java:297) at org.apache.phoenix.util.DateUtil.parseDateTime(DateUtil.java:163) at org.apache.phoenix.util.DateUtil.parseTimestamp(DateUtil.java:175) at org.apache.phoenix.schema.types.PTimestamp.toObject(PTimestamp.java:95) at org.apache.phoenix.expression.LiteralExpression.newConstant(LiteralExpression.java:194) at org.apache.phoenix.expression.LiteralExpression.newConstant(LiteralExpression.java:172) at org.apache.phoenix.expression.LiteralExpression.newConstant(LiteralExpression.java:159) at org.apache.phoenix.compile.UpsertCompiler$UpsertValuesCompiler.visit(UpsertCompiler.java:979) at org.apache.phoenix.compile.UpsertCompiler$UpsertValuesCompiler.visit(UpsertCompiler.java:963) at org.apache.phoenix.parse.BindParseNode.accept(BindParseNode.java:47) at org.apache.phoenix.compile.UpsertCompiler.compile(UpsertCompiler.java:832) at org.apache.phoenix.jdbc.PhoenixStatement$ExecutableUpsertStatement.compilePlan(PhoenixStatement.java:578) at org.apache.phoenix.jdbc.PhoenixStatement$ExecutableUpsertStatement.compilePlan(PhoenixStatement.java:566) at org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:331) at org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:326) at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53) at org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:324) at org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStatement.java:245) at org.apache.phoenix.jdbc.PhoenixPreparedStatement.execute(PhoenixPreparedStatement.java:172) at org.apache.phoenix.jdbc.PhoenixPreparedStatement.execute(PhoenixPreparedStatement.java:177) at org.apache.phoenix.mapreduce.PhoenixRecordWriter.write(PhoenixRecordWriter.java:79) at org.apache.phoenix.mapreduce.PhoenixRecordWriter.write(PhoenixRecordWriter.java:39) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1113) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalArgumentException: Invalid format: "" at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:673) at org.apache.phoenix.util.DateUtil$ISODateFormatParser.parseDateTime(DateUtil.java:295) My code: val myDF = sqlContext.createDataFrame(myRows, myStruct) myDF.write .format(sourcePhoenixSpark) .mode("overwrite") .options(Map("table" -> (myPhoenixNamespace + myTable), "zkUrl" -> myPhoenixZKUrl)) .save() I am using phoenix-spark version 4.7.0-HBase-1.1. Any suggestion to solve the problem would be appreciated. Thanks
You are trying to process dirty data. That error comes from here: https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/util/DateUtil.java#L301 Where it's trying to parse some string that is expected to be a Date in ISO format and the provided String is empty (""). You need to prepare+clean your data before attempting to write it to storage.
ERROR 2103: doing work on Longs
I have data store trn_date dept_id sale_amt 1 2014-12-15 101 10007655 1 2014-12-15 101 10007654 1 2014-12-15 101 10007544 6 2014-12-15 104 100086544 8 2014-12-14 101 1000000 8 2014-12-15 101 100865761 I'm trying to aggregate the data using below code - Loading the data (tried both the way using HCatLoader() and using PigStorage()) data = LOAD 'data' USING org.apache.hcatalog.pig.HCatLoader(); group_table = GROUP data BY (store, tran_date, dept_id); group_gen = FOREACH grp_table GENERATE FLATTEN(group) AS (store, tran_date, dept_id), SUM(table.sale_amt) AS tota_sale_amt; Below is the Error stack Trace which I'm getting while running the job ================================================================================ Pig Stack Trace --------------- ERROR 2103: Problem doing work on Longs org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: grouped_all: Local Rearrange[tuple]{tuple}(false) - scope-1317 Operator Key: scope-1317): org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:263) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:183) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1645) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1611) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1462) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:700) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:84) at org.apache.pig.builtin.AlgebraicLongMathBase$Intermediate.exec(AlgebraicLongMathBase.java:108) at org.apache.pig.builtin.AlgebraicLongMathBase$Intermediate.exec(AlgebraicLongMathBase.java:102) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextTuple(POUserFunc.java:369) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:333) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:281) Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Number at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:77) ================================================================================ As i was looking for solution, many of said it is due to loading the data using HCatalog Loader. so i have tried loading the data using "PigStorage()". still getting the same error.
This is may be because of the way you are storing data in hive. If any aggregation is going to happen on any column do mention it's data type integer or numeric.
Basically each aggregation function returns data with it's default data type, Like - AVG returns DOUBLE SUM returns DOUBLE COUNT returns LONG I don't think so this is the issue while storing it into hive because you already tried PigStore() it means, it's just data type issue while you are passing it to aggregation. Try to change the data type before passing it to aggregation and try.