Flink consuming s3 parquet file kyro serialization error - protocol-buffers

We want to consume parquet file from s3
My code snippet is like this. My input files are protobuf encoded parquet files. The protobuf class is Pageview.class.
import com.twitter.chill.protobuf.ProtobufSerializer;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.scala.hadoop.mapreduce.HadoopInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.parquet.proto.ProtoParquetInputFormat;
import org.apache.hadoop.fs.Path;
import scala.Tuple2;
public class ParquetReadJob {
public static void main(String... args) throws Exception {
ExecutionEnvironment ee = ExecutionEnvironment.getExecutionEnvironment();
ee.getConfig().registerTypeWithKryoSerializer(StandardLog.Pageview.class, ProtobufSerializer.class);
String path = args[0];
Job job = Job.getInstance();
job.setInputFormatClass(ProtoParquetInputFormat.class);
HadoopInputFormat<Void, StandardLog.Pageview> hadoopIF =
new HadoopInputFormat<> (new ProtoParquetInputFormat<>(), Void.class, StandardLog.Pageview.class, job);
ProtoParquetInputFormat.addInputPath(job, new Path(path));
DataSource<Tuple2<Void, StandardLog.Pageview>> dataSet = ee.createInput(hadoopIF).setParallelism(10);
dataSet.print();
}
}
There is always errors:
com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException
Serialization trace:
supportCrtSize_ (access.Access$AdPositionInfo)
adPositionInfo_ (access.Access$AccessRequest)
accessRequest_ (com.adshonor.proto.StandardLog$Pageview$Builder)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:730)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:22)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:730)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:315)
at org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
at org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106)
at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72)
at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:216)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException
at java.util.Collections$UnmodifiableCollection.add(Collections.java:1055)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:22)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
... 23 more
Can anyone advise my how to write batch processing program which can consume this kind of files?

I also encountered this issue.
I found this and this in a pending PR for flink-protobuf which solved it.
You would need to add the NonLazyProtobufSerializer and ProtobufKryoSerializer classes to your project
and register NonLazyProtobufSerializer as a default Kryo serializer for the Message type:
env.getConfig().addDefaultKryoSerializer(Message.class, NonLazyProtobufSerializer.class);
From the authors JavaDocs:
This is a workaround for an issue that surfaces when consuming a DataSource from Kafka in a Flink
TableEnvironment. For fields declared with type 'string' in .proto, the corresponding field on the
Java class has declared type 'Object'. The actual type of these fields on objects returned by
Message.parseFrom(byte[]) is 'ByteArray'. But the getter methods for these fields return 'String',
lazily replacing the underlying ByteArray field with a String, when necessary.
Hope this helps.

Related

Aws lambda java - Implement simple cache to read a file

I have a lambda process in java and it reads a json file with a table everytime is triggered. I'd like to implement a kind of cache to have that file in memory and I wonder how to do something simple. I don't want to use elasticchache or redis.
I read something similar to my approach in javascript declaring a global variable with let but not sure how to do it in java, where it should be declared and how to test it. Any idea or example you can provide me? Thanks
There are global variables in lambda which can be of help but they have to be used wisely.
They are usually the variables declared out side of lambda_handler.
There are pros and cons of using it.
You can't rely on this behavior but you must be aware it exists. When you call your Lambda function several times, you MIGHT get the same container to optimise run duration and setup delay Use of Global Variables
At the same time you should be aware of the issues or avoid wrong use of it caching issues
If you don't want to use ElastiCache/redis then i guess you have very less options left.......may be dynamoDB or S3 that's all i can think of
again connection to dynamoDB or S3 can be cached here. It won't be as fast as ElastiCache though.
In Java it's not too hard to do. Just create your cache outside of the handler:
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.HashMap;
import java.util.Map;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import com.amazonaws.services.lambda.runtime.RequestStreamHandler;
import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.LambdaLogger;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
public class SampleHandler implements RequestStreamHandler {
private static final Logger logger = LogManager.getLogger(SampleHandler.class);
private static Map<String, String> theCache = null;
public SampleHandler() {
logger.info( "filling cache...");
theCache = new HashMap<>();
theCache.put("key1", "value1");
theCache.put("key2", "value2");
theCache.put("key3", "value3");
theCache.put("key4", "value4");
theCache.put("key5", "value5");
}
public void handleRequest(InputStream inputStream, OutputStream outputStream, Context context) throws IOException {
logger.info("handlingRequest");
LambdaLogger lambdaLogger = context.getLogger();
ObjectMapper objectMapper = new ObjectMapper();
JsonNode jsonNode = objectMapper.readTree(inputStream);
String requestedKey = jsonNode.get("requestedKey").asText();
if( theCache.containsKey( requestedKey )) {
// read from the cache
String result = "{\"requestedValue\": \"" + theCache.get(requestedKey) + "\"}";
outputStream.write(result.getBytes());
}
logger.info("done with run, remaining time in ms is " + context.getRemainingTimeInMillis() );
}
}
(run with the AWS cli with aws lambda invoke --function-name lambda-cache-test --payload '{"requestedKey":"key4"}' out with the output going the the file out)
When this runs with a "cold start" you'll see the "filling cache..." message and then the "handlingRequest" in the CloudWatch log. As long as the Lambda is kept "warm" you will not see the cache message again.
Note that if you had hundreds of the same Lamda's running they would all have their own independent cache. Ultimately this does what you want though - it's a lazy load of the cache during a cold start and the cache is reused for warm calls.

Flink thowing serialization error when reading from hbase

When I read from hbase using richfatMapFunction inside a map I am getting serialization error. What I am trying to do is if a datastream equals to a particular string read from hbase else ignore. Below is the sample program and error I am getting.
package com.abb.Flinktest
import java.text.SimpleDateFormat
import java.util.Properties
import scala.collection.concurrent.TrieMap
import org.apache.flink.addons.hbase.TableInputFormat
import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.api.common.io.OutputFormat
import org.apache.flink.api.java.tuple.Tuple2
import org.apache.flink.streaming.api.scala.DataStream
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.createTypeInformation
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer08
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.util.Collector
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.filter.BinaryComparator
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.util.Bytes
import org.apache.log4j.Level
import org.apache.flink.api.common.functions.RichMapFunction
object Flinktesthbaseread {
def main(args:Array[String])
{
val env = StreamExecutionEnvironment.createLocalEnvironment()
val kafkaStream = env.fromElements("hello")
val c=kafkaStream.map(x => if(x.equals("hello"))kafkaStream.flatMap(new ReadHbase()) )
env.execute()
}
class ReadHbase extends RichFlatMapFunction[String,Tuple11[String,String,String,String,String,String,String,String,String,String,String]] with Serializable
{
var conf: org.apache.hadoop.conf.Configuration = null;
var table: org.apache.hadoop.hbase.client.HTable = null;
var hbaseconnection:org.apache.hadoop.hbase.client.Connection =null
var taskNumber: String = null;
var rowNumber = 0;
val serialVersionUID = 1L;
override def open(parameters: org.apache.flink.configuration.Configuration) {
println("getting table")
conf = HBaseConfiguration.create()
val in = getClass().getResourceAsStream("/hbase-site.xml")
conf.addResource(in)
hbaseconnection = ConnectionFactory.createConnection(conf)
table = new HTable(conf, "testtable");
// this.taskNumber = String.valueOf(taskNumber);
}
override def flatMap(msg:String,out:Collector[Tuple11[String,String,String,String,String,String,String,String,String,String,String]])
{
//flatmap operation here
}
override def close() {
table.flushCommits();
table.close();
}
}
}
Error:
log4j:WARN No appenders could be found for logger (org.apache.flink.api.scala.ClosureCleaner$).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: Task not serializable
at org.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:172)
at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:164)
at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.scalaClean(StreamExecutionEnvironment.scala:617)
at org.apache.flink.streaming.api.scala.DataStream.clean(DataStream.scala:959)
at org.apache.flink.streaming.api.scala.DataStream.map(DataStream.scala:484)
at com.abb.Flinktest.Flinktesthbaseread$.main(Flinktesthbaseread.scala:45)
at com.abb.Flinktest.Flinktesthbaseread.main(Flinktesthbaseread.scala)
Caused by: java.io.NotSerializableException: org.apache.flink.streaming.api.scala.DataStream
- field (class "com.abb.Flinktest.Flinktesthbaseread$$anonfun$1", name: "kafkaStream$1", type: "class org.apache.flink.streaming.api.scala.DataStream")
- root object (class "com.abb.Flinktest.Flinktesthbaseread$$anonfun$1", <function1>)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1182)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:301)
at org.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:170)
... 6 more
I tried wrapping the field inside a method and a class by making the class serializable as wel, but no luck. Could someone throw some lights on this or suggest some workaround for this.
The problem is that you're trying to access the kafka stream variable in the map function which is simply not serializable. It is just an abstract representation of the data. It doesn't contain anything, which invalidates your function in the first place.
instead, do something like this:
kafkaStream.filter(x => x.equals("hello")).flatMap(new ReadHBase())
The filter funtion will only retain the elements for which the condition is true, and those will be passed to your flatMap function.
I would highly recommend you to read the basis API concepts documentation, as there appears to be some misunderstanding as to what actually happens when you specify a transformation.

Access Data from REST API in HIVE

Is there a way to create a hive table where the location for that hive table will be a http JSON REST API? I don't want to import the data every time in HDFS.
I had encountered similar situation in a project couple of years ago. This is the sort of low-key way of ingesting data from Restful to HDFS and then you use Hive analytics to implement the business logic.I hope you are familiar with core Java, Map Reduce (if not you might look into Hortonworks Data Flow, HDF which is a product of Hortonworks).
Step 1: Your data ingestion workflow should not be tied to your Hive workflow that contains business logic. This should be executed independently in timely manner based on your requirement (volume & velocity of data flow) and monitored regularly. I am writing this code on a text editor. WARN: It's not compiled or tested!!
The code below is using a Mapper which would take in the url or tweak it to accept the list of urls from the FS. The payload or requested data is stored as text file in the specified job output directory (forget the structure of data this time).
Mapper Class:
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.net.URLConnection;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class HadoopHttpClientMap extends Mapper<LongWritable, Text, Text, Text> {
private int file = 0;
private String jobOutDir;
private String taskId;
#Override
protected void setup(Context context) throws IOException,InterruptedException {
super.setup(context);
jobOutDir = context.getOutputValueClass().getName();
taskId = context.getJobID().toString();
}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
Path httpDest = new Path(jobOutDir, taskId + "_http_" + (file++));
InputStream is = null;
OutputStream os = null;
URLConnection connection;
try {
connection = new URL(value.toString()).openConnection();
//implement connection timeout logics
//authenticate.. etc
is = connection.getInputStream();
os = FileSystem.getLocal(context.getConfiguration()).create(httpDest,true);
IOUtils.copyBytes(is, os, context.getConfiguration(), true);
} catch(Throwable t){
t.printStackTrace();
}finally {
IOUtils.closeStream(is);
IOUtils.closeStream(os);
}
context.write(value, null);
//context.write(new Text (httpDest.getName()), new Text (os.toString()));
}
}
Mapper Only Job:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class HadoopHttpClientJob {
private static final String data_input_directory = “YOUR_INPUT_DIR”;
private static final String data_output_directory = “YOUR_OUTPUT_DIR”;
public HadoopHttpClientJob() {
}
public static void main(String... args) {
try {
Configuration conf = new Configuration();
Path test_data_in = new Path(data_input_directory, "urls.txt");
Path test_data_out = new Path(data_output_directory);
#SuppressWarnings("deprecation")
Job job = new Job(conf, "HadoopHttpClientMap" + System.currentTimeMillis());
job.setJarByClass(HadoopHttpClientJob.class);
FileSystem fs = FileSystem.get(conf);
fs.delete(test_data_out, true);
job.setMapperClass(HadoopHttpClientMap.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, test_data_in);
FileOutputFormat.setOutputPath(job, test_data_out);
job.waitForCompletion(true);
}catch (Throwable t){
t.printStackTrace();
}
}
}
Step 2: Create external table in Hive based on the HDFS directory. Remember to use Hive SerDe for the JSON data (in your case) then you can copy the data from external table into managed master tables. This is the step where you implement your incremental logics, compression..
Step 3: Point your hive queries (which you might have already created) to the master table to implement your business needs.
Note: If you are supposedly referring to realtime analysis or streaming api, you might have to change your application's architecture. Since you have asked architectural question, I am using my best educated guess to support you. Please go through this once. If you feel you can implement this in your application then you can ask the specific question, I will try my best to address them.

Meaning of context.getconfiguration in hadoop

I have this doubt in code of searching by argument.
what is the meaning of context.getConfiguration().get("Uid2Search");
package SearchTxnByArg;
// This is the Mapper Program for SearchTxnByArg
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMap extends Mapper<LongWritable, Text, NullWritable, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String Txn = value.toString();
String TxnParts[] = Txn.split(",");
String Uid = TxnParts[2];
String Uid2Search = context.getConfiguration().get("Uid2Search");
if(Uid.equals(Uid2Search))
{
context.write(null, value);
}
}
}
Driver Program
package SearchTxnByArg;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MyDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("Uid2Search", args[0]);
Job job = new Job(conf, "Map Reduce Search Txn by Arg");
job.setJarByClass(MyDriver.class);
job.setMapperClass(MyMap.class);
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I don't know how you have written your driver program. But in my experience,
If you are trying to get system property either by using -D option from the command line or by System.setproperty method by default these values will be set to context configuration.
As per documentation,
Configurations are specified by resources. A resource contains a set
of name/value pairs as XML data. Each resource is named by either a
String or by a Path. If named by a String, then the classpath is
examined for a file with that name. If named by a Path, then the local
filesystem is examined directly, without referring to the classpath.
Unless explicitly turned off, Hadoop by default specifies two
resources, loaded in-order from the classpath: core-default.xml :
Read-only defaults for hadoop. core-site.xml: Site-specific
configuration for a given hadoop installation. Applications may add
additional resources, which are loaded subsequent to these resources
in the order they are added.
Please see this answer as well
Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.
Applications can use the Context:
to report progress
to set application-level status messages
update Counters
indicate they are alive
to get the values that are stored in job configuration across map/reduce phase.

camus-example work with kafka

My usecase is I want to push Avro data from Kafka to HDFS. Camus seems to be right tool, however I am not able to make it work.
I am new to camus, trying to make camus-example work,
https://github.com/linkedin/camus
Now I am trying to make camus-example work. However I am still facing issues.
Code Snippet for DummyLogKafkaProducerClient
package com.linkedin.camus.example.schemaregistry;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import java.util.Random;
import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
import com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageEncoder;
import com.linkedin.camus.example.records.DummyLog;
public class DummyLogKafkaProducerClient {
public static void main(String[] args) {
Properties props = new Properties();
props.put("metadata.broker.list", "localhost:6667");
// props.put("serializer.class", "kafka.serializer.StringEncoder");
// props.put("partitioner.class", "example.producer.SimplePartitioner");
//props.put("request.required.acks", "1");
ProducerConfig config = new ProducerConfig(props);
Producer<String, byte[]> producer = new Producer<String, byte[]>(config);
KafkaAvroMessageEncoder encoder = get_DUMMY_LOG_Encoder();
for (int i = 0; i < 500; i++) {
KeyedMessage<String, byte[]> data = new KeyedMessage<String, byte[]>("DUMMY_LOG", encoder.toBytes(getDummyLog()));
producer.send(data);
}
}
public static DummyLog getDummyLog() {
Random random = new Random();
DummyLog dummyLog = DummyLog.newBuilder().build();
dummyLog.setId(random.nextLong());
dummyLog.setLogTime(new Date().getTime());
Map<CharSequence, CharSequence> machoStuff = new HashMap<CharSequence, CharSequence>();
machoStuff.put("macho1", "abcd");
machoStuff.put("macho2", "xyz");
dummyLog.setMuchoStuff(machoStuff);
return dummyLog;
}
public static KafkaAvroMessageEncoder get_DUMMY_LOG_Encoder() {
KafkaAvroMessageEncoder encoder = new KafkaAvroMessageEncoder("DUMMY_LOG", null);
Properties props = new Properties();
props.put(KafkaAvroMessageEncoder.KAFKA_MESSAGE_CODER_SCHEMA_REGISTRY_CLASS, "com.linkedin.camus.example.schemaregistry.DummySchemaRegistry");
encoder.init(props, "DUMMY_LOG");
return encoder;
}
}
I am also added Default no-arg constructor ot DummySchemaRegistry as it was giving instantiation Exception
package com.linkedin.camus.example.schemaregistry;
import org.apache.avro.Schema;
import org.apache.hadoop.conf.Configuration;
import com.linkedin.camus.example.records.DummyLog;
import com.linkedin.camus.example.records.DummyLog2;
import com.linkedin.camus.schemaregistry.MemorySchemaRegistry;
/**
* This is a little dummy registry that just uses a memory-backed schema registry to store two dummy Avro schemas. You
* can use this with camus.properties
*/
public class DummySchemaRegistry extends MemorySchemaRegistry<Schema> {
public DummySchemaRegistry(Configuration conf) {
super();
super.register("DUMMY_LOG", DummyLog.newBuilder().build().getSchema());
super.register("DUMMY_LOG_2", DummyLog2.newBuilder().build()
.getSchema());
}
public DummySchemaRegistry() {
super();
super.register("DUMMY_LOG", DummyLog.newBuilder().build().getSchema());
super.register("DUMMY_LOG_2", DummyLog2.newBuilder().build().getSchema());
}
}
Below Exception trace I am getting after running the Program
Exception in thread "main"
com.linkedin.camus.coders.MessageEncoderException:
org.apache.avro.AvroRuntimeException:
org.apache.avro.AvroRuntimeException: Field id type:LONG pos:0 not set
and has no default value at
com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageEncoder.init(KafkaAvroMessageEncoder.java:55)
at
com.linkedin.camus.example.schemaregistry.DummyLogKafkaProducerClient.get_DUMMY_LOG_Encoder(DummyLogKafkaProducerClient.java:57)
at
com.linkedin.camus.example.schemaregistry.DummyLogKafkaProducerClient.main(DummyLogKafkaProducerClient.java:32)
Caused by: org.apache.avro.AvroRuntimeException:
org.apache.avro.AvroRuntimeException: Field id type:LONG pos:0 not set
and has no default value at
com.linkedin.camus.example.records.DummyLog$Builder.build(DummyLog.java:214)
at
com.linkedin.camus.example.schemaregistry.DummySchemaRegistry.(DummySchemaRegistry.java:16)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method) at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at java.lang.Class.newInstance(Class.java:438) at
com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageEncoder.init(KafkaAvroMessageEncoder.java:52)
... 2 more Caused by: org.apache.avro.AvroRuntimeException: Field id
type:LONG pos:0 not set and has no default value at
org.apache.avro.data.RecordBuilderBase.defaultValue(RecordBuilderBase.java:151)
at
com.linkedin.camus.example.records.DummyLog$Builder.build(DummyLog.java:209)
... 9 more
I suppose camus expects the Avro schema to have default values. I had changed my dummyLog.avsc to following and recompile-
{
"namespace": "com.linkedin.camus.example.records",
"type": "record",
"name": "DummyLog",
"doc": "Logs for not so important stuff.",
"fields": [
{
"name": "id",
"type": "int",
"default": 0
},
{
"name": "logTime",
"type": "int",
"default": 0
}
]
}
Let me know if it works for you.
Thanks,
Ambarish
You can default any String or Long field as follows
{"type":"record","name":"CounterData","namespace":"org.avro.usage.tutorial","fields":[{"name":"word","type":["string","null"]},{"name":"count","type":["long","null"]}]}
Camus doesn't assume the schema will have default values. I have recently used camus found the same issue. Actually the way it used in schema registry is not correct in default example. I have done some modification in Camus code you can check out https://github.com/chandanbansal/camus there are minor changes to make it work.
They don't have decoder for Avro records. I have written that as well.
I was getting this issue because I was initializing the registry like so:
super.register("DUMMY_LOG_2", LogEvent.newBuilder().build().getSchema());
When I changed it to:
super.register("logEventAvro", LogEvent.SCHEMA$);
That got me passed the exception.
I also used Garry's com.linkedin.camus.etl.kafka.coders.AvroMessageDecoder.
I also found this blog (Alvin Jin's Notebook) very useful. It pinpoints every issue you could have with camus example and solves it!

Resources