Kindly regret if this question is silly. I am finding it difficult to get what it really means.When i read 'Hadoop the definitive guide' it says that the best advantage of avro is that code generation is optional in Avro. This link has a program for avro serialization/deserialization with/without code generation. Could some one help me in understanding exactly what with/without code generation mean and the real context of the same.
It's not a silly question -- it's actually a very important aspect of Avro.
With code-generation usually means that before compiling your Java application, you have an Avro schema available. You, as a developer, will use an Avro compiler to generate a class for each record in the schema and you use these classes in your application.
In the referenced link, the author does this: java -jar avro-tools-1.7.5.jar compile schema student.avsc, and then uses the student_marks class directly.
In this case, each instance of the class student_marks inherits from SpecificRecord, with custom methods for accessing the data inside (such as getStudentId() to fetch the student_id field).
Without code-generation usually means that your application doesn't have any specific necessary schema (for example, it can treat different kinds of data).
In this case, there's no student class generated, but you can still read Avro records in an Avro container. You won't have instances of student, but instances of GenericRecord. There won't be any helpful methods like getStudentId(), but you can use methods get("student_marks") or get(0).
Often, using specific records with code generation is easier to read, easier to serialize and deserialize, but generic records offer more flexibility when the exact schema of the records you want to process isn't known at compile time.
A helpful way to think of it is the difference between storing some data in a helpful handwritten POJO structure versus an Object[]. The former is much easier to develop with, but the latter is necessary if the types and quantity of data are dynamic or unknown.
Related
Hey there StackOverflow community,
I have a question regarding nested Avro schemas, and what would be a best practice on how to store them in the schema registry when using them with Kafka.
TL;DR & Question: What’s the best practice for storing complex, nested types inside an Avro schema registry?
a) all subtypes as a separate subject (like demonstrated below)
b) a nested supertype as a single subject, containing all subtypes
c) something different altogether?
A little context: Our schema consists of a main type that has a few complex subtypes (with some of the subtypes themselves having subtypes). To keep things clean, we moved every complex type to its own *.avsc file. This leaves us with ~10 *.avsc Files. All messages we produce have the main type, and subtypes are never sent separately.
For uploading/registering the schema, we use a gradle plugin. In order for this to work, we need to fully specify every subtype as a separate subject, and then define the references between them, like so (in build.gradle.kts):
schemaRegistry {
url.set("https://$schemaRegistryPath")
register {
subject("SubSubType1", "$projectDir/src/main/avro/SubSubType1.avsc", "AVRO")
subject("SubType1", "$projectDir/src/main/avro/SubType1.avsc", "AVRO")
.addReference("SubSubType1","SubSubType1",-1)
subject("MyMainType", "$projectDir/src/main/avro/MyMainType.avsc", "AVRO")
.addReference("SubType1","SubSubType1",-1)
// remaining config omitted for brevity
}
}
This results in all subtypes being registered in the schema registry as a separate subject:
curl -X GET http://schema-registry:8085/subjects
["MyMainType","Subtype1","Subtype2","Subtype3","SubSubType1","SubSubType2"]%
This feels awkward; We only ever produce Kafka messages with a payload of MyMainType - therefore I only need to have that type in the registry, with all subtypes nested in, like so:
curl -X GET http://schema-registry:8085/subjects
["MyMainType"]%
It appears as if this isn't possible with this particular Gradle plugin, however it looks like other plugins handle this identically. So apparently when having Avro subtypes specified in separate files the only way to register them is by registering them as separate subjects.
What should I do here? Register all subtypes, or merge all *.avsc into one big file?
Thanks for any pointers everybody!
Unfortunately, there doesn't seem to be a whole lot of information available on this topic, but this is what I found out regarding your options with complex Avro schemas:
for simple schemas with few complex types, use Avro Schemas (*.avsc)
for more complex schemas and loads of nesting, use Avro Interface Definitions (*.avdl) - these natively support imports
So it would probably be worthwhile to convert the definitions to *.avdl. In case you insist on keeping your *.avsc style definitions, there are Maven plugins available for merging these (see https://michalklempa.com/2020/04/composing-avro-schemas-from-subtypes/).
However, the impression that I get is that whenever things get complex, it would be preferable to use Avro IDL. This blog post supports this hypothesis.
I am working on a new project using spring data jdbc because it is very easy to handle and indeed splendid.
In my scenario i have three (maybe more in the future) types of projects. So my domain model could be easily modelled with plain old java objects using type inheritance.
First question:
As i am using spring data jdbc, is this way (inheritance) even supported like it is in JPA?
Second question - as addition to the first one:
I could not found anything regarding this within the official docs. So i am assuming there are good reasons why it is not supported. Speaking of that, may i be on the wrong track modelling entities with inheritance in general?
Currently Spring Data JDBC does not support inheritance.
The reason for this is that inheritance make things rather complicated and it was not at all clear what the correct approach is.
I have a couple of vague ideas how one might create something usable. Different repositories per type is one option, using a single type for persisting, but having some post processing to obtain the correct type upon reading is another one.
Is it possible to mark fields as read only in a .proto file such that when the code is generated, these fields do not have setters?
Ultimately, I think the answer here will be "no". There's a good basic guidance rule that applies to DTOs:
DTOs should generally be as simple as possible to convey the data for serialization in a manner well-suited to the specific serializer.
if that basic model is sufficient for you to work with above that layer, then fine
but if not: do not fight the serializer; instead, create a separate domain model above the DTO layer, and simply map between the two models before serialization or after deserialization
Or put another way: the fact that the generator doesn't want to expose read-only members is irrelevant, because if you need something exotic, you shouldn't be using the generated type outside of the code that directly touches serialization. So: in your domain type that mirrors the DTO: make it read-only there.
As for why read-only fields aren't usually a thing in serialization tools: you presumably want to be able to give it a value. Serialization tools usually want to be able to write everything they can read, and read everything they can write.
Minor note for completeness since you mention C#: if you are using a code-first approach with protobuf-net, it'll work fine with {get;}-only auto-props, and with {get;}-only manual props if all public members trivially map to an obvious constructor.
I use AvroParquetInputFormat. The usecase requires scanning of multiple input directories and each directory will have files with one schema. Since AvroParquetInputFormat class could not handle multiple input schemas, I created a workaround by statically creating multiple dummy classes like MyAvroParquetInputFormat1, MyAvroParquetInputFormat2 etc where each class just inherits from AvroParquetInputFormat. And for each directory, I set a different MyAvroParquetInputFormat and that worked (please let me know if there is a cleaner way to achieve this).
My current problem is as follows:
Each file has a few hundred columns and based on meta-data I construct a projectionSchema for each directory, to reduce unnecessary disk & network IO. I use the static setRequestedProjection() method on each of my MyAvroParquetInputFormat classes. But, being static, the last call’s projectionSchema is used for reading data from all directories, which is not the required behavior.
Any pointers to workarounds/solutions would is highly appreciated.
Thanks & Regards
MK
Keep in mind that if your avro schemas are compatible (see avro doc for definition of schema compatibility) you can access all the data with a single schema. Extending on this, it is also possible to construct a parquet friendly schema (no unions) that is compatible with all your schemas so you can use just that one.
As for the approach you took, there is no easy way of doing this that I know of. You have to extend MultipleInputs functionality somehow to assign a different schema for each of your input formats. MultipleInputs works by setting two configuration properties in your job configuration:
mapreduce.input.multipleinputs.dir.formats //contains a comma separated list of InputFormat classes
mapreduce.input.multipleinputs.dir.mappers //contains a comma separated list of Mapper classes.
These two lists must be the same length. And this is where it gets tricky. This information is used deep within hadoop code to initialize mappers and input formats, so that's where you should add your own code.
As an alternative, I would suggest that you do the projection using one of the tools already available, such as hive. If there are not too many different schemas, you can write a set of simple hive queries to do the projection for each of the schemas, and after that you can use a single mapper to process the data or whatever the hell you want.
I have some types of data that I have to upload on HDFS as Sequence Files.
Initially, I had thought of creating a .jr file at runtime depending on the type of schema and use rcc DDL tool by Hadoop to create these classes and use them.
But looking at rcc documentation, I see that it has been deprecated. I was trying to see what other options I have to create these value classes per type of data.
This is a problem as I get to know the metadata of the data to be loaded at runtime along with the data-stream. So, I have, no choice, but to create Value class at runtime and then use it for writing (key, vale) to SequenceFile.Writer and finally saving it on HDFS.
Is there any solution for this problem?
You can try looking other serialization frameworks, like Protocol Buffers, Thrift, or Avro. You might want to look at Avro first, since it doesn't require static code generation, which might be more suitable for you.
Or if you want something really quick and dirty, each record in the SequenceFile can be a HashMap where the key/values are the name of the field and the value.