I am working on an open-source project(https://github.com/google/science-journal/tree/master/OpenScienceJournal). With this application, I can record an experiment. Recorded experiments are stored with the .proto extension. I tried to compile them to generate classes but failed.
Is there any way to open this kind of files?
In protocol-buffers, .proto files are usually the text-based schema DSL that describes messages, not data; however, it is possible that these files do indeed contain the binary data instead (just... unusual). Double-check the files : if they look like:
message Foo {
int32 bar = 1;
// etc
}
then it is the schema; if it is binary-looking, it is probably data.
As to how to read it: the simplest option is to already have the schema. If you don't, the data is technically ambiguous - you can probably reverse-engineer it by examining the data, but it can be awkward. You may find tools such as https://protogen.marcgravell.com/decode useful for that purpose.
Once you have a schema and the data, you would:
generate the necessary stubs in your chosen platform from the schema (https://protogen.marcgravell.com/ may be useful here)
then: use the protbuf library's "deserialize" API for your chosen platform to load the data into an object model
finally: inspect the object model, now populated with the data
Related
Hey there StackOverflow community,
I have a question regarding nested Avro schemas, and what would be a best practice on how to store them in the schema registry when using them with Kafka.
TL;DR & Question: What’s the best practice for storing complex, nested types inside an Avro schema registry?
a) all subtypes as a separate subject (like demonstrated below)
b) a nested supertype as a single subject, containing all subtypes
c) something different altogether?
A little context: Our schema consists of a main type that has a few complex subtypes (with some of the subtypes themselves having subtypes). To keep things clean, we moved every complex type to its own *.avsc file. This leaves us with ~10 *.avsc Files. All messages we produce have the main type, and subtypes are never sent separately.
For uploading/registering the schema, we use a gradle plugin. In order for this to work, we need to fully specify every subtype as a separate subject, and then define the references between them, like so (in build.gradle.kts):
schemaRegistry {
url.set("https://$schemaRegistryPath")
register {
subject("SubSubType1", "$projectDir/src/main/avro/SubSubType1.avsc", "AVRO")
subject("SubType1", "$projectDir/src/main/avro/SubType1.avsc", "AVRO")
.addReference("SubSubType1","SubSubType1",-1)
subject("MyMainType", "$projectDir/src/main/avro/MyMainType.avsc", "AVRO")
.addReference("SubType1","SubSubType1",-1)
// remaining config omitted for brevity
}
}
This results in all subtypes being registered in the schema registry as a separate subject:
curl -X GET http://schema-registry:8085/subjects
["MyMainType","Subtype1","Subtype2","Subtype3","SubSubType1","SubSubType2"]%
This feels awkward; We only ever produce Kafka messages with a payload of MyMainType - therefore I only need to have that type in the registry, with all subtypes nested in, like so:
curl -X GET http://schema-registry:8085/subjects
["MyMainType"]%
It appears as if this isn't possible with this particular Gradle plugin, however it looks like other plugins handle this identically. So apparently when having Avro subtypes specified in separate files the only way to register them is by registering them as separate subjects.
What should I do here? Register all subtypes, or merge all *.avsc into one big file?
Thanks for any pointers everybody!
Unfortunately, there doesn't seem to be a whole lot of information available on this topic, but this is what I found out regarding your options with complex Avro schemas:
for simple schemas with few complex types, use Avro Schemas (*.avsc)
for more complex schemas and loads of nesting, use Avro Interface Definitions (*.avdl) - these natively support imports
So it would probably be worthwhile to convert the definitions to *.avdl. In case you insist on keeping your *.avsc style definitions, there are Maven plugins available for merging these (see https://michalklempa.com/2020/04/composing-avro-schemas-from-subtypes/).
However, the impression that I get is that whenever things get complex, it would be preferable to use Avro IDL. This blog post supports this hypothesis.
I have a question about the Golang project structure.
Assume that this is my project structure in high level:
The project fzr defines yaml structure, which I need to parse and provide functions, to get data on top of this yaml file content.
the model contains all the structs
the provider contains the yaml.Unmarshal to parse the structs and provide objects which contain all the yaml file data
Assume that I need to provide functions on top of the structs data, such as:
getUserApps
getServices
getUserServices
getApps
getUserByIde
etc..
Where should these functions be placed? Maybe in a new created package under fzr? I don't want to use the flat option.
Of course, I can place some files under the provider package, which contains all the function there, but not sure if this would be clean? Go package structure is quite confusing me.
I have an interface which has certain properties defined.
For example:
interface Student {
name: string;
dob:string;
age:number;
city:string;
}
I read a JSON file which contains a record in this format and assign it to a variable.
let s1:Student = require('./student.json');
Now, I want to verify if s1 contains all the properties mentioned in interface Student. At runtime this is not validated. Is there any way I can do this?
There is an option of type guards, but that won't serve the purpose here. I do not know which fields would come from the JSON file. I also cannot add a discriminator(No data manipulation);
Without explicitly writing code to do so, this isn't possible in TypeScript. Why? Because once it's gone through the compiler, your code looks like this:
let s1 = require('./student.json');
Everything relating to types gets erased once compilation completes, leaving you with just the pure JavaScript. TypeScript will never emit code to verify that the type checks will actually hold true at runtime - this is explicitly outside of the language's design goals.
So, unfortunately, if you want this functionality you're going to have to write if (s1.name), if (s1.dob), etc.
(That said, it is worth noting that there are third-party projects which aim to add runtime type checking to TypeScript, but they're still experimental and it's doubtful they'll every become part of the actual TypeScript language.)
Kindly regret if this question is silly. I am finding it difficult to get what it really means.When i read 'Hadoop the definitive guide' it says that the best advantage of avro is that code generation is optional in Avro. This link has a program for avro serialization/deserialization with/without code generation. Could some one help me in understanding exactly what with/without code generation mean and the real context of the same.
It's not a silly question -- it's actually a very important aspect of Avro.
With code-generation usually means that before compiling your Java application, you have an Avro schema available. You, as a developer, will use an Avro compiler to generate a class for each record in the schema and you use these classes in your application.
In the referenced link, the author does this: java -jar avro-tools-1.7.5.jar compile schema student.avsc, and then uses the student_marks class directly.
In this case, each instance of the class student_marks inherits from SpecificRecord, with custom methods for accessing the data inside (such as getStudentId() to fetch the student_id field).
Without code-generation usually means that your application doesn't have any specific necessary schema (for example, it can treat different kinds of data).
In this case, there's no student class generated, but you can still read Avro records in an Avro container. You won't have instances of student, but instances of GenericRecord. There won't be any helpful methods like getStudentId(), but you can use methods get("student_marks") or get(0).
Often, using specific records with code generation is easier to read, easier to serialize and deserialize, but generic records offer more flexibility when the exact schema of the records you want to process isn't known at compile time.
A helpful way to think of it is the difference between storing some data in a helpful handwritten POJO structure versus an Object[]. The former is much easier to develop with, but the latter is necessary if the types and quantity of data are dynamic or unknown.
I have some types of data that I have to upload on HDFS as Sequence Files.
Initially, I had thought of creating a .jr file at runtime depending on the type of schema and use rcc DDL tool by Hadoop to create these classes and use them.
But looking at rcc documentation, I see that it has been deprecated. I was trying to see what other options I have to create these value classes per type of data.
This is a problem as I get to know the metadata of the data to be loaded at runtime along with the data-stream. So, I have, no choice, but to create Value class at runtime and then use it for writing (key, vale) to SequenceFile.Writer and finally saving it on HDFS.
Is there any solution for this problem?
You can try looking other serialization frameworks, like Protocol Buffers, Thrift, or Avro. You might want to look at Avro first, since it doesn't require static code generation, which might be more suitable for you.
Or if you want something really quick and dirty, each record in the SequenceFile can be a HashMap where the key/values are the name of the field and the value.