Issue with setting multiple projectionSchemas for AvroParquetInputFormat - hadoop

I use AvroParquetInputFormat. The usecase requires scanning of multiple input directories and each directory will have files with one schema. Since AvroParquetInputFormat class could not handle multiple input schemas, I created a workaround by statically creating multiple dummy classes like MyAvroParquetInputFormat1, MyAvroParquetInputFormat2 etc where each class just inherits from AvroParquetInputFormat. And for each directory, I set a different MyAvroParquetInputFormat and that worked (please let me know if there is a cleaner way to achieve this).
My current problem is as follows:
Each file has a few hundred columns and based on meta-data I construct a projectionSchema for each directory, to reduce unnecessary disk & network IO. I use the static setRequestedProjection() method on each of my MyAvroParquetInputFormat classes. But, being static, the last call’s projectionSchema is used for reading data from all directories, which is not the required behavior.
Any pointers to workarounds/solutions would is highly appreciated.
Thanks & Regards
MK

Keep in mind that if your avro schemas are compatible (see avro doc for definition of schema compatibility) you can access all the data with a single schema. Extending on this, it is also possible to construct a parquet friendly schema (no unions) that is compatible with all your schemas so you can use just that one.
As for the approach you took, there is no easy way of doing this that I know of. You have to extend MultipleInputs functionality somehow to assign a different schema for each of your input formats. MultipleInputs works by setting two configuration properties in your job configuration:
mapreduce.input.multipleinputs.dir.formats //contains a comma separated list of InputFormat classes
mapreduce.input.multipleinputs.dir.mappers //contains a comma separated list of Mapper classes.
These two lists must be the same length. And this is where it gets tricky. This information is used deep within hadoop code to initialize mappers and input formats, so that's where you should add your own code.
As an alternative, I would suggest that you do the projection using one of the tools already available, such as hive. If there are not too many different schemas, you can write a set of simple hive queries to do the projection for each of the schemas, and after that you can use a single mapper to process the data or whatever the hell you want.

Related

Is there a best practice for nested Avro Types with Kafka?

Hey there StackOverflow community,
I have a question regarding nested Avro schemas, and what would be a best practice on how to store them in the schema registry when using them with Kafka.
TL;DR & Question: What’s the best practice for storing complex, nested types inside an Avro schema registry?
a) all subtypes as a separate subject (like demonstrated below)
b) a nested supertype as a single subject, containing all subtypes
c) something different altogether?
A little context: Our schema consists of a main type that has a few complex subtypes (with some of the subtypes themselves having subtypes). To keep things clean, we moved every complex type to its own *.avsc file. This leaves us with ~10 *.avsc Files. All messages we produce have the main type, and subtypes are never sent separately.
For uploading/registering the schema, we use a gradle plugin. In order for this to work, we need to fully specify every subtype as a separate subject, and then define the references between them, like so (in build.gradle.kts):
schemaRegistry {
url.set("https://$schemaRegistryPath")
register {
subject("SubSubType1", "$projectDir/src/main/avro/SubSubType1.avsc", "AVRO")
subject("SubType1", "$projectDir/src/main/avro/SubType1.avsc", "AVRO")
.addReference("SubSubType1","SubSubType1",-1)
subject("MyMainType", "$projectDir/src/main/avro/MyMainType.avsc", "AVRO")
.addReference("SubType1","SubSubType1",-1)
// remaining config omitted for brevity
}
}
This results in all subtypes being registered in the schema registry as a separate subject:
curl -X GET http://schema-registry:8085/subjects
["MyMainType","Subtype1","Subtype2","Subtype3","SubSubType1","SubSubType2"]%
This feels awkward; We only ever produce Kafka messages with a payload of MyMainType - therefore I only need to have that type in the registry, with all subtypes nested in, like so:
curl -X GET http://schema-registry:8085/subjects
["MyMainType"]%
It appears as if this isn't possible with this particular Gradle plugin, however it looks like other plugins handle this identically. So apparently when having Avro subtypes specified in separate files the only way to register them is by registering them as separate subjects.
What should I do here? Register all subtypes, or merge all *.avsc into one big file?
Thanks for any pointers everybody!
Unfortunately, there doesn't seem to be a whole lot of information available on this topic, but this is what I found out regarding your options with complex Avro schemas:
for simple schemas with few complex types, use Avro Schemas (*.avsc)
for more complex schemas and loads of nesting, use Avro Interface Definitions (*.avdl) - these natively support imports
So it would probably be worthwhile to convert the definitions to *.avdl. In case you insist on keeping your *.avsc style definitions, there are Maven plugins available for merging these (see https://michalklempa.com/2020/04/composing-avro-schemas-from-subtypes/).
However, the impression that I get is that whenever things get complex, it would be preferable to use Avro IDL. This blog post supports this hypothesis.

What does code generation mean in avro - hadoop

Kindly regret if this question is silly. I am finding it difficult to get what it really means.When i read 'Hadoop the definitive guide' it says that the best advantage of avro is that code generation is optional in Avro. This link has a program for avro serialization/deserialization with/without code generation. Could some one help me in understanding exactly what with/without code generation mean and the real context of the same.
It's not a silly question -- it's actually a very important aspect of Avro.
With code-generation usually means that before compiling your Java application, you have an Avro schema available. You, as a developer, will use an Avro compiler to generate a class for each record in the schema and you use these classes in your application.
In the referenced link, the author does this: java -jar avro-tools-1.7.5.jar compile schema student.avsc, and then uses the student_marks class directly.
In this case, each instance of the class student_marks inherits from SpecificRecord, with custom methods for accessing the data inside (such as getStudentId() to fetch the student_id field).
Without code-generation usually means that your application doesn't have any specific necessary schema (for example, it can treat different kinds of data).
In this case, there's no student class generated, but you can still read Avro records in an Avro container. You won't have instances of student, but instances of GenericRecord. There won't be any helpful methods like getStudentId(), but you can use methods get("student_marks") or get(0).
Often, using specific records with code generation is easier to read, easier to serialize and deserialize, but generic records offer more flexibility when the exact schema of the records you want to process isn't known at compile time.
A helpful way to think of it is the difference between storing some data in a helpful handwritten POJO structure versus an Object[]. The former is much easier to develop with, but the latter is necessary if the types and quantity of data are dynamic or unknown.

Yaml properties as a Map in Spring Boot

We have a spring-boot project and are using application.yml files. This works exactly as described in the spring-boot documentation. spring-boot automatically looks in several locations for the files, and obeys any environment overrides we use for the location of those files.
Now we want to also expose those yaml properties as a Map. According to the documentation this can be done with YamlMapFactoryBean. However YamlMapFactoryBean wants me to specify which yaml files to use via the resources property. I want it to use the same yaml files and processing hierarchy that it used when creating properties, so that I can take still take advantage of "magical" features such as placeholder resolution in property values.
I didn't see any documentation on if this was possible.
I was thinking of writing a MapFactoryBean that looked at the environment and simply reversed the "flattening" performed by the YamlProcessor when creating the properties representation of the file.
Any other ideas?
The ConfigFileApplicationContextListener contains the logic for searching for files in various locations. And PropertySourcesLoader loads a file (Resource) into property sources. Neither is really designed for standalone use, but you could easily duplicate them if you want more control. The PropertySourcesLoader delegates to a collection of PropertySourceLoaders so you could add one of the latter that delegates to your YamlMapFactoryBean.
A slightly awkward but workable solution would be to use the existing machinery to collect the YAML on startup. Add a new PropertySourceLoader to your META-INF/spring.factories and let it create new property sources, then post process the Environment to extract the source map(s).
Beware, though: creating a single Map from multiple YAML files, or even a single one with multiple documents (let alone multiple files with multiple documents) isn't as easy as you might think. You have a map-merge problem, and someone is going to have to define the algorithm. The flattening done in YamlMapPropertiesBean and the merge in YamlMapFactoryBean are just two choices out of (probably) a larger set of possibilities.

Create Value class for Sequence Files at runtime

I have some types of data that I have to upload on HDFS as Sequence Files.
Initially, I had thought of creating a .jr file at runtime depending on the type of schema and use rcc DDL tool by Hadoop to create these classes and use them.
But looking at rcc documentation, I see that it has been deprecated. I was trying to see what other options I have to create these value classes per type of data.
This is a problem as I get to know the metadata of the data to be loaded at runtime along with the data-stream. So, I have, no choice, but to create Value class at runtime and then use it for writing (key, vale) to SequenceFile.Writer and finally saving it on HDFS.
Is there any solution for this problem?
You can try looking other serialization frameworks, like Protocol Buffers, Thrift, or Avro. You might want to look at Avro first, since it doesn't require static code generation, which might be more suitable for you.
Or if you want something really quick and dirty, each record in the SequenceFile can be a HashMap where the key/values are the name of the field and the value.

Appropriate data structure for flat file processing?

Essentially, I have to get a flat file into a database. The flat files come in with the first two characters on each line indicating which type of record it is.
Do I create a class for each record type with properties matching the fields in the record? Should I just use arrays?
I want to load the data into some sort of data structure before saving it in the database so that I can use unit tests to verify that the data was loaded correctly.
Here's a sample of what I have to work with (BAI2 bank statements):
01,121000358,CLIENT,050312,0213,1,80,1,2/
02,CLIENT-STANDARD,BOFAGB22,1,050311,2359,,/
03,600812345678,GBP,fab1,111319005,,V,050314,0000/
88,fab2,113781251,,V,050315,0000,fab3,113781251,,V,050316,0000/
88,fab4,113781251,,V,050317,0000,fab5,113781251,,V,050318,0000/
88,010,0,,,015,0,,,045,0,,,100,302982205,,,400,302982205,,/
16,169,57626223,V,050311,0000,102 0101857345,/
88,LLOYDS TSB BANK PL 779300 99129797
88,TRF/REF 6008ABS12300015439
88,102 0101857345 K BANK GIRO CREDIT
88,/IVD-11 MAR
49,1778372829,90/
98,1778372839,1,91/
99,1778372839,1,92
I'd recommend creating classes (or structs, or what-ever value type your language supports), as
record.ClientReference
is so much more descriptive than
record[0]
and, if you're using the (wonderful!) FileHelpers Library, then your terms are pretty much dictated for you.
Validation logic usually has at least 2 levels, the grosser level being "well-formatted" and the finer level being "correct data".
There are a few separate problems here. One issue is that of simply verifying the data, or writing tests to make sure that your parsing is accurate. A simple way to do this is to parse into a class that accepts a given range of values, and throws the appropriate error if not,
e.g.
public void setField1(int i)
{
if (i>100) throw new InvalidDataException...
}
Creating different classes for each record type is something you might want to do if the parsing logic is significantly different for different codes, so you don't have conditional logic like
public void setField2(String s)
{
if (field1==88 && s.equals ...
else if (field2==22 && s
}
yechh.
When I have had to load this kind of data in the past, I have put it all into a work table with the first two characters in one field and the rest in another. Then I have parsed it out to the appropriate other work tables based on the first two characters. Then I have done any cleanup and validation before inserting the data from the second set of work tables into the database.
In SQL Server you can do this through a DTS (2000) or an SSIS package and using SSIS , you may be able to process the data onthe fly with storing in work tables first, but the prcess is smilar, use the first two characters to determine the data flow branch to use, then parse the rest of the record into some type of holding mechanism and then clean up and validate before inserting. I'm sure other databases also have some type of mechanism for importing data and would use a simliar process.
I agree that if your data format has any sort of complexity you should create a set of custom classes to parse and hold the data, perform validation, and do any other appropriate model tasks (for instance, return a human readable description, although some would argue this would be better to put into a separate view class). This would probably be a good situation to use inheritance, where you have a parent class (possibly abstract) define the properties and methods common to all types of records, and each child class can override these methods to provide their own parsing and validation if necessary, or add their own properties and methods.
Creating a class for each type of row would be a better solution than using Arrays.
That said, however, in the past I have used Arraylists of Hashtables to accomplish the same thing. Each item in the arraylist is a row, and each entry in the hashtable is a key/value pair representing column name and cell value.
Why not start by designing the database that will hold the data then you can use the entity framwork to generate the classes for you.
here's a wacky idea:
if you were working in Perl, you could use DBD::CSV to read data from your flat file, provided you gave it the correct values for separator and EOL characters. you'd then read rows from the flat file by means of SQL statements; DBI will make them into standard Perl data structures for you, and you can run whatever validation logic you like. once each row passes all the validation tests, you'd be able to write it into the destination database using DBD::whatever.
-steve

Resources