hadoop CustomWritables - hadoop

I have more of a design question regarding the necessity of a CustomWritable for my use case:
So I have a document pair that I will process through a pipeline and write out intermediate and final data to HDFS. My key will be something like ObjectId - DocId - Pair - Lang. I do not see why/if I will need a CustomWritable for this use case. I guess if I did not have a key, I would need a CustomWritable? Also, when I write data out to HDFS in the Reducer, I use a Custom Partitioner. So, that would kind of eliminate my need for a Custom Writable?
I am not sure if I got the concept of the need for a Custom Writable right. Can someone point me in the right direction?

Writables can be used for de/serializing objects. For example a log entry can contain a timestamp, an user IP and the browser agent. So you should implement your own WritableComparable for a key that identifies this entry and you should implement a value class that implements Writable that reads and writes the attributes in your log entry.
These serializations are just a handy way to get the data from a binary format to an object. Some Frameworks like HBase still require byte arrays to persist the data. So you'll have a lot of overhead in transfering this by yourself and messes up your code.

Thomas' answer explains a bit. Its way too late but I'd like to add the following for prospective readers:
Partitioner only comes into play between the map and reduce phase and has no role to play in writing from reducer to output files.
I don't believe writing INTERMEDIATE data to hdfs is a requirement in most cases, although there are some hacks that can be applied to do the same.
When you write from a reducer to hdfs, the keys will automatically be sorted and each reducer will write to ONE SEPARATE file. Based on their compareTo method, keys are sorted. So if you want to sort based on multiple variables, go for a Custom key class that extends WritableComparable, and implement the write, readFields and compareTo methods. You can now control the way the keys are sorted, based on the compareTo implementation

Related

Spring data Mongodb - Encrypting a single field using convertor

I have a collection which has several array of objects. In one of the sub-objects there is a field called secret which has to be stored in encrypted format, the field is of type String.
What is the best way of achieving?
I don't think writing a custom writer for the entire document is feasible.
How to write a String convertor that will be only applied for this single field?
There are many answers to this questions and different approaches that depend on your actual requirements.
The first question that you want to ask is, whether MongoDB is a good place to store encrypted values at all or whether there is a better option that gives you features like rewrapping (re-encrypt), key rotation, audit logging, access control, key management, …
Another thing that comes into play is decryption: Every time you retrieve data from MongoDB, the secret is decrypted. Also, encrypting a lot of entries with the same key allows facilitates cryptanalysis so you need to ensure regular key rotation. Least, but not last, you're in charge of storing the crypto keys securely and making sure it's hard to get hold of these.
Having a dedicated data type makes it very convenient writing a with a signature of e.g. Converter<Secret, String> or Converter<Secret, Binary> as you get full control over serialization.
Alternatively, have a look at https://github.com/bolcom/spring-data-mongodb-encrypt or external crypto tools like HashiCorp Vault.

How to pass different set of data to two different mappers of the same job

I have one Single Mapper , say SingleGroupIdentifierMapper.java
Now this is a generic mapper which does all the filtration on a single line of mapper-input value/record based on property file (containing filters and key-value field indexes) passed to it from the driver class using cache.
Only the reducer business logic is is different and has been implemented keeping the mapper logic generic and to be implemented using the PropertyFile as mentioned above.
Now my problem statement is I have input from multiple Sources now, having different formats. That means I have to do some thing like
MultipleInputs.addInputPath(conf, new Path("/inputA"),TextInputFormat.class, SingleGroupIdentifierMapper.class);
MultipleInputs.addInputPath(conf, new Path("/inputB"),TextInputFormat.class, SingleGroupIdentifierMapper.class);
But the cached property file which I pass from the driver class to the mapper for implementing filter based on field indexes is common, So how can I pass two different property file to the same mapper, where if it processes, say Input A, then it will use PropertyFileA (to filter and create key value pair) and if it processes, say Input B then it will use PropertyFileB (to filter and create key value pair).
It is possible to change the Generic Code of the Mapper to take care of this scenario BUT how to approach this problem in the Generic Class and how to identify in the same Mapper Class if the input is from inputA/inputB and accordingly apply the propertyFile Configuration on the data.
Can we pass arguments to the constructor of this mapper class to specify it is from inputB or it needs to read which property file in cache.?
Eg Something like :
MultipleInputs.addInputPath(conf, new Path("/inputB"),TextInputFormat.class, args[], SingleGroupIdentifierMapper.class);
where args[] is passed to the SingleGroupIdentifierMapper class's constructor which we define to take as input and set it as a attribure.
Any thoughts or expertise is most welcomed.
Hope I was able to express my problem clearly, kindly ask me in case there needs to be more clarity in the question.
Thanks in Advance,
Cheers :)
Unfortunately MultipleInputs is not that flexible. But there is a workaround which matches InputSplit paths to the property files in the setup method of the Mapper. If you are not using any sort of Combine*Format, than a single mapper will process a single split from a single file:
When adding prop files into cache use /propfile_1#PROPS_A and /propfile_2#PROPS_B
Add input path into job.getConfiguration().set("PROPS_A", "/inputA") and job.getConfiguration().set("PROPS_B", "/inputB")
In the Mapper.setup(Context context) method, use context.getInputSplit().toString() to get the path of the split. Than match it to the paths saved in the context.getConfiguration().get("PROPS_A") or PROPS_B
If you are using some Combine*Format, than you would need to extend it, override getSplits that use information from the JobContext to build the PathFilter[] and call createPool, which will create splits that contain files from the same group (inputA or inputB).

Issue with setting multiple projectionSchemas for AvroParquetInputFormat

I use AvroParquetInputFormat. The usecase requires scanning of multiple input directories and each directory will have files with one schema. Since AvroParquetInputFormat class could not handle multiple input schemas, I created a workaround by statically creating multiple dummy classes like MyAvroParquetInputFormat1, MyAvroParquetInputFormat2 etc where each class just inherits from AvroParquetInputFormat. And for each directory, I set a different MyAvroParquetInputFormat and that worked (please let me know if there is a cleaner way to achieve this).
My current problem is as follows:
Each file has a few hundred columns and based on meta-data I construct a projectionSchema for each directory, to reduce unnecessary disk & network IO. I use the static setRequestedProjection() method on each of my MyAvroParquetInputFormat classes. But, being static, the last call’s projectionSchema is used for reading data from all directories, which is not the required behavior.
Any pointers to workarounds/solutions would is highly appreciated.
Thanks & Regards
MK
Keep in mind that if your avro schemas are compatible (see avro doc for definition of schema compatibility) you can access all the data with a single schema. Extending on this, it is also possible to construct a parquet friendly schema (no unions) that is compatible with all your schemas so you can use just that one.
As for the approach you took, there is no easy way of doing this that I know of. You have to extend MultipleInputs functionality somehow to assign a different schema for each of your input formats. MultipleInputs works by setting two configuration properties in your job configuration:
mapreduce.input.multipleinputs.dir.formats //contains a comma separated list of InputFormat classes
mapreduce.input.multipleinputs.dir.mappers //contains a comma separated list of Mapper classes.
These two lists must be the same length. And this is where it gets tricky. This information is used deep within hadoop code to initialize mappers and input formats, so that's where you should add your own code.
As an alternative, I would suggest that you do the projection using one of the tools already available, such as hive. If there are not too many different schemas, you can write a set of simple hive queries to do the projection for each of the schemas, and after that you can use a single mapper to process the data or whatever the hell you want.

Create Value class for Sequence Files at runtime

I have some types of data that I have to upload on HDFS as Sequence Files.
Initially, I had thought of creating a .jr file at runtime depending on the type of schema and use rcc DDL tool by Hadoop to create these classes and use them.
But looking at rcc documentation, I see that it has been deprecated. I was trying to see what other options I have to create these value classes per type of data.
This is a problem as I get to know the metadata of the data to be loaded at runtime along with the data-stream. So, I have, no choice, but to create Value class at runtime and then use it for writing (key, vale) to SequenceFile.Writer and finally saving it on HDFS.
Is there any solution for this problem?
You can try looking other serialization frameworks, like Protocol Buffers, Thrift, or Avro. You might want to look at Avro first, since it doesn't require static code generation, which might be more suitable for you.
Or if you want something really quick and dirty, each record in the SequenceFile can be a HashMap where the key/values are the name of the field and the value.

Appropriate data structure for flat file processing?

Essentially, I have to get a flat file into a database. The flat files come in with the first two characters on each line indicating which type of record it is.
Do I create a class for each record type with properties matching the fields in the record? Should I just use arrays?
I want to load the data into some sort of data structure before saving it in the database so that I can use unit tests to verify that the data was loaded correctly.
Here's a sample of what I have to work with (BAI2 bank statements):
01,121000358,CLIENT,050312,0213,1,80,1,2/
02,CLIENT-STANDARD,BOFAGB22,1,050311,2359,,/
03,600812345678,GBP,fab1,111319005,,V,050314,0000/
88,fab2,113781251,,V,050315,0000,fab3,113781251,,V,050316,0000/
88,fab4,113781251,,V,050317,0000,fab5,113781251,,V,050318,0000/
88,010,0,,,015,0,,,045,0,,,100,302982205,,,400,302982205,,/
16,169,57626223,V,050311,0000,102 0101857345,/
88,LLOYDS TSB BANK PL 779300 99129797
88,TRF/REF 6008ABS12300015439
88,102 0101857345 K BANK GIRO CREDIT
88,/IVD-11 MAR
49,1778372829,90/
98,1778372839,1,91/
99,1778372839,1,92
I'd recommend creating classes (or structs, or what-ever value type your language supports), as
record.ClientReference
is so much more descriptive than
record[0]
and, if you're using the (wonderful!) FileHelpers Library, then your terms are pretty much dictated for you.
Validation logic usually has at least 2 levels, the grosser level being "well-formatted" and the finer level being "correct data".
There are a few separate problems here. One issue is that of simply verifying the data, or writing tests to make sure that your parsing is accurate. A simple way to do this is to parse into a class that accepts a given range of values, and throws the appropriate error if not,
e.g.
public void setField1(int i)
{
if (i>100) throw new InvalidDataException...
}
Creating different classes for each record type is something you might want to do if the parsing logic is significantly different for different codes, so you don't have conditional logic like
public void setField2(String s)
{
if (field1==88 && s.equals ...
else if (field2==22 && s
}
yechh.
When I have had to load this kind of data in the past, I have put it all into a work table with the first two characters in one field and the rest in another. Then I have parsed it out to the appropriate other work tables based on the first two characters. Then I have done any cleanup and validation before inserting the data from the second set of work tables into the database.
In SQL Server you can do this through a DTS (2000) or an SSIS package and using SSIS , you may be able to process the data onthe fly with storing in work tables first, but the prcess is smilar, use the first two characters to determine the data flow branch to use, then parse the rest of the record into some type of holding mechanism and then clean up and validate before inserting. I'm sure other databases also have some type of mechanism for importing data and would use a simliar process.
I agree that if your data format has any sort of complexity you should create a set of custom classes to parse and hold the data, perform validation, and do any other appropriate model tasks (for instance, return a human readable description, although some would argue this would be better to put into a separate view class). This would probably be a good situation to use inheritance, where you have a parent class (possibly abstract) define the properties and methods common to all types of records, and each child class can override these methods to provide their own parsing and validation if necessary, or add their own properties and methods.
Creating a class for each type of row would be a better solution than using Arrays.
That said, however, in the past I have used Arraylists of Hashtables to accomplish the same thing. Each item in the arraylist is a row, and each entry in the hashtable is a key/value pair representing column name and cell value.
Why not start by designing the database that will hold the data then you can use the entity framwork to generate the classes for you.
here's a wacky idea:
if you were working in Perl, you could use DBD::CSV to read data from your flat file, provided you gave it the correct values for separator and EOL characters. you'd then read rows from the flat file by means of SQL statements; DBI will make them into standard Perl data structures for you, and you can run whatever validation logic you like. once each row passes all the validation tests, you'd be able to write it into the destination database using DBD::whatever.
-steve

Resources