Hadoop - How to switch from implementing the writable interface to use an Avro object? - hadoop

I’m using Hadoop to convert JSONs into CSV files to access them with Hive.
At the moment the Mapper is filling an own data structure parsing the JSONs with JSON-Smart. Then the reducer is reading out that object and is writing it to a file, separated by commas.
For making this faster I already implemented the writable interface in the data structure...
Now I want to use Avro for the data structure object to have more flexibility and performance. How could I change my classes to make them exchange an Avro object instead of a writable?

Hadoop offers a pluggable serialization mechanism via the SerializationFactory.
By default, Hadoop uses the WritableSerialization class to handle the deserialization of classes which implement the Writable interface, but you can register custom serializers that implement the Serialization interface by setting the Hadoop configuration property io.serializations (a CSV list of classes that implement the Serialization interface).
Avro has an implementation of the Serialization interface in the AvroSerialization class - so this would be the class you configure in the io.serializations property.
Avro actually has a whole bunch of helper classes which help you write Map / Reduce jobs to use Avro as input / output - there's some examples in the source (Git copy)
I can't seem to find any good documentation for Avro & Map Reduce at the moment, but i'm sure there are some other good examples out there.

Related

jackson - root element read Tree vs pojo

Hi I want to parse a json that i retrieve by hitting an legacy system, and build a response json. We are using Spring Boot having a jackson dependency. The problem i have is almost 75% of fields from legacy can be mapped directly or on basis of simple rules (0: false, 1:true). But, there are some complex rules as well like based on certain conditions and data present in some fields, we can map them to a nested object etc. To cater to this requirement which approach should we consider -
POJO approach to fetch the data from legacy target. Use bean util. copyproperties to populate the response bean (75% of properties), and then apply the business transformations on this POJO to tranform based on business logic. (Would we need two pojos here a. to copy from beanutil.copyproperties and then b. create final response dto ??)
Do not use pojo directly parse the JSON apply the transformations and then create a new POJO or response DTO. (But, this may not be generic solution and would need to be done on case by case basis).
Main considerations are approach should be fast, and generic to be applied like a framework. Thanks aakash
The considerations should be like below:
- Are the POJOs reusable?
- Is the JSON multilevel and very large?
If the answer is yes for both, then better to choose POJOs for cleaner implementation. Otherwise JsonObject parsing.
Hope this will help.

Where to find details for the API of Context in hadoop?

I coded some routine Hadoop MapReduce jobs, and thus call the context.write() method just based on some examples from the given Apache Hadoop source code. But such kinda copy doesn't help me understand the Hadoop API deeper.
Therefore, recently I started to read the Hadoop API document (https://hadoop.apache.org/docs/r2.7.0/api/) more carefully and try to figure out are there any other methods in Context except for context.write(). For instance, in the teragen example, context.getCounter() is used.
But to my surprise, I couldn't find the Context class documentation at all from the link above.
Where can i find the documentation for the Context class in hadoop?
You can start to work out whats going on if you dig into the standard Mapper class source (around line 106).
public abstract class Context
implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}
So this is just an abstract class which implements the MapContext interface found here (Javadoc link).
The concrete implementation is MapContextImpl found here.
It looks like the ContextFactory (source) is responsible for creating the different implementations of Context.

How to read sqoop options in PutTransformer class

we have implemented PutTransformer class for our use case. We want to use bucketting concept in ROW KEY ID, so we want to pass the number of Buckets from out side which we will be configured in a config file. Is it possible to read sqoop options in PutTransformer class.
I am passing customized sqoop put transformer class using the option "sqoop.hbase.insert.put.transformer.class".
Any idea on this.
The PutTransformer class is currently not exposing Hadoop's configuration object. I would suggest to file a new feature request on Sqoop JIRA to add such capability!

Mapping validation property paths using bean mappers like ModelMapper, Orika or Dozer

Given that a bean mapper is used to map a Business Modell of into a DTO model and vice versa:
Which one do provide easy ways to convert property paths?
Lets say my business logic reports that "person.street.number" is mandatory and now I want tell the client this in the correct way, i.e. as "personDto.streetnumber"?
I'm pretty sure every bean mapper has this knowledge, but which one provide easy ways to access this knowledge?
Based on this excellent list of bean mappers and further reading on their websites if found ModelMapper and Orika to look pretty attractive.
ModelMapper seems to offer good ways to access this information. Depending on the configuration, most specifically the MatchingStrategy, ModelMapper will automatically convert values from source paths to destination paths so long as they match.
To access this information one can use ModelMapper.getTypeMap and TypeMap.getMappings.

Hadoop: How to save Map object in configuration

Any idea how can I set Map object into org.apache.hadoop.conf.Configuration?
Serialize your map into JSON and then put it as string in your configuration.
There is no way to put a whole object into it, because the whole configuration will be written as a XML file.
GSON is quite good at it: http://code.google.com/p/google-gson/
Here is the tutorial about how to serialize collections: http://sites.google.com/site/gson/gson-user-guide#TOC-Collections-Examples

Resources