Multiple Custom Writable formats - hadoop

I have multiple input sources and I have used Sqoop's codegen tool to generate custom classes for each input source
public class SQOOP_REC1 extends SqoopRecord implements DBWritable, Writable
public class SQOOP_REC2 extends SqoopRecord implements DBWritable, Writable
On the Map side, based on the input source, I create objects of the above 2 classes accordingly.
I have the key as type "Text" and since I have 2 different types of values, I kept the value output type as "Writable".
On the reduce side, I accept the value type as Writable.
public class SkeletonReduce extends Reducer<Text,Writable, Text, Text> {
public void reduce(Text key, Iterable<Writable> values, Context context) throws IOException,InterruptedException {
}
}
I also set
job.setMapOutputValueClass(Writable.class);
During execution, it does not enter the reduce function at all.
Could someone tell me if it possible to do this? If so, what am I doing wrong?

You can't specify Writable as your output type; it has to be a concrete type. All records need to have the same (concrete) key and value types, in Mappers and Reducers. If you need different types you can create some kind of hybrid Writable that contains either an "A" or "B" inside. It's a little ugly but works and is done a lot in Mahout for example.
But I don't know why any of this would make the reducer not run; this is likely something quite separate and not answerable based on this info.

Look into extending GenericWritable for your value type. You need to define the set of classes which are allowed (SQOOP_REC1 and SQOOP_REC2 in your case), and it's not as efficient because it creates new object instances in the readFields method (but you can override this if you have a small set of classes, just have instance variables of both types, and a flag which denotes which one is valid)
http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/io/GenericWritable.html

Ok, I think I figured out how to do this. Based on a suggestion give by Doug Cutting himself
http://grokbase.com/t/hadoop/common-user/083gzhd6zd/multiple-output-value-classes
I wrapped the class using ObjectWritable
ObjectWritable obj = new ObjectWritable(SQOOP_REC2.class,sqoop_rec2);
And then on the Reduce side, I can get the name of the wrapped class and Cast it back to the original class.
if(val.getDeclaredClass().getName().equals("SQOOP_REC2")){
SQOOP_REC2temp = (SQOOP_REC2) val.get();
And don't forget
job.setMapOutputValueClass(ObjectWritable.class);

Related

What's the usage of org.springframework.data.repository.query.parser.Part?

As you can see in the title , I'd appreciate it if somebody can tell the usage of the Class .
There's a inside enum Type ,how to use it?
public static enum Type {
BETWEEN(2, "IsBetween", "Between"), IS_NOT_NULL(0, "IsNotNull", "NotNull"), IS_NULL(0, "IsNull", "Null"), LESS_THAN(
"IsLessThan", "LessThan"), LESS_THAN_EQUAL("IsLessThanEqual", "LessThanEqual"), GREATER_THAN("IsGreaterThan",
"GreaterThan"), GREATER_THAN_EQUAL("IsGreaterThanEqual", "GreaterThanEqual"), BEFORE("IsBefore", "Before"), AFTER(
"IsAfter", "After"), NOT_LIKE("IsNotLike", "NotLike"), LIKE("IsLike", "Like"), STARTING_WITH("IsStartingWith",
"StartingWith", "StartsWith"), ENDING_WITH("IsEndingWith", "EndingWith", "EndsWith"), NOT_CONTAINING(
"IsNotContaining", "NotContaining", "NotContains"), CONTAINING("IsContaining", "Containing", "Contains"), NOT_IN(
"IsNotIn", "NotIn"), IN("IsIn", "In"), NEAR("IsNear", "Near"), WITHIN("IsWithin", "Within"), REGEX(
"MatchesRegex", "Matches", "Regex"), EXISTS(0, "Exists"), TRUE(0, "IsTrue", "True"), FALSE(0, "IsFalse",
"False"), NEGATING_SIMPLE_PROPERTY("IsNot", "Not"), SIMPLE_PROPERTY("Is", "Equals");
// Need to list them again explicitly as the order is important
// (esp. for IS_NULL, IS_NOT_NULL)
private static final List<Part.Type> ALL = Arrays.asList(IS_NOT_NULL, IS_NULL, BETWEEN, LESS_THAN, LESS_THAN_EQUAL,
GREATER_THAN, GREATER_THAN_EQUAL, BEFORE, AFTER, NOT_LIKE, LIKE, STARTING_WITH, ENDING_WITH, NOT_CONTAINING,
CONTAINING, NOT_IN, IN, NEAR, WITHIN, REGEX, EXISTS, TRUE, FALSE, NEGATING_SIMPLE_PROPERTY, SIMPLE_PROPERTY);
...}
Part is internal to Spring Data. It is not intended to be used by client code. So if you don't implement your own Spring Data Modul you shouldn't use it at all nor anything inside it.
A Part is basically an element of an AST that will probably result in an element of a where clause or equivalent depending on the store in use.
E.g. if you have a method findByNameAndDobBetween(String, Date, Date) parsing the method name will result in two parts. One for the name condition and one for the DOB between condition.
The type enum lists all the different types of conditions that are possible.
The parameters of the elements are the number of method arguments required and (possibly multiple) Strings that identify this type inside a method name.

Creating composite key class for Secondary Sort

I am trying to create a composite key class of a String uniqueCarrier and int month for Secondary Sort. Can anyone tell me, what are the steps for the same.
Looks like you have an equality problem since you're not using uniqueCarrier in your compareTo method. You need to use uniqueCarrier in your compareTo and equals methods (also define an equals method). From the java lang reference
The natural ordering for a class C is said to be consistent with equals if and only if e1.compareTo(e2) == 0 has the same boolean value as e1.equals(e2) for every e1 and e2 of class C. Note that null is not an instance of any class, and e.compareTo(null) should throw a NullPointerException even though e.equals(null) returns false.
You can also implement a RawComparator so that you can compare them without deserializing for some faster performance.
However, I recommend (as I always do) to not write things like Secondary Sort yourself. These have been implemented (as well as dozens of other optimizations) in projects like Pig and Hive. E.g. if you were using Hive, all you need to write is:
SELECT ...
FROM my_table
ORDER BY month, carrier;
The above is a lot simpler to write than trying to figure out how to write Secondary Sorts (and eventually when you need to use it again, how to do it in a generic fashion). MapReduce should be considered a low level programming paradigm and should only be used (IMHO) when you need high performance optimizations that you don't get from higher level projects like Pig or Hive.
EDIT: Forgot to mention about Grouping comparators, see Matt's answer
Your compareTo() implementation is incorrect. You need to sort first on uniqueCarrier, then on month to break equality:
#Override
public int compareTo(CompositeKey other) {
if (this.getUniqueCarrier().equals(other.getUniqueCarrier())) {
return this.getMonth().compareTo(other.getMonth());
} else {
return this.getUniqueCarrier().compareTo(other.getUniqueCarrier());
}
}
One suggestion though: I typically choose to implement my attributes directly as Writable types if possible (for example, IntWriteable month and Text uniqueCarrier). This allows me to call write and readFields directly on them, and also use their compareTo. Less code to write is always good...
Speaking of less code, you don't have to call the parent constructor for your composite key.
Now for what is left to be done:
My guess is you are still missing a hashCode() method, which should only return the hash of the attribute you want to group on, in this case uniqueCarrier. This method is called by the default Hadoop partitionner to distribute work across reducers.
I would also write custom GroupingComparator and SortingComparator to make sure grouping happens only on uniqueCarrier, and that sorting behaves according to CompositeKey compareTo():
public class CompositeGroupingComparator extends WritableComparator {
public CompositeGroupingComparator() {
super(CompositeKey.class, true);
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.getUniqueCarrier().compareTo(second.getUniqueCarrier());
}
}
public class CompositeSortingComparator extends WritableComparator {
public CompositeSortingComparator()
{
super (CompositeKey.class, true);
}
#Override
public int compare (WritableComparable a, WritableComparable b){
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.compareTo(second);
}
}
Then, tell your Driver to use those two:
job.setSortComparatorClass(CompositeSortingComparator.class);
job.setGroupingComparatorClass(CompositeGroupingComparator.class);
Edit: Also see Pradeep's suggestion of implementing RawComparator to prevent having to unmarshall to an Object each time, if you want to optimize further.

How to sort values fed to a reducer in a specific order

In my map-reduce job, the mapper's output type is <Text, FileAlias> and the class FileAlias has two attributes as follows
public class FileAlias extends Configured implements WritableComparable<FileAlias>{
public boolean isAlias;
public String value;
...
}
For each output key (of type Text), only one of the output values (of type FileAlias) has attribute isAlias set as true. I would like this output value to be the first item in the OutputCollector fed to the reducer. Is there any way to do that?
Take a look at the setGroupingComparatorClass method on the Job object. You should be able to implement a comparator that makes FileAlias with isAlias == true first in the Iterable that is passed to the reduce task.

Class: Immutability vs Not Extensible

I was reading that there are many reasons for making a class final in SO threads and also in an arcticle
Two of which were
1. To remove extensibility
2. to make class immutable.
Does making a class immutable have the characteristic along with it as being final ( it's methods )? I don't see the difference between the two?
Immutable object does not allow to change his state. Final class does not allow to inherit itself. For example class Foo (see below) is immutable (the state, ie _name is never changed ) and class Bar is mutable (rename method allows to change the state):
final class Foo
{
private String _name;
public Foo(string name)
{
_name = name;
}
public String getName()
{
return _name;
}
}
final class Bar
{
private String _name;
public Bar(string name)
{
_name = name;
}
public String getName()
{
return _name;
}
public void rename(string newName)
{
_name = newName;
}
}
It can sometimes be useful to recognize types as "verifiably deeply immutable", meaning that static analysis can demonstrate that (1) once an instance is constructed, none of its properties will ever change, and (2) every object instance to which it holds a reference is verifiably deeply immutable. Classes which are open to extension cannot be verifiably deeply immutable, because a static analyzer would have no way of knowing whether a mutable subclass might be created, and a reference to that mutable subclass stored within what's supposed to be a verifiably deeply immutable object.
On the other hand, it can sometimes be useful to have abstract (and thus extensible) classes which are specified to be deeply immutable. The abstract class would have no way of forcing derived classes to immutable, but any mutable derived classes should be considered "broken". The situation would be somewhat analogous to the requirement that two object instances which report themselves as "equal" to each other should report the same hash code. It's possible to design classes which violate that requirement, but any errant hash-table behavior that results is the fault of the broken hash-code function, rather than the hash table.
For example, one might have an abstract ImmutableMatrix property with a method to read the element at a given (row,column) location. One possible implementation would be to back an NxM ImmutableMatrix with an array of N*M elements. On the other hand, it may also be useful to define some subclasses like ImmutableDiagonalMatrix, with an array of N elements, where Value(R,C) would yield 0 for R!=C, and Arr[R] for R==C. If a significant fraction of the arrays one is using will be diagonal arrays, one could save a lot of memory for each such instance. While leaving the class extensible would leave open the possibility that someone might extend it in a fashion which is open to mutation, it would also leave open the possibility that a programmer who knew that many of the arrays a program used would fit some particular form could design a class to optimally store that form.

Gson, How to write a JsonDeserializer for Generic Typed Classes?

Situation
I have a class that holds a generic type, and it also has a non-zero arg constructor. I don't want to expose a zero arg constructor because it can lead to erroneous data.
public class Geometries<T extends AbstractGeometry>{
private final GeometryType geometryType;
private Collection<T> geometries;
public Geometries(Class<T> classOfT) {
this.geometryType = lookup(classOfT);//strict typing.
}
}
There are several (known and final) classes that may extend AbstractGeometry.
public final Point extends AbstractGeometry{ ....}
public final Polygon extends AbstractGeometry{ ....}
Example json:
{
"geometryType" : "point",
"geometries" : [
{ ...contents differ... hence AbstractGeometry},
{ ...contents differ... hence AbstractGeometry},
{ ...contents differ... hence AbstractGeometry}
]
}
Question
How can I write a JsonDeserializer that will deserialize a Generic Typed class (such as Geometires)?
CHEERS :)
p.s. I don't believe I need a JsonSerializer, this should work out of the box :)
Note: This answer was based on the first version of the question. The edits and subsequent question(s) change things.
p.s. I don't believe I need a JsonSerializer, this should work out of the box :)
That's not the case at all. The JSON example you posted does not match the Java class structure you apparently want to bind to and generate.
If you want JSON like that from Java like that, you'll definitely need custom serialization processing.
The JSON structure is
an object with two elements
element 1 is a string named "geometryType"
element 2 is an object named "geometries", with differing elements based on type
The Java structure is
an object with two fields
field 1, named "geometryType", is a complex type GeometryType
field 2, named "geometries" is a Collection of AbstractGeometry objects
Major Differences:
JSON string does not match Java type GeometryType
JSON object does not match Java type Collection
Given this Java structure, a matching JSON structure would be
an object with two elements
element 1, named "geometryType", is a complex object, with elements matching the fields in GeometryType
element 2, named "geometries", is a collection of objects, where the elements of the different objects in the collection differ based on specific AbstractGeometry types
Are you sure that what you posted is really what you intended? I'm guessing that either or both of the structures should be changed.
Regarding any question on polymorphic deserialization, please note that the issue was discussed a few times on StackOverflow.com already. I posted a link to four different such questions and answers (some with code examples) at Can I instantiate a superclass and have a particular subclass be instantiated based on the parameters supplied.

Resources