How do I get the sorted o/p using Hadoop mapreduce programming.
Is there any way to get final key-value pair in sorted order. ( either by key or value).
Any pointers on this greatly appreciated.
Thank You
R
By default, MapReduce will sort input records by their keys.
However, it might help you more to download latest Hadoop release and check out examples they have. There are different sort examples as well.
If you need more information on sort order, this is how it can be changed.
The sort order for keys is controlled by a RawComparator, which is found as follows:
If the property mapred.output.key.comparator.class is set, an instance of that class
is used. (The setOutputKeyComparatorClass() method on JobConf is a convenient
way to set this property.)
Otherwise, keys must be a subclass of WritableComparable, and the registered
comparator for the key class is used.
If there is no registered comparator, then a RawComparator is used that deserializes
the byte streams being compared into objects and delegates to the WritableCompar
able’s compareTo() method.
These rules reinforce why it’s important to register optimized versions of RawCompara
tors for your own custom Writable classes, and also that it’s straightforward to override the
sort order by setting your own comparator.
"Hadoop: The Definitive Guide" 2nd edition describes global sort in chapter 8 with code samples.
Related
"Sets can be declared using instances of the Set and RangeSet classes or by assigning set expressions. The simplest set declaration creates a set and postpones creation of its members."
That isn't a definition, what should a set be used for?
A set is an indexing mechanism. If you are familiar with basic python, you index lists by a numerical index. You can “index” a dictionary by keys that are hashable, etc.
So in most models you have collections of things, perhaps products, with variables and parameters (constants) that are related to these collections. So you might have a group of products {pc, tablet, iphone} and parameters that are logically indexed by this set…. cost[pc], cost[tablet], etc.
In Pyomo, you can declare a set and use that set to index a variable or parameter, etc. At the simplest level, you can just use a range of numbers, but you might use something more logical, depending on the model.
If this is confusing, you might consider locating an introductory textbook on Linear Programming.
The Pyomo Documentation Release 6.4.2 defines Set as A component used to index other components. (cf page 255)
Pyomo Set objects are compatible with Python set objects. It might help to look at the Python documentation:
A set is an unordered collection with no duplicate elements.
Sets can be used to model the presence or absence of properties (colors, brands, etc). Sets are basic data structures for all sorts of algorithms. They can be used to model the relations between other objects. The theory of sets was historically promoted/discussed as the basis of mathematics (Wikipedia link).
I have an array of custom classes. I've defined <=> on them, and have tested to make sure that my custom definition behaves as it should. I assumed that I could then call [].uniq and have it filter out my duplicates, but that isn't happening. Is there another operator I need to overload?
Array#uniq is based on equality, not on ordering, so your objects need to respond to eql?. Also, it uses hashing to speed up performance, so you need to implement hash as well.
Unfortunately, this contract isn't specified in the documentation, but it usually is specified in pretty much every Ruby book or course.
What I needed to implement, was .hash
I have a Core Data entity with four boolean non-optional properties, defaulted to NO. A class gets the entity object when the class is initialized, so this is not a result of an NSFetchResquest, and one of these four properties will be set to YES.
The class needs to know which property is YES.
Of course, I can use nested IF/Else statements (or ternaries) to find out which property is YES, but I'm wondering if there is a better (meaning more cocoa-ish) way to look at the entity and say 'is there a boolean value YES in your properties?'.
also, i can remodel to have the booleans have no value as default, and only look for the boolean that has YES, but that seems the same question.
Well there are several different possibilities. Using four different boolean properties is a clean solution. You then have to use the if ... elsif statements to find out what happened.
A more C way of doing that would be to define bitmasks which can be OR'ed together and stored as an NSUInteger. If this would semantically makes sense you could group them together in an enum, but that is the C way.
You could also define a custom subclass of NSManagedObject and write some convenience methods to check these options. Depends a bit on what they are good for.
You could use reflection (e.g., class_copyPropertyList and class_getProperty) to check what properties the class has, and examine their values, but that's a pretty heavy-handed approach when you already know which four properties are relevant. I wouldn't suggest this approach, and I wouldn't call it more Cocoa-ish, just more abstracted.
If you're looking at specific combinations of states, I think GorillaPatch's suggestion is right: You would turn those four booleans into a single 4-bit integer and compare it against bit masks representing the various combinations you're interested in.
I am working in a group that is writing some APIs for tools that we are using in Ruby. When writing API methods, many of my team mates use hash tables as the method's only parameter while I write my methods with each value specified.
For example, a class Apple defined as:
class Apple
#commonName
#volume
#color
end
I would instantiate the class with method:
Apple.new( commonName, volume, color )
My team mates would write it so the method looked like:
Apple.new( {"commonName"=>commonName, "volume"=>volume, "color"=>color )
I don't like using a hash table as the input. To me is seems unnecessarily bulky and doesn't add any clarity to the code. While it doesn't appear to be a big deal in this example, some of our methods have greater than 10 parameters and there will often be hash tables nested in inside other hash tables. I also noticed that using hash tables in this way is extremely uncommon in public APIs(net/telnet is the only exception that I can think of right now).
Question: What arguments could I make to my team members to not use hash tables as input parameters. The bulkiness of the code isn't a sufficient justification(they are not afraid of writing 200-400 character lines) and excessive memory/processing overhead won't work because it won't become an issue with the way our tools will be used.
Actually if your method takes more than 10 arguments, you should either redesign your class or eat dirt and use hashes. For any method that takes more than 4 arguments, using typical arguments can be counter-intuitive while calling the method, because you got to remember the order correctly.
I think best solution would be to simply redesign such methods and use something like builder or fluent patterns.
First of all, you should chide them for using strings instead of symbols for hash keys.
One issue with using a hash is that you then have to check that all the appropriate keys are in it. This makes it useful for optional parameters, but for mandatory one, why not use the built-in functionality of the language? For example, with their method, what happens if I do this:
Apple.new( {"commonName"=>commonName, "volume"=>volume} )
Whereas, with Apple.new(commonName, volume), you know you'll get an ArgumentError.
Named parameters make for more self-documenting code which is nice. But other than that there's not a lot of difference. The Hash allows for more flexibility, especially if you start doing any method aliasing. Also, the various Hash methods in ActiveSupport make setting defaults and verifying inputs pretty painless. I guess this probably wasn't the answer you were looking for.
Update: Please read this question in the context of design principles, elegance, expression of intent, and especially the "signals" sent to other programmers by design choices.
I have two "views" of a set of objects. One is a dictionary/map indexing the objects by a string value. The other is a dictionary/map indexing the objects by an ordinal (ordering integer). There is no "master" collection of the objects by themselves that can serve as the authoritative source for the number of objects, but the two dictionaries should always both contain references to all the objects.
When a new item is added to the set a reference is added to both dictionaries, and then some processing needs to be done which is affected by the new total number of objects.
What should I use as the authoritative source to reference for the current size of the set of objects? It seems that all my options are flawed in one dimension or another. I can just consistently reference one of the dictionaries, but that would codify an implication of that dictionary's superiority over the other. I could add a 3rd collection, a simple list of the objects to serve as the authoritative list, but that increases redundancy. Storing a running count seems simplest, but also increases redundancy and is more brittle than referencing a collection's self-tracked count on the fly.
Is there another option that will allow me to avoid choosing the lesser evil, or will I have to accept a compromise on elegance?
I would create a class that has (at least) two collections.
A version of the collection that is
sorted by string
A version of the
collection that is sorted by ordinal
(Optional) A master collection
The class would handle the nitty gritty management:
The syncing of the contents for the collections
Standard collection actions (e.g. Allow users get the size, Add or retrieve items)
Let users get by string or ordinal
That way you can use the same collection wherever you need either behavior, but still abstract away the "indexing" behavior you are going for.
The separate class gives you a single interface with which to explain your intent regarding how this class is to be used.
I'd suggest encapsulation: create a class that hides the "management" details (such as the current count) and use it to expose immutable "views" of the two collections.
Clients will ask the "manglement" object for an appropriate reference to one of the collections.
Clients adding a "term" (for lack of a better word) to the collections will do so through the "manglement" object.
This way your assumptions and implementation choices are "hidden" from clients of the service and you can document that the choice of collection for size/count was arbitrary. Future maintainers can change how the count is managed without breaking clients.
BTW, yes, I meant "manglement" - my favorite malapropism for management (in any context!)
If both dictionaries contain references to every object, the count should be the same for both of them, correct? If so, just pick one and be consistent.
I don't think it is a big deal at all. Just reference the sets in the same order each time
you need to get access to them.
If you really are concerned about it you could encapsulate the collections with a wrapper that exposes the public interfaces - like
Add(item)
Count()
This way it will always be consistent and atomic - or at least you could implement it that way.
But, I don't think it is a big deal.