Throughout a Giraph graph, I need to maintain an array on a Vertex basis to store the results of several "health" checks done at the Vertex level.
If it as simple as writing a new Input format that will get carried over?
My worry goes to the fact that the actual data that will feed the graph does not need to know about this array.
You don’t need to read the data from anywhere, if the array is just there to keep temporary calculations between steps you don’t need to read, nor write it.
You will need to create a new class which implements Writable. You’ll store the array within this class and take care of the serialisation/deserialization between the supersteps. This is done in the two functions:
#Override
public void write(DateOutput dataOutput) throws IOException {
. . . .
}
#Override
public void readFields(DataInput dataInput) throws IOException {
. . . .
}
Make sure, that you’ll read and write the fields in the same order, as they are written into a buffer and having different orders would screw up everything.
Afterwards you just need to specify this very class in the Generic type for the Vertex-Value-Type.
Although if you don’t initialize the VertexValue during the set-up process, when you read the input file, … you should do it in the first SuperStep (== 0)
I’ve written a blog post about complex data types in Giraph about a year ago, maybe it will help you further, although some things might have had changed in the meanwhile.
Related
We are creating a dictionary like application on Hadoop and Hive.
The general process is batch-scanning billion of log data (e.g. words) against a big fixed dictionary (about 100G, like a multiple language WordNet dictionary).
We already have a single machine version of the java application (let's call this "singleApp") to query this dictionary. We currently could not modify either this java application or the dictionary file, thus we could not re-design and re-write a complete new MapReduce application. We need use this single machine version Java Application as the building block to extend it to a MapReduce version.
Currently, we are able to create a MapReduce application by calling this "singleApp" and pass a subset of dictionary (e.g. 1G dictionary) using distributed-cache. However, if we use the full dictionary (100G), the app is very very slow to start. Furthermore, we really want to install these dictionaries into the Hadoop cluster without calling it each time using -file or distributed cache options.
We tried to copy the dictionary files directly into local disks in slave nodes and pointing the java app to it, but it could not find the dictionary. Any documents on what need to be done if we want to debug more on this approach?
Any suggestions on what should be the best practice/process for us to handle situations like this (very large dictionary files, and prefer to keep the dictionary files installed all the time)?
You don't need to use Hadoop for 100GB of data. You can use your distributed cache as a processing platform as well.
Consider your distributed cache as an In-Memory Data Grid.
Try TayzGrid an Open Source In-Memory DataGrid with a MapReduce usecase such as yours.
public class ProductAnalysisMapper implements
com.alachisoft.tayzgrid.runtime.mapreduce.Mapper {
#Override
public void map(Object ikey, Object ivalue, OutputMap omap)
{
//This line emits value count to the map.
omap.emit(ivalue, 1);
}
}
public class ProductAnalysisReducer implements
com.alachisoft.tayzgrid.runtime.mapreduce.Reducer {
public ProductAnalysisReducer(Object k) { /* ... */ }
#Override
public void reduce(Object iv) { /* ... */ }
#Override
public void finishReduce(KeyValuePair kvp) { /* ... */ }
}
I'm working on the implementation of a recommendation algorithm with a "special" feature and I would like to perform just this small customization on basic algorithms provided by Apache Mahout.
These are the steps I'm following (*=non-basic steps):
create a DataModel: DataModel userModel = new GenericDataModel(userItemPreference);
compute similarity: UserSimilarity euclideanDistanceUserSimilarity = new EuclideanDistanceSimilarity(userModel);
get basic neighborhood UserNeighborhood e_n_neighborhood = new NearestNUserNeighborhood(20, euclideanDistanceUserSimilarity, userModel);
compute a "weight" value for any of these user (*)
change the similarity of users inside neighorhood using weighted values (e.g. similarity*=weight) (*)
get a sub neighborhood containing only the 10 high-weighted-related users (*)
get a recommendation using only this new neighborhood (*)
It should be very simple, but I don't understand why, to build a Recommender, Mahout need for userModel, neighborhood and distanceSimilarity again)... so I can't figure which object I need to modify and pass to the constructor (new neighborhood apart).
UserBasedRecommender t_recommender =
new GenericUserBasedRecommender(userModel, e_n_neighborhood,
euclideanDistanceUserSimilarity, null);
Do you have any suggestion to help me to implement steps 5 and 7 (considering the GenericUserBasedRecommender concerns I just spoke about)?
Thank you
I think I was able to solve the problem, so I will explain here my solution... maybe it will be useful for someone else.
As I said I need to run
UserBasedRecommender t_recommender =
new GenericUserBasedRecommender(userModel, e_n_neighborhood,
euclideanDistanceUserSimilarity, null);
but, by modifying with a weight-list the similarity, and restricting the set of the users and in the neighborhood.
To achieve this, a way to proceed is to create your own implementation of the interfaces expected by GenericUserBasedRecommender constructor.
These implementation don't need to compute any metric or value, they should just be able to return expected elements.
So, after applying the weight and filtering the user list, the main points are the following.
Create a new userModel (like the existing one) with just interested users, all items and all scores
Create a UserNeighborhood implementing just
public long[] getUserNeighborhood(long userID) throws TasteException;
This class should implements a basic neighborhood container, and the methos should returni just the static list of neighborhood for any user (static, because you already have it, so you can simply pass it by the constructor and put it in memory).
Create a UserSimilarity implementing just
public double userSimilarity(long userID1, long userID2) throws TasteException;
The behaviour should be the same of the previous class (so you can use values from the initial UserSimilarity, remembering to consider weight also).
Now you can proceed with the recommendation, passing these new objects.
I will explain with an example. My GWT project has a Company module, which lets a user add, edit, delete, select and list companies.
Of these, the add, edit and delete operations lands back the user on the CompanyList page.
Thus, having three different events - CompanyAddedEvent, CompanyUpdatedEvent and CompanyDeletedEvent, and their respective event handlers - seems overkill to me, as there is absolutely not difference in their function.
Is it OK to let a single event manage the three operations?
One alternative I think is to use some event like CompanyListInvokedEvent. However, somewhere I think its not appropriate, is the event actually is not the list being invoked, but a company being added/updated/deleted.
If it had been only a single module, I would have get the task done with three separate events. But other 10 such modules are facing this dilemma. It means 10x3 = 30 event classes along with their 30 respective handlers. The number is large enough for me to reconsider.
What would be a good solution to this?
UPDATE -
#ColinAlworth's answer made me realize that I could easily use Generics instead of my stupid solution. The following code represents an event EntityUpdatedEvent, which would be raised whenever an entity is updated.
Event handler class -
public class EntityUpdatedEvent<T> extends GwtEvent<EntityUpdatedEventHandler<T>>{
private Type<EntityUpdatedEventHandler<T>> type;
private final String statusMessage;
public EntityUpdatedEvent(Type<EntityUpdatedEventHandler<T>> type, String statusMessage) {
this.statusMessage = statusMessage;
this.type = type;
}
public String getStatusMessage() {
return this.statusMessage;
}
#Override
public com.google.gwt.event.shared.GwtEvent.Type<EntityUpdatedEventHandler<T>> getAssociatedType() {
return this.type;
}
#Override
protected void dispatch(EntityUpdatedEventHandler<T> handler) {
handler.onEventRaised(this);
}
}
Event handler interface -
public interface EntityUpdatedEventHandler<T> extends EventHandler {
void onEventRaised(EntityUpdatedEvent<T> event);
}
Adding the handler to event bus -
eventBus.addHandler(CompanyEventHandlerTypes.CompanyUpdated, new EntityUpdatedEventHandler<Company>() {
#Override
public void onEventRaised(EntityUpdatedEvent<Company> event) {
History.newItem(CompanyToken.CompanyList.name());
Presenter presenter = new CompanyListPresenter(serviceBundle, eventBus, new CompanyListView(), event.getStatusMessage());
presenter.go(container);
}
});
Likewise, I have two other Added and Deleted generic events, thus eliminating entire redundancy from my event-related codebase.
Are there any suggestions on this solution?
P.S. > This discussion provides more insight on this problem.
To answer this question, let me first pose another way of thinking about this same kind of problem - instead of events, we'll just use methods.
In my tiered application, two modules communicate via an interface (notice that these methods are all void, so they are rather like events - the caller doesn't expect an answer back):
package com.acme.project;
public interface CompanyServiceInteface {
public void addCompany(CompanyDto company) throws AcmeBusinessLogicException;
public void updateCompany(CompanyDto company) throws AcmeBusinessLogicException;
public void deleteCompany(CompanyDto company) throws AcmeBusinessLogicException;
}
This seems like overkill to me - why not just reduce the size of this API to one method, and add an enum argument to simplify this. This way, when I build an alternative implementation or need to mock this in my unit tests, I just have one method to build instead of three. This gets to be clearly overkill when I make the rest of my application - why not just ObjectServiceInterface.modify(Object someDto, OperationEnum invocation); to work for all 10 modules?
One answer is that you might want want to drastically modify the implementation of one but not the others - now that you've reduced this to just one method, all of this belongs inside that switch case. Another is that once simplified in this way, the inclination often to further simplify - perhaps to combine create and update into just one method. Once this is done, all callsites must make sure to fulfill all possible details of that method's contract instead of just the one specific one.
If the receivers of those events are simple and will remain so, there may be no good reason to not just have a single ModelModifiedEvent that clearly is generic enough for all possible use cases - perhaps just wrapping the ID to request that all client modules refresh their view of that object. If a future use case arises where only one kind of event is important, now the event must change, as must all sites that cause the event to be created so that they properly populate this new field.
Java shops typically don't use Java because it is the prettiest language, or because it is the easiest language to write or find developers for, but because it is relatively easy to maintain and refactor. When designing an API, it is important to consider future needs, but also to think about what it will take to modify the current API - your IDE almost certainly has a shortcut key to find all invocations of a particular method or constructor, allowing you to easily find all places where that is used and update them. So consider what other use cases you expect, and how easily the rest of the codebase can be udpated.
Finally, don't forget about generics - for my example above, I would probably make a DtoServiceInterface to simplify matters, so that I just declare the one interface with three methods, and implement it and refer to it as needed. In the same way, you can make one set of three GwtEvent types (with *Handler interfaces and possibly Has*Handlers as well), but keep them generic for all possible types. Consider com.google.gwt.event.logical.shared.SelectionEvent<T> as an example here - in your case you would probably want to make the model object type a parameter so that handlers can check which type of event they are dealing with (remember that generics are erased in Java), or source from one EventBus for each model type.
I am working on using a real time application in java, I have a data structure that looks like this.
HashMap<Integer, Object> myMap;
now this works really well for storing the data that I need but it kills me on getting data out. The underlying problems that I run into is that if i call
Collection<Object> myObjects = myMap.values();
Iterator<object> it = myObjects.iterator();
while(it.hasNext(){ object o = it.next(); }
I declare the iterator and collection as variable in my class, and assign them each iteration, but iterating over the collection is very slow. This is a real time application so need to iterate at least 25x per second.
Looking at the profiler I see that there is a new instance of the iterator being created every update.
I was thinking of two ways of possibly changing the hashmap to possibly fix my problems.
1. cache the iterator somehow although i'm not sure if that's possible.
2. possibly changing the return type of hashmap.values() to return a list instead of a collection
3. use a different data structure but I don't know what I could use.
If this is still open use Google Guava collections. They have things like multiMap for the structures you are defining. Ok, these might not be an exact replacement, but close:
From the website here: https://code.google.com/p/guava-libraries/wiki/NewCollectionTypesExplained
Every experienced Java programmer has, at one point or another, implemented a Map> or Map>, and dealt with the awkwardness of that structure. For example, Map> is a typical way to represent an unlabeled directed graph. Guava's Multimap framework makes it easy to handle a mapping from keys to multiple values. A Multimap is a general way to associate keys with arbitrarily many values.
I'm writing a turn-based strategy game. Each player in the game has a team of units which can be individually controlled. On a user's turn, the game currently follows a pretty constant sequence of events:
Select a unit -> Move the selected unit -> Issue a command -> Confirm
I could implement this by creating a game class that keeps track of which of these stages the player is in and providing methods to move from one stage to the next, like this:
interface TeamCommander {
public void select(Coordinate where);
public void move(Coordinate to);
public void sendCommand(String command);
public void execute();
}
However, that would allow the possibility of a method being called at the wrong time (for example, calling move() before calling select()), and I would like to avoid that. So I currently have it implemented statelessly, like this:
interface UnitSelector {
public UnitMover select(Coordinate where);
}
interface UnitMover {
public UnitCommander move(Coordinate to);
}
interface UnitCommander {
public CommandExecutor sendCommand(String command);
}
interface CommandExecutor {
public void execute();
}
However, I'm having difficulty presenting this information to the user. Since this is stateless, the game model does not store any information about what the user is currently doing, and thus the view can't query the model about it. I could store some state in the GUI, but that would be bad form. So, my question is: does anyone have an idea about how to resolve this?
First, there's something I'm not getting here: You have to be storing persistent state somewhere, even if it is only in the View / GUI. Without persistent state you cannot have a game. I'm guessing you're using either ASP or PHP; if so, use sessions to track state.
Secondly, build your state logic into that so it is known where in the input sequence you are for each player / each unit in that player's team. Don't try to get fancy with it. B requires A, C requires B and so on. While you're writing it, just give yourself a scaffold by throwing exceptions if the call order comes up incorrect (which you should be checking on every user input as I assume this is an event driven rather than loop-driven game), and debug it from there.
As an aside: I get suspicious when I see interfaces with a single method as in your second example above. An interface typically informs of there being a unique SET of functionalities which different classes each fulfill -- unless you are trying to construct multiple different classes which use slightly different sets of individual method signatures, don't do what you're doing there. It is all fine and good to say "code to an interface and not an implementation", but you need to first take the top down approach, saying, "How does my ultimate client code (in your root game logic class or method) need to call for such-and-such to occur?" and keep asking that question up the call stack (i.e. at each subsequent sub-call codepoint). If you try to build it from the bottom up, you will end up with the confusing and unnecessarily complicated code I see there. The only other exception to this which I see on a regular basis is the command pattern, and that is generally intended to look like
void execute();
or
void execute(Object data);
...But typically not a whole slew of slightly different method signatures (again possible, but unlikely). My gut feeling comes from my experience with such constructs in that they usually don't make sense and you end up completely refactoring code that uses them.