MapReduce Sorting on the basis of three fields - hadoop

This question is regarding the Map/Reduce sorting. I have three fields
XXID, Identifier, TimeStamp
The XXID can be any Strings value, the identifier has got two possible values 1 or 2
I want the sorting to be such that all the same XXID goes to the same reducer and in the iterable the fields with 1 comes first in the iterable with increasing timestamp and fields with 2 comes next.
Can anybody help me with this?

You are definitely violating the mapreduce framework to do this, but you gotta do what you gotta do!
First of all, the sorting is only done on the key. Therefore, you have to assume that the values are going to be in an arbitrary order. Therefore, we need to figure out how to get the XXID, Identifier, and TimeStamp, all in the key together. (You can probably just use NullWriteable as the value now)
To fit the three items into a key, you should make a new data type by implementing WriteableComparable. Have this new class wrap the three values and let's call it JavanxTriple.
The way you are going to customize the MapReduce sort of JavanxTriple items is to change the .compareTo function from Comparable. Make it so XXID is compared first, then 1 or 2, then the time stamp.
Next, you need to solve the problem that since each of these things are separate keys, that by default data will go to different reducers. Out of the box, you won't be able to compute the streams of data that you want. To get around this problem, you need to write a custom partitioner. The partitioner tells which reducer each record is going to go to. In order to do this, you override .getPartition. When you are calculating .getPartition, only use XXID to determine this number (not the Identifier and TimeStamp portions of the key). They way, all items with the same XXID are sent to the same reducer.
Finally, you now have the problem that the way you implement the reducer will not be typical. The reduce will only get called once per key, and the Iterable that gets passed in will only have a NullWriteable in it.
To get around this, use some static variables in the Reducer class to keep track of what is going on in the reduce functions. You have to detect when the XXID changes so you know to switch up the next analysis. You may have to use the setup and cleanup methods to set things up and finish things up.

Related

AnyLogic: Same random value for every simulation run?

I want to simulate a discrete-event simulation which contains 6 processes as delays. At model startup I want to initialize the delay times for every delay/process station.
I have written a java class "Prozess2" and every "Prozess2" object contains 6 CustomDistributions. At object initialization of "Prozess2" I draw one random value for any of them. In the end, I aggregate the 6 random values to the delay time.
Therefore, I always want to get other random values at startup for any delay time. However, when I run the simulation over and over again, I always get the same aggregated delay time by Math.round(time()).
In the constructor of "Prozess2" I hand over a RandomNumberGenerator called rng, which lies on the main agent and an instance of the main agent:
Where is my mistake?
In my opinion, it is best practice to explicitly supply your model with a random seed as one of the parameters and then use this seed to generate a random variable for each and every process that has randomness. That way you can be assured that if you change one part of the model, either through logic it input, that the other parts will still have the same random stream as they had before.
See example below:
And then you can use it
So in your case pass the seed to the model and use it instead of the getDefaualtRandomGenerator()
You can also see this question for a similar problem to do with randomness
Why do two flowcharts set up exactly the same end with different results every time the simulation is run even when I use a fixed seed?
AnyLogic will generate same random number if you manually run the exactly same model multiple times. The numbers will change in case if you add new blocks, or delete some of the existing ones in your model. To my understanding, they use the data in your model (blocks, variables, parameters, etc.) to add noise into the random number generator. That being said, the correct way of having different outputs for different runs is to use the Parameters Variation experiment as below:
Right click on your model name in the AnyLogic window, select New, then Experiment. You will see this screen. Select Parameters Variation:
On the Properties window, you have three options for Randomness. By selecting the first option, you will have different random numbers for different runs.

several questions about multi-paxos?

I have several questions about multi-paxos
will each instance has it's own proposal Number and accepted ballot and accepted value ? or all the instance share with the same
proposal number ,after one is finished ,then anther one start?
if all the instance share with the same proposal number ,Consider the below condition, server A sends a proposal ,and the acceptor returns the accepted instanceId which might be greater or less than the proposal'instanceid ,then what will proposal do? use that instanceId and it's value for accept phase? then increase it'own instanceId ,waiting for next round ,then re-proposal with it own value? if so , when is the previous accepted value removed,because if it's not removed ,the acceptor will return this intanceId and value again,then it seems it is a loop
Multi-Paxos has a vague description so two persons may build two different systems based on it and in a context of one system the answer is "no," and in the context of another it's "yes."
Approach #1 - "No"
Mindset: Paxos is a two-phase protocol for building write-once registers. Multi-Paxos is a technique how to create a log on top of them.
One of the possible ways to build a log is
Create an array of completely independent write-once registers and initialize the first one with an initial value.
On new record we should:
A) Guess an index (X) of a vacant register and try to write a dummy record here (if it's already used then pick a register with a higher index and retry).
B) Start writing dummy records to every register with smaller than X index until we find a register filled with a non-dummy record.
C) Calculate a new record based on it (e.g., a record may have an ordinal, and we can use it to calculate an ordinal of the new record; since some registers are filled with dummy records the ordinals aren't equal to index) and write it to the X+1 register. In case of a conflict, we should restart the procedure from step A).
To read the log we should start writing dummy values from the first record, and on each conflict, we should increment index and retry until the write is succeeded which would indicate that the log's end is reached.
Of course, there is a lot of overhead in this approach, so please treat it just like a top-level overview what Multi-Paxos is.
The log is a powerful concept, and we can use it as a recipe for building distributed state machines - just think of each record as an update command. Unfortunately, in some cases, there is also a lot of overhead. For example, if you want to build a key/value storage and you care only about the current value than you don't need history and probably need to implement garbage collection to remove past versions from the log to optimize storage costs.
Approach #2 - "Yes"
Mindset: rewritable register as a heavily optimized version of Multi-Paxos.
If you start with the described approach with an application to the creation of key/value storage and then iterate in other to get rid of overhead, e.g., by doing garbage collection on the fly then eventually you may come up with an idea how to update the write-once register to be rewritable.
In that case, each instance uses the same ballot numbers just because all the instances are collapsed into one rewritable instance.
I described this approach in the How Paxos Works post and implemented it in the Gryadka project with 500-lines of JavaScript. Also, the idea behind it was independently checked with TLA+ by Greg Rogers and Tobias Schottdorf.

How to check whether data is modified?

I have a web app, which allows users to edit several configurations. I want to make the users aware that they have made changed locally but not committed since their last load. When users change the configuration back, the "modified flag" should disappear.
What's a good approach to achieve it? Currently, I am thinking about keeping an original copy of the configurations, and compare current configuration with the original one each time the user made some changes, but I am worried about the performance since the configuration data is fairly large. (The app client runs in browsers.)
Please help, thanks.
I don't think performance will be an issue but if it were, I'd also create a "dirty field" map based on associative array or a hashmap/object with field names as keys. Of course all field names must be unique. This map will be initially empty.
When you edit any field, catch the onChange or onBlur event and compare just this field with the saved one. If they differ, put it to the dirty map like field_map['field1'] = true;. If they are equal, remove this key from the dirty map.
So, if your dirty map is not empty, you will know changes were made, and you will also know exactly what fields have changed.

Choosing Redis datatypes for advanced data manipulation in a simple torrent tracker service

I need your advice on Redis datatypes for my project. The project is a torrent-tracker (ruby, simple sinatra-based) with pure in-memory data store for current information about peers. I feel like this is what Redis is made for. But I'm stuck at choosing proper data types for this. For now I tend to the following setup:
Use list for seeders. Actually I'd better need a ring buffer to get a sequential range of seeders (with given size and start position) and save new start position for the next time.
Use sorted set for leechers. Score for each leecher is downloaded/(downloaded+left) so I can also extract a range for any specific case.
All string values in set and list are string (bencoded) representation of peer data.
What I actually lack in the setup above is:
Necessity to store offset for seeders so data access needs synchronization.
Unknown method of finding a specific seeder in list. Here I may benefit from set but then I won't be able to extract a range of items at once.
(General problem) Need TTL for set/list members (if client shuts down without sending any data before this). Possible option is to make each peer an ordinary string key/value (string or hash), give it TTL, subscribe on destroy and delete it in corresponding list or set.
What could you suggest? Any practical advice?

Is there a way to generate a short random id, avoiding collisions, without hitting persistent storage?

If you've used GoToMeeting, that's the type of ID I want. I'd like it to be random so that it obfuscates the number of items being tracked and short, so that it's easy to reference manually; UUIDs are way too long. I'd like to avoid hitting persistent storage merely for performance reasons, but I can't think of any other way to avoid collisions. Is 9 digits enough to do something time-based?
In response to questions:
I'm building a ticket-tracking application. This ID would be used as the primary key for a table, but it would be needed before the record is persisted which would result in an extra database call that I'd like to avoid if possible.
I'd like to keep it at a 9 digit int. I consider a UUID to be too long because people are going to have to reference the ID manually (via email, phone, etc.).
I'm thinking of using the time of generation somehow. Since time is always ticking on forward, it would continually limit the set of potential IDs, excluding those that had already been generated.
One way is to take a unique number or string (like a random UUID) then calculate a fixed-length digest (such as MD5 or SHA-1) and/or encode it in a higher base (like base64) to shorten it further.
Git does something similar where it generates a sha numbers for commits (and other events) and then the user can references the numbers manually in order to lookup those commits. The trick they used is that the user doesn't have to enter the whole string in order to find the correct event, they simply have to enter a long enough string that it doesn't collide with any other commit currently in the repository. In general this only require 5 or so hex digits for relatively large repositories.

Resources