I've been toying around with writing my own Javascript editor, with functionality similar to Google Docs (allowing multiple people to work on it at the same time). One thing I don't understand:
Let's say you've got User A and User B connected directly to each other with a network delay of 10ms. I'm assuming the editor uses a diff system (as I understand Docs does) where edits are represented like "insert 'text' at index 3," and that diffs are timestamped and forced to apply chronologically by all clients.
Let's start off with a document containing the text: "xyz123"
User A types "abc" at the begining of the document at timestamp 001ms, while User B types "hello" between "xyz" and "123" at timestamp 005ms.
Both users would expect the result to be: "abcxyzhello123," however, taking into account network delay:
User B would receive User A's edits of "insert 'abc' at index 0" at time 011ms. In order to keep the chronological order, User B would undo User B's insertion at index 3, insert User A's "abc" at index 0, then re-insert User B's insertion at index 3, which is now between "abc" and "xyz," thus giving "abchelloxyz123"
User A would receive User B's edits of "insert 'hello' at index 3" at time 015ms. It would recognize that User B's insertion was done after User A's, and simply insert "hello" at index 3 (now between "abc" and "xyz"), giving "abchelloxyz123"
Of course, "abchelloxyz123" is not the same as "abcxyzhello123"
Other than literally assigning each and every character its own unique ID, I can't imagine how Google would manage to solve this problem effectively.
Some possibilities I've thought of:
Tracking insertion points instead of sending indexes with diffs would work, except you would have the exact same problem if User B moved his insertion point 1ms before editing.
You could have User B send some information with his diff, like "inserting after 'xyz'" so that User A could intelligently recognize this has happened, but what if User A inserts the text "xyz?"
User B could recognize that this has happened (when it receives User A's diff and sees that it's a conflict), then send out a diff undoing User B's edits and a new diff that inserts User B's "hello" "abc".length index further right. The problem with this is (1) User A would see a "jump" in the text and (2) if User A keeps editing then User B would have to continuously fix its diffs - even the "fixer" diffs would be off and need to fix, exponentially increasing the complexity.
User B could send along with its diff a property that the last timestamped diff it received was -005ms or something, then A could recognize that B didn't know about its changes (since A's diff was at 001ms) and do conflict resolution then. The issue is (1) all users timestamps will be slightly off, considering most computer clocks aren't accurate to the ms and (2) if there's a third User C with a 25ms lag with User A but a 2ms lag with User B, and User C adds some text between "x" and "y" at -003ms, then User B would use User C's edit as a reference point, but User A wouldn't know about User C's edit (and thus User B's reference point) until 22ms. I believe this could be solved if you used a common server to timestamp all edits, but that seems rather involved.
You could give each character a unique ID, then work off of those IDs instead of indexes, but that seems like overkill...
I'm reading through http://www.waveprotocol.org/whitepapers/operational-transform, but would love to hear any and all approaches to fixing this problem.
There are different possibilities for realizing concurrent changing of replicas, depending on the scenario's topology and with different trade-offs.
Using a central server
The most common scenario is a central server that all clients have to communicate with.
The server could keep track of how the document of each participant looks. Both A and B then send a diff with their changes to the server. The server would then apply the changes to the respective tracking documents. Then it would perform a three-way-merge and apply the changes to the master document. It would then send the diff between the master document and the tracking documents to the respective clients. This is called differential synchronization.
A different approach is called operation(al) transformation, which is similar to rebasing in traditional version control systems. It doesn't require a central server, but having one makes things much easier if you have more than 2 participants (see the OT FAQ). The gist is that you transform the changes in one edit so that the edit assumes that the changes of another edit already happened. E.g. A would transform B's edit insert(3, hello) against its edit insert(0, abc) with the result insert(6, hello).
The difference between rebasing and OT is that rebasing doesn't guarantee consistency if you apply edits in different orders (e.g. if B were to rebase A's edit against theirs the other way around, this can lead to diverging document states). The promise of OT on the other hand is to allow any order if you do the right transformations.
No central server
OT algorithms exist that can deal with peer-to-peer scenarios (with the trade-off of increased implementation complexity on the control layer and increased memory usage). Instead of a simple timestamp, a Version vector can be used to keep track of the state an edit is based on. Then (depending on the capability of your OT algorithm, specifically transform property 2), incoming edits can be transformed to fit in the order they are received, or the version vector can be used to impose a partial order on the edits -- in this case history needs to be "rewritten", by undoing and transforming edits, so that they adhere to the order imposed by the version vectors.
Lastly, there are a group of algorithms based on CRDT, called WOOT, Treedoc or Logoot, which try to solve the problem with specially designed data types that allow operations to commute, so the order in which they are applied doesn't matter (this is similar to your idea of an ID for each character). The trade-offs here are memory consumption and overhead in operation construction.
I am wondering how you make sure you are not adding the same person twice in your EventStore?
lets say that on you application you add person data but you want to make sure that the same person name and birthday is not added twice in different streams.
Do you ask you ReadModels or do you do it within your Evenstore?
The generalized form of the problem that you are trying to solve is set validation.
Step #1 is to push back really hard on the requirement to ensure that the data is always unique - if it doesn't have to be unique always, then you can use a detect and correct approach. See Memories, Guesses, and Apologies by Pat Helland. Roughly translated, you do the best you can with the information you have, and back up if it turns out you have to revert an error.
If a uniqueness violation would expose you to unacceptable risk (for instance, getting sued to bankruptcy because the duplication violated government mandated privacy requirements), then you have to work.
To validate set uniqueness you need to lock the entire set; this lock could be pessimistic or optimistic in implementation. That's relatively straight forward when the entire set is stored in one place (which is to say, under a single lock), but something of a nightmare when the set is distributed (aka multiple databases).
If your set is an aggregate (meaning that the members of the set are being treated as a single whole for purposes of update), then the mechanics of DDD are straightforward. Load the set into memory from the "repository", make changes to the set, persist the changes.
This design is fine with event sourcing where each aggregate has a single stream -- you guard against races by locking "the" stream.
Most people don't want this design, because the members of the set are big, and for most data you need only a tiny slice of that data, so loading/storing the entire set in working memory is wasteful.
So what they do instead is move the responsibility for maintaining the uniqueness property from the domain model to the storage. RDBMS solutions are really good at sets. You define the constraint that maintains the property, and the database ensures that no writes which violate the constraint are permitted.
If your event store is a relational database, you can do the same thing -- the event stream and the table maintaining your set invariant are updated together within the same transaction.
If your event store isn't a relational database? Well, again, you have to look at money -- if the risk is high enough, then you have to discard plumbing that doesn't let you solve the problem with plumbing that does.
In some cases, there is another approach: encoding the information that needs to be unique into the stream identifier. The stream comes to represent "All users named Bob", and then your domain model can make sure that the Bob stream contains at most one active user at a time.
Then you start needing to think about whether the name Bob is stable, and which trade-offs you are willing to make when an unstable name changes.
Names of people is a particularly miserable problem, because none of the things we believe about names are true. So you get all of the usual problems with uniqueness, dialed up to eleven.
If you are going to validate this kind of thing then it should be done in the aggregate itself IMO, and you'd have to use use read models for that like you say. But you end up infrastructure code/dependencies being sent into your aggregates/passed into your methods.
In this case I'd suggest creating a read model of Person.Id, Person.Name, Person.Birthday and then instead of creating a Person directly, create some service which uses the read model table to look up whether or not a row exists and either give you that aggregate back or create a new one and give that back. Then you won't need to validate at all, so long as all Person-creation is done via this service.
I'm trying to understand how Event Sourcing changes the data architecture of a service. I've been doing a lot of research, but I can't seem to understand how data is supposed to be properly stored with event sourcing.
Let's say I have a service that keeps track of vehicles transporting packages. The current non relational structure for the data model is that each document represents a vehicle, and has many fields representing origin location, destination location, types of packages, amount of packages, status of the vehicle, etc. Normally this gets queried for information to be read to the front end. When changes are made by the user, the appropriate changes are made to this document in order to update this.
With event sourcing, it seems that a snapshot of every event is stored, but there seem to be a few ways to interpret that:
The first is that the multiple versions of the document I described exist, each a new snapshot every time a change is made. Each event would create a new version of this document and alter it. This is the easiest way for me to wrap my head around it, but I believe this to be incorrect.
Another interpretation I have is that each event stores SPECIFIC information about what's been altered in the document. When the vehicle status changes from On Road to Available, for example, an event specifically for vehicle status changes is triggered. Let's say it's called VehicleStatusUpdatedEvent, and contains the Vehicle ID number, the new status, and the timestamp for this event. So this event is stored and is published to a messaging queue. When picked up from the queue, the appropriate changes are made to the current version of the document. I can understand this, but I think I still have some misconceptions here. My understanding is that event sourcing allows us to have a snapshot of data upon each change, so we can know what it looks like at any point. What I just described would keep a log of changes, but still only have one version of the file, as the events only contain specific pieces of the whole file.
Can someone describe how the data flow and architecture works with event sourcing? Using the vehicle data example I provided might help me frame it better. I feel that I am close to understanding this, but I am missing some fundamental pieces that I can't seem to understand by searching online.
The current non relational structure for the data model is that each document represents a vehicle
OK, let's start from there.
In the data model you've described, storage of a document destroys the earlier copy.
Now imagine that instead we were storing the the document in a git repository. Then then saving the document would also save metadata, and that metadata would include a pointer to the previous document.
Of course, we've probably got a lot of duplication in that case. So instead of storing the complete document every time, we'll store a patch document (think JSON Patch), and metadata pointing to the original patch.
Take that same idea again, but instead of storing generic patch documents, we use domain specific messages that describe what is going on in terms of the model.
That's what the data model of an event sourced entity looks like: a list of domain specific descriptions of document transformations.
When you need to reconstitute the current state, you start with a state you know (which could be the "null" state of the document before anything happened to it, and replay onto that document all of the patches (events) that have occurred since.
If you want to do a temporal query, the game is the same, you replay the events up to the point in time that you are interested in.
So essentially when referring to an older build, you reconstruct the document using the events, correct?
Yes, that's exactly right.
So is there still a "current status" document or is that considered bad practice?
"It depends". In the general case, there is no current status document; only the write-ordered list of events is "real", and everything else is derived from that.
Conversations about event sourcing often lead to consideration of dedicated message stores for managing persistence of those ordered lists, and it is common that the message stores do not also support document storage. So trying to keep a "current version" around would require commits to two different stores.
At this point, designers typically either decide that "recent version" is good enough, in which case they build eventually consistent representations of documents outside of the transaction boundary... OR they decide current version is important, and look into storage solutions that support storing the current version in the same transaction as the events (ex: using an RDBMS).
what is the procedure used to generate the snapshot you want using the events?
IF you want to generate a snapshot, then you'll normally end up using a pattern called a projection, to iterate over the events and either fold or reduce them to create the document.
Roughly, you have a function somewhere that looks like
document-with-meta-data = projection(event-history-with-metadata)
I am looking for alternative to GUIDs for key generation in a distributed app. For example supposed I have Bob, James, and Jack all running a bug tracking application on their desktop where they can do thing like create bug tickets ala JIRA, or Bugzilla ... etc. When a ticket is created it is assigned a number such as T-1, T-2, T-3, T-4 ... etc. Tickets need to have a stable ID and should be creatable without having to consult a central server.
I understand that this is what GUID's are really good for but it in my case displaying a GUID in a UI is ugly people can't just copy and paste it and discuss it on a phone call, I really want integers or some sort of short string that is easy to talk about read in one glance .. etc.
Is there a way to use the bitcoin block chain as some sort of counter?
You may evaluate the approach taken by git. They use sha1 hash of commit information. And then abbreviate IDs are allowed which are much shorter and easier to read\transfer manually.
Having the number of bugs in your tracker is not going to reach millions that should be sufficient. Once it is you'll just need a longer abbreviation.
There seem to be plenty info around on how git calculates hash IDs and abbreviates them.
If I recall correctly how UUIDv1 works - it's "just" putting together the mac address and a very exact timestamp + maybe some additional integer. As your mac address should be unique (unless you've fiddled with it) and there are only so many UUIDs one computer can generate within a nano second, the resulting ID will be unique.
This is a very general and uninformed way to create IDs. If you'd implement a version of it yourself for your specific use case you could get much smaller IDs.
Assuming you can identify each node with a bug tracking system with a simple and unique string - for instance "Bob", "James", "Jack" - and you can create unique continuous integers within each node, you could combine those two and have IDs like "Bob-1", "James-12", ...
As you can see, actually there has to be again one central point, which will assign the unique strings, however depending on the number of nodes and how long they stay within the system, this could be as well done just by a human being.
The additional disadvantage (or advantage, depends how you look at it) of this approach (as well as of UUIDv1) would be, that you'd know where the ticket has been created as well as order of the tickets within one system.
I have a vendor defined database (about 140GB total) on Caché 2007. It uses the old style MUMPS programming environment and accesses globals directly in a hierarchical style. There is one global that accounts for about 75% of the total database size. The first subscript in this table is an artificial integer account number. The next 2-3 subscripts are constant subrecord identifiers that break up blocks of fields and denote repeating sub record kinds.
One of these repeating subrecords (record type 30) is for notes on an account. Because of the way the system is used, this dimension accounts for a very large portion of the global's total space; I'd estimate it to be at least 50%. Because of the way Caché stores data physically in the database, a scan of this global ends up loading all or most of these notes as a side effect even though they aren't relevant to most operations. It has the effect of greatly increasing the cost of IO operations on the global, especially when you only want one tiny detail from a bunch of accounts.
Example subscript references for this global:
^ACCT(3461,30,1)="NOTE1 blah blah"
^ACCT(3461,30,2)="NOTE2 blah blah"
^ACCT(3461,30,100)="NOTE100 blah blah"
I can't change the design of the database. It's controlled by an outside vendor and there is a large amount of MUMPS style hardcoded references in the database. I'm thinking that a big reason that batch operations are so slow on the system are due to the high cost of these mostly irrelevant notes coming along for the IO ride whenever account data is accessed. Scanning this whole global (i.e. when there is no useful application maintained index) takes at least 8 hours.
One thought I had is to shift the note data from being stored along side other details in the global to a separate database file by using the global mapping facility described in the Guide to Using Caché Globals and Guide to System Administration. If I could map all the subscript 30s to a separate database file in the same Caché database, most data operations (the ones that don't even care about notes) wouldn't be bringing those in to memory along with the details they do care about.
In the global structure guide (1st link), this looks plausible as they show a particular 2nd subscript mapping separately than the 1st subscript. What they don't show in any of the examples is what the syntax is to make that happen. In the "Add a new global mapping" screen in the Caché Management Portal, I should be able to do something like
Global name: ACCT
Subscripts to be mapped: (BEGIN:END)(30)
But whatever variations I try in the syntax, I always get ERROR #657: Invalid subscript in reference 1 subscript #1.
Unfortunately, while it's possible to map 2nd level subscripts of a particular node, it's not possible to map 2nd level subscripts of all nodes.
There is an experienced Performance team on WRC, did you try to contact them?
We've got a surprisingly complex workflow that needs to be monitored by a quasi-technical employees with an in-house webapp. There's about 30 steps, some of which are manual (editing), some are semi-automated stop points (like "the files have been received" or customer approval of certain templates), and some are completely automated (file conversion, search indexing, etc). The flowchart for all of these steps is large and complicated, and three people might be working on three completely different steps at any one time.
How would you present this vast amount of information as usefully as possible to your users? Just showing the whole diagram seems like the brute force solution. But it's big, and it'll likely get bigger as we do more things. Not to mention the complexity necessary to encode this entire diagram in HTML.
I assume you don't want to show these just for entertainment or mockery, but help the users along the way, automating as much as possible, document the process etc. It would probably help if you clearly define the goals or purpose of your app.
I don't see a point in showing the entire workflow, except for "debugging the business rules" or maybe the clients want to see it.
If your goal is to help users do their job, I would present the state of the "project" (or whatever term fits better) is at, and possible transitions to other states.
The State might be multiple mostly independent variables, e.g. one might describe the progress of content - e.g. "incomplete" / "complete" / "reviewed by 2nd staffer" / "signed off by 2nd staffer", others might contain a schedule that is developed in parallel, e.g. "test print date = not scheduled", "print date = not scheduled", "final delivery = tomorrow, preferredly yesterday".
A transition might be "Seint to customer for review", "mark as content-complete", "content modified", etc.
Is this what you have in mind?
I propose to divide your workflow in modules and represent the active state for each module.
A module is a subset of your main workflow. For example it could be divided by tasks, person, roles, department, etc. This will greatly simplify the representation of the workflow. Let's says someone is responsible for data entry at many critical moments. We can group all his tasks in one module (or sub-workflow) containing the same activities, inputs, outputs and conditions. Modules could be inter-dependants and related.
A state is where we are located in a module. In simple workflows there is only one active task. In real life we are multi-threaded! So maybe in one module many states could be active at the same time. The state also includes active inputs, outputs and memory bits.
An input is something required to perform an activity for evaluation a boolean condition. It could be a document, a piece of data, a signal...
An output is something resulting from a task: an information, a document, a signal...
Enough definitions?
Then simply convert your workflow into a LADDER LOGIC and you have your states!
See Ladder Logic definition on Wikipedia
You display only active states:
Active task(s) for the module
Inputs required / inputs confirmed
Output required / output realized
Conditions to continue
Seems abstract?
Here is a small example...
Janet enters data in the system. She manages the green tasks of the diagram. We focus only on her work, not other tasks. She knows how to do 16 tasks in the workflow. We are waiting the following actions from her to continue, and her Intranet dashboard says:
Priority 1: You must send a PO to order enough pencils for the next month based on the sales report.
Task: Send a purchase order
Inputs: Forecast report from the marketing department
Outputs: PO, vendor, item, quantity
Condition for completion: PO sent and order confirmation received from supplier
Priority 2: You must enter into the financial system the number of erasers rejected by production
Task: Data entry
Inputs: Reject count from production
Outputs: Number of rejects
Condition for completion: data entered and confirmed
We do a lot of troubleshooting on automated production systems having hundreds of thousands ladder steps (the workflow is too complex to be represented in a whole). When the system is blocked we look at each module and determine what inputs are missing to activation task completion.
Good luck!
This sounds like the sort of application for which BPEL is suited.
Of course you don't want to re-architect your system right now. But there are a number of BPEL implmentations out there, some of which include graphical editing tools. One of these might help you in your current situation, because they are good at handling scope and hiding detail. So I think you might derive benefit from drawing your workflow as a BPEL diagram even if you don't do anything else with the language.
The Wikipedia page lists several of the available implementations. In addition, Oracle's JDeveloper IDE includes a BPEL Diagrammer as part of its SOA suite; unfortunately it is no longer part of the standard install but it is still available. Find out more.
Try doing it in layers. You have the most detailed layer done, now add additional docs with the details hidden, grouped into higher-level business processes. Users should be able to safely ignore some of those details, but it's good for them to have visibility of how their part fits in to the whole.
You may need more than one higher-level document.
You can use Prezi to present this information to users in a lucid manner.
Split and present the work flow into phases such that the end user is easily able to identify the phase he is currently in.
Display as many number of phases as the number of inputs. The workflow starts with 6 different inputs so display the six different buttons on screen enabling the user to select the input that he wants.
On selecting the button zoom into the workflow depicting the next steps. This would also help the user to verify the actions that he has done so far to reach the current states.
This would also help the user to verify the actions that he has done so far to reach the current states. But this way of presenting could become cumbersome for the users as the number of steps that he has completed goes up. Say the user has almost reached the end of the workflow. To check for the next step he should go through all the steps which might frustrate the user.
To avoid this you can split the complete work flow chronologically into 3-5 phases. The phases should be split logically. The ultimate aim would be not to overwhelm the users with the full work flow. Personally i would try to avoid the task involving this workflow if presented the way you have shown. No offense. I bet you also feel the same.
Could give you a better picture if you could re-post the image after replacing the state names with numbers.
I'd recommend having the whole flow documented somewhere, but in terms of what is distributed to users, how about focusing on task-oriented flows? No one user will be responsible for the entire process I would imagine.
For example, let's say I have 2 roles, A and B, and 6 tasks, 1 through 6, executed in order. Each task may have multiple steps but is self-contained (e.g. download the file, review, run process, review again, upload). A does the even tasks and B does the odd tasks.
A would need to know about those detailed steps that comprise tasks 2, 4, and 6 but not about what goes on in 1, 3, and 5. So hand A a detailed set of flows for the tasks he is responsible for, along with a diagram that treats each task as a black box.
If the flow can't be made modular in this way, you may want to review the process itself to see why it's so complex.
How about showing an example of a workflow scenario, that is, showing the transitions in one possible passing through the workflow? You could cater this to a specific user profile and highlight the pertinent states, dimming the others. This allows them to get a clear idea of the transitions by seeing a real-life example.