Text difference patch - algorithm

Am trying to write a piece of code which will allows the user to type text into a textbox which then gets saved on the server. When the user types some more text in the textbox, I want only the difference to be sent to the server.
Is there a difference algorithm for JS which I can use to send only information about the difference. So it should be able to tell the difference between two text boxes essentially.
It could also be language agnostic and I can port it.
Thank you for your time.
UPDATE
In simple words. I have a text area which keeps saving the text in the box every X seconds. Now to save bandwidth I only want it to send the difference from the last saved revision (which I can say put in a variable. Initially this will be empty). Now the JS has to check the difference between the last revision and the current state of the textbox and generate a change list to send to the server.
UPDATE 2
Something like www.etherpad.com

Google DiffMatchPatch has a Javascript implementation, I've used it with much success.
http://code.google.com/p/google-diff-match-patch/

The Python difflib module does this and more. It's very flexible but might be challenging to port to Javascript.
Regarding your update, I'm first wondering why you need to worry about bandwidth. Unless your users are typing a lot of text into an edit box (which has its own usability issues) then there just aren't that many bytes to send. Send the whole text box each time you autosave. Users can't type fast enough to really notice the use of bandwidth.
Or, you could meet halfway. Every time you autosave, check to see whether the user has only added new text to the end compared to the last time. If so, send an "append" type update with just the new text. If the user has gone back and edited anything else, then send a "replace" type update where you send the whole text. This takes care of the common append-only case without severely complicating your implementation.

Instead of calculating a diff between 2 texts, which is difficult,
you could always, while people are editting, record the keystrokes and the caret position in the textbox. If you send this over every now and then (and clean the buffer), the server can playback the exact same sequence.

This code-smells of premature optimization. Perhaps you should implement your solution first and then see about optimizing your transfer rates using diffs. How much text are you looking at? Because the request and response packets are going to be more or less the same size with only a few bytes difference for your message, so the savings could be very minimal.
At the very least, complete your solution without optimization and profile your network traffic using tools like Firebug and then test to see how much worse the performance is with what you would consider to be the maximum text block that could be sent.
Finally, you could always use the TypeWatch JQuery plugin to listen for change events in the textbox. You can set a delay so that once the user finishes typing and the delay elapses, the callback function is triggered. This means that the text will only be sent when the user types something, and only when they are finished typing. This will be significantly more efficient than repeatedly polling the server.

Depends on how far you are ready to go.
You would like to check deltav algorithm, it is used by svn in particular: http://svn.apache.org/repos/asf/subversion/trunk/notes/svndiff

Related

Delays in updating content controls when more context.sync() are used on WORD for Mac

We update content control for every character typed in the task pane’s input field. So that user can see the live updates on the word document.
Recently we added functionality for locking content controls. And it happens as below:
User input (types a character) in a input field
We search a content control for that input field (involves context.sync)
Unlock the content control (involves context.sync)
Update value in content control (involves context.sync)
Lock back the content control (involves context.sync)
All this works nice in Word for windows without problems.
But is extremely (visibly) slow with Word for Mac (apple machines)
How should I overcome the delays happening on Mac?
As Juan mentioned in the comment, there are some important details that the team would need to investigate. Sample code would be good too.
That being said, just looking at what you describe, I think you can dramatically cut down on the context.sync() statements. Unlocking the content control, updating its value, and locking it should all be possible to do in one sync.
I have a bunch of details about minimizing sync-s in my book, "Building Office Add-ins using Office.js. Quoting one of the sections from it:
As an add-in author, your job is to minimize the number of context.sync()
calls. Each sync is an extra round-trip to the host application; and when
that application is Office Online, the cost of each of those round-trip adds up
quickly.
If you set out to write your add-in with this in principle in mind, you will
find that you need a surprisingly small number of sync calls. In fact, when
writing this chapter, I found that I really needed to rack my brain to come up with a scenario that did need more than two sync calls. The trick for
minimizing sync calls is to arrange the application logic in such a way that
you're initially scraping the document for whatever information you need
(and queuing it all up for loading), and then following up with a bunch
of operations that modify the document (based on the previously-loaded
data). You've seen several examples of this already: one in the intro chapter,
when describing why Office.js is async; and more recently in the "canonical
sample" section at the beginning of this chapter. For the latter, note that the
scenario itself was reasonably complex: reading document data, processing
it to determine which city has experienced the highest growth, and then
creating a formatted table and chart out of that data. However, given the
"time-travel" superpowers of proxy objects, you can still accomplish this task
as one group of read operations, followed by a group of write operations.
Still, there are some scenarios where multiple loads may be required. And in
fact, there may be legitimate scenarios where even doing an extra sync is the
right thing to do – if it saves on loading a bunch of unneeded data. You will
see an example of this later in the chapter.

Performance: Need to read from LONGTEXT

I'm building a CMS-type webapp that allows users to enter arbitrary-sized blocks of HTML. These blocks are entered by the user in their admin area and inserted into their template of choice when a page is delivered.
I'm guessing a user is not going to add more than 50-100 blocks and I'm not going to be getting more than 1000 users any time soon.
I was planning on using mySQL's LONGTEXT type to store these but I'm wondering if storing files in a directory will be more performant as the Linux OS will cache them? Given that I'm building for at most (1000 * 100) text blocks is there any reasonable performance worry with using mySQL?
Obviously I will be caching the HTML before delivery so I won't be reading these blocks on every delivery - reads will only occur when someone updates/creates new content.
I could use memcached/other cache/noSQL implementation or some other storage mechanism but I'm focusing on keeping it simple and delivering ASAP so don't want to introduce other stuff that I don't have experience with unless there's a significant performance worry.
Are the blocks of HTML content the only thing you are saving? If so, a file may be easiest.
However, it seems likely that you may want to save other bits of information along with the HTML and be able to query based on those bits of data. For example: date created, date last modified, name of the block, the user(s) who have edited the block.
If this is the case, then a database may be the best way to go. Since you said you do not expect to have many users (at least not a first) I would concentrate on finding the solution that is the fastest / most flexible to program and focus on performance and caching after your website begins to grow in size.
I advise you to use a flat file rather than Mysql to store this kind of data.
Html is more a "file" than a "value information" so it hasn't to be in a DB.
Moreover, you will certainly have better performances.
You can also read this post.

AJAX every form element?

Is it better-practice to AJAX every form element separately (eg. send request onChange, etc) or collect all the data, then submit with 1 click save?
Essentially, auto-save or user-initiated-save?
I would generally say that a user-initiated save is the way to go for most web-applications. If for nothing else, this is how users are used to interacting with web apps; familiarity and ease of use is extremely important in web applications. Not to mention it can cut down on unnecessary traffic.
This is not to say that auto-saving does not have it's place, but often it can be cause unnecessary traffic. For example, if I am auto-saving a contact form, fill out my name, then email, then back to name to change it, that is already 3 requests that have been sent with no benefit - this is extra work for no added advantage.
Once again, I think it does have a lot to do with your application or where you are planning on using it. Inline edits are something that often uses auto-saving and there I think it is useful, whereas a contact form/signup form would not be a good idea.
I'd say that depends on the nature of your application and whether "auto-save" is a behaviour desired by your users.
"User initiated save" is what a user would expect from their experience with web forms nowadays - I would not deviate from that unless there's a good reason.
Depends on following factors:
What kind of data are you trying to save. E.g. is it okay to be able to save the data partly or you need to save it all at once?
How much data do you want to save? If you have many fields, you might want to send data in chunks (In case of wizards) or save everything at once
Its also a good idea to have data saved (in background) for large forms in a temp way if the user may take a long time to fill in the data (e.g. emails saved as drafts)
It also depends on your web app and the way you have designed your forms. In some forms you may allow certain fields to be modified and saved inplace, so that you can fetch additional data for example
In most cases it would be good to have an explicit "Save" action for your data forms

Is there a generally acceptable definition of (soft) realtime delays?

I'm trying to find a benchmark for how long users are willing to wait for a response from a remote service. In my case the response is for very useful but not business critical validation of data entry. I guess that there must have been some work done in the HCI space on this.
If you know of a generally accepted definition for soft realtime responses then great but I'd also appreciate your well reasoned thoughts.
Chris
US DOD MIL-STD 1472-F Human Engineering Standard has the most widely accepted requirements for maximum allowed response time (from Table XXII, page 196, times in seconds):
Key Response (Key depression until positive response, e.g., "click"): 0.1
Key Print (Key depression until appearance of character): 0.2
Page Turn (End of request until first few lines are visible): 1.0
Page Scan (End of request until text begins to scroll): 0.5
XY Entry (From selection of field until visual verification): 0.2
Function (From selection of command until response): 2.0
Pointing (From input of point to display point): 0.2
Sketching (From input of point to display of line): 0.2
Local Update (Change to image using local data base, e.g., new menu list): 0.5
Host Update (from display buffer): 2.0
File Update (Change where data is at host in readily accessible form): 10.0
Inquiry - Simple (e.g., a scale change of existing image): 2.0
Inquiry - Complex (Image update requires an access to a host file): 10.0
Error Feedback (From command until display of a commonly used message): 2.0
As you can see, acceptable response time depends on what response the user is waiting for. For something like a pulldown menu appearing, it's 0.5 seconds max. For a full page load in a browser, you want something to appear in 1.0 s to 2.0 s and the full page loaded in 10.0 s. In all the above, shorter response times are better. Only in bizarre circumstances will users object to a 0.001s response time.
In any case, if the response time will be greater than 0.5 s, then you need to provide feedback such as a throbber or hourglass sprite. If response time is a minimum of 5-15 s (depending on what standard you use), provide a progress bar. With a progress bar, very long response times (on the order or minutes over even hours) may be acceptable as long as you set it up for the user as a “batch” process rather than being an interactive program. It's much better for the user to make all input and wait an hour than to make input on four occasions, waiting 15 minutes after each.
The above list has the accepted standards. How long your users are willing to wait (e.g., before giving up) essentially boils down to the user making a cost-benefit analysis. Is what I’m going to get worth the wait? What are my sunk costs? Is there an alternative (e.g., another web site) that can do it better? Can I do other things while I wait to make the most of my time? However, whatever users willing to do, you can bet they’ll resent delays greater than the standards above.
Human reaction time seems to be around 200 ms - anything around there will be perceived as instantaneous. That sort of number is hard to achieve, especially in an application that gets information from remote services.
If you take a look at Google's search suggestion box, the lag there is minimal - less than a second. It's astoundingly fast, and really remarkable for a web application. This is really nice for Google's users, but it's bad news for you. These days, users expect most applications to react with the same sort of speed an efficiency; anything slower is considered rather laggy. However, it's worth noting that people's patience usually varies with the complexity of the task at hand. A simple form submit should never take much time, but something like uploading photos is expected to take a while.
My feeling is this: go with your gut. If your application is fairly simple then you should try to get the wait/load time down to less than a second. If you can't, then your best bet is to add an indicator so the user knows that some computations are being done in the background. This can be in the form of a small animation or a progress bar.
Unfortunately, the answer to this question is not typically a well-defined number. Users expectations vary widely and can change depending on what it is you're talking about.
As computers continue to become more ubiquitous and we (the consumers) continue to have growing expectations of speed, remote services, websites, and even applications will need to continue to respond more quickly. Generally speaking, you want everything to be as fast as possible.
With this said, I would look at what your remote service is for. Since you said, "the response is for very useful..." to me, that means it probably will get used frequently. People tend to use what is useful. If that's the case, I would look for ways to make that remote service respond quickly.
Of course, there is also the caveat that you don't want to start optimizing before the service is written. What is the current response time? What is the context in which this will be used? Those factors will do a lot to determine the longest users are willing to wait for the service.
You might want to search for "SLA" or "Service Level Agreement". Those are the documents in a web business that make guarantees as to how long data will take to get back to the user, whether it's an HTML document or a web service call.

Speeding up text output on Windows, for a console

We have an application that has one or more text console windows that all essentially represent serial ports (text input and output, character by character). These windows have turned into a major performance problem in the way they are currently code... we manage to spend a very significant chunk of time in them.
The current code is structured by having the window living its own little life, and the main application thread driving it across "SendMessage()" calls. This message-passing seems to be the cause of incredible overhead. Basically, having a detour through the OS feels to be the wrong thing to do.
Note that we do draw text lines as a whole where appropriate, so that easy optimization is already done.
I am not an expert in Windows coding, so I need to ask the community if there is some other architecture to drive the display of text in a window than sending messages like this? It seems pretty heavyweight.
Note that this is in C++ or plain C, as the main application is a portable C/C++/some other languages program that also runs on Linux and Solaris.
We did some more investigations, seems that half of the overhead is preparing and sending each message using SendMessage, and the other half is the actual screen drawing. The SendMessage is done between functions in the same file...
So I guess all the advice given below is correct:
Look for how much things are redrawn
Draw things directly
Chunk drawing operations in time, to not send every character to the screen, aiming for 10 to 20 Hz update rate of the serial console.
Can you accept ALL answers?
I agree with Will Dean that the drawing in a console window or a text box is a performance bottleneck by itself. You first need to be sure that this isn't your problem. You say that you draw each line as a whole, but even this could be a problem, if the data throughput is too high.
I recommend that you don't use the SendMessage to pass data from the main application to the text window. Instead, use some other means of communication. Are these in the same process? If not, you could use shared memory. Even a file in the disk could do in some circumstances. Have the main application write to this file and the text console read from it. You could send a SendMessage notification to the text console to inform it to update the view. But do not send the message whenever a new line arrives. Define a minimum interval between two subsequent updates.
You should try profiling properly, but in lieu of that I would stop worrying about the SendMessage, which almost certainly not your problem, and think about the redrawing of the window itself.
You describe these are 'text console windows', but then say you have multiple of them - are they actually Windows Consoles? Or are they something your application is drawing?
If the latter, then I would be looking at measuring my paint code, and whether I'm invalidating too much of a window on each update.
Are the output windows part of the same application? It almost sounds like they aren't...
If they are, you should look into the Observer design pattern to get away from SendMessage(). I've used it for the same type of use case, and it worked beautifully for me.
If you can't make a change like that, perhaps you could buffer your output for something like 100ms so that you don't have so many out-going messages per second, but it should also update at a comfortable rate.
Are the output windows part of the
same application? It almost sounds
like they aren't...
Yes they are, all in the same process.
I did not write this code... but it seems like SendMessage is a bit heavy for this all in one application case.
You describe these are 'text console
windows', but then say you have
multiple of them - are they actually
Windows Consoles? Or are they
something your application is drawing?
Our app is drawing them, they are not regular windows consoles.
Note that we also need to get data back when a user types into the console, as we quite often have interactive serial sessions. Think of it as very similar to what you would see in a serial terminal program -- but using an external application is obviously even more expensive than what we have now.
If you can't make a change like that,
perhaps you could buffer your output
for something like 100ms so that you
don't have so many out-going messages
per second, but it should also update
at a comfortable rate.
Good point. Right now, every single character output causes a message to be sent.
And when we scroll the window up when a newline comes, then we redraw it line-by-line.
Note that we also have a scrollback buffer of arbitrary size, but scrolling back is an interactive case with much lower performance requirements.

Resources