Applying a diff-patch to a string/file - ruby

For an offline-capable smartphone app, I'm creating a one-way text sync for Xml files. I'd like my server to send the delta/difference (e.g. a GNU diff-patch) to the target device.
This is the plan:
Time = 0
Server: has version_1 of Xml file (~800 kiB)
Client: has version_1 of Xml file (~800 kiB)
Time = 1
Server: has version_1 and version_2 of Xml file (each ~800 kiB)
computes delta of these versions (=patch) (~10 kiB)
sends patch to Client (~10 kiB transferred)
Client: computes version_2 from version_1 and patch <= this is the problem =>
Is there a Ruby library that can do this last step to apply a text patch to files/strings? The patch can be formatted as required by the library.
Thanks for your help!
(I'm using the Rhodes Cross-Platform Framework, which uses Ruby as programming language.)

Your first task is to choose a patch format. The hardest format for humans to read (IMHO) turns out to be the easiest format for software to apply: the ed(1) script. You can start off with a simple /usr/bin/diff -e old.xml new.xml to generate the patches; diff(1) will produce line-oriented patches but that should be fine to start with. The ed format looks like this:
36a
<tr><td class="eg" style="background: #182349;"> </td><td><tt>#182349</tt></td></tr>
.
34c
<tr><td class="eg" style="background: #66ccff;"> </td><td><tt>#xxxxxx</tt></td></tr>
.
20,23d
The numbers are line numbers, line number ranges are separated with commas. Then there are three single letter commands:
a: add the next block of text at this position.
c: change the text at this position to the following block. This is equivalent to a d followed by an a command.
d: delete these lines.
You'll also notice that the line numbers in the patch go from the bottom up so you don't have to worry about changes messing up the lines numbers in subsequent chunks of the patch. The actual chunks of text to be added or changed follow the commands as a sequence of lines terminated by a line with a single period (i.e. /^\.$/ or patch_line == '.' depending on your preference). In summary, the format looks like this:
[line-number-range][command]
[optional-argument-lines...]
[dot-terminator-if-there-are-arguments]
So, to apply an ed patch, all you need to do is load the target file into an array (one element per line), parse the patch using a simple state machine, call Array#insert to add new lines and Array#delete_at to remove them. Shouldn't take more than a couple dozen lines of Ruby to write the patcher and no library is needed.
If you can arrange your XML to come out like this:
<tag>
blah blah
</tag>
<other-tag x="y">
mumble mumble
</other>
rather than:
<tag>blah blah</tag><other-tag x="y">mumble mumble</other>
then the above simple line-oriented approach will work fine; the extra EOLs aren't going to cost much space so go for easy implementation to start.
There are Ruby libraries for producing diffs between two arrays (google "ruby algorithm::diff" to start). Combining a diff library with an XML parser will let you produce patches that are tag-based rather than line-based and this might suit you better. The important thing is the choice of patch formats, once you choose the ed format (and realize the wisdom of the patch working from the bottom to the top) then everything else pretty much falls into place with little effort.

I know this question is almost five years old, but I'm going to post an answer anyway. When searching for how to make and apply patches for strings in Ruby, even now, I was unable to find any resources that answer this question satisfactorily. For that reason, I'll show how I solved this problem in my application.
Making Patches
I'm assuming you're using Linux, or else have access to the program diff through Cygwin. In that case, you can use the excellent Diffy gem to create ed script patches:
patch_text = Diffy::Diff.new(old_text, new_text, :diff => "-e").to_s
Applying Patches
Applying patches is not quite as straightforward. I opted to write my own algorithm, ask for improvements in Code Review, and finally settle on using the code below. This code is identical to 200_success's answer except for one change to improve its correctness.
require 'stringio'
def self.apply_patch(old_text, patch)
text = old_text.split("\n")
patch = StringIO.new(patch)
current_line = 1
while patch_line = patch.gets
# Grab the command
m = %r{\A(?:(\d+))?(?:,(\d+))?([acd]|s/\.//)\Z}.match(patch_line)
raise ArgumentError.new("Invalid ed command: #{patch_line.chomp}") if m.nil?
first_line = (m[1] || current_line).to_i
last_line = (m[2] || first_line).to_i
command = m[3]
case command
when "s/.//"
(first_line..last_line).each { |i| text[i - 1].sub!(/./, '') }
else
if ['d', 'c'].include?(command)
text[first_line - 1 .. last_line - 1] = []
end
if ['a', 'c'].include?(command)
current_line = first_line - (command=='a' ? 0 : 1) # Adds are 0-indexed, but Changes and Deletes are 1-indexed
while (patch_line = patch.gets) && (patch_line.chomp! != '.') && (patch_line != '.')
text.insert(current_line, patch_line)
current_line += 1
end
end
end
end
text.join("\n")
end

Related

Ruby modify file instead of creating new file

Say I have the following Ruby code which, given a hash of insert positions, reads a file and creates a new file with extra text inserted at those positions:
insertpos = {14=>25,16=>25}
File.open('file.old', 'r') do |oldfile|
File.open('file.new', 'w') do |newfile|
oldfile.each_with_index do |line,linenum|
inserthere = insertpos[linenum]
if(!inserthere.nil?)then
line.insert(inserthere,"foo")
end
newfile.write(line)
end
end
end
Now, instead of creating that new file, I would like to modify this original (old) file. Can someone give me a hint on how to modify the code? Thanks!
At a very fundamental level, this is an extremely difficult thing to do, in any language, on any operating system. Envision a file as a contiguous series of bytes on disk (this is a very simplistic scenario, but it serves to illustrate the point). You want to insert some bytes in the middle of the file. Where do you put those bytes? There's no place to put them! You would have to basically "shift" the existing bytes after the insertion point "down" by the number of bytes you want to insert. If you're inserting multiple sections into an existing file, you would have to do this multiple times! It will be extremely slow, and you will run a high risk of corrupting your data if something goes awry.
You can, however, overwrite existing bytes, and/or append to the end of the file. Most Unix utilities give the appearance of modifying files by creating new files and swapping them with the old. Some more sophisticated schemes, such as those used by databases, allow inserts in the middle of files by 1. reserving space for such operations (when the data is first written), 2. allowing non-contiguous blocks of data within the file through indexing and other techniques, and/or 3. copy-on-write schemes where a new version of the data is written to the end of the file and the old version is invalidated by overwriting an indicator of some kind. You are most likely not wanting to go through all this trouble for your simple use case!
Anyway, you've already found the best way to do what you're trying to do. The only thing you're missing is a FileUtils.mv('file.new', 'file.old') at the very end to replace the old file with the new. Please let me know in the comments if I can help explain this any further.
(Of course, you can read the entire file into memory, make your changes, and overwrite the old file with the updated contents, but I don't believe that's what you're asking here.)
Here's something that hopefully solves your purpose:
# 'source' param is a string, the entire source text
# 'lines' param is an array, a list of line numbers to insert after
# 'new' param is a string, the text to add
def insert(source, lines, new)
results = []
source.split("\n").each_with_index do |line, idx|
if lines.include?(idx)
results << (line + new)
else
results << line
end
end
results.join("\n")
end
File.open("foo", "w") do |f|
10.times do |i|
f.write("#{i}\n")
end
end
puts "initial text: \n\n"
txt = File.read("foo")
puts txt
puts "\n\n after inserting at lines 1,3, and 5: \n\n"
result = insert(txt, [1,3,5], "\nfoo")
puts result
Running this shows:
initial text:
0
1
2
3
4
5
6
7
8
9
after inserting at lines 1,3, and 5:
0
1
foo
2
3
foo
4
5
foo
6
7
8
If its a relatively simple operation you can do it with a ruby one-liner, like this
ruby -i -lpe '$_.reverse!' thefile.txt
(found e.g. at https://gist.github.com/KL-7/1590797).

Easier way to search through large files in Ruby?

I'm writing a simple log sniffer that will search logs for specific errors that are indicative of issues with the software I support. It allows the user to specify the path to the log and specify how many days back they'd like to search.
If users have log roll over turned off, the log files can sometimes get quite large. Currently I'm doing the following (though not done with it yet):
File.open(#log_file, "r") do |file_handle|
file_handle.each do |line|
if line.match(/\d+++-\d+-\d+/)
etc...
The line.match obviously looks for the date format we use in the logs, and the rest of the logic will be below. However, is there a better way to search through the file without .each_line? If not, I'm totally fine with that. I just wanted to make sure I'm using the best resources available to me.
Thanks
fgrep as a standalone or called from system('fgrep ...') may be faster solution
file.readlines might be better in speed, but it's a time-space tradeoff
look at this little research - last approaches seem to be rather fast.
Here are some coding hints...
Instead of:
File.open(#log_file, "r") do |file_handle|
file_handle.each do |line|
use:
File.foreach(#log_file) do |line|
next unless line[/\A\d+++-\d+-\d+/]
foreach simplifies opening and looping over the file.
next unless... makes a tight loop skipping every line that does NOT start with your target string. The less you do before figuring out whether you have a good line, the faster your code will run.
Using an anchor at the start of your pattern, like \A gives the regex engine a major hint about where to look in the line, and allows it to bail out very quickly if the line doesn't match. Also, using line[/\A\d+++-\d+-\d+/] is a bit more concise.
If your log file is sorted by date, then you can avoid having search through the entire file by doing a binary search. In this case you'd:
Open the file like you are doing
Use lineo= to fast forward to the middle of the file.
Check if the date on the beging of the line is higher or lower than the date you are looking for.
Continue splitting the file in halves until you find what you need.
I do however think your file needs to be very large for the above to make sense.
Edit
Here is some code which shows the basic idea. It find a line containing search date, not the first. This can be fixed either by more binary searches or by doing an linear search from the last midpoint, which did not contain date. There also isn't a termination condition in case the date is not in the file. These small additions, are left as an exercise to the reader :-)
require 'date'
def bin_fsearch(search_date, file)
f = File.open file
search = {min: 0, max: f.size}
while true
# go to file midpoint
f.seek (search[:max] + search[:min]) / 2
# read in until EOL
f.gets
# record the actual mid-point we are using
pos = f.pos
# read in next line
line = f.gets
# get date from line
line_date = Date.parse(line)
if line_date < search_date
search[:min] = f.pos
elsif line_date > search_date
search[:max] = pos
else
f.seek pos
return
end
end
end
bin_fsearch(Date.new(2013, 5, 4), '/var/log/system.log')
Try this, it will search one time at a time & should be pretty fast & takes less memory.
File.open(file, 'r') do |f|
f.each_line do |line|
# do stuff here to line
end
end
Another more faster option is to read the whole file into one array. it would be fast but will take LOT of memory.
File.readlines.each do |line|
#do stuff with each line
end
Further, if you need fastest approach with least amount of memory try grep which is specifically tuned for searching through large files. so should be fast & memory responsive
`grep -e regex bigfile`.split(/\n/).each do |line|
# ... (called on each matching line) ...
end
Faster than line-by-line is read the line by chunks:
File.open('file.txt') do |f|
buff = f.read(10240)
# ...
end
But you are using regexp to match dates, you might get incomplete lines. You will have to deal with it in your logic.
Also, if performance is that important, consider write a really simple C extension.
If the log file can get huge, and that is your concern, then maybe you can consider saving the errors in a database. Then, you will get faster response.

Fastest way to skip lines while parsing files in Ruby?

I tried searching for this, but couldn't find much. It seems like something that's probably been asked before (many times?), so I apologize if that's the case.
I was wondering what the fastest way to parse certain parts of a file in Ruby would be. For example, suppose I know the information I want for a particular function is between lines 500 and 600 of, say, a 1000 line file. (obviously this kind of question is geared toward much large files, I'm just using those smaller numbers for the sake of example), since I know it won't be in the first half, is there a quick way of disregarding that information?
Currently I'm using something along the lines of:
while buffer = file_in.gets and file_in.lineno <600
next unless file_in.lineno > 500
if buffer.chomp!.include? some_string
do_func_whatever
end
end
It works, but I just can't help but think it could work better.
I'm very new to Ruby and am interested in learning new ways of doing things in it.
file.lines.drop(500).take(100) # will get you lines 501-600
Generally, you can't avoid reading file from the start until the line you are interested in, as each line can be of different length. The one thing you can avoid, though, is loading whole file into a big array. Just read line by line, counting, and discard them until you reach what you look for. Pretty much like your own example. You can just make it more Rubyish.
PS. the Tin Man's comment made me do some experimenting. While I didn't find any reason why would drop load whole file, there is indeed a problem: drop returns the rest of the file in an array. Here's a way this could be avoided:
file.lines.select.with_index{|l,i| (501..600) === i}
PS2: Doh, above code, while not making a huge array, iterates through the whole file, even the lines below 600. :( Here's a third version:
enum = file.lines
500.times{enum.next} # skip 500
enum.take(100) # take the next 100
or, if you prefer FP:
file.lines.tap{|enum| 500.times{enum.next}}.take(100)
Anyway, the good point of this monologue is that you can learn multiple ways to iterate a file. ;)
I don't know if there is an equivalent way of doing this for lines, but you can use seek or the offset argument on an IO object to "skip" bytes.
See IO#seek, or see IO#open for information on the offset argument.
Sounds like rio might be of help here. It provides you with a lines() method.
You can use IO#readlines, that returns an array with all the lines
IO.readlines(file_in)[500..600].each do |line|
#line is each line in the file (including the last \n)
#stuff
end
or
f = File.new(file_in)
f.readlines[500..600].each do |line|
#line is each line in the file (including the last \n)
#stuff
end

Convert a .rtf into a mac .r resource, in a scriptable way

I currently have a SLA in a .rtf format, which is to be integrated into .dmg using the intermediary .r mac resource format, which is used by the Rez utility. I had already done it by hand once, but updates made to the .rtf file are overwhelming to propagate to the disk image, and error-prone. I would like to automate this task, which could also help adding other languages or variants.
How could the process of .rtf to .r text conversion be automated?
Thanks.
Only because I didn't fully understand how the accepted answer actually achieved the goal, I use a combination of a script to generate the hex encoding:
#!/usr/bin/env ruby
# Makes resource (.r) text from binaries.
def usage
puts "usage: #{$0} infile"
puts ""
puts " infile The file to convert (the output will go to stdout)"
exit 1
end
infile = ARGV[0] || usage
data = File.read(infile)
data.bytes.each_slice(16) do |slice|
hex = slice.each_slice(2).map { |pair| pair.pack('C*').unpack('H*')[0] }.join(' ')
# We could put the comments in too, but it probably isn't a big deal.
puts "\t$\"#{hex}\""
end
The output of this is inserted into a variable during the build and then the variable ends up in a template (we're using Ant to do this, but the specifics aren't particularly interesting):
data 'RTF ' (5000, "English SLA") {
#english.licence#
};
The one bit of this which did take quite a while to figure out is that 'RTF ' can be used for the resource directly. The Apple docs say to separately insert 'TEXT' (with just the plain text) and 'styl' (with just the style). There are tools to do this of course, but it was one more tool to run and I could never figure out how to make hyperlinks work in the resulting DMG. With 'RTF ', hyperlinks just work.
Hoping that this saves someone time in the future.
Use the unrtf port (from macports), then format the lines, heading and tail with a shell script.

Reformatting text (or, better, LaTeX) in 80 colums in SciTE

I recently dived into LaTeX, starting with the help of a WYSIWYM editor like Lix. Now I'm staring writing tex files in Sci-TE, It already has syntax higlighting and I adapted the tex.properties file to work in Windows showing a preview on Go [F5]
One pretty thing Lyx does, and it's hard to acheive with a common text editor, is to format text in 80 columns: I can write a paragraph and hit Return each time I reach near the edge column but if, after the first draft, I want to add or cut some words here and there I end up breaking the layout and having to rearrange newlines.
It would be useful to have a tool in Sci-TE so I can select a paragraph of text I added or deleted some words in and have it rearranged in 80 columns. Probably not something working on the whole document since it could probably break some intended anticipated line break.
Probably I could easily write a Python plugin for geany, I saw vim has something similar, but I'd like to know if its' possible in Sci-TE too.
I was a bit disappointed when I found no answer as I was searching for same. No helpers by Google either, so I searched for Lua examples and syntax in a hope to craft it myself. I don't know Lua so this can perhaps be made differently or efficiently but its better then nothing I hope - here is Lua function which needs to be put in SciTE start-up Lua script:
function wrap_text()
local border = 80
local t = {}
local pos = editor.SelectionStart
local sel = editor:GetSelText()
if #sel == 0 then return end
local para = {}
local function helper(line) table.insert(para, line) return "" end
helper((sel:gsub("(.-)\r?\n", helper)))
for k, v in pairs(para) do
line = ""
for token in string.gmatch(v, "[^%s]+") do
if string.len(token .. line) >= border then
t[#t + 1] = line
line = token .. " "
else
line = line .. token .. " "
end
end
t[#t + 1] = line:gsub("%s$", "")
end
editor:ReplaceSel(table.concat(t, "\n"))
editor:GotoPos(pos)
end
Usage is like any other function from start-up script, but for completness I'll paste my tool definition from SciTE properties file:
command.name.8.*=Wrap Text
command.mode.8.*=subsystem:lua,savebefore:no,groupundo
command.8.*=wrap_text
command.replace.selection.8.*=2
It does respect paragraphs, so it can be used on broader selection, not just one paragraph.
This is one way to do it in scite: first, add this to your .SciTEUser.properties (Options/Open User Options file):
# Column guide, indicates long lines (https://wiki.archlinux.org/index.php/SciTE)
# this is what they call "margin line" in gedit (at right),
# in scite, "margin" is the area on left for line numbers
edge.mode=1
edge.column=80
... and save, so you can see a line at 80 characters.
Then scale the scite window, so the text you see is wrapped at the line.
Finally, select the long line text which is to be broken into lines, and do Edit / Paragraph / Split (for me the shortcut Ctrl-K also works for that).
Unfortunately, there seems to be no "break-lines-as-you-type" facility in scite, like the "Line Breaking" facility in geany. not anymore, now there's a plugin - see this answer
Well, I was rather disappointed that there seems to be no "break-lines-as-you-type" facility in scite; and I finally managed to code a small Lua plugin/add-on/extension for that, and released it here:
lua-users wiki: Scite Line Break
Installation and usage instructions are in the script itself. Here is how SciTE may look when the extension properly installed, and toggle activated after startup:
Note that it's pretty much the same functionality as in geany - it inserts linebreaks upon typing text - but not on pressing backspace, nor upon copy/pasting.
the same but more easy, I think...
put this in the user properties:
command.name.0.*=swrap
command.0.*=fold -s $(FileNameExt) > /tmp/scite_temp ; cat /tmp/scite_temp >$(FileNameExt)
command.is.filter.0.*=1
Ciao
Pietro

Resources