Copy yaml formatting (indent) from one file to another - format

A translator completely messed up a yaml file by copying everything into word (don't ask).
I have already cleaned up the file using regexes, but the indent (spacing) is now missing; everything starts at the first character:
es:
default_blocks:
thank_you_html: "thank you text"
instead of
en:
default_blocks:
thank_you_html: "thank you text"
Do you have a good idea on how to automatically copy the format/structure/indent from the correct file (say en.yml) to the corrupt one (say es.yml)? (I'm using textmate 2.0 as editor)
Thanks!

Assuming the original and the translation contain exactly the same strings per line (except for the indentation problem), a quick&dirty script scanning the leading whitespace may solve this:
#!/usr/bin/env ruby
# encoding: UTF-8
indented = File.readlines(ARGV[0]).map do |l|
l.scan(/^\s+/)[0]
end.zip(File.readlines(ARGV[1])).map { |e| e.join }.join
File.open(ARGV[1], "w") { |io| io.write(indented) }
Save it, make it executable and call
./script_name.rb en.yml es.yml
Wouldn't mess with Textmate if this is not a regular task, but you could easily transform this to a command and either prompt for the two files via a dialog or select both in the file browser, open one of them in the current tab and differentiate them via environment variables ($TM_FILEPATH, $TM_SELECTED_FILES)

Related

How to replace the first few bytes of a file in Ruby without opening the whole file?

I have a 30MB XML file that contains some gibberish in the beginning, and so typically I have to remove that in order for Nokogiri to be able to parse the XML document properly.
Here's what I currently have:
contents = File.open(file_path).read
if contents[0..123].include? 'authenticate_response'
fixed_contents = File.open(file_path).read[123..-1]
File.open(file_path, 'w') { |f| f.write(fixed_contents) }
end
However, this actually causes the ruby script to open up the large XML file twice. Once to read the first 123 characters, and another time to read everything but the first 123 characters.
To solve the first issue, I was able to accomplish this:
contents = File.open(file_path).read(123)
However, now I need to remove these characters from the file without reading the entire file. How can I "trim" the beginning of this file without having to open the entire thing in memory?
You can open the file once, then read and check the "garbage" and finally pass the opened file directly to nokogiri for parsing. That way, you only need read the file once and don't need to write it at all.
File.open(file_path) do |xml_file|
if xml_file.read(123).include? 'authenticate_response'
# header found, nothing to do
else
# no header found. We rewind and let nokogiri parse the whole file
xml_file.rewind
end
xml = Nokogiri::XML.parse(xml_file)
# Now to whatever you want with the parsed XML document
end
Please refer to the documentation of IO#read, IO#rewind and Nokigiri::XML::Document.parse for details about those methods.

Encoding problems with ruby while reading in command line arguments with optparse

I'm writing a small programm in ruby, which essentially changes some files within a zip-file. The zip-file is specified as a parameter on the command line and interpreted via the OptionParser.
The problem is, that when specifiying a file, which contains non-ascii characters, the file cannot be opened, saying that it could not be found. This problem occurs using cmd.exe under Windows.
Here is a minimal example:
# example.rb
require "zip"
require "optparse"
zip_file_name = String.new
# read and interprete command line arguments:
OptionParser.new do |opts|
opts.on("-f", "--file FILE", String, "The zip-file, which will be modified") do |f|
zip_file_name = f
end
end.parse!
# Open the zip file:
Zip::File.open(zip_file_name) do |zipfile|
end
If you create a zip-file test.zip and run example.rb -f test.zip everything is okay (it does finish without errors). Doing the same with a zip-file täst.zip gives me an error. I tried doing zip_file_name.encode!(Encoding::UTF_8), but this didn't solve the problem.
It seems to be an encoding problem (the encoding of zip_file_name is cp850) but the transcoding does not seem to work correctly.
So my question would be: How can I change my program to also allow non-ascii characters for specifying files on the command line?
Adding zip_file_name.force_encoding(Encoding::Windows_1252) before opening the file solves the issue (on Western Europe Windows).
Apparently, the CP850 file names encoding is a wrong assumption from Ruby. On my Windows system, it seems that filenames are encoded in Windows_1252 (a custom version of Latin1 or ISO 8859-1).

Automatically open a file as binary with Ruby

I'm using Ruby 1.9 to open several files and copy them into an archive. Now there are some binary files, but some are not. Since Ruby 1.9 does not open binary files automatically as binaries, is there a way to open them automatically anyway? (So ".class" would be binary, ".txt" not)
Actually, the previous answer by Alex D is incomplete. While it's true that there is no "text" mode in Unix file systems, Ruby does make a difference between opening files in binary and non-binary mode:
s = File.open('/tmp/test.jpg', 'r') { |io| io.read }
s.encoding
=> #<Encoding:UTF-8>
is different from (note the "rb")
s = File.open('/tmp/test.jpg', 'rb') { |io| io.read }
s.encoding
=> #<Encoding:ASCII-8BIT>
The latter, as the docs say, set the external encoding to ASCII-8BIT which tells Ruby to not attempt to interpret the result at UTF-8. You can achieve the same thing by setting the encoding explicitly with s.force_encoding('ASCII-8BIT'). This is key if you want to read binary into a string and move them around (e.g. saving them to a database, etc.).
Since Ruby 1.9.1 there is a separate method for binary reading (IO.binread) and since 1.9.3 there is one for writing (IO.binwrite) as well:
For reading:
content = IO.binread(file)
For writing:
IO.binwrite(file, content)
Since IO is the parent class of File, you could also do the following which is probably more expressive:
content = File.binread(file)
File.binwrite(file, content)
On Unix-like platforms, there is no difference between opening files in "binary" and "text" modes. On Windows, "text" mode converts line breaks to DOS style, and "binary" mode does not.
Unless you need linebreak conversion on Windows platforms, just open all the files in "binary" mode. There is no harm in reading a text file in "binary" mode.
If you really want to distinguish, you will have to match File.extname(filename) against a list of known extensions like ".txt" and ".class".

Why won't gsub! change my files?

I am trying to do a simple find/replace on all text files in a directory, modifying any instance of [RAVEN_START: by inserting a string (in this case 'raven was here') before the line.
Here is the entire ruby program:
#!/usr/bin/env ruby
require 'rubygems'
require 'fileutils' #for FileUtils.mv('your file', 'new location')
class RavenParser
rawDir = Dir.glob("*.txt")
count = 0
rawDir.each do |ravFile|
#we have selected every text file, so now we have to search through the file
#and make the needed changes.
rav = File.open(ravFile, "r+") do |modRav|
#Now we've opened the file, and we need to do the operations.
if modRav
lines = File.open(modRav).readlines
lines.each { |line|
if line.match /\[RAVEN_START:.*\]/
line.gsub!(/\[RAVEN_START:/, 'raven was here '+line)
count = count + 1
end
}
printf("Total Changed: %d\n",count)
else
printf("No txt files found. \n")
end
end
#end of file replacing instructions.
end
# S
end
The program runs and compiles fine, but when I open up the text file, there has been no change to any of the text within the file. count increments properly (that is, it is equal to the number of instances of [RAVEN_START: across all the files), but the actual substitution is failing to take place (or at least not saving the changes).
Is my syntax on the gsub! incorrect? Am I doing something else wrong?
You're reading the data, updating it, and then neglecting to write it back to the file. You need something like:
# And save the modified lines.
File.open(modRav, 'w') { |f| f.puts lines.join("\n") }
immediately before or after this:
printf("Total Changed: %d\n",count)
As DMG notes below, just overwriting the file isn't properly paranoid as you could be interrupted in the middle of the write and lose data. If you want to be paranoid (which all of us should be because they really are out to get us), then you want to write to a temporary file and then do an atomic rename to replace the original file the new one. A rename generally only works when you stay within a single file system as there is no guarantee that the OS's temp directory (which Tempfile uses by default) will be on the same file system as modRav so File.rename might not even be an option with a Tempfile unless precautions are taken. But the Tempfile constructor takes a tmpdir parameter so we're saved:
modRavDir = File.dirname(File.realpath(modRav))
tmp = Tempfile.new(modRav, modRavDir)
tmp.write(lines.join("\n"))
tmp.close
File.rename(tmp.path, modRav)
You might want to stick that in a separate method (safe_save(modRav, lines) perhaps) to avoid further cluttering your block.
There is no gsub! in the post (except the title and question). I would actually recommend not using gsub!, but rather use the result of gsub -- avoiding mutability can help reduce a number of subtle bugs.
The line read from the file stream into a String is a copy and modifying it will not affect the contents of the file. (The general approach is to read a line, process the line, and write the line. Or do it all at once: read all lines, process all lines, write all processed lines. In either case, nothing is being written back to the file in the code in the post ;-)
Happy coding.
You're not using gsub!, you're using gsub. gsub! and gsub different methods, one does replacement on the object itself and the other does replacement then returns the result, respectively.
Change this
line.gsub(/\[RAVEN_START:/, 'raven was here '+line)
to this :
line.gsub!(/\[RAVEN_START:/, 'raven was here '+line)
or this:
line = line.gsub(/\[RAVEN_START:/, 'raven was here '+line)
See String#gsub for more info

How to search file text for a pattern and replace it with a given value

I'm looking for a script to search a file (or list of files) for a pattern and, if found, replace that pattern with a given value.
Thoughts?
Disclaimer: This approach is a naive illustration of Ruby's capabilities, and not a production-grade solution for replacing strings in files. It's prone to various failure scenarios, such as data loss in case of a crash, interrupt, or disk being full. This code is not fit for anything beyond a quick one-off script where all the data is backed up. For that reason, do NOT copy this code into your programs.
Here's a quick short way to do it.
file_names = ['foo.txt', 'bar.txt']
file_names.each do |file_name|
text = File.read(file_name)
new_contents = text.gsub(/search_regexp/, "replacement string")
# To merely print the contents of the file, use:
puts new_contents
# To write changes to the file, use:
File.open(file_name, "w") {|file| file.puts new_contents }
end
Actually, Ruby does have an in-place editing feature. Like Perl, you can say
ruby -pi.bak -e "gsub(/oldtext/, 'newtext')" *.txt
This will apply the code in double-quotes to all files in the current directory whose names end with ".txt". Backup copies of edited files will be created with a ".bak" extension ("foobar.txt.bak" I think).
NOTE: this does not appear to work for multiline searches. For those, you have to do it the other less pretty way, with a wrapper script around the regex.
Keep in mind that, when you do this, the filesystem could be out of space and you may create a zero-length file. This is catastrophic if you're doing something like writing out /etc/passwd files as part of system configuration management.
Note that in-place file editing like in the accepted answer will always truncate the file and write out the new file sequentially. There will always be a race condition where concurrent readers will see a truncated file. If the process is aborted for any reason (ctrl-c, OOM killer, system crash, power outage, etc) during the write then the truncated file will also be left over, which can be catastrophic. This is the kind of dataloss scenario which developers MUST consider because it will happen. For that reason, I think the accepted answer should most likely not be the accepted answer. At a bare minimum write to a tempfile and move/rename the file into place like the "simple" solution at the end of this answer.
You need to use an algorithm that:
Reads the old file and writes out to the new file. (You need to be careful about slurping entire files into memory).
Explicitly closes the new temporary file, which is where you may throw an exception because the file buffers cannot be written to disk because there is no space. (Catch this and cleanup the temporary file if you like, but you need to rethrow something or fail fairly hard at this point.
Fixes the file permissions and modes on the new file.
Renames the new file and drops it into place.
With ext3 filesystems you are guaranteed that the metadata write to move the file into place will not get rearranged by the filesystem and written before the data buffers for the new file are written, so this should either succeed or fail. The ext4 filesystem has also been patched to support this kind of behavior. If you are very paranoid you should call the fdatasync() system call as a step 3.5 before moving the file into place.
Regardless of language, this is best practice. In languages where calling close() does not throw an exception (Perl or C) you must explicitly check the return of close() and throw an exception if it fails.
The suggestion above to simply slurp the file into memory, manipulate it and write it out to the file will be guaranteed to produce zero-length files on a full filesystem. You need to always use FileUtils.mv to move a fully-written temporary file into place.
A final consideration is the placement of the temporary file. If you open a file in /tmp then you have to consider a few problems:
If /tmp is mounted on a different file system you may run /tmp out of space before you've written out the file that would otherwise be deployable to the destination of the old file.
Probably more importantly, when you try to mv the file across a device mount you will transparently get converted to cp behavior. The old file will be opened, the old files inode will be preserved and reopened and the file contents will be copied. This is most likely not what you want, and you may run into "text file busy" errors if you try to edit the contents of a running file. This also defeats the purpose of using the filesystem mv commands and you may run the destination filesystem out of space with only a partially written file.
This also has nothing to do with Ruby's implementation. The system mv and cp commands behave similarly.
What is more preferable is to open a Tempfile in the same directory as the old file. This ensures that there will be no cross-device move issues. The mv itself should never fail, and you should always get a complete and untruncated file. Any failures, such as device out of space, permission errors, etc., should be encountered during writing the Tempfile out.
The only downsides to the approach of creating the Tempfile in the destination directory are:
Sometimes you may not be able to open a Tempfile there, such as if you are trying to 'edit' a file in /proc for example. For that reason you might want to fall back and try /tmp if opening the file in the destination directory fails.
You must have enough space on the destination partition in order to hold both the complete old file and the new file. However, if you have insufficient space to hold both copies then you are probably short on disk space and the actual risk of writing a truncated file is much higher, so I would argue this is a very poor tradeoff outside of some exceedingly narrow (and well-monitored) edge cases.
Here's some code that implements the full-algorithm (windows code is untested and unfinished):
#!/usr/bin/env ruby
require 'tempfile'
def file_edit(filename, regexp, replacement)
tempdir = File.dirname(filename)
tempprefix = File.basename(filename)
tempprefix.prepend('.') unless RUBY_PLATFORM =~ /mswin|mingw|windows/
tempfile =
begin
Tempfile.new(tempprefix, tempdir)
rescue
Tempfile.new(tempprefix)
end
File.open(filename).each do |line|
tempfile.puts line.gsub(regexp, replacement)
end
tempfile.fdatasync unless RUBY_PLATFORM =~ /mswin|mingw|windows/
tempfile.close
unless RUBY_PLATFORM =~ /mswin|mingw|windows/
stat = File.stat(filename)
FileUtils.chown stat.uid, stat.gid, tempfile.path
FileUtils.chmod stat.mode, tempfile.path
else
# FIXME: apply perms on windows
end
FileUtils.mv tempfile.path, filename
end
file_edit('/tmp/foo', /foo/, "baz")
And here is a slightly tighter version that doesn't worry about every possible edge case (if you are on Unix and don't care about writing to /proc):
#!/usr/bin/env ruby
require 'tempfile'
def file_edit(filename, regexp, replacement)
Tempfile.open(".#{File.basename(filename)}", File.dirname(filename)) do |tempfile|
File.open(filename).each do |line|
tempfile.puts line.gsub(regexp, replacement)
end
tempfile.fdatasync
tempfile.close
stat = File.stat(filename)
FileUtils.chown stat.uid, stat.gid, tempfile.path
FileUtils.chmod stat.mode, tempfile.path
FileUtils.mv tempfile.path, filename
end
end
file_edit('/tmp/foo', /foo/, "baz")
The really simple use-case, for when you don't care about file system permissions (either you're not running as root, or you're running as root and the file is root owned):
#!/usr/bin/env ruby
require 'tempfile'
def file_edit(filename, regexp, replacement)
Tempfile.open(".#{File.basename(filename)}", File.dirname(filename)) do |tempfile|
File.open(filename).each do |line|
tempfile.puts line.gsub(regexp, replacement)
end
tempfile.close
FileUtils.mv tempfile.path, filename
end
end
file_edit('/tmp/foo', /foo/, "baz")
TL;DR: That should be used instead of the accepted answer at a minimum, in all cases, in order to ensure the update is atomic and concurrent readers will not see truncated files. As I mentioned above, creating the Tempfile in the same directory as the edited file is important here to avoid cross device mv operations being translated into cp operations if /tmp is mounted on a different device. Calling fdatasync is an added layer of paranoia, but it will incur a performance hit, so I omitted it from this example since it is not commonly practiced.
There isn't really a way to edit files in-place. What you usually do when you can get away with it (i.e. if the files are not too big) is, you read the file into memory (File.read), perform your substitutions on the read string (String#gsub) and then write the changed string back to the file (File.open, File#write).
If the files are big enough for that to be unfeasible, what you need to do, is read the file in chunks (if the pattern you want to replace won't span multiple lines then one chunk usually means one line - you can use File.foreach to read a file line by line), and for each chunk perform the substitution on it and append it to a temporary file. When you're done iterating over the source file, you close it and use FileUtils.mv to overwrite it with the temporary file.
Another approach is to use inplace editing inside Ruby (not from the command line):
#!/usr/bin/ruby
def inplace_edit(file, bak, &block)
old_stdout = $stdout
argf = ARGF.clone
argf.argv.replace [file]
argf.inplace_mode = bak
argf.each_line do |line|
yield line
end
argf.close
$stdout = old_stdout
end
inplace_edit 'test.txt', '.bak' do |line|
line = line.gsub(/search1/,"replace1")
line = line.gsub(/search2/,"replace2")
print line unless line.match(/something/)
end
If you don't want to create a backup then change '.bak' to ''.
This works for me:
filename = "foo"
text = File.read(filename)
content = text.gsub(/search_regexp/, "replacestring")
File.open(filename, "w") { |file| file << content }
Here's a solution for find/replace in all files of a given directory. Basically I took the answer provided by sepp2k and expanded it.
# First set the files to search/replace in
files = Dir.glob("/PATH/*")
# Then set the variables for find/replace
#original_string_or_regex = /REGEX/
#replacement_string = "STRING"
files.each do |file_name|
text = File.read(file_name)
replace = text.gsub!(#original_string_or_regex, #replacement_string)
File.open(file_name, "w") { |file| file.puts replace }
end
require 'trollop'
opts = Trollop::options do
opt :output, "Output file", :type => String
opt :input, "Input file", :type => String
opt :ss, "String to search", :type => String
opt :rs, "String to replace", :type => String
end
text = File.read(opts.input)
text.gsub!(opts.ss, opts.rs)
File.open(opts.output, 'w') { |f| f.write(text) }
If you need to do substitutions across line boundaries, then using ruby -pi -e won't work because the p processes one line at a time. Instead, I recommend the following, although it could fail with a multi-GB file:
ruby -e "file='translation.ja.yml'; IO.write(file, (IO.read(file).gsub(/\s+'$/, %q('))))"
The is looking for white space (potentially including new lines) following by a quote, in which case it gets rid of the whitespace. The %q(')is just a fancy way of quoting the quote character.
Here an alternative to the one liner from jim, this time in a script
ARGV[0..-3].each{|f| File.write(f, File.read(f).gsub(ARGV[-2],ARGV[-1]))}
Save it in a script, eg replace.rb
You start in on the command line with
replace.rb *.txt <string_to_replace> <replacement>
*.txt can be replaced with another selection or with some filenames or paths
broken down so that I can explain what's happening but still executable
# ARGV is an array of the arguments passed to the script.
ARGV[0..-3].each do |f| # enumerate the arguments of this script from the first to the last (-1) minus 2
File.write(f, # open the argument (= filename) for writing
File.read(f) # open the argument (= filename) for reading
.gsub(ARGV[-2],ARGV[-1])) # and replace all occurances of the beforelast with the last argument (string)
end
EDIT: if you want to use a regular expression use this instead
Obviously, this is only for handling relatively small text files, no Gigabyte monsters
ARGV[0..-3].each{|f| File.write(f, File.read(f).gsub(/#{ARGV[-2]}/,ARGV[-1]))}
I am using the tty-file gem
Apart from replacing, it includes append, prepend (on a given text/regex inside the file), diff, and others.

Resources