Good Way to Handle Many Different Files? - ruby

I'm building a specialized pipeline, and basically, every step in the pipeline involves taking one file as input and creating a different file as output. Not all files are in the same directory, all output files are of a different format, and because I'm using several different programs, different actions have to be taken to appease the different programs.
This has led to some complicated file management in my code, and the more I try to organize the file directories, the more ugly it's getting. Just about every class involves some sort of code like the following:
#fileName = File.basename(file)
#dataPath = "#{$path}/../data/"
MzmlToOther.new("mgf", "#{#dataPath}/spectra/#{#fileName}.mzML", 1, false).convert
system("wine readw.exe --mzXML #{#file}.raw #{$path}../data/spectra/#{File.basename(#file + ".raw", ".raw")}.mzXML 2>/dev/null")
fileName = "#{$path}../data/" + parts[0] + parts[1][6..parts[1].length-1].chomp(".pep.xml")
Is there some sort of design pattern, or ruby gem, or something to clean this up? I like writing clean code, so this is really starting to bother me.

You could use a Makefile.
Make is essential a DSL designed for handling converting one type of file to another type via running an external program. As an added bonus, it will handle only performing the steps necessary to incrementally update your output if some set of source files change.
If you really want to use Ruby, try a rakefile. Rake will do this, and it's still Ruby.

You can make this as sophisticated as you want but this basic script will match a file suffix to a method which you can then call with the file path.
# a conversion method can be used for each file type if you want to
# make the code more readable or if you need to rearrange filenames.
def htm_convert file
"HTML #{file}"
end
# file suffix as key, lambda as value, the last uses an external method
routines = {
:log => lambda {|file| puts "LOG #{file}"},
:rb => lambda {|file| puts "RUBY #{file}"},
:haml => lambda {|file| puts "HAML #{file}"},
:htm => lambda {|file| puts htm_convert(file) }
}
# this loops recursively through the directory and sub folders
Dir['**/*.*'].each do |f|
suffix = f.split(".")[-1]
if routine = routines[suffix.to_sym]
routine.call(f)
else
puts "UNPROCESSED -- #{f}"
end
end

Related

`__FILE__` not working within `DATA`/`__END__`

I'm working on Practicing Ruby's Self-Guided Course on Stream, File Formats, and Sockets, and came across the following problem in the pre-built test for the first exercise. The following test script is supposed to change the directory to the data subdirectory of the project folder:
eval(DATA.read) # load the test helper script
... # various calls to test method defined below
__END__
dir = File.dirname(__FILE__)
Dir.chdir("#{dir}/data")
...
But this breaks because __FILE__ returns (eval) (instead of the path to the file) and File.dirname(__FILE__) returns . Why is this happening, and how should it be written to yield the intended output instead?
__END__ and DATA aren't really relevant here. You're simply passing a string to Kernel#eval. For example, a simple eval('__FILE__') also returns "(eval)" because that's the default filename. It can be changed by passing another string but as third argument:
eval('__FILE__', nil, 'hello.rb') # => "hello.rb"
Or in your case:
eval(DATA.read, nil, __FILE__)

Test all subclasses on file update

I am learning unit testing with PHP and am following the TDD session on tutsplus: http://net.tutsplus.com/sessions/test-driven-php/
I have set up a ruby watchr script to run the PHPUnit unit tests every time a file is modified using Susan Buck's script: https://gist.github.com/susanBuck/4335092
I would like to change the ruby script so that in addition to testing a file when it is updated it will test all files that inherit from it. I name my files to indicate inheritance (and to group files) as Parent.php, Parent.Child.php, and Parent.Child.GrandChild.php, etc so the watchr script could just search by name. I just have no idea how to do that.
I would like to change:
watch("Classes/(.*).php") do |match|
run_test %{Tests/#{match[1]}_test.php}
end
to something like:
watch("Classes/(.*).php") do |match|
files = get all classes that inherit from {match[1]} /\b{match[1]}\.(.*)\.php/i
files.each do |file|
run_test %{Tests/{file}_test.php}
end
end
How do I do the search for file names in the directory? Or, is there an easier/better way to accomplish this?
Thanks
EDIT
This is what I ended up with:
watch("#{Library}/(.*/)?(.*).php") do |match|
file_moded(match[1], match[2])
end
def file_moded(path, file)
subclasses = Dir["#{Library}/#{path}#{file}*.php"]
p subclasses
subclasses.each do |file|
test_file = Tests + file.tap{|s| s.slice!(".php")}.tap{|s| s.slice!("#{Library}")} + TestFileEnd
run_test test_file
end
end
Where Library, Tests, and TestFileEnd are values defined at the top of the file. It was also changed so that it will detect changes in subfolders to the application library and load the appropriate test file.
I'm not entirely certain, but i think this will work:
watch("Classes/(.*).php") do |match|
subclasses = Dir["Classes/#{match[1]}*.php"]
filenames = subclasses.map do |file|
file.match(/Classes\/(.*)\.php/)[1]
end
filenames.each do |file|
run_test "Tests/#{file}_test.php"
end
end
It's probably not the cleaneast way, but it should work.
The first line saves all the relative paths to files in the Classes directory beginning with the changed filename in subclasses.
in the map block I use a regex to only get the filename, without any folder names or the .php extensions.
Hope this helps you

Why won't gsub! change my files?

I am trying to do a simple find/replace on all text files in a directory, modifying any instance of [RAVEN_START: by inserting a string (in this case 'raven was here') before the line.
Here is the entire ruby program:
#!/usr/bin/env ruby
require 'rubygems'
require 'fileutils' #for FileUtils.mv('your file', 'new location')
class RavenParser
rawDir = Dir.glob("*.txt")
count = 0
rawDir.each do |ravFile|
#we have selected every text file, so now we have to search through the file
#and make the needed changes.
rav = File.open(ravFile, "r+") do |modRav|
#Now we've opened the file, and we need to do the operations.
if modRav
lines = File.open(modRav).readlines
lines.each { |line|
if line.match /\[RAVEN_START:.*\]/
line.gsub!(/\[RAVEN_START:/, 'raven was here '+line)
count = count + 1
end
}
printf("Total Changed: %d\n",count)
else
printf("No txt files found. \n")
end
end
#end of file replacing instructions.
end
# S
end
The program runs and compiles fine, but when I open up the text file, there has been no change to any of the text within the file. count increments properly (that is, it is equal to the number of instances of [RAVEN_START: across all the files), but the actual substitution is failing to take place (or at least not saving the changes).
Is my syntax on the gsub! incorrect? Am I doing something else wrong?
You're reading the data, updating it, and then neglecting to write it back to the file. You need something like:
# And save the modified lines.
File.open(modRav, 'w') { |f| f.puts lines.join("\n") }
immediately before or after this:
printf("Total Changed: %d\n",count)
As DMG notes below, just overwriting the file isn't properly paranoid as you could be interrupted in the middle of the write and lose data. If you want to be paranoid (which all of us should be because they really are out to get us), then you want to write to a temporary file and then do an atomic rename to replace the original file the new one. A rename generally only works when you stay within a single file system as there is no guarantee that the OS's temp directory (which Tempfile uses by default) will be on the same file system as modRav so File.rename might not even be an option with a Tempfile unless precautions are taken. But the Tempfile constructor takes a tmpdir parameter so we're saved:
modRavDir = File.dirname(File.realpath(modRav))
tmp = Tempfile.new(modRav, modRavDir)
tmp.write(lines.join("\n"))
tmp.close
File.rename(tmp.path, modRav)
You might want to stick that in a separate method (safe_save(modRav, lines) perhaps) to avoid further cluttering your block.
There is no gsub! in the post (except the title and question). I would actually recommend not using gsub!, but rather use the result of gsub -- avoiding mutability can help reduce a number of subtle bugs.
The line read from the file stream into a String is a copy and modifying it will not affect the contents of the file. (The general approach is to read a line, process the line, and write the line. Or do it all at once: read all lines, process all lines, write all processed lines. In either case, nothing is being written back to the file in the code in the post ;-)
Happy coding.
You're not using gsub!, you're using gsub. gsub! and gsub different methods, one does replacement on the object itself and the other does replacement then returns the result, respectively.
Change this
line.gsub(/\[RAVEN_START:/, 'raven was here '+line)
to this :
line.gsub!(/\[RAVEN_START:/, 'raven was here '+line)
or this:
line = line.gsub(/\[RAVEN_START:/, 'raven was here '+line)
See String#gsub for more info

How to test a script that generates files

I am creating a Rubygem that will let me generate jekyll post files. One of the reasons I am developing this project is to learn TDD. This gem is strictly functional on the command line, and it has to make a series of checks to make sure that it finds the _posts directory. This depends on two things:
Wether or not a location option was passed
Is that location option valid?
A location option was not passed
Is the posts dir in the current directory?
Is the posts dir the current working directory?
At that point, I am really having a hard time testing that part of the application. So I have two questions:
is it acceptable/okay to skip tests for small parts of the application like the one described above?
If not, how do you test file manipulation in ruby using minitest?
Some projects I've seen implement their command line tools as Command objects (for example: Rubygems and my linebreak gem). These objects are initialized with the ARGV simply have a call or execute method which then starts the whole process. This enables these projects to put their command line applications into a virtual environment. They could, for example hold the input and output stream objects in instance variables of the command object to make the application independant of using STDOUT/STDIN. And thus, making it possible to test the input/output of the command line application. In the same way I imagine, you could hold your current working directory in an instance variable to make your command line application independent of your real working directory. You could then create a temporary directory for each test and set this one as the working directory for your Command object.
And now some code:
require 'pathname'
class MyCommand
attr_accessor :input, :output, :error, :working_dir
def initialize(options = {})
#input = options[:input] ? options[:input] : STDIN
#output = options[:output] ? options[:output] : STDOUT
#error = options[:error] ? options[:error] : STDERR
#working_dir = options[:working_dir] ? Pathname.new(options[:working_dir]) : Pathname.pwd
end
# Override the puts method to use the specified output stream
def puts(output = nil)
#output.puts(output)
end
def execute(arguments = ARGV)
# Change to the given working directory
Dir.chdir(working_dir) do
# Analyze the arguments
if arguments[0] == '--readfile'
posts_dir = Pathname.new('posts')
my_file = posts_dir + 'myfile'
puts my_file.read
end
end
end
end
# Start the command without mockups if the ruby script is called directly
if __FILE__ == $PROGRAM_NAME
MyCommand.new.execute
end
Now in your test's setup and teardown methods you could do:
require 'pathname'
require 'tmpdir'
require 'stringio'
def setup
#working_dir = Pathname.new(Dir.mktmpdir('mycommand'))
#output = StringIO.new
#error = StringIO.new
#command = MyCommand.new(:working_dir => #working_dir, :output => #output, :error => #error)
end
def test_some_stuff
#command.execute(['--readfile'])
# ...
end
def teardown
#working_dir.rmtree
end
(In the example I'm using Pathname, which is a really nice object oriented file system API from Ruby's standard library and StringIO, which is useful for for mocking STDOUT as it's an IO object which streams into a simple String)
In the acutal test you could now use the #working_dir variable to test for existence or content of files:
path = #working_dir + 'posts' + 'myfile'
path.exist?
path.file?
path.directory?
path.read == "abc\n"
From my experience (and thus this is VERY subjective), I think it's ok sometimes to skip unit testing in some areas which are difficult to test. You need to find out what you get in return and the cost for testing or not. My rule of thumb is that the decision to not test a class should be very unusual (around less than 1 in 300 classes)
If what you're trying to test is very difficult, because of the dependencies with the file system, I think you could try to extract all the bits that interact with the file system.

command-line ruby scripts accessing a libs folder

I'm trying to create an application that will primarily consist of ruby scripts that will be run from the command-line (cron, specifically). I want to have a libs folder, so I can put encapsulated, reusable classes/modules in there, and be able to access them from any script.
I want to be able to put my scripts into a "bin" folder.
What is the best way to give them access to the libs folder? I know I can add to the load path via command-line argument, or at the top of each command-line script. In PHP, it sometimes made more sense to create a custom .ini file and point the cli to the ini file, so you got them all in one pop.
Anything similar for ruby? Based on your experience, what's the best way to go here?
At the top of each bin/executable, you can put this at the top
#!/usr/bin/env ruby
$:.unshift(File.join(File.dirname(__FILE__), '..', 'lib')
require 'libfile'
[etc.]
Were you looking for something different?
If you turn your application into a Ruby gem and install the gem on your system, you don't even need to put this stuff at the top. The require statement would suffice in that case.
Sean,
There is no way to not have to require a library, that I know of. I guess if you want to personalize your Ruby so much you could "roll your own" using eval.
The script below basically works as the interpreter. You can add your own functions and include libraries. Give the file executable permissions and put it in /usr/bin if you really want. Then just use
$ myruby <source>
Here's the code for a very minimal one. As an example I've included the md5 digest library and created a custom function called md5()
#!/usr/bin/ruby -w
require 'digest/md5';
def executeCode(file)
handle = File.open(file,'r');
for line in handle.readlines()
line = line.strip();
begin
eval(line);
rescue Exception => e
print "Problem with script '" + file + "'\n";
print e + "\n";
end
end
end
def checkFile(file)
if !File.exists?(file)
print "No such source file '" + file + "'\n";
exit(1);
elsif !File.readable?(file)
print "Cannot read from source file '" + file + "'\n";
exit(1);
else
executeCode(file);
end
end
# My custom function for our "interpreter"
def md5(key=nil)
if key.nil?
raise "md5 requires 1 parameter, 0 given!\n";
else
return Digest::MD5.hexdigest(key)
end
end
if ARGV[0].nil?
print "No input file specified!\n"
exit(1);
else
checkFile(ARGV[0]);
end
Save that as myruby or myruby.rb and give it executable permissions (755). Now you're ready to create a normal ruby source file
puts "I will now generate a md5 digest for mypass using the md5() function"
puts md5('mypass')
Save that and run it as you would a normal ruby script but with our new interpreter. You'll notice I didn't need to include any libraries or write the function in the source code because it's all defined in our interpreter.
It's probably not the most ideal method, but it's the only one I can come up with.
Cheers
There is a RUBYLIB environment variable that can be set to any folder on the system
If you want to use your classes/modules globally, why not just move them to your main Ruby lib directory? eg: /usr/lib/ruby/1.8/ ?
Eg:
$ cat > /usr/lib/ruby/1.8/mymodule.rb
module HelloWorld
def hello
puts("Hello, World!");
end
end
We have our module in the main lib directory - should be able to
require it from anywhere in the system now.
$ irb
irb(main):001:0> require 'mymodule'
=> true
irb(main):002:0> include HelloWorld
=> Object
irb(main):003:0> hello
Hello, World!
=> nil

Resources