Anything external as fast as an array? So I don't need to re-load arrays each time I run scripts - ruby

While I am developing my application I need to do tons of math over and over again, tweaking it and running again and observing results.
The math is done on arrays that are loaded from large files. Many megabytes. Not very large but the problem is each time I run my script it first has to load the files into arrays. Which takes a long time.
I was wondering if there is anything external that works similarly to arrays, in that I can know the location of data and just get it. And that it doesn't need to reload everything.
I don't know much about databases except that they seem to not work the way I need to. They aren't ordered and always need to search through everything. It seems. Still a possibility is in-memory databases?
If anyone has a solution it would be great to hear it.
Side question - isn't it just possible to have user entered scripts that my ruby program runs so I can have the main ruby program run indefinitely? I still don't know anything about user entered options and how that would work though.

Use Marshal:
# save an array to a file
File.open('array', 'w') { |f| f.write Marshal.dump(my_array) }
# load an array from file
my_array = File.open('array', 'r') { |f| Marshal.load(f.read) }
Your OS will keep the file cached between saves and loads, even between runs of separate processes using the data.

Related

Reading and Writing the same CSV file in Ruby

I have some processing to do involving a third party API, and I was planning to use a CSV file as a backlog of things to do.
Example
Task to do Resulting file
#1 data/1.json
#2 data/2.json
#3
So, #1 and #2 are already done. I want to work on #3, and save the CSV file as soon as data/3.json is completed.
As the task is unstable and error prone, I want to save progress after each task in the CSV file.
I've written this script in Ruby, it's working well, but as tasks are numerous (> 100k), it's written couple Megabytes to disk each time a task is processed. The whole thing. It seems a good way to kill my HD:
class CSVResolver
require 'csv'
attr_accessor :csv_path
def initialize csv_path:
self.csv_path = csv_path
end
def resolve
csv = CSV.read(csv_path)
csv.each_with_index do |row, index|
next if row[1] # Don't do anything if we've already processed this task, and got a JSON data
json = very_expensive_task_and_error_prone
row[1] = "/data/#{index}.json"
File.write row[1], JSON.pretty_generate(json)
csv[index] = row
CSV.open(csv_path, "wb") do |old_csv|
csv.each do |row|
old_csv << row
end
end
resolve
end
end
end
Is there any way to improve on this, like making the write to CSV file atomic?
I'd use an embedded database for this purpose, such as SQLite or LevelDB.
Unlike a regular database, you'll still get many of the benefits of a CSV file, ie it can be stored in a single file/folder and without any server or permissioning hassle. At the same time, you'll get the benefit of better I/O characteristic than reading and writing a monolithic file upon each update ... the library should be smart enough to be able to index records, minimise changes, and store things in memory while buffering output.
For data persistence you would be, in most cases, best served to select a tool designed for the job, a database. You've already named enough of a reason to not use the hand spun CSV design as it is memory inefficient and proposes more problems then it likely solves. Also, depending on the amount of data you need to process via the 3rd part API, you may want to handle multi-threaded processes where reading/writing to a single file won't work.
You might wanna checkout https://github.com/jeremyevans/sequel

Read and write file atomically

I'd like to read and write a file atomically in Ruby between multiple independent Ruby processes (not threads).
I found atomic_write from ActiveSupport. This writes to a temp file, then moves it over the original and sets all permissions. However, this does not prevent the file from being read while it is being written.
I have not found any atomic_read. (Are file reads already atomic?)
Do I need to implement my own separate 'lock' file that I check for before reads and writes? Or is there a better mechanism already present in the file system for flagging a file as 'busy' that I could check before any read/write?
The motivation is dumb, but included here because you're going to ask about it.
I have a web application using Sinatra and served by Thin which (for its own reasons) uses a JSON file as a 'database'. Each request to the server reads the latest version of the file, makes any necessary changes, and writes out changes to the file.
This would be fine if I only had a single instance of the server running. However, I was thinking about having multiple copies of Thin running behind an Apache reverse proxy. These are discrete Ruby processes, and thus running truly in parallel.
Upon further reflection I realize that I really want to make the act of read-process-write atomic. At which point I realize that this basically forces me to process only one request at a time, and thus there's no reason to have multiple instances running. But the curiosity about atomic reads, and preventing reads during write, remains. Hence the question.
You want to use File#flock in exclusive mode. Here's a little demo. Run this in two different terminal windows.
filename = 'test.txt'
File.open(filename, File::RDWR) do |file|
file.flock(File::LOCK_EX)
puts "content: #{file.read}"
puts 'doing some heavy-lifting now'
sleep(10)
end
Take a look at transaction and open_and_lock_file methods in "pstore.rb" (Ruby stdlib).
YAML::Store works fine for me. So when I need to read/write atomically I (ab)use it to store data as a Hash.

get_dir_file_info() hangs when run on a large directory

I have made a little function that deletes files based on date. Prior to doing the deletions, it lets the user choose how many days/months back to delete files, telling them how many files and how much memory it would clean up.
It worked great in my test environment, but when I attempted to test it on a larger directory (approximately 100K files), it hangs.
I’ve stripped everything else from my code to ensure that it is the get_dir_info() function that is causing the issue.
$this->load->helper('file');
$folder = "iPad/images/";
set_time_limit (0);
echo "working<br />";
$dirListArray = get_dir_file_info($folder);
echo "still working";
When I run this, the page loads for approximately 60 seconds, then displays only the first message “working” and not the following message “still working”.
It doesn’t seem to be a system/php memory problem as it is coming back after 60 seconds and the server respects my set_time_limit() as I’ve had to use that for other processes.
Is there some other memory/time limit I might be hitting that I need to adjust?
from the CI user guide the get_dir_file_info() is:
Reads the specified directory and builds an array containing the filenames, filesize, dates, and permissions. Sub-folders contained within the specified path are only read if forced by sending the second parameter, $top_level_only to FALSE, as this can be an intensive operation.
so if you are saying that you have 100k files then the best way to do it, is to cut it into two steps:
First: use get_filenames('path/to/directory/') to retrieve all your files without their information.
Second: use get_file_info('path/to/file', $file_information) to retrieve a specific file info, as you might not need all the file information immediately. it can be done on file name click or something relevant.
the idea here is not to force your server to deal with large amount of process while in production. that would kill two things, responsiveness, and performance (I haven't found a better definition for performance) but the idea here is clear.

How can I manipulate a local database with Perl?

I'm a Perl programmer with some nice scripts that go fetch HTTP pages (from a text file-list of URLs) with cURL and save them to a folder.
However, the number of pages to get is in the tens of millions. Sometimes the script fails on number 170,000 and I have to start the script again manually. It automatically reads the URL and sees if there is a page downloaded and skips. But, with a few hundred thousand, it still takes a few hours to skip back up to where it left off. Obviously, this is not going to pan out in the end.
I've been told that instead of saving to a text file, which is hard to search and modify, I need to use a database. I don't know much about databases, just messed around with MySQL on a school server a year ago. I just need the ability to add millions of rows and a few static columns, search/modify one quickly, and do this all locally on a lan (or a single computer if that's difficult). And of course, I need to access this database using perl.
Where should I start? What do I need to download to get a server started on Windows? Which Perl modules should I use? (I'm using an ActiveState distro)
There's many sorts of databases, but if you've already decided for an SQL database and are trying to make the setup process easy, you might want to have a look at SQLite and the DBI/DBD::SQLite modules, which allow you to use that from perl.
Since you only need to search on one column, you may wish to consider a key/value store database like the Berkeley DB by using either BerkeleyDB or DB_File.
Generally, you can think of these key/value databases as being Perl hashes that operate from a disk rather than memory. Exact key look ups are very fast. Everything else requires scanning the whole dataset.
Look into DBI. If you do not like SQL in your programs, try SQL::Abstract.

Increasing the Loading Speed of Large Files

There are two large text files (Millions of lines) that my program uses. These files are parsed and loaded into hashes so that the data can be accessed quickly. The problem I face is that, currently, the parsing and loading is the slowest part of the program. Below is the code where this is done.
database = extractDatabase(#type).chomp("fasta") + "yml"
revDatabase = extractDatabase(#type + "-r").chomp("fasta.reverse") + "yml"
#proteins = Hash.new
#decoyProteins = Hash.new
File.open(database, "r").each_line do |line|
parts = line.split(": ")
#proteins[parts[0]] = parts[1]
end
File.open(revDatabase, "r").each_line do |line|
parts = line.split(": ")
#decoyProteins[parts[0]] = parts[1]
end
And the files look like the example below. It started off as a YAML file, but the format was modified to increase parsing speed.
MTMDK: P31946 Q14624 Q14624-2 B5BU24 B7ZKJ8 B7Z545 Q4VY19 B2RMS9 B7Z544 Q4VY20
MTMDKSELVQK: P31946 B5BU24 Q4VY19 Q4VY20
....
I've messed around with different ways of setting up the file and parsing them, and so far this is the fastest way, but it's still awfully slow.
Is there a way to improve the speed of this, or is there a whole other approach I can take?
List of things that don't work:
YAML.
Standard Ruby threads.
Forking off processes and then retrieving the hash through a pipe.
In my usage, reading all or part the file into memory before parsing usually goes faster. If the database sizes are small enough this could be as simple as
buffer = File.readlines(database)
buffer.each do |line|
...
end
If they're too big to fit into memory, it gets more complicated, you have to setup block reads of data followed by parse, or threaded with separate read and parse threads.
Why not use the solution devised through decades of experience: a database, say SQLlite3?
(To be different, although I'd first recommend looking at (Ruby) BDB and other "NoSQL" backend-engines, if they fit your need.)
If fixed-sized records with a deterministic index are used then you can perform a lazy-load of each item through a proxy object. This would be a suitable candidate for a mmap. However, this will not speed up the total access time, but will merely amortize the loading throughout the life-cycle of the program (at least until first use and if some data is never used then you get the benefit of never loading it). Without fixed-sized records or deterministic index values this problem is more complex and starts to look more like a traditional "index" store (eg. a B-tree in an SQL back-end or whatever BDB uses :-).
The general problems with threading here are:
The IO will likely be your bottleneck around Ruby "green" threads
You still need all the data before use
You may be interested in the Widefinder Project, just in general "trying to get faster IO processing".
I don't know too much about Ruby but I have had to deal with the problem before. I found the best way was to split the file up into chunks or separate files then spawn threads to read each chunk in at a single time. Once the partitioned files are in memory combining the results should be fast. Here is some information on Threads in Ruby:
http://rubylearning.com/satishtalim/ruby_threads.html
Hope that helps.

Resources