Caching File in a class variable in Ruby On Rails - ruby

In my Rails application I need to query a binary file database on each page load. The query is read only. The file size is 1.4 MB. I have two questions:
1) Does it make sense to cache the File object in a class variable?
def some_controller_action
##file ||= File.open(filename, 'rb')
# binary search in ##file
end
2) Will the cached object be shared across different requests in the same rails process?

If you use a constant in your class, aka
FILE = File.read(filename, 'rb').read
so it gets evaluated at application load time. The fork will happen afterwards, so it will be in the shared memory.

It does make sense. The limitation of this, however, is that if you spawn multiple procsesses for your app, each process will have to cache the 1.4 MB. So the answer to your second question is yes but it will not be shared across multiple processes.

Related

Reading and Writing the same CSV file in Ruby

I have some processing to do involving a third party API, and I was planning to use a CSV file as a backlog of things to do.
Example
Task to do Resulting file
#1 data/1.json
#2 data/2.json
#3
So, #1 and #2 are already done. I want to work on #3, and save the CSV file as soon as data/3.json is completed.
As the task is unstable and error prone, I want to save progress after each task in the CSV file.
I've written this script in Ruby, it's working well, but as tasks are numerous (> 100k), it's written couple Megabytes to disk each time a task is processed. The whole thing. It seems a good way to kill my HD:
class CSVResolver
require 'csv'
attr_accessor :csv_path
def initialize csv_path:
self.csv_path = csv_path
end
def resolve
csv = CSV.read(csv_path)
csv.each_with_index do |row, index|
next if row[1] # Don't do anything if we've already processed this task, and got a JSON data
json = very_expensive_task_and_error_prone
row[1] = "/data/#{index}.json"
File.write row[1], JSON.pretty_generate(json)
csv[index] = row
CSV.open(csv_path, "wb") do |old_csv|
csv.each do |row|
old_csv << row
end
end
resolve
end
end
end
Is there any way to improve on this, like making the write to CSV file atomic?
I'd use an embedded database for this purpose, such as SQLite or LevelDB.
Unlike a regular database, you'll still get many of the benefits of a CSV file, ie it can be stored in a single file/folder and without any server or permissioning hassle. At the same time, you'll get the benefit of better I/O characteristic than reading and writing a monolithic file upon each update ... the library should be smart enough to be able to index records, minimise changes, and store things in memory while buffering output.
For data persistence you would be, in most cases, best served to select a tool designed for the job, a database. You've already named enough of a reason to not use the hand spun CSV design as it is memory inefficient and proposes more problems then it likely solves. Also, depending on the amount of data you need to process via the 3rd part API, you may want to handle multi-threaded processes where reading/writing to a single file won't work.
You might wanna checkout https://github.com/jeremyevans/sequel

Read and write file atomically

I'd like to read and write a file atomically in Ruby between multiple independent Ruby processes (not threads).
I found atomic_write from ActiveSupport. This writes to a temp file, then moves it over the original and sets all permissions. However, this does not prevent the file from being read while it is being written.
I have not found any atomic_read. (Are file reads already atomic?)
Do I need to implement my own separate 'lock' file that I check for before reads and writes? Or is there a better mechanism already present in the file system for flagging a file as 'busy' that I could check before any read/write?
The motivation is dumb, but included here because you're going to ask about it.
I have a web application using Sinatra and served by Thin which (for its own reasons) uses a JSON file as a 'database'. Each request to the server reads the latest version of the file, makes any necessary changes, and writes out changes to the file.
This would be fine if I only had a single instance of the server running. However, I was thinking about having multiple copies of Thin running behind an Apache reverse proxy. These are discrete Ruby processes, and thus running truly in parallel.
Upon further reflection I realize that I really want to make the act of read-process-write atomic. At which point I realize that this basically forces me to process only one request at a time, and thus there's no reason to have multiple instances running. But the curiosity about atomic reads, and preventing reads during write, remains. Hence the question.
You want to use File#flock in exclusive mode. Here's a little demo. Run this in two different terminal windows.
filename = 'test.txt'
File.open(filename, File::RDWR) do |file|
file.flock(File::LOCK_EX)
puts "content: #{file.read}"
puts 'doing some heavy-lifting now'
sleep(10)
end
Take a look at transaction and open_and_lock_file methods in "pstore.rb" (Ruby stdlib).
YAML::Store works fine for me. So when I need to read/write atomically I (ab)use it to store data as a Hash.

Anything external as fast as an array? So I don't need to re-load arrays each time I run scripts

While I am developing my application I need to do tons of math over and over again, tweaking it and running again and observing results.
The math is done on arrays that are loaded from large files. Many megabytes. Not very large but the problem is each time I run my script it first has to load the files into arrays. Which takes a long time.
I was wondering if there is anything external that works similarly to arrays, in that I can know the location of data and just get it. And that it doesn't need to reload everything.
I don't know much about databases except that they seem to not work the way I need to. They aren't ordered and always need to search through everything. It seems. Still a possibility is in-memory databases?
If anyone has a solution it would be great to hear it.
Side question - isn't it just possible to have user entered scripts that my ruby program runs so I can have the main ruby program run indefinitely? I still don't know anything about user entered options and how that would work though.
Use Marshal:
# save an array to a file
File.open('array', 'w') { |f| f.write Marshal.dump(my_array) }
# load an array from file
my_array = File.open('array', 'r') { |f| Marshal.load(f.read) }
Your OS will keep the file cached between saves and loads, even between runs of separate processes using the data.

memory leaking when using aws/s3 gem for AWS::S3::Logging::Log

I have a ruby script which processes S3 logs something like this:
AWS::S3::Bucket.find(data_bucket).logs.each do |s3log|
s3log.lines.each do |line|
# accumulate stuff
end
end
However, in the aws/s3 gem the :lines accessor memoizes the data. So memory grows as I process each file.
Aside from hacking the gem, capping the files read each run, and/or running the script frequently, how can I gracefully avoid a ruby process that might grow to several gb? Am I missing a memory management trick?
The root of this is really Bucket.objects, which is what Bucket.logs uses, and which keeps references all the objects in #object_cache. The caching of lines is a subsidiary problem.
Solution: I simply use two loops: the inner one fetches logs from S3 10 at a time, and the outer one processes until is_truncated is false. This means we never keep references to more than 10 S3 log files at a time, and my memory problem is gone.

Why is curl in Ruby slower than command-line curl?

I am trying to download more than 1m pages (URLs ending by a sequence ID). I have implemented kind of multi-purpose download manager with configurable number of download threads and one processing thread. The downloader downloads files in batches:
curl = Curl::Easy.new
batch_urls.each { |url_info|
curl.url = url_info[:url]
curl.perform
file = File.new(url_info[:file], "wb")
file << curl.body_str
file.close
# ... some other stuff
}
I have tried to download 8000 pages sample. When using the code above, I get 1000 in 2 minutes. When I write all URLs into a file and do in shell:
cat list | xargs curl
I gen all 8000 pages in two minutes.
Thing is, I need it to have it in ruby code, because there is other monitoring and processing code.
I have tried:
Curl::Multi - it is somehow faster, but misses 50-90% of files (does not download them and gives no reason/code)
multiple threads with Curl::Easy - around the same speed as single threaded
Why is reused Curl::Easy slower than subsequent command line curl calls and how can I make it faster? Or what I am doing wrong?
I would prefer to fix my download manager code than to make downloading for this case in a different way.
Before this, I was calling command-line wget which I provided with a file with list of URLs. Howerver, not all errors were handled, also it was not possible to specify output file for each URL separately when using URL list.
Now it seems to me that the best way would be to use multiple threads with system call to 'curl' command. But why when I can use directly Curl in Ruby?
Code for the download manager is here, if it might help: Download Manager (I have played with timeouts, from not-setting it to various values, it did not seem help)
Any hints appreciated.
This could be a fitting task for Typhoeus
Something like this (untested):
require 'typhoeus'
def write_file(filename, data)
file = File.new(filename, "wb")
file.write(data)
file.close
# ... some other stuff
end
hydra = Typhoeus::Hydra.new(:max_concurrency => 20)
batch_urls.each do |url_info|
req = Typhoeus::Request.new(url_info[:url])
req.on_complete do |response|
write_file(url_info[:file], response.body)
end
hydra.queue req
end
hydra.run
Come to think of it, you might get a memory problem because of the enormous amout of files. One way to prevent that would be to never store the data in a variable but instead stream it to the file directly. You could use em-http-request for that.
EventMachine.run {
http = EventMachine::HttpRequest.new('http://www.website.com/').get
http.stream { |chunk| print chunk }
# ...
}
So, if you don't set a on_body handler than curb will buffer the download. If you're downloading files you should use an on_body handler. If you want to download multiple files using Ruby Curl, try the Curl::Multi.download interface.
require 'rubygems'
require 'curb'
urls_to_download = [
'http://www.google.com/',
'http://www.yahoo.com/',
'http://www.cnn.com/',
'http://www.espn.com/'
]
path_to_files = [
'google.com.html',
'yahoo.com.html',
'cnn.com.html',
'espn.com.html'
]
Curl::Multi.download(urls_to_download, {:follow_location => true}, {}, path_to_files) {|c,p|}
If you want to just download a single file.
Curl::Easy.download('http://www.yahoo.com/')
Here is a good resource: http://gist.github.com/405779
There's been benchmarks done that has compared curb with other methods such as HTTPClient. The winner, in almost all categories was HTTPClient. Plus, there have been some documented scenarios where curb does NOT work in multi-threading scenarios.
Like you, I've had your experience. I ran system commands of curl in 20+ concurrent threads and it was 10 X fasters than running curb in 20+ concurrent threads. No matter, what I tried, this was always the case.
I've since then switched to HTTPClient, and the difference is huge. Now it runs as fast as 20 concurrent curl system commands, and uses less CPU as well.
First let me say that I know almost nothing about Ruby.
What I do know is that Ruby is an interpreted language; it's not surprising that it's slower than heavily optimised code that's been compiled for a specific platform. Every file operation will probably have checks around it that curl doesn't. The "some other stuff" will slow things down even more.
Have you tried profiling your code to see where most of the time is being spent?
Stiivi,
any chance that Net::HTTP would suffice for simple downloading of HTML pages?
You didn't specify a Ruby version, but threads in 1.8.x are user-space threads, not scheduled by the OS, so the entire Ruby interpreter only ever use one CPU/core. On top of that there is a Global Interpreter Lock, and probably other locks as well, interfering with concurrency. Since you're trying to maximize network throughput, you're probably underutilizing CPUs.
Spawn as many processes as the machine has memory for, and limit the reliance on threads.

Resources