Reading and Writing the same CSV file in Ruby - ruby

I have some processing to do involving a third party API, and I was planning to use a CSV file as a backlog of things to do.
Example
Task to do Resulting file
#1 data/1.json
#2 data/2.json
#3
So, #1 and #2 are already done. I want to work on #3, and save the CSV file as soon as data/3.json is completed.
As the task is unstable and error prone, I want to save progress after each task in the CSV file.
I've written this script in Ruby, it's working well, but as tasks are numerous (> 100k), it's written couple Megabytes to disk each time a task is processed. The whole thing. It seems a good way to kill my HD:
class CSVResolver
require 'csv'
attr_accessor :csv_path
def initialize csv_path:
self.csv_path = csv_path
end
def resolve
csv = CSV.read(csv_path)
csv.each_with_index do |row, index|
next if row[1] # Don't do anything if we've already processed this task, and got a JSON data
json = very_expensive_task_and_error_prone
row[1] = "/data/#{index}.json"
File.write row[1], JSON.pretty_generate(json)
csv[index] = row
CSV.open(csv_path, "wb") do |old_csv|
csv.each do |row|
old_csv << row
end
end
resolve
end
end
end
Is there any way to improve on this, like making the write to CSV file atomic?

I'd use an embedded database for this purpose, such as SQLite or LevelDB.
Unlike a regular database, you'll still get many of the benefits of a CSV file, ie it can be stored in a single file/folder and without any server or permissioning hassle. At the same time, you'll get the benefit of better I/O characteristic than reading and writing a monolithic file upon each update ... the library should be smart enough to be able to index records, minimise changes, and store things in memory while buffering output.

For data persistence you would be, in most cases, best served to select a tool designed for the job, a database. You've already named enough of a reason to not use the hand spun CSV design as it is memory inefficient and proposes more problems then it likely solves. Also, depending on the amount of data you need to process via the 3rd part API, you may want to handle multi-threaded processes where reading/writing to a single file won't work.
You might wanna checkout https://github.com/jeremyevans/sequel

Related

Read and write file atomically

I'd like to read and write a file atomically in Ruby between multiple independent Ruby processes (not threads).
I found atomic_write from ActiveSupport. This writes to a temp file, then moves it over the original and sets all permissions. However, this does not prevent the file from being read while it is being written.
I have not found any atomic_read. (Are file reads already atomic?)
Do I need to implement my own separate 'lock' file that I check for before reads and writes? Or is there a better mechanism already present in the file system for flagging a file as 'busy' that I could check before any read/write?
The motivation is dumb, but included here because you're going to ask about it.
I have a web application using Sinatra and served by Thin which (for its own reasons) uses a JSON file as a 'database'. Each request to the server reads the latest version of the file, makes any necessary changes, and writes out changes to the file.
This would be fine if I only had a single instance of the server running. However, I was thinking about having multiple copies of Thin running behind an Apache reverse proxy. These are discrete Ruby processes, and thus running truly in parallel.
Upon further reflection I realize that I really want to make the act of read-process-write atomic. At which point I realize that this basically forces me to process only one request at a time, and thus there's no reason to have multiple instances running. But the curiosity about atomic reads, and preventing reads during write, remains. Hence the question.
You want to use File#flock in exclusive mode. Here's a little demo. Run this in two different terminal windows.
filename = 'test.txt'
File.open(filename, File::RDWR) do |file|
file.flock(File::LOCK_EX)
puts "content: #{file.read}"
puts 'doing some heavy-lifting now'
sleep(10)
end
Take a look at transaction and open_and_lock_file methods in "pstore.rb" (Ruby stdlib).
YAML::Store works fine for me. So when I need to read/write atomically I (ab)use it to store data as a Hash.

speedup postgresql to add data from text file using Django python script

I am working with server who's configurations are as:
RAM - 56GB
Processor - 2.6 GHz x 16 cores
How to do parallel processing using shell? How to utilize all the cores of processor?
I have to load data from text file which contains millions of entries for example one file contains half million lines data.
I am using django python script to load data in postgresql database.
But it takes lot of time to add data in database even though i have such a good config. server but i don't know how to utilize server resources in parallel so that it takes less time to process data.
Yesterday i had loaded only 15000 lines of data from text file to postgresql and it took nearly 12 hours to do it.
My django python script is as below:
import re
import collections
def SystemType():
filename = raw_input("Enter file Name:")
in_file = file(filename,"r")
out_file = file("SystemType.txt","w+")
for line in in_file:
line = line.decode("unicode_escape")
line = line.encode("ascii","ignore")
values = line.split("\t")
if values[1]:
for list in values[1].strip("wordnetyagowikicategory"):
out_file.write(re.sub("[^\ a-zA-Z()<>\n""]"," ",list))
# Eliminate Duplicate Entries from extracted data using regular expression
def FSystemType():
lines_seen = set()
outfile = open("Output.txt","w+")
infile = open("SystemType.txt","r+")
for line in infile:
if line not in lines_seen:
l = line.lstrip()
# Below reg exp is used to handle Camel Case.
outfile.write(re.sub(r'((?<=[a-z])[A-Z]|(?<!\A)[A-Z](?=[a-z]))', r' \1', l).lower())
lines_seen.add(line)
infile.close()
outfile.close()
sylist=[]
def create_system_type(stname):
syslist=Systemtype.objects.all()
for i in syslist:
sylist.append(str(i.title))
if not stname in sylist:
slu=slugify(stname)
st=Systemtype()
st.title=stname
st.slug=slu
# st.sites=Site.objects.all()[0]
st.save()
print "one ST added."
if you could express your requirements without the code (not every shell programmer can really read phython), possibly we could help here.
e.g. your report of 12 hours for 15000 lines suggests you have a too-busy "for" loop somewhere, and i'd suggest the nested for
for list in values[1]....
what are you trying to strip? individual characters, whole words? ...
then i'd suggest "awk".
If you are able to work out the precise data structure required by Django, you can load the database tables directly, using the psql "copy" command. You could do this by preparing a csv file to load into the db.
There are any number of reasons why loading is slow using your approach. First of all Django has a lot of transactional overhead. Secondly it is not clear in what way you are running the Django code -- is this via the internal testing server? If so you may have to deal with the slowness of that. Finally what makes a fast database is not normally to do with CPU, but rather fast IO and lots of memory.

Anything external as fast as an array? So I don't need to re-load arrays each time I run scripts

While I am developing my application I need to do tons of math over and over again, tweaking it and running again and observing results.
The math is done on arrays that are loaded from large files. Many megabytes. Not very large but the problem is each time I run my script it first has to load the files into arrays. Which takes a long time.
I was wondering if there is anything external that works similarly to arrays, in that I can know the location of data and just get it. And that it doesn't need to reload everything.
I don't know much about databases except that they seem to not work the way I need to. They aren't ordered and always need to search through everything. It seems. Still a possibility is in-memory databases?
If anyone has a solution it would be great to hear it.
Side question - isn't it just possible to have user entered scripts that my ruby program runs so I can have the main ruby program run indefinitely? I still don't know anything about user entered options and how that would work though.
Use Marshal:
# save an array to a file
File.open('array', 'w') { |f| f.write Marshal.dump(my_array) }
# load an array from file
my_array = File.open('array', 'r') { |f| Marshal.load(f.read) }
Your OS will keep the file cached between saves and loads, even between runs of separate processes using the data.

Caching File in a class variable in Ruby On Rails

In my Rails application I need to query a binary file database on each page load. The query is read only. The file size is 1.4 MB. I have two questions:
1) Does it make sense to cache the File object in a class variable?
def some_controller_action
##file ||= File.open(filename, 'rb')
# binary search in ##file
end
2) Will the cached object be shared across different requests in the same rails process?
If you use a constant in your class, aka
FILE = File.read(filename, 'rb').read
so it gets evaluated at application load time. The fork will happen afterwards, so it will be in the shared memory.
It does make sense. The limitation of this, however, is that if you spawn multiple procsesses for your app, each process will have to cache the 1.4 MB. So the answer to your second question is yes but it will not be shared across multiple processes.

Increasing the Loading Speed of Large Files

There are two large text files (Millions of lines) that my program uses. These files are parsed and loaded into hashes so that the data can be accessed quickly. The problem I face is that, currently, the parsing and loading is the slowest part of the program. Below is the code where this is done.
database = extractDatabase(#type).chomp("fasta") + "yml"
revDatabase = extractDatabase(#type + "-r").chomp("fasta.reverse") + "yml"
#proteins = Hash.new
#decoyProteins = Hash.new
File.open(database, "r").each_line do |line|
parts = line.split(": ")
#proteins[parts[0]] = parts[1]
end
File.open(revDatabase, "r").each_line do |line|
parts = line.split(": ")
#decoyProteins[parts[0]] = parts[1]
end
And the files look like the example below. It started off as a YAML file, but the format was modified to increase parsing speed.
MTMDK: P31946 Q14624 Q14624-2 B5BU24 B7ZKJ8 B7Z545 Q4VY19 B2RMS9 B7Z544 Q4VY20
MTMDKSELVQK: P31946 B5BU24 Q4VY19 Q4VY20
....
I've messed around with different ways of setting up the file and parsing them, and so far this is the fastest way, but it's still awfully slow.
Is there a way to improve the speed of this, or is there a whole other approach I can take?
List of things that don't work:
YAML.
Standard Ruby threads.
Forking off processes and then retrieving the hash through a pipe.
In my usage, reading all or part the file into memory before parsing usually goes faster. If the database sizes are small enough this could be as simple as
buffer = File.readlines(database)
buffer.each do |line|
...
end
If they're too big to fit into memory, it gets more complicated, you have to setup block reads of data followed by parse, or threaded with separate read and parse threads.
Why not use the solution devised through decades of experience: a database, say SQLlite3?
(To be different, although I'd first recommend looking at (Ruby) BDB and other "NoSQL" backend-engines, if they fit your need.)
If fixed-sized records with a deterministic index are used then you can perform a lazy-load of each item through a proxy object. This would be a suitable candidate for a mmap. However, this will not speed up the total access time, but will merely amortize the loading throughout the life-cycle of the program (at least until first use and if some data is never used then you get the benefit of never loading it). Without fixed-sized records or deterministic index values this problem is more complex and starts to look more like a traditional "index" store (eg. a B-tree in an SQL back-end or whatever BDB uses :-).
The general problems with threading here are:
The IO will likely be your bottleneck around Ruby "green" threads
You still need all the data before use
You may be interested in the Widefinder Project, just in general "trying to get faster IO processing".
I don't know too much about Ruby but I have had to deal with the problem before. I found the best way was to split the file up into chunks or separate files then spawn threads to read each chunk in at a single time. Once the partitioned files are in memory combining the results should be fast. Here is some information on Threads in Ruby:
http://rubylearning.com/satishtalim/ruby_threads.html
Hope that helps.

Resources