Fastest Way to Parse a Large File in Ruby - ruby

I have a simple text file that is ~150mb. My code will read each line, and if it matches certain regexes, it gets written to an output file.
But right now, it just takes a long time to iterate through all of the lines of the file (several minutes) doing it like
File.open(filename).each do |line|
# do some stuff
end
I know that it is the looping through the lines of the file that is taking a while because even if I do nothing with the data in "#do some stuff", it still takes a long time.
I know that some unix programs can parse large files like this almost instantly (like grep), so I am wondering why ruby (MRI 1.9) takes so long to read the file, and is there some way to make it faster?

It's not really fair to compare to grep because that is a highly tuned utility that only scans the data, it doesn't store any of it. When you're reading that file using Ruby you end up allocating memory for each line, then releasing it during the garbage collection cycle. grep is a pretty lean and mean regexp processing machine.
You may find that you can achieve the speed you want by using an external program like grep called using system or through the pipe facility:
`grep ABC bigfile`.split(/\n/).each do |line|
# ... (called on each matching line) ...
end

File.readlines.each do |line|
#do stuff with each line
end
Will read the whole file into one array of lines. It should be a lot faster, but it takes more memory.

You should read it into the memory and then parse. Of course it depends on what you are looking for. Don't expect miracle performance from ruby, especially comparing to c/c++ programs which are being optimized for past 30 years ;-)

Related

Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?

I posted the following code and got scolded. Why is this not acceptable?
numberOfLines=$(wc -l <"$1")
for ((i=1; $i<=$numberOfLines; ++$i)); do
lineN=$(sed -n "$i!d;p;q" "$1")
# ... do things with "$lineN"
done
We collect the number of lines in the input file into numberOfLines, then loop from 1 to that number, pulling out the next line from the file with sed in each iteration.
The feedback I received complained that reading the same file repeatedly with sed inside the loop to get the next line is inefficient. I guess I could use head -n "$i" "$1" | tail -n 1 but that's hardly more efficient, is it?
Is there a better way to do this? Why would I want to avoid this particular approach?
The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed already does this, so if the shell couldn't do it, you could loop over the output from sed instead.
The proper way to loop over the lines in a file in the shell is with while read. There are a couple of complications — commonly, you reset IFS to avoid having the shell needlessly split the input into tokens, and you use read -r to avoid some pesky legacy behavior with backslashes in the original Bourne shell's implementation of read, which have been retained for backward compatibility.
while IFS='' read -r lineN; do
# do things with "$lineN"
done <"$1"
Besides being much simpler than your sed script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern OS, some repeated reading will be avoided thanks to caching (the disk driver keeps a buffer of recently accessed data in memory, so that reading it again will not actually require fetching it from the disk again), but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read, again and again.
In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed (or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1) thousands of times in a tight loop will add significant overhead for any non-trivial input file. On the other hand, if the body of your loop could be done in e.g. sed or Awk instead, that's going to be a lot more efficient than a native shell while read loop, because of the way read is implemented. This is why while read is also frequently regarded as an antipattern.
And make sure you are reasonably familiar with the standard palette of Unix text processing tools - cut, paste, nl, pr, etc etc.
In many, many cases you should avoid looping over the lines in a shell script and use an external tool instead. There is basically only one exception to this; when the body of the loop is also significantly using built-in shell commands.
The q in the sed script is a very partial remedy for repeatedly reading the input file; and frequently, you see variations where the sed script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.
With a small input file, the effects are negligible, but perpetuating this bad practice just because it's not immediately harmful when the input file is small is simply irresponsible. Just don't teach this technique to beginners. At all.
If you really need to display the number of lines in the input file, for a progress indicator or similar, at least make sure you don't spend a lot of time seeking through to the end just to obtain that number. Maybe stat the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234 display something like line 1/approximately 10000000?) ... or use an external tool like pv.
Tangentially, there is a vaguely related antipattern you want to avoid, too; you don't want to read an entire file into memory when you are only going to process one line at a time. Doing that in a for loop also has some additional gotchas, so don't do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor
Another common variation is to find the line you want to modify with grep, only so you can find it with sed ... which already knows full well how to perform a regex search by itself. (See also useless use of grep.)
# XXX FIXME: wrong
line=$(grep "foo" file)
sed -i "s/$line/thing/" file
The correct way to do this would be to simply change the sed script to contain a search condition:
sed -i '/foo/s/.*/thing/' file
This also avoids the complications when the value of $line in the original, faulty script contains something which needs to be escaped in order to actually match itself. (For example, foo\bar* in a regular expression does not match the literal text itself.)

Opening file in Ruby

Code sample 1:
def count_lines1(file_name)
open(file_name) do |file|
count = 0
while file.gets
count += 1
end
count
end
end
Code sample 2:
def count_lines2(file_name)
file = open(file_name)
count = 0
while file.gets
count += 1
end
count
end
I am wondering which is the better way to implement the counting of lines in a file. In terms of good syntax in Ruby.
which is the better way to implement the counting of lines in a file.
Neither. Ruby can do it easily using foreach:
def count_lines(file_name)
lines = 0
File.foreach(file_name) { lines += 1 }
lines
end
If I run that against my ~/.bashrc:
$ ruby test.rb
37
foreach is very fast and will avoid scalability problems.
Alternately, you could take advantage of tools in the OS, such as wc -l which were written specifically for the task:
`wc -l .bashrc`.to_i
which will return 37 again. If the file is huge, wc will likely outrun doing it in Ruby because wc is written in compiled code.
You can also read in large chunks with read and count newline characters.
Yes, read will allow you to do that, but the scalability issue will remain. In my environment read or readlines can be a script killer because we often have to process files well into the tens of GB. There's plenty of RAM to hold the data, but the I/O suffers because of the overhead of slurping the data. "Why is "slurping" a file not a good practice?" goes into this.
An alternate way of reading in big chunks is to tell Ruby to read a set block size, count the line-ends in that block, looping until the file is read completely. I didn't test that method in the above linked answer, but in the past did similar things when I was writing in Perl and found that the difference didn't really improve things because it resulted in a bit more code. At that point, if all I was doing was counting lines, it'd make more sense to call wc -l and let it do the work as it'd be a lot faster for coding time and most likely in execution time.

How to pretty print a 48 GB JSON? (Wikidata)

I'm working with WikiData (a cross-referencing of multiple data sources, including Wikipedia) and they provide a ~50 GB JSON file with no white space. I want to extract certain kinds of data from it, which I could do with grep if it was pretty printed. I'm running on a mac.
Some methods of reformatting, e.g.,
cat ... | python -m json.too
./jq . filename.json
Will not work on a large file. python chokes. jq dies. There was a great thread here: How can I pretty-print JSON in (unix) shell script? But I'm not sure how/if any can deal with large files.
This company uses "Akka streams" to do this very task (they claim <10 minutes to process all Wikidata), but I know nothing about it: http://engineering.intenthq.com/2015/06/wikidata-akka-streams/
Wikidata has a predictable format (https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON), and I am able to accomplish most of my goal by piping through a series of sed and tr, but it's clumsy and potentially error-prone, and I'd much prefer to be grepping on a prettyprint.
Any suggestions?
There are several libraries out there for parsing JSON streams, which I think is what you want—you can pipe the JSON in and deal with it as a stream, which saves you from having to load the whole thing into memory.
Oboe.js looks like a particularly mature project, and the docs are very good. See the "Reading from Node.js streams" and "Loading JSON trees larger than the available RAM" sections on this page: http://oboejs.com/examples
If you'd rather use Ruby, take a look at yajl-ruby. The API isn't quite as simple as Oboe.js's, but it ought to work for you.
You could try this, it looks like it lets you just pipe in your JSON file and it will output a grep friendly file...
json-liner

How do I create this File Input and Output assignment in Ruby

I have an assignment that I am not sure what to do. Was wondering if anyone could help. This is it:
Create a program that allows the user to input how many hours they exercised for today. Then the program should output the total of how many hours they have exercised for all time. To allow the program to persist beyond the first run the total exercise time will need to be written and retrieved from a file.
My code is this so far:
myFileObject2 = File.open("exercise.txt")
myFileObjecit2.read
puts "This is an exercise log. It keeps track of the number hours of exercise."
hours = gets.to_f
myFileObject2.close
Write your code like:
File.open("exercise.txt", "r") do |fi|
file_content = fi.read
puts "This is an exercise log. It keeps track of the number hours of exercise."
hours = gets.chomp.to_f
end
Ruby's File.open takes a block. When that block exits File will automatically close the file. Don't use the non-block form unless you are absolutely positive you know why you should do it another way.
chomp the value you get from gets. This is because gets won't return until it sees a trailing END-OF-LINE, which is usually a "\n" on Mac OS and *nix, or "\r\n" on Windows. Failing to remove that with chomp is the cause of much weeping and gnashing of teeth in unaware developers.
The rest of the program is left for you to figure out.
The code will fail if "exercise.txt" doesn't already exist. You need to figure out how to deal with that.
Using read is bad form unless you are absolutely positive the file will always fit in memory because the entire file will be read at once. Once it is in memory, it will be one big string of data so you'll have to figure out how to break it into an array so you can iterate it. There are better ways to handle reading than read so I'd study the IO class, plus read what you can find on Stack Overflow. Hint: Don't slurp your files.

Increasing the Loading Speed of Large Files

There are two large text files (Millions of lines) that my program uses. These files are parsed and loaded into hashes so that the data can be accessed quickly. The problem I face is that, currently, the parsing and loading is the slowest part of the program. Below is the code where this is done.
database = extractDatabase(#type).chomp("fasta") + "yml"
revDatabase = extractDatabase(#type + "-r").chomp("fasta.reverse") + "yml"
#proteins = Hash.new
#decoyProteins = Hash.new
File.open(database, "r").each_line do |line|
parts = line.split(": ")
#proteins[parts[0]] = parts[1]
end
File.open(revDatabase, "r").each_line do |line|
parts = line.split(": ")
#decoyProteins[parts[0]] = parts[1]
end
And the files look like the example below. It started off as a YAML file, but the format was modified to increase parsing speed.
MTMDK: P31946 Q14624 Q14624-2 B5BU24 B7ZKJ8 B7Z545 Q4VY19 B2RMS9 B7Z544 Q4VY20
MTMDKSELVQK: P31946 B5BU24 Q4VY19 Q4VY20
....
I've messed around with different ways of setting up the file and parsing them, and so far this is the fastest way, but it's still awfully slow.
Is there a way to improve the speed of this, or is there a whole other approach I can take?
List of things that don't work:
YAML.
Standard Ruby threads.
Forking off processes and then retrieving the hash through a pipe.
In my usage, reading all or part the file into memory before parsing usually goes faster. If the database sizes are small enough this could be as simple as
buffer = File.readlines(database)
buffer.each do |line|
...
end
If they're too big to fit into memory, it gets more complicated, you have to setup block reads of data followed by parse, or threaded with separate read and parse threads.
Why not use the solution devised through decades of experience: a database, say SQLlite3?
(To be different, although I'd first recommend looking at (Ruby) BDB and other "NoSQL" backend-engines, if they fit your need.)
If fixed-sized records with a deterministic index are used then you can perform a lazy-load of each item through a proxy object. This would be a suitable candidate for a mmap. However, this will not speed up the total access time, but will merely amortize the loading throughout the life-cycle of the program (at least until first use and if some data is never used then you get the benefit of never loading it). Without fixed-sized records or deterministic index values this problem is more complex and starts to look more like a traditional "index" store (eg. a B-tree in an SQL back-end or whatever BDB uses :-).
The general problems with threading here are:
The IO will likely be your bottleneck around Ruby "green" threads
You still need all the data before use
You may be interested in the Widefinder Project, just in general "trying to get faster IO processing".
I don't know too much about Ruby but I have had to deal with the problem before. I found the best way was to split the file up into chunks or separate files then spawn threads to read each chunk in at a single time. Once the partitioned files are in memory combining the results should be fast. Here is some information on Threads in Ruby:
http://rubylearning.com/satishtalim/ruby_threads.html
Hope that helps.

Resources