Parse huge file (10+gb) and write content in another one - ruby

I'm trying to use Sphinx Search Server to index a really huge file (around 14gb).
The file is whitespace separated, one entry per line.
To be able to use it with Sphinx, I need to provide a xml file to the Sphinx server.
How can I do it without killing my computer ?
What is the best strategy? Should I try to split the main file in several little files? What's the best way to do it?
Note: I'm doing it in Ruby, but I'm totally open to other hints.
Thanks for your time.

I think the main idea would be to parse the main file line by line, while generating a result XML. And every time it gets large enough, to feed it to Sphinx. Rinse and repeat.

What parsing do you need to do? If the transformations are restricted to just one line in the input at once and not too complicated, I would use awk instead of Ruby...

I hate guys who doesn't write solution after a question. So I'll try to don't be one of them, hopefully it will help somebody.
I added a simple reader method to the File class then used it to loop on the file based on a chunk size of my choice. Quite simple actually, working like a charm with Sphinx.
class File
# New static method
def self.seq_read(file_path,chunk_size=nil)
open(file_path,"rb") do |f|
f.each_chunk(chunk_size) do |chunk|
yield chunk
end
end
end
# New instance method
def each_chunk(chunk_size=1.kilobyte)
yield read(chunk_size) until eof?
end
end
Then just use it like this:
source_path = "./my_very_big_file.txt"
CHUNK_SIZE = 10.megabytes
File.seq_read(source_path, CHUNK_SIZE) do |chunk|
chunk.each_line do |line|
...
end
end

Related

How to read a large text file line-by-line and append this stream to a file line-by-line in Ruby?

Let's say I want to combine several massive files into one and then uniq! the one (THAT alone might take a hot second)
It's my understanding that File.readlines() loads ALL the lines into memory. Is there a way to read it line by line, sort of like how node.js pipe() system works?
One of the great things about Ruby is that you can do file IO in a block:
File.open("test.txt", "r").each_line do |row|
puts row
end # file closed here
so things get cleaned up automatically. Maybe it doesn't matter on a little script but it's always nice to know you can get it for free.
you aren't operating on the entire file contents at once, and you don't need to store the entirety of each line either if you use readline.
file = File.open("sample.txt", 'r')
while !file.eof?
line = file.readline
puts line
end
Large files are best read by streaming methods like each_line as shown in the other answer or with foreach which opens the file and reads line by line. So if the process doesn't request to have the whole file in memory you should use the streaming methods. While using streaming the required memory won't increase even if the file size increases opposing to non-streaming methods like readlines.
File.foreach("name.txt") { |line| puts line }
uniq! is defined on Array, so you'll have to read the files into an Array anyway. You cannot process the file line-by-line because you don't want to process a file, you want to process an Array, and an Array is a strict in-memory data structure.

Easier way to search through large files in Ruby?

I'm writing a simple log sniffer that will search logs for specific errors that are indicative of issues with the software I support. It allows the user to specify the path to the log and specify how many days back they'd like to search.
If users have log roll over turned off, the log files can sometimes get quite large. Currently I'm doing the following (though not done with it yet):
File.open(#log_file, "r") do |file_handle|
file_handle.each do |line|
if line.match(/\d+++-\d+-\d+/)
etc...
The line.match obviously looks for the date format we use in the logs, and the rest of the logic will be below. However, is there a better way to search through the file without .each_line? If not, I'm totally fine with that. I just wanted to make sure I'm using the best resources available to me.
Thanks
fgrep as a standalone or called from system('fgrep ...') may be faster solution
file.readlines might be better in speed, but it's a time-space tradeoff
look at this little research - last approaches seem to be rather fast.
Here are some coding hints...
Instead of:
File.open(#log_file, "r") do |file_handle|
file_handle.each do |line|
use:
File.foreach(#log_file) do |line|
next unless line[/\A\d+++-\d+-\d+/]
foreach simplifies opening and looping over the file.
next unless... makes a tight loop skipping every line that does NOT start with your target string. The less you do before figuring out whether you have a good line, the faster your code will run.
Using an anchor at the start of your pattern, like \A gives the regex engine a major hint about where to look in the line, and allows it to bail out very quickly if the line doesn't match. Also, using line[/\A\d+++-\d+-\d+/] is a bit more concise.
If your log file is sorted by date, then you can avoid having search through the entire file by doing a binary search. In this case you'd:
Open the file like you are doing
Use lineo= to fast forward to the middle of the file.
Check if the date on the beging of the line is higher or lower than the date you are looking for.
Continue splitting the file in halves until you find what you need.
I do however think your file needs to be very large for the above to make sense.
Edit
Here is some code which shows the basic idea. It find a line containing search date, not the first. This can be fixed either by more binary searches or by doing an linear search from the last midpoint, which did not contain date. There also isn't a termination condition in case the date is not in the file. These small additions, are left as an exercise to the reader :-)
require 'date'
def bin_fsearch(search_date, file)
f = File.open file
search = {min: 0, max: f.size}
while true
# go to file midpoint
f.seek (search[:max] + search[:min]) / 2
# read in until EOL
f.gets
# record the actual mid-point we are using
pos = f.pos
# read in next line
line = f.gets
# get date from line
line_date = Date.parse(line)
if line_date < search_date
search[:min] = f.pos
elsif line_date > search_date
search[:max] = pos
else
f.seek pos
return
end
end
end
bin_fsearch(Date.new(2013, 5, 4), '/var/log/system.log')
Try this, it will search one time at a time & should be pretty fast & takes less memory.
File.open(file, 'r') do |f|
f.each_line do |line|
# do stuff here to line
end
end
Another more faster option is to read the whole file into one array. it would be fast but will take LOT of memory.
File.readlines.each do |line|
#do stuff with each line
end
Further, if you need fastest approach with least amount of memory try grep which is specifically tuned for searching through large files. so should be fast & memory responsive
`grep -e regex bigfile`.split(/\n/).each do |line|
# ... (called on each matching line) ...
end
Faster than line-by-line is read the line by chunks:
File.open('file.txt') do |f|
buff = f.read(10240)
# ...
end
But you are using regexp to match dates, you might get incomplete lines. You will have to deal with it in your logic.
Also, if performance is that important, consider write a really simple C extension.
If the log file can get huge, and that is your concern, then maybe you can consider saving the errors in a database. Then, you will get faster response.

Nokogiri builder performance on huge XML?

I need to build a huge XML file, about 1-50 MB. I thought that using builder would be effective enough and, well it is, somewhat. The problem is, after the program reaches its last line it doesn't end immediately, but Ruby is still doing something for several seconds, maybe garbage collection? After that the program finally ends.
To give a real example, I am measured the time of building an XML file. It outputs 55 seconds (there is a database behind so it takes long) when the XML was built, but Ruby still processes for about 15 more seconds and the processor is going crazy.
The pseudo/real code is as follows:
...
builder = Nokogiri::XML::Builder.with(doc) do |xml|
build_node(xml)
end
...
def build_node(xml)
...
xml["#{namespace}"] if namespace
xml.send("#{elem_name}", attrs_hash) do |elem_xml|
...
if has_children
if type
case type
when XML::TextContent::PLAIN
elem_xml.text text_content
when XML::TextContent::COMMENT
elem_xml.comment text_content
when XML::TextContent::CDATA
elem_xml.cdata text_content
end
else
build_node(elem_xml)
end
end
end
end
Note that I was using a different approach using my own structure of classes, and the speed of the build was the same, but at the last line the program normally ended, but now I am forced to use Nokogiri so I have to find a solution.
What I can do to avoid that X seconds long overhead after the XML is built? Is it even possible?
UPDATE:
Thanks to a suggestion from Adiel Mittmann, during the creation of my minimal working example I was able to locate the problem. I now have a small (well not that small) example demonstrating the problem.
The following code is causing the problem:
xml.send("#{elem_name}_") do |elem_xml|
...
elem_xml.text text_content #This line is the problem
...
end
So the line executes the following code based on Nokogiri's documentation:
def create_text_node string, &block
Nokogiri::XML::Text.new string.to_s, self, &block
end
Text node creation code gets executed then. So, what exactly is happening here?
UPDATE 2:
After some other tries, the problem can be easily reproduced by:
builder = Nokogiri::XML::Builder.new do |xml|
0.upto(81900) do
xml.text "test"
end
end
puts "End"
So is it really Nokogiri itself? Is there any option for me?
Your example also takes a long time to execute here. And you were right: it's the garbage collector that's taking so long to execute. Try this:
require 'nokogiri'
class A
def a
builder = Nokogiri::XML::Builder.new do |xml|
0.upto(81900) do
xml.text "test"
end
end
end
end
A.new.a
puts "End1"
GC.start
puts "End2"
Here, the delay happens between "End1" and "End2". After "End2" is printed, the program closes immediately.
Notice that I created an object to demonstrate it. Otherwise, the data generated by the builder can only be garbage collected when the program finishes.
As for the best way to do what you're trying to accomplish, I suggest you ask another question giving details of what exactly you're trying to do with the XML files.
Try using the Ruby built-in (sic) Builder. I use it to generate large XML files as well, and it has such an small footprint.

Fastest way to skip lines while parsing files in Ruby?

I tried searching for this, but couldn't find much. It seems like something that's probably been asked before (many times?), so I apologize if that's the case.
I was wondering what the fastest way to parse certain parts of a file in Ruby would be. For example, suppose I know the information I want for a particular function is between lines 500 and 600 of, say, a 1000 line file. (obviously this kind of question is geared toward much large files, I'm just using those smaller numbers for the sake of example), since I know it won't be in the first half, is there a quick way of disregarding that information?
Currently I'm using something along the lines of:
while buffer = file_in.gets and file_in.lineno <600
next unless file_in.lineno > 500
if buffer.chomp!.include? some_string
do_func_whatever
end
end
It works, but I just can't help but think it could work better.
I'm very new to Ruby and am interested in learning new ways of doing things in it.
file.lines.drop(500).take(100) # will get you lines 501-600
Generally, you can't avoid reading file from the start until the line you are interested in, as each line can be of different length. The one thing you can avoid, though, is loading whole file into a big array. Just read line by line, counting, and discard them until you reach what you look for. Pretty much like your own example. You can just make it more Rubyish.
PS. the Tin Man's comment made me do some experimenting. While I didn't find any reason why would drop load whole file, there is indeed a problem: drop returns the rest of the file in an array. Here's a way this could be avoided:
file.lines.select.with_index{|l,i| (501..600) === i}
PS2: Doh, above code, while not making a huge array, iterates through the whole file, even the lines below 600. :( Here's a third version:
enum = file.lines
500.times{enum.next} # skip 500
enum.take(100) # take the next 100
or, if you prefer FP:
file.lines.tap{|enum| 500.times{enum.next}}.take(100)
Anyway, the good point of this monologue is that you can learn multiple ways to iterate a file. ;)
I don't know if there is an equivalent way of doing this for lines, but you can use seek or the offset argument on an IO object to "skip" bytes.
See IO#seek, or see IO#open for information on the offset argument.
Sounds like rio might be of help here. It provides you with a lines() method.
You can use IO#readlines, that returns an array with all the lines
IO.readlines(file_in)[500..600].each do |line|
#line is each line in the file (including the last \n)
#stuff
end
or
f = File.new(file_in)
f.readlines[500..600].each do |line|
#line is each line in the file (including the last \n)
#stuff
end

StringScanner scanning IO instead of a string

I've got a parser written using ruby's standard StringScanner. It would be nice if I could use it on streaming files. Is there an equivalent to StringScanner that doesn't require me to load the whole string into memory?
You might have to rework your parser a bit, but you can feed lines from a file to a scanner like this:
File.open('filepath.txt', 'r') do |file|
scanner = StringScanner.new(file.readline)
until file.eof?
scanner.scan(/whatever/)
scanner << file.readline
end
end
StringScanner was intended for that, to load a big string and going back and forth with an internal pointer, if you make it a stream, then the references get lost, you can not use unscan, check_until, pre_match, post_match,
well you can, but for that you need to buffer all the previous input.
If you are concerned about the buffer size then just load by chunk of data, and use a simple regexp or a gem called Parser.
The simplest way is to read a fix size of data.
# iterate over fixed length records
open("fixed-record-file") do |f|
while record = f.read(1024)
# parse here the record using regexp or parser
end
end
[Updated]
Even with this loop you can use StringSanner, you just need to update the string with each new chunk of data:
string=(str)
Changes the string being scanned to str and resets the scanner.
Returns str
There is StringIO.
Sorry misread you question. Take a look at this seems to have streaming options

Resources