Parsing a big string in Ruby - ruby

I have a file of a few hundred megabytes containing strings:
str1 x1 x2\n
str2 xx1 xx2\n
str3 xxx1 xxx2\n
str4 xxxx1 xxxx2\n
str5 xxxxx1 xxxxx2
where x1 and x2 are some numbers. How big the numbers x(...x)1 and x(...x)2 are is unknown.
Each line has in "\n" in it. I have a list of strings str2 and str4.
I want to find the corresponding numbers for those strings.
What I'm doing is pretty straightforward (and, probably, not efficient performance-wise):
source_str = read_from_file() # source_str contains all file content of a few hundred Megabyte
str_to_find = [str2, str4]
res = []
str_to_find.each do |x|
index = source_str.index(x)
if index
a = source_str[index .. index + x.length] # a contains "str2"
#?? how do I "select" xx1 and xx2 ??
# and finally...
# res << num1
# res << num2
end
end
Note that I can't apply source_str.split("\n") due to the error ArgumentError: invalid byte sequence in UTF-8 and I can't fix it by changing a file in any way. The file can't be changed.

You want to avoid reading a hundred of megabytes into memory, as well as scanning them repeatedly. This has the potential of taking forever, while clogging the machine's available memory.
Try to re-frame the problem, so you can treat the large input file as a stream, so instead of asking for each string you want to find "does it exist in my file?", try asking for each line in the file "does it contain a string I am looking for?".
str_to_find = [str2, str4]
numbers = []
File.foreach('foo.txt') do |li|
columns = li.split
numbers += columns[2] if str_to_find.include?(columns.shift)
end
Also, read again #theTinMan's answer regarding the file encoding - what he is suggesting is that you may be able fine-tune the reading of the file to avoid the error, without changing the file itself.
If you have a very large number of items in str_to_find, I'd suggest that you use a Set instead of an Array for better performance:
str_to_find = [str1, str2, ... str5000].to_set

If you want to find a line in a text file, which it sounds like you are reading, then read the file line-by-line.
The IO class has the foreach method, which makes it easy to read a file line-by-line, which also makes it possible to easily locate lines that contain the particular string you want to find.
If you had your source input file saved as "foo.txt", you could read it using something like:
str2 = 'some value'
str4 = 'some other value'
numbers = []
File.foreach('foo.txt') do |li|
numbers << li.split[2] if li[str2] || li[str2]
end
At the end of the loop numbers should contain the numbers you want.
You say you're getting an encoding error, but you don't give us any clue what the characters are that are causing it. Without that information we can't really help you fix that problem except to say you need to tell Ruby what the file encoding is. You can do that when the file is opened; You'd properly set the open_args to whatever the encoding should be. Odds are good it should be an encoding of ISO-8859-1 or Win-1252 since those are very common with Windows machines.
I have to find a list of values, iterating through each line doesn't seem sensible because I'd have to iterate for each value over and over again.
We can only work with the examples you give us. Since that wasn't clearly explained in your question you got an answer based on what was initially said.
Ruby's Regexp has the tools necessary to make this work, but to do it correctly requires taking advantage of Perl's Regexp::Assemble library, since Ruby has nothing close to it. See "Is there an efficient way to perform hundreds of text substitutions in ruby?" for more information.
Note that this will allow you to scan through a huge string in memory, however that is still not a good way to process what you are talking about. I'd use a database instead, which are designed for this sort of task.

Related

Ruby modify file instead of creating new file

Say I have the following Ruby code which, given a hash of insert positions, reads a file and creates a new file with extra text inserted at those positions:
insertpos = {14=>25,16=>25}
File.open('file.old', 'r') do |oldfile|
File.open('file.new', 'w') do |newfile|
oldfile.each_with_index do |line,linenum|
inserthere = insertpos[linenum]
if(!inserthere.nil?)then
line.insert(inserthere,"foo")
end
newfile.write(line)
end
end
end
Now, instead of creating that new file, I would like to modify this original (old) file. Can someone give me a hint on how to modify the code? Thanks!
At a very fundamental level, this is an extremely difficult thing to do, in any language, on any operating system. Envision a file as a contiguous series of bytes on disk (this is a very simplistic scenario, but it serves to illustrate the point). You want to insert some bytes in the middle of the file. Where do you put those bytes? There's no place to put them! You would have to basically "shift" the existing bytes after the insertion point "down" by the number of bytes you want to insert. If you're inserting multiple sections into an existing file, you would have to do this multiple times! It will be extremely slow, and you will run a high risk of corrupting your data if something goes awry.
You can, however, overwrite existing bytes, and/or append to the end of the file. Most Unix utilities give the appearance of modifying files by creating new files and swapping them with the old. Some more sophisticated schemes, such as those used by databases, allow inserts in the middle of files by 1. reserving space for such operations (when the data is first written), 2. allowing non-contiguous blocks of data within the file through indexing and other techniques, and/or 3. copy-on-write schemes where a new version of the data is written to the end of the file and the old version is invalidated by overwriting an indicator of some kind. You are most likely not wanting to go through all this trouble for your simple use case!
Anyway, you've already found the best way to do what you're trying to do. The only thing you're missing is a FileUtils.mv('file.new', 'file.old') at the very end to replace the old file with the new. Please let me know in the comments if I can help explain this any further.
(Of course, you can read the entire file into memory, make your changes, and overwrite the old file with the updated contents, but I don't believe that's what you're asking here.)
Here's something that hopefully solves your purpose:
# 'source' param is a string, the entire source text
# 'lines' param is an array, a list of line numbers to insert after
# 'new' param is a string, the text to add
def insert(source, lines, new)
results = []
source.split("\n").each_with_index do |line, idx|
if lines.include?(idx)
results << (line + new)
else
results << line
end
end
results.join("\n")
end
File.open("foo", "w") do |f|
10.times do |i|
f.write("#{i}\n")
end
end
puts "initial text: \n\n"
txt = File.read("foo")
puts txt
puts "\n\n after inserting at lines 1,3, and 5: \n\n"
result = insert(txt, [1,3,5], "\nfoo")
puts result
Running this shows:
initial text:
0
1
2
3
4
5
6
7
8
9
after inserting at lines 1,3, and 5:
0
1
foo
2
3
foo
4
5
foo
6
7
8
If its a relatively simple operation you can do it with a ruby one-liner, like this
ruby -i -lpe '$_.reverse!' thefile.txt
(found e.g. at https://gist.github.com/KL-7/1590797).

Deleting contents of file after a specific line in ruby

Probably a simple question, but I need to delete the contents of a file after a specific line number? So I wan't to keep the first e.g 5 lines and delete the rest of the contents of a file. I have been searching for a while and can't find a way to do this, I am an iOS developer so Ruby is not a language I am very familiar with.
That is called truncate. The truncate method needs the byte position after which everything gets cut off - and the File.pos method delivers just that:
File.open("test.csv", "r+") do |f|
f.each_line.take(5)
f.truncate( f.pos )
end
The "r+" mode from File.open is read and write, without truncating existing files to zero size, like "w+" would.
The block form of File.open ensures that the file is closed when the block ends.
I'm not aware of any methods to delete from a file so my first thought was to read the file and then write back to it. Something like this:
path = '/path/to/thefile'
start_line = 0
end_line = 4
File.write(path, File.readlines(path)[start_line..end_line].join)
File#readlines reads the file and returns an array of strings, where each element is one line of the file. You can then use the subscript operator with a range for the lines you want
This isn't going to be very memory efficient for large files, so you may want to optimise if that's something you'll be doing.

Ruby - Files - gets method

I am following Wicked cool ruby scripts book.
here,
there are two files, file_output = file_list.txt and oldfile_output = file_list.old. These two files contain list of all files the program went through and going to go through.
Now, the file is renamed as old file if a 'file_list.txt' file exists .
then, I am not able to understand the code.
Apparently every line of the file is read and the line is stored in oldfile hash.
Can some one explain from 4 the line?
And also, why is gets used here? why cant a .each method be used to read through every line?
if File.exists?(file_output)
File.rename(file_output, oldfile_output)
File.open(oldfile_output, 'rb') do |infile|
while (temp = infile.gets)
line = /(.+)\s{5,5}(\w{32,32})/.match(temp)
puts "#{line[1]} ---> #{line[2]}"
oldfile_hash[line[1]] = line[2]
end
end
end
Judging from the redundant use of quantifiers ({5,5} and {32,32}) in the regex (which would be better written as {5}, {32}), it looks like the person who wrote that code is not a professional Ruby programmer. So you can assume that the choice taken in the code is not necessarily the best.
As you pointed out, the code could have used each instead of while with gets. The latter approach is sort of an old-school Ruby way of doing it. There is nothing wrong in using it. Until the end of file is reached, gets will return a string, and when it does reach the end of file, gets will return nil, so the while loop works as the same when you use each; in each iteration, it reads the next line.
It looks like each line is supposed to represent a key-value pair. The regex assumes that the key is not an empty string, and that the key and the value are separated by exactly five spaces, and the the value consists of exactly thirty-two letters. Each key-value pair is printed (perhaps for monitoring the progress), and is stored in oldfile_hash, which is most likely a hash.
So the point of using .gets is to tell when the file is finished being read. Essentially, it's tied to the
while (condition)
....
end
block. So gets serves as a little method that will keep giving ruby the next line of the file until there is no more lines to give.

Easier way to search through large files in Ruby?

I'm writing a simple log sniffer that will search logs for specific errors that are indicative of issues with the software I support. It allows the user to specify the path to the log and specify how many days back they'd like to search.
If users have log roll over turned off, the log files can sometimes get quite large. Currently I'm doing the following (though not done with it yet):
File.open(#log_file, "r") do |file_handle|
file_handle.each do |line|
if line.match(/\d+++-\d+-\d+/)
etc...
The line.match obviously looks for the date format we use in the logs, and the rest of the logic will be below. However, is there a better way to search through the file without .each_line? If not, I'm totally fine with that. I just wanted to make sure I'm using the best resources available to me.
Thanks
fgrep as a standalone or called from system('fgrep ...') may be faster solution
file.readlines might be better in speed, but it's a time-space tradeoff
look at this little research - last approaches seem to be rather fast.
Here are some coding hints...
Instead of:
File.open(#log_file, "r") do |file_handle|
file_handle.each do |line|
use:
File.foreach(#log_file) do |line|
next unless line[/\A\d+++-\d+-\d+/]
foreach simplifies opening and looping over the file.
next unless... makes a tight loop skipping every line that does NOT start with your target string. The less you do before figuring out whether you have a good line, the faster your code will run.
Using an anchor at the start of your pattern, like \A gives the regex engine a major hint about where to look in the line, and allows it to bail out very quickly if the line doesn't match. Also, using line[/\A\d+++-\d+-\d+/] is a bit more concise.
If your log file is sorted by date, then you can avoid having search through the entire file by doing a binary search. In this case you'd:
Open the file like you are doing
Use lineo= to fast forward to the middle of the file.
Check if the date on the beging of the line is higher or lower than the date you are looking for.
Continue splitting the file in halves until you find what you need.
I do however think your file needs to be very large for the above to make sense.
Edit
Here is some code which shows the basic idea. It find a line containing search date, not the first. This can be fixed either by more binary searches or by doing an linear search from the last midpoint, which did not contain date. There also isn't a termination condition in case the date is not in the file. These small additions, are left as an exercise to the reader :-)
require 'date'
def bin_fsearch(search_date, file)
f = File.open file
search = {min: 0, max: f.size}
while true
# go to file midpoint
f.seek (search[:max] + search[:min]) / 2
# read in until EOL
f.gets
# record the actual mid-point we are using
pos = f.pos
# read in next line
line = f.gets
# get date from line
line_date = Date.parse(line)
if line_date < search_date
search[:min] = f.pos
elsif line_date > search_date
search[:max] = pos
else
f.seek pos
return
end
end
end
bin_fsearch(Date.new(2013, 5, 4), '/var/log/system.log')
Try this, it will search one time at a time & should be pretty fast & takes less memory.
File.open(file, 'r') do |f|
f.each_line do |line|
# do stuff here to line
end
end
Another more faster option is to read the whole file into one array. it would be fast but will take LOT of memory.
File.readlines.each do |line|
#do stuff with each line
end
Further, if you need fastest approach with least amount of memory try grep which is specifically tuned for searching through large files. so should be fast & memory responsive
`grep -e regex bigfile`.split(/\n/).each do |line|
# ... (called on each matching line) ...
end
Faster than line-by-line is read the line by chunks:
File.open('file.txt') do |f|
buff = f.read(10240)
# ...
end
But you are using regexp to match dates, you might get incomplete lines. You will have to deal with it in your logic.
Also, if performance is that important, consider write a really simple C extension.
If the log file can get huge, and that is your concern, then maybe you can consider saving the errors in a database. Then, you will get faster response.

Ruby File.read vs. File.gets

If I want to append the contents of a src file into the end of a dest file in Ruby, is it better to use:
while line = src.gets do
or
while buffer = src.read( 1024 )
I have seen both used and was wondering when should I use each method and why?
One is for reading "lines", one is for reading n bytes.
While byte buffering might be faster, a lot of that may disappear into the OS which likely does buffering anyway. IMO it has more to do with the context of the read--do you want lines, or are you just shuffling chunks of data around?
That said, a performance test in your specific environment may be helpful when deciding.
You have a number of options when reading a file that are tailored to different situations.
Read in the file line-by-line, but only store one line at a time:
while (line = file.gets) do
# ...
end
Read in all lines of a file at once:
file.readlines.each do |line|
# ...
end
Read the file in as a series of blocks:
while (data = file.read(block_size))
# ...
end
Read in the whole file at once:
data = file.read
It really depends on what kind of data you're working with. Generally read is better suited towards binary files, or those where you want it as one big string. gets and readlines are similar, but readlines is more convenient if you're confident the file will fit in memory. Don't do this on multi-gigabyte log files or you'll be in for a world of hurt as your system starts swapping. Use gets for situations like that.
gets will read until the end of the line based on a separator
read will read n bytes at a time
It all depends on what you are trying to read.
It may be more efficient to use read if your src file has unpredictable line lengths.

Resources