Ruby: Length of a line of a file in bytes? - ruby

I'm writing this little HelloWorld as a followup to this and the numbers do not add up
filename = "testThis.txt"
total_bytes = 0
file = File.new(filename, "r")
file.each do |line|
total_bytes += line.unpack("U*").length
end
puts "original size #{File.size(filename)}"
puts "Total bytes #{total_bytes}"
The result is not the same as the file size. I think I just need to know what format I need to plug in... or maybe I've missed the point entirely. How can I measure the file size line by line?
Note: I'm on Windows, and the file is encoded as type ANSI.
Edit: This produces the same results!
filename = "testThis.txt"
total_bytes = 0
file = File.new(filename, "r")
file.each_byte do |whatever|
total_bytes += 1
end
puts "Original size #{File.size(filename)}"
puts "Total bytes #{total_bytes}"
so anybody who can help now...

IO#gets works the same as if you were capturing input from the command line: the "Enter" isn't sent as part of the input; neither is it passed when #gets is called on a File or other subclass of IO, so the numbers are definitely not going to match up.
See the relevant Pickaxe section
May I enquire why you're so concerned about the line lengths summing to the file size? You may be solving a harder problem than is necessary...
Aha. I think I get it now.
Lacking a handy iPod (or any other sort, for that matter), I don't know if you want exactly 4K chunks, in which case IO#read(4000) would be your friend (4000 or 4096?) or if you're happier to break by line, in which case something like this ought to work:
class Chunkifier
def Chunkifier.to_chunks(path)
chunks, current_chunk_size = [""], 0
File.readlines(path).each do |line|
line.chomp! # strips off \n, \r or \r\n depending on OS
if chunks.last.size + line.size >= 4_000 # 4096?
chunks.last.chomp! # remove last line terminator
chunks << ""
end
chunks.last << line + "\n" # or whatever terminator you need
end
chunks
end
end
if __FILE__ == $0
require 'test/unit'
class TestFile < Test::Unit::TestCase
def test_chunking
chs = Chunkifier.to_chunks(PATH)
chs.each do |chunk|
assert 4_000 >= chunk.size, "chunk is #{chunk.size} bytes long"
end
end
end
end
Note the use of IO#readlines to get all the text in one slurp: #each or #each_line would do as well. I used String#chomp! to ensure that whatever the OS is doing, the byts at the end are removed, so that \n or whatever can be forced into the output.
I would suggest using File#write, rather than #print or #puts for the output, as the latter have a tendency to deliver OS-specific newline sequences.
If you're really concerned about multi-byte characters, consider taking the each_byte or unpack(C*) options and monkey-patching String, something like this:
class String
def size_in_bytes
self.unpack("C*").size
end
end
The unpack version is about 8 times faster than the each_byte one on my machine, btw.

You might try IO#each_byte, e.g.
total_bytes = 0
file_name = "test_this.txt"
File.open(file_name, "r") do |file|
file.each_byte {|b| total_bytes += 1}
end
puts "Original size #{File.size(file_name)}"
puts "Total bytes #{total_bytes}"
That, of course, doesn't give you a line at a time. Your best option for that is probably to go through the file via each_byte until you encounter \r\n. The IO class provides a bunch of pretty low-level read methods that might be helpful.

You potentially have several overlapping issues here:
Linefeed characters \r\n vs. \n (as per your previous post). Also EOF file character (^Z)?
Definition of "size" in your problem statement: do you mean "how many characters" (taking into account multi-byte character encodings) or do you mean "how many bytes"?
Interaction of the $KCODE global variable (deprecated in ruby 1.9. See String#encoding and friends if you're running under 1.9). Are there, for example, accented characters in your file?
Your format string for #unpack. I think you want C* here if you really want to count bytes.
Note also the existence of IO#each_line (just so you can throw away the while and be a little more ruby-idiomatic ;-)).

The issue is that when you save a text file on windows, your line breaks are two characters (characters 13 and 10) and therefore 2 bytes, when you save it on linux there is only 1 (character 10). However, ruby reports both these as a single character '\n' - it says character 10. What's worse, is that if you're on linux with a windows file, ruby will give you both characters.
So, if you know that your files are always coming from windows text files and executed on windows, every time you get a newline character you can add 1 to your count. Otherwise it's a couple of conditionals and a little state machine.
BTW there's no EOF 'character'.

f = File.new("log.txt")
begin
while (line = f.readline)
line.chomp
puts line.length
end
rescue EOFError
f.close
end

Here is a simple solution, presuming that the current file pointer is set to the start of a line in the read file:
last_pos = file.pos
next_line = file.gets
current_pos = file.pos
backup_dist = last_pos - current_pos
file.seek(backup_dist, IO::SEEK_CUR)
in this example "file" is the file from which you are reading. To do this in a loop:
last_pos = file.pos
begin loop
next_line = file.gets
current_pos = file.pos
backup_dist = last_pos - current_pos
last_pos = current_pos
file.seek(backup_dist, IO::SEEK_CUR)
end loop

Related

How far does .each read? To the end of the line?

Sorry for the newbie question. Was loading a .txt file into the following code:
line_count = 0
File.open("text.txt").each {|line| line_count += 1}
puts line_count
Does .each simply read until the end of a line before passing its value to the code block? Little explanation would be great. Thanks!
You can use .each_line to be more explicit, but yes, http://www.ruby-doc.org/core-2.0.0/IO.html#method-i-each each reads a line.
f = File.new("testfile")
f.each {|line| puts "#{f.lineno}: #{line}" }
It's really important to read the documentation, because all sorts of things are explained there. For instance, the documentation for each says:
Executes the block for every line in ios, where lines are separated by sep.
sep means "\r", "\n" or "\r\n", depending on the OS the code is running on which is also the value of the special $/ global variable which contains the default line-ending character for that OS. You can tell Ruby to use a different value for the line-end/separator if you know the file uses something else.
Regarding your code:
I'd do it this way:
line_count = 0
File.foreach("text.txt") do |line|
line_count += 1
end
puts line_count
foreach is very self-explanatory, which is important when writing code. You want it to be self-documenting as much as possible. foreach iterates over "each" line in the file. It also assumes the line-ends are the same as $/, but you can force it to be something different, perhaps the letter "z" or "." or " ", depending on your whim and fancy at the moment.

How to print from specific column range?

I want to grab only the first line of columns 46 to 245 of source.txt and write it to output.txt
source_file.each { |line|
File.open(output_file,"a+") { |f|
f.print ???
}
Bonus: I also need to keep a count of the number of characters in this range, as some may be whitespace. i.e. 38 characters and the rest whitespace.
Example:
source_file: (first line only, columns 45 to 245): 13287912721981239854 + 180 blank columns
output_file: 13287912721981239854
count = 20 characters
Update: appending [46..245].delete(' ').size gives me the desired count.
If I am understanding what you are asking correctly, there's no reason to grab the whole file when you only want the first line. If this isn't what you're asking for, then you need to specify what you're trying to pull out of the source file more clearly.
This should grab the data you need:
output_line = source_file.gets [45..244]
If you write:
source_file.each { |line|
File.open(output_file,"a+") { |f|
f.print ???
}
}
You will open, then close, your output file for each line read from the output file. That is the wrong way to do it, even if you only want to read one line of input.
Instead try something like one of these:
File.open(output_file, 'a') do |fo|
File.open('path/to/input_file') do |fi|
fo.puts fi.readline[46..245]
end
end
This uses IO.readline, which reads a single line from the file. The block falls through afterwards, causing both the input and output files to be closed automatically. Also, it opens the output file as 'a' which is append-mode only. 'a+' is wrong unless you intend to append and read, which is rarely done. From the documentation:
"a+" Read-write, starts at end of file if file exists,
otherwise creates a new file for reading and
writing
Or:
File.open(output_file, 'a') do |fo|
File.foreach('path/to/input_file') do |li|
fo.puts li[46..245]
break
end
end
foreach is used most often when we're reading a file line-by-line. It's the mainstay for reading files in a scalable manner. It wants to loop over the file inside the block, which is why break is there, to break out of that loop.
Or:
File.foreach('path/to/input_file') do |li|
File.write(output_file, li[46..245], -1, :mode => 'a')
break
end
File.write is useful when you have a blob of text or binary, and want to write it in one chunk, then move on. The -1 tells Ruby to move to the end of the file. :mode => 'a' overrides the default mode which would normally truncate an existing file.
Maybe this will do the job:
line = f.readline
columns = line.split
File.open("output.txt", "w") do |out|
columns[46, (245 - 46 + 1)].each do |column|
out.puts column
end
end
break # only process first line
I have used 245 - 46 + 1 to indicate this is the number of columns we are interested in. I have also assumed that columns are separate by whitespaces. If that is not the case you will need to change the delimiter of split.

How to read lines of a file in Ruby

I was trying to use the following code to read lines from a file. But when reading a file, the contents are all in one line:
line_num=0
File.open('xxx.txt').each do |line|
print "#{line_num += 1} #{line}"
end
But this file prints each line separately.
I have to use stdin, like ruby my_prog.rb < file.txt, where I can't assume what the line-ending character is that the file uses. How can I handle it?
Ruby does have a method for this:
File.readlines('foo').each do |line|
puts(line)
end
http://ruby-doc.org/core-1.9.3/IO.html#method-c-readlines
File.foreach(filename).with_index do |line, line_num|
puts "#{line_num}: #{line}"
end
This will execute the given block for each line in the file without slurping the entire file into memory. See: IO::foreach.
I believe my answer covers your new concerns about handling any type of line endings since both "\r\n" and "\r" are converted to Linux standard "\n" before parsing the lines.
To support the "\r" EOL character along with the regular "\n", and "\r\n" from Windows, here's what I would do:
line_num=0
text=File.open('xxx.txt').read
text.gsub!(/\r\n?/, "\n")
text.each_line do |line|
print "#{line_num += 1} #{line}"
end
Of course this could be a bad idea on very large files since it means loading the whole file into memory.
Your first file has Mac Classic line endings (that’s "\r" instead of the usual "\n"). Open it with
File.open('foo').each(sep="\r") do |line|
to specify the line endings.
I'm partial to the following approach for files that have headers:
File.open(file, "r") do |fh|
header = fh.readline
# Process the header
while(line = fh.gets) != nil
#do stuff
end
end
This allows you to process a header line (or lines) differently than the content lines.
It is because of the endlines in each lines.
Use the chomp method in ruby to delete the endline '\n' or 'r' at the end.
line_num=0
File.open('xxx.txt').each do |line|
print "#{line_num += 1} #{line.chomp}"
end
how about gets ?
myFile=File.open("paths_to_file","r")
while(line=myFile.gets)
//do stuff with line
end
Don't forget that if you are concerned about reading in a file that might have huge lines that could swamp your RAM during runtime, you can always read the file piece-meal. See "Why slurping a file is bad".
File.open('file_path', 'rb') do |io|
while chunk = io.read(16 * 1024) do
something_with_the chunk
# like stream it across a network
# or write it to another file:
# other_io.write chunk
end
end

How to get a particular line from a file

Is it possible to extract a particular line from a file knowing its line number? For example, just get the contents of line N as a string from file "text.txt"?
You could get it by index from readlines.
line = IO.readlines("file.txt")[42]
Only use this if it's a small file.
Try one of these two solutions:
file = File.open "file.txt"
#1 solution would eat a lot of RAM
p [*file][n-1]
#2 solution would not
n.times{ file.gets }
p $_
file.close
def get_line_from_file(path, line)
result = nil
File.open(path, "r") do |f|
while line > 0
line -= 1
result = f.gets
end
end
return result
end
get_line_from_file("/tmp/foo.txt", 20)
This is a good solution because:
You don't use File.read, thus you don't read the entire file into memory. Doing so could become a problem if the file is 20MB large and you read often enough so GC doesn't keep up.
You only read from the file until the line you want. If your file has 1000 lines, getting line 20 will only read the 20 first lines into Ruby.
You can replace gets with readline if you want to raise an error (EOFError) instead of returning nil when passing an out-of-bounds line.
File has a nice lineno method.
def get_line(filename, lineno)
File.open(filename,'r') do |f|
f.gets until f.lineno == lineno - 1
f.gets
end
end
linenumber=5
open("file").each_with_index{|line,ind|
if ind+1==linenumber
save=line
# break or exit if needed.
end
}
or
linenumber=5
f=open("file")
while line=f.gets
if $. == linenumber # $. is line number
print "#{f.lineno} #{line}" # another way
# break # break or exit if needed
end
end
f.close
If you just want to get the line and do nothing else, you can use this one liner
ruby -ne '(print $_ and exit) if $.==5' file
If you want one liner and do not care about memory usage, use (assuming lines are numbered from 1)
lineN = IO.readlines('text.txt')[n-1]
or
lineN = f.readlines[n-1]
if you already have file opened.
Otherwise it would be better to do like this:
lineN = File.open('text.txt') do |f|
(n-1).times { f.gets } # skip lines preceeding line N
f.gets # read line N contents
end
These solutions work if you want only one line from a file, or if you want multiple lines from a file small enough to be read repeatedly. Large files (for example, 10 million lines) take much longer to search for a specific line so it's better to get the necessary lines sequentially in a single read so the large file doesn't get read multiple times.
Create a large file:
File.open('foo', 'a') { |f| f.write((0..10_000_000).to_a.join("\n")) }
Pick which lines will be read from it and make sure they're sorted:
lines = [9_999_999, 3_333_333, 6_666_666].sort
Print out those lines:
File.open('foo') do |f|
lines.each_with_index do |line, index|
(line - (index.zero? ? 0 : lines[index - 1]) - 1).times { f.gets }
puts f.gets
end
end
This solution works for any number of lines, does not load the entire file into memory, reads as few lines as possible, and only reads the file one time.

Quick Advice: How should this be written in Ruby?

I am a Java/C++ programmer and Ruby is my first scripting language. I sometimes find that I am not using it as productively as I could in some areas, like this one for example:
Objective: to parse only certain lines from a file. The pattern I am going with is that there is one very large line with a size greater than 15, the rest are definitely smaller. I want to ignore all the lines before (and including) the large one.
def do_something(str)
puts str
end
str =
'ignore me
me too!
LARGE LINE ahahahahha its a line!
target1
target2
target3'
flag1 = nil
str.each_line do |line|
do_something(line) if flag1
flag1 = 1 if line.size > 15
end
I wrote this, but I think it could be written a lot better, ie, there must be a better way than setting a flag. Recommendations for how to write beautiful lines of Ruby also welcome.
Note/Clarification: I need to print ALL lines AFTER the first appearance of the LARGE LINE.
str.lines.drop_while {|l| l.length < 15 }.drop(1).each {|l| do_something(l) }
I like this, because if you read it from left to right, it reads almost exactly like your original description:
Split the string in lines and drop lines shorter than 15 characters. Then drop another line (i.e. the first one with more than 14 characters). Then do something with each remaining line.
You don't even need to necessarily understand Ruby, or programming at all to be able to verify whether this is correct.
require 'enumerator' # Not needed in Ruby 1.9
str.each_line.inject( false ) do |flag, line|
do_something( line ) if flag
flag || line.size > 15
end
lines = str.split($/)
start_index = 1 + lines.find_index {|l| l.size > 15 }
lines[start_index..-1].each do |l|
do_something(l)
end

Resources