I'm trying to write a very simple ruby script that opens a text file, removes the \n from the end of lines UNLESS the line starts with a non-alphabetic character OR the line itself is blank (\n).
The code below works fine, except that it skips all of the content beyond the last \n line. When I add \n\n to the end of the file, it works perfectly. Examples: A file with this text in it works great and pulls everything to one line:
Hello
there my
friend how are you?
becomes Hello there my friend how are you?
But text like this:
Hello
there
my friend
how
are you today
returns just Hello and There, and completely skips the last 3 lines. If I add 2 blank lines to the end, it will pick up everything and behave as I want it to.
Can anybody explain to me why this happens? Obviously I know I can fix this instance by appending \n\n to the end of the source file at the start, but that doesn't help me understand why the .gets isn't working as I'd expect.
Thanks in advance for any help!
source_file_name = "somefile.txt"
destination_file_name = "some_other_file.txt"
source_file = File.new(source_file_name, "r")
para = []
x = ""
while (line = source_file.gets)
if line != "\n"
if line[0].match(/[A-z]/) #If the first character is a letter
x += line.chomp + " "
else
x += "\n" + line.chomp + " "
end
else
para[para.length] = x
x = ""
end
end
source_file.close
fixed_file = File.open(destination_file_name, "w")
para.each do |paragraph|
fixed_file << "#{paragraph}\n\n"
end
fixed_file.close
Your problem lies in the fact you only add your string x to the para array if and only if you encounter an empty line ('\n'). Since your second example does not contain the empty line at the end, the final contents of x are never added to the para array.
The easy way to fix this without changing any of your code, is add the following lines after closing your while loop:
if(x != "")
para.push(x)
end
I would prefer to add the strings to my array right away rather then appending them onto x until you hit an empty line, but this should work with your solution.
Also,
para.push(x)
para << x
both read much nicer and look more straightforward than
para[para.length] = x
That one threw me off for a second, since in non-dynamic languages, that would give you an error. I advise using one of those instead, simply because it's more readable.
Your code is like a c code to me, ruby way should be this, which substitutes your above 100 lines.
File.write "dest.txt", File.read("src.txt")
It's easier to use a multiline regex. Maybe:
source_file.read.gsub(/(?<!\n)\n([a-z])/im, ' \\1')
Related
I'm trying to receive multiple paragraphs at once from a user.
I've tried using gets, but it doesn't seem to be working... it discards the second paragraph:
#The code:
print("Paste your text here: ")
.. essay = gets
.. puts(essay)
# Getting user imput (the second sentance is a separate paragraph)
Paste your text here: I like cake.
It makes me happy.
# What the computer did for puts(essay):
I like cake.
=> nil
I expected the result to be something like this:
"I like cake.\nIt makes me happy.\n"
But it gave me "I like cake." instead.
How could I end up with my expected result?
Add paragraphs to a string until the input consists of a empty line:
str = ""
para = "init"
str << (para = gets) until para.chomp.empty? #or para == "\n"
p str
Here's an alternative, with a slightly different logic
def getps
save, $/ = $/, "\n\n"
gets.chomp
ensure
$/ = save
end
str = getps
The global variable $/ is what Ruby uses to find out what line end is. gets gets things till line end. If we tell Ruby that line end is two newlines, then gets waits till we have two newlines in a row till it exits. Since we don't need them, we'll just chomp them off. The rest of the code is just to ensure that $/ gets restored properly afterwards so normal gets is not messed up forever.
I have a file formatted by lines like this (I know it's a terrible format, I didn't write it):
id: 12345 synset: word1,word2
I want to read the entire file and check to see if every line is correct without having to look line by line.
I've looked into File and Regex, but couldn't find what I need. I tried to use File.read to read the entire file all at once, then use m modifier for regex to check multiple lines, but it's not working the way I anticipated (perhaps it's not what I need).
p.s. Ruby newbie :)
Assuming your file always ends with a newline, this should work:
/^(id: \d+ synset: \w+,\w+\n)+$/m
The full ruby:
content = ''
File.open('myfile.txt', 'r') { |f| content = f.read }
puts 'file is valid!' if content =~ /^(id: \d+ synset: \w+,\w+\n)+$/m
You can use this regex to check each line of the file: ^id:\s*\d+\s+synset:\s*(?:\w+,)*\w+$. You can try the following code, but I don't know any Ruby, I just searched and tested a little. It might work.
line_num = 0
text = File.open('file.txt').read
text.each_line do |line|
line_num += 1
if !/^id:\s*\d+\s+synset:\s*(?:\w+,)*\w+$/.match(line)
print "Line #{line_num} is incorrect"
end
end
I learned that gets creates a new line and asks the user to input something, and gets.chomp does the same thing except that it does not create a new line. gets must return an object, so you can call a method on it, right? If so, lets name that object returned by gets as tmp, then you can call the chomp method of tmp. But before gets returns tmp, it should print a new line on the screen. So what does chomp do? Does it remove the new line after the gets created it?
Another way to re-expound my question is: Are the following actions performed when I call gets.chomp?
gets prints a new line
gets returns tmp
tmp.chomp removes the new line
User input
Is this the right order?
gets lets the user input a line and returns it as a value to your program. This value includes the trailing line break. If you then call chomp on that value, this line break is cut off. So no, what you have there is incorrect, it should rather be:
gets gets a line of text, including a line break at the end.
This is the user input
gets returns that line of text as a string value.
Calling chomp on that value removes the line break
The fact that you see the line of text on the screen is only because you entered it there in the first place. gets does not magically suppress output of things you entered.
The question shouldn't be "Is this the right order?" but more "is this is the right way of approaching this?"
Consider this, which is more or less what you want to achieve:
You assign a variable called tmp the return value of gets, which is a String.
Then you call String's chomp method on that object and you can see that chomp removed the trailing new-line.
Actually what chomp does, is remove the Enter character ("\n") at the end of your string. When you type h e l l o, one character at a time, and then press Enter gets takes all the letters and the Enter key's new-line character ("\n").
1. tmp = gets
hello
=>"hello\n"
2. tmp.chomp
"hello"
gets is your user's input. Also, it's good to know that *gets means "get string" and puts means "put string". That means these methods are dealing with Strings only.
chomp is the method to remove trailing new line character i.e. '\n' from the the string.
whenever "gets" is use to take i/p from user it appends new line character i.e.'\n' in the end of the string.So to remove '\n' from the string 'chomp' is used.
str = "Hello ruby\n"
str = str.chomp
puts str
o/p
"Hello ruby"
chomp returns a new String with the given record separator removed from the end of str (if present).
See the Ruby String API for more information.
"gets" allows user input but a new line will be added after the string (string means text or a sequence of characters)
"gets.chomp" allows user input as well just like "gets", but there is
not going to be a new line that is added after the string.
Proof that there are differences between them:
Gets.chomp
puts "Enter first text:"
text1 = gets.chomp
puts "Enter second text:"
text2 = gets.chomp
puts text1 + text2
Gets:
puts "Enter first text:"
text1 = gets
puts "Enter second text:"
text2 = gets
puts text1 + text2
Copy paste the code I gave you, run and you will see and know that they are both different.
For example:
x = gets
y = gets
puts x+y
and
x = gets.chomp
y = gets.chomp
puts x+y
Now run the two examples separately and see the difference.
Sorry for the newbie question. Was loading a .txt file into the following code:
line_count = 0
File.open("text.txt").each {|line| line_count += 1}
puts line_count
Does .each simply read until the end of a line before passing its value to the code block? Little explanation would be great. Thanks!
You can use .each_line to be more explicit, but yes, http://www.ruby-doc.org/core-2.0.0/IO.html#method-i-each each reads a line.
f = File.new("testfile")
f.each {|line| puts "#{f.lineno}: #{line}" }
It's really important to read the documentation, because all sorts of things are explained there. For instance, the documentation for each says:
Executes the block for every line in ios, where lines are separated by sep.
sep means "\r", "\n" or "\r\n", depending on the OS the code is running on which is also the value of the special $/ global variable which contains the default line-ending character for that OS. You can tell Ruby to use a different value for the line-end/separator if you know the file uses something else.
Regarding your code:
I'd do it this way:
line_count = 0
File.foreach("text.txt") do |line|
line_count += 1
end
puts line_count
foreach is very self-explanatory, which is important when writing code. You want it to be self-documenting as much as possible. foreach iterates over "each" line in the file. It also assumes the line-ends are the same as $/, but you can force it to be something different, perhaps the letter "z" or "." or " ", depending on your whim and fancy at the moment.
I'm writing this little HelloWorld as a followup to this and the numbers do not add up
filename = "testThis.txt"
total_bytes = 0
file = File.new(filename, "r")
file.each do |line|
total_bytes += line.unpack("U*").length
end
puts "original size #{File.size(filename)}"
puts "Total bytes #{total_bytes}"
The result is not the same as the file size. I think I just need to know what format I need to plug in... or maybe I've missed the point entirely. How can I measure the file size line by line?
Note: I'm on Windows, and the file is encoded as type ANSI.
Edit: This produces the same results!
filename = "testThis.txt"
total_bytes = 0
file = File.new(filename, "r")
file.each_byte do |whatever|
total_bytes += 1
end
puts "Original size #{File.size(filename)}"
puts "Total bytes #{total_bytes}"
so anybody who can help now...
IO#gets works the same as if you were capturing input from the command line: the "Enter" isn't sent as part of the input; neither is it passed when #gets is called on a File or other subclass of IO, so the numbers are definitely not going to match up.
See the relevant Pickaxe section
May I enquire why you're so concerned about the line lengths summing to the file size? You may be solving a harder problem than is necessary...
Aha. I think I get it now.
Lacking a handy iPod (or any other sort, for that matter), I don't know if you want exactly 4K chunks, in which case IO#read(4000) would be your friend (4000 or 4096?) or if you're happier to break by line, in which case something like this ought to work:
class Chunkifier
def Chunkifier.to_chunks(path)
chunks, current_chunk_size = [""], 0
File.readlines(path).each do |line|
line.chomp! # strips off \n, \r or \r\n depending on OS
if chunks.last.size + line.size >= 4_000 # 4096?
chunks.last.chomp! # remove last line terminator
chunks << ""
end
chunks.last << line + "\n" # or whatever terminator you need
end
chunks
end
end
if __FILE__ == $0
require 'test/unit'
class TestFile < Test::Unit::TestCase
def test_chunking
chs = Chunkifier.to_chunks(PATH)
chs.each do |chunk|
assert 4_000 >= chunk.size, "chunk is #{chunk.size} bytes long"
end
end
end
end
Note the use of IO#readlines to get all the text in one slurp: #each or #each_line would do as well. I used String#chomp! to ensure that whatever the OS is doing, the byts at the end are removed, so that \n or whatever can be forced into the output.
I would suggest using File#write, rather than #print or #puts for the output, as the latter have a tendency to deliver OS-specific newline sequences.
If you're really concerned about multi-byte characters, consider taking the each_byte or unpack(C*) options and monkey-patching String, something like this:
class String
def size_in_bytes
self.unpack("C*").size
end
end
The unpack version is about 8 times faster than the each_byte one on my machine, btw.
You might try IO#each_byte, e.g.
total_bytes = 0
file_name = "test_this.txt"
File.open(file_name, "r") do |file|
file.each_byte {|b| total_bytes += 1}
end
puts "Original size #{File.size(file_name)}"
puts "Total bytes #{total_bytes}"
That, of course, doesn't give you a line at a time. Your best option for that is probably to go through the file via each_byte until you encounter \r\n. The IO class provides a bunch of pretty low-level read methods that might be helpful.
You potentially have several overlapping issues here:
Linefeed characters \r\n vs. \n (as per your previous post). Also EOF file character (^Z)?
Definition of "size" in your problem statement: do you mean "how many characters" (taking into account multi-byte character encodings) or do you mean "how many bytes"?
Interaction of the $KCODE global variable (deprecated in ruby 1.9. See String#encoding and friends if you're running under 1.9). Are there, for example, accented characters in your file?
Your format string for #unpack. I think you want C* here if you really want to count bytes.
Note also the existence of IO#each_line (just so you can throw away the while and be a little more ruby-idiomatic ;-)).
The issue is that when you save a text file on windows, your line breaks are two characters (characters 13 and 10) and therefore 2 bytes, when you save it on linux there is only 1 (character 10). However, ruby reports both these as a single character '\n' - it says character 10. What's worse, is that if you're on linux with a windows file, ruby will give you both characters.
So, if you know that your files are always coming from windows text files and executed on windows, every time you get a newline character you can add 1 to your count. Otherwise it's a couple of conditionals and a little state machine.
BTW there's no EOF 'character'.
f = File.new("log.txt")
begin
while (line = f.readline)
line.chomp
puts line.length
end
rescue EOFError
f.close
end
Here is a simple solution, presuming that the current file pointer is set to the start of a line in the read file:
last_pos = file.pos
next_line = file.gets
current_pos = file.pos
backup_dist = last_pos - current_pos
file.seek(backup_dist, IO::SEEK_CUR)
in this example "file" is the file from which you are reading. To do this in a loop:
last_pos = file.pos
begin loop
next_line = file.gets
current_pos = file.pos
backup_dist = last_pos - current_pos
last_pos = current_pos
file.seek(backup_dist, IO::SEEK_CUR)
end loop