Reading a .txt file with escaped characters in Ruby - ruby

I'm having difficulty reading a file with escaped characters in Ruby...
My text file has the string "First Line\r\nSecond Line" and when I use File.read, I get a string back that escapes my escaped characters: "First Line\r\nSecond Line"
These two strings are not the same things...
1.9.2-p318 :006 > f = File.read("file.txt")
=> "First Line\\r\\nSecond Line"
1.9.2-p318 :007 > f.count('\\')
=> 2
1.9.2-p318 :008 > f = "First Line\r\nSecond Line"
=> "First Line\r\nSecond Line"
1.9.2-p318 :009 > f.count('\\')
=> 0
How can I get the File.read to not escape my escaped characters?

Create a method to remove all the additional escape characters that the File.Read method added, like this:
# Define a method to handle unescaping the escape characters
def unescape_escapes(s)
s = s.gsub("\\\\", "\\") #Backslash
s = s.gsub('\\"', '"') #Double quotes
s = s.gsub("\\'", "\'") #Single quotes
s = s.gsub("\\a", "\a") #Bell/alert
s = s.gsub("\\b", "\b") #Backspace
s = s.gsub("\\r", "\r") #Carriage Return
s = s.gsub("\\n", "\n") #New Line
s = s.gsub("\\s", "\s") #Space
s = s.gsub("\\t", "\t") #Tab
s
end
Then see it in action:
# Create your sample file
f = File.new("file.txt", "w")
f.write("First Line\\r\\nSecond Line")
f.close
# Use the method to solve your problem
f = File.read("file.txt")
puts "BEFORE:", f
puts f.count('\\')
f = unescape_escapes(f)
puts "AFTER:", f
puts f.count('\\')
# Here's a more elaborate use of it
f = File.new("file2.txt", "w")
f.write("He used \\\"Double Quotes\\\".")
f.write("\\nThen a Backslash: \\\\")
f.write('\\nFollowed by \\\'Single Quotes\\\'.')
f.write("\\nHere's a bell/alert: \\a")
f.write("\\nThis is a backspaces\\b.")
f.write("\\nNow we see a\\rcarriage return.")
f.write("\\nWe've seen many\\nnew lines already.")
f.write("\\nHow\\sabout\\ssome\\sspaces?")
f.write("\\nWe'll also see some more:\\n\\ttab\\n\\tcharacters")
f.close
# Read the file without the method
puts "", "BEFORE:"
puts File.read("file2.txt")
# Read the file with the method
puts "", "AFTER:"
puts unescape_escapes(File.read("file2.txt"))

You could just hack them back in.
foo = f.gsub("\r\n", "\\r\\n")
#=> "First Line\\r\\nSecond Line"
foo.count("\\")
#=> 2

Related

Using `gsub` inside (double quoted) heredoc does not work

It appears that using gsub inside a (double quoted) heredoc does not evaluate the result of gsub, as follows:
class Test
def self.define_phone
class_eval <<-EOS
def _phone=(val)
puts val
puts val.gsub(/\D/,'')
end
EOS
end
end
Test.define_phone
test = Test.new
test._phone = '123-456-7890'
# >> 123-456-7890
# >> 123-456-7890
The second puts should have printed 1234567890, just as it would in this case:
'123-456-7890'.gsub(/\D/,'')
# => "1234567890"
What is going on inside the heredoc?
The problem is with the \D in the regex. It will be evaluated when the heredoc is evaluated as a string, which results in D:
"\D" # => "D"
eval("/\D/") #=> /D/
On the other hand, \D inside a single quote will not be evaluated as D:
'\D' # => "\\D"
eval('/\D/') # => /\D/
So wrap the heredoc terminator EOS in a single quote to achieve what you want:
class Test
def self.define_phone
class_eval <<-'EOS'
def _phone=(val)
puts val
puts val.gsub(/\D/,'')
end
EOS
end
end
Test.define_phone
test = Test.new
test._phone = '123-456-7890'
# >> 123-456-7890
# >> 1234567890
Reference
If you run the above code without the wrapped EOS, gsub will try to replace "D" (literally) in the val. See this:
test._phone = '123-D456-D7890DD'
# >> 123-D456-D7890DD
# >> 123-456-7890

How to remove word break and line break in pdf file?

I'm trying to parse a pdf file and I would like to get an input without word break at the end of the line, ex :
text.pdf
"hello guys I ne-
ed help"
How to remove the "-" and the line break in order to stick the both part of "need" together
This is my actual code :
reader = PDF::Reader.new(‘text.pdf’)
reader.pages.each do |page|
page.text.each_line do |line|
words = line.split(” “) # => ["hello"], ["guys"], ["I"], ["ne-"], ["ed"], ["help"]
words.each do |word|
puts word
end
end
You can use String#gsub:
a = "hello guys I ne-
ed help"
#=> "hello guys I ne-\n" + "ed help"
a.gsub(/-|\n/, '-' => '', "\n" => '')
#=> "hello guys I need help"
With your code:
reader = PDF::Reader.new(‘text.pdf’)
reader.pages.each do |page|
page.text.each_line { |line| line.gsub(/-|\n/, '-' => '', "\n" => '')}
end
Or, if dash and new line element are always together substitute them together:
a.gsub(/-\n/, '')
#=> "hello guys I need help"

How do I apply removing of characters to the string itself?

Using Ruby 2.4. How do I apply an editing of a stirng to the string itself? I have this method
# Removes the word from teh end of the string
def remove_word_from_end_of_str(str, word)
str[0...(-1 * word.length)]
end
I want the parameter to be operated upon, but it isn't working ...
2.4.0 :001 > str = "abc def"
=> "abc def"
2.4.0 :002 > StringHelper.remove_word_from_end_of_str(str, "def")
=> "abc "
2.4.0 :003 > str
=> "abc def"
I want the string that was passed in to be equal to "abc " but that isn't happening. I don't want to set the variable to the result of the function (e.g. "str = StringHelper.remove(...)"
Ruby already has the String#delete! method that does exactly this:
>> str = 'abc def'
=> "abc def"
>> word = 'def'
=> "def"
>> str.delete!(word)
=> "abc "
>> str
=> "abc "
Note that this will remove all instances of word:
>> str = 'def abc def'
=> "def abc def"
>> str.delete!(word)
=> " abc "
To limit the effect to only the last word, you can do:
>> str = 'def abc def'
=> "def abc def"
>> str.slice!(-word.length..-1)
=> "def"
>> str
=> "def abc "
str[range] is just a shorthand for str.slice(range). You just have to use the destructive method, like that :
# Removes the word from the end of the string
def remove_word_from_end_of_str(str, word)
str.slice!((str.length - word.length)...(str.length))
end
For more information, see the documentation.
If you want your function to return the new string as well, you should use :
# Removes the word from the end of the string
def remove_word_from_end_of_str(str, word)
str.slice!((str.length - word.length)...(str.length))
str
end
Try:
def remove_word_from_end_of_str(str, word)
str.slice!((str.length - word.length)..str.length)
end
Also, your explanation is a little confusing. You are calling the remove_word method as a class method but it is an instance method.
chomp! returns a the String with the given record separator removed from the end of string (if present), and nil if nothing was removed.
def remove_word_from_end_of_str(str, word)
str.chomp!( "CO")
end
str = "Aurora CO"
remove_word_from_end_of_str(str, "CO")
p str #=> "Aurora "

Ruby regex to get text blocks including delimiters

When using scan in Ruby, we are searching for a block within a text file.
Sample file:
sometextbefore
begin
sometext
end
sometextafter
begin
sometext2
end
sometextafter2
We want the following result in an array:
["begin\nsometext\nend","begin\nsometext2\nend"]
With this scan method:
textfile.scan(/begin\s.(.*?)end/m)
we get:
["sometext","sometext2"]
We want the begin and end still in the output, not cut off.
Any suggestions?
You may remove the capturing group completely:
textfile.scan(/begin\s.*?end/m)
See the IDEONE demo
The String#scan method returns captured values only if you have capturing groups defined inside the pattern, thus a non-capturing one should fix the issue.
UPDATE
If the lines inside the blocks must be trimmed from leading/trailing whitespace, you can just use a gsub against each matched block of text to remove all the horizontal whitespace (with the help of \p{Zs} Unicode category/property class):
.scan(/begin\s.*?end/m).map { |s| s.gsub(/^\p{Zs}+|\p{Zs}+$/, "") }
Here, each match is passed to a block where /^\p{Zs}+|\p{Zs}+$/ matches either the start of a line with 1+ horizontal whitespace(s) (see ^\p{Zs}+), or 1+ horizontal whitespace(s) at the end of the line (see \p{Zs}+$).
See another IDEONE demo
Here's another approach, using Ruby's flip-flop operator. I cannot say I would recommend this approach, but Rubiests should understand how the flip-flop operator works.
First let's create a file.
str =<<_
some
text
at beginning
begin
some
text
1
end
some text
between
begin
some
text
2
end
some text at end
_
#=> "some\ntext\nat beginning\nbegin\n some\n text\n 1\nend\n...at end\n"
FName = "text"
File.write(FName, str)
Now read the file line-by-line into the array lines:
lines = File.readlines(FName)
#=> ["some\n", "text\n", "at beginning\n", "begin\n", " some\n", " text\n",
# " 1\n", "end\n", "some text\n", "between\n", "begin\n", " some\n",
# " text\n", " 2\n", "end\n", "some text at end\n"]
We can obtain the desired result as follows.
lines.chunk { |line| true if line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }.
map { |_,arr| arr.map(&:strip).join("\n") }
#=> ["begin\nsome\ntext\n1\nend", "begin\nsome\ntext\n2\nend"]
The two steps are as follows.
First, select and group the lines of interest, using Enumerable#chunk with the flip-flop operator.
a = lines.chunk { |line| true if line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }
#=> #<Enumerator: #<Enumerator::Generator:0x007ff62b981510>:each>
We can see the objects that will be generated by this enumerator by converting it to an array.
a.to_a
#=> [[true, ["begin\n", " some\n", " text\n", " 1\n", "end\n"]],
# [true, ["begin\n", " some\n", " text\n", " 2\n", "end\n"]]]
Note that the flip-flop operator is distinguished from a range definition by making it part of a logical expression. For that reason we cannot write
lines.chunk { |line| line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }.to_a
#=> ArgumentError: bad value for range
The second step is the following:
b = a.map { |_,arr| arr.map(&:strip).join("\n") }
#=> ["begin\nsome\ntext\n1\nend", "begin\nsome\ntext\n2\nend"]
Ruby has some great methods in Enumerable. slice_before and slice_after can help with this sort of problem:
string = <<EOT
sometextbefore
begin
sometext
end
sometextafter
begin
sometext2
end
sometextafter2
EOT
ary = string.split # => ["sometextbefore", "begin", "sometext", "end", "sometextafter", "begin", "sometext2", "end", "sometextafter2"]
.slice_after(/^end/) # => #<Enumerator: #<Enumerator::Generator:0x007fb1e20b42a8>:each>
.map{ |a| a.shift; a } # => [["begin", "sometext", "end"], ["begin", "sometext2", "end"], []]
ary.pop # => []
ary # => [["begin", "sometext", "end"], ["begin", "sometext2", "end"]]
If you want the resulting sub-arrays joined then that's an easy step:
ary.map{ |a| a.join("\n") } # => ["begin\nsometext\nend", "begin\nsometext2\nend"]

Setting end-of-line character for puts

I have an array of entries I would like to print.
Being arr the array, I used just to write:
puts arr
Then I needed to use the DOS format end-of-line: \r\n, so I wrote:
arr.each { |e| print "#{e}\r\n" }
This works correctly, but I would like to know if there is a way to specify what end-of-line format to use so that I could write something like:
$eol = "\r\n"
puts arr
UPDATE 1
I know that puts will use the correct line-endings depending on the platform it is run on, but I need this because I will write the output to a file.
UPDATE 2
As Mark suggested, setting $\ is useful. Anyway it just works for print.
For example,
irb(main):001:0> a = [1, 2, 3]
=> [1, 2, 3]
irb(main):002:0> $\ = "\r\n"
=> "\r\n"
irb(main):003:0> print a
123
=> nil
irb(main):004:0> puts a
1
2
3
=> nil
print prints all array items on a single line and then add $\, while I would like the behaviour of puts: adding $\ after each item of the array.
Is this possible at all without using Array#each?
The Ruby variable $\ will set the record separator for calls to print and write:
>> $\ = '!!!'
=> "!!!"
>> print 'hi'
hi!!!=> nil
Alternatively you can refer to $\ as $OUTPUT_RECORD_SEPARATOR if you import the English module.
Kernel#puts is equivalent to STDOUT.puts; and IO.puts "writes a newline after every element that does not already end with a newline sequence". So you're out of luck with pure puts for arrays. However, the $, variable is the separator string output between parameters suck as Kernel#print and Array#join. So if you can handle calling print arr.join, this might be the best solution for what you're doing:
>> [1,2,3].join
=> "123"
>> $, = '---'
=> "---"
>> [1,2,3].join
=> "1---2---3"
>> $\ = '!!!'
=> "!!!"
>> print [1,2,3].join
1---2---3!!!=> nil

Resources