How to remove word break and line break in pdf file? - ruby

I'm trying to parse a pdf file and I would like to get an input without word break at the end of the line, ex :
text.pdf
"hello guys I ne-
ed help"
How to remove the "-" and the line break in order to stick the both part of "need" together
This is my actual code :
reader = PDF::Reader.new(‘text.pdf’)
reader.pages.each do |page|
page.text.each_line do |line|
words = line.split(” “) # => ["hello"], ["guys"], ["I"], ["ne-"], ["ed"], ["help"]
words.each do |word|
puts word
end
end

You can use String#gsub:
a = "hello guys I ne-
ed help"
#=> "hello guys I ne-\n" + "ed help"
a.gsub(/-|\n/, '-' => '', "\n" => '')
#=> "hello guys I need help"
With your code:
reader = PDF::Reader.new(‘text.pdf’)
reader.pages.each do |page|
page.text.each_line { |line| line.gsub(/-|\n/, '-' => '', "\n" => '')}
end
Or, if dash and new line element are always together substitute them together:
a.gsub(/-\n/, '')
#=> "hello guys I need help"

Related

ruby How I could print without leave newline space for each line?

ruby How I could print without leave newline space for each line
this is my file
name1;name2;name3
name4;name5;name6
I have this command
File.open("text2xycff.txt", "r") do |fix|
fix.readlines.each do |line|
parts = line.chomp.split(';')
input3= zero
File.open('text2xyczzzz2.txt', 'a+') do |f|
f.puts parts[0] , parts[1], parts[2] ,input3
end
end
end
this is my output
name1
name2
name3
zero
name4
name5
name6
zero
I need to output this
name1;name2;name3;zero
name4;name5;name6;zero
Please help me whit this problem
A more minimal approach is to just append something to each line:
File.open("text2xycff.txt", "r") do |input|
File.open('text2xyczzzz2.txt', 'a+') do |output|
input.readlines.each do |line|
output.puts(line.chomp + ';zero')
end
end
end
Or if you want to actually parse things, which presents an opportunity for clean-up:
File.open("text2xycff.txt", "r") do |input|
File.open('text2xyczzzz2.txt', 'a+') do |output|
input.readlines.each do |line|
parts = line.chomp.split(/;/)
parts << 'zero'
output.puts(parts.join(';'))
end
end
end
You have two solutions.
The first one uses puts as you currently do:
File.open('yourfile.txt', 'a+') { |f|
f.puts "#{parts[0]}#{parts[1]}#{parts[2]}#{input3}"
}
The second one uses write instead of puts:
File.open('yourfile.txt', 'a+') { |f|
f.write parts[0]
f.write parts[1]
f.write parts[2]
f.write input3
}
If you call puts with comma-separated arguments, each one of them will be printed on a different line.
You can use ruby string interpolation here (http://ruby-for-beginners.rubymonstas.org/bonus/string_interpolation.html):
f.puts "#{parts[0]};#{parts[1]};#{parts[3]};#{input3}"
Try:
File.open("test_io.txt", "r") do |fix|
fix.readlines.each do |line|
File.open('new_file10.txt', 'a+') do |f|
next if line == "\n"
f.puts "#{line.chomp};zero"
end
end
end
I'm not sure why you're splitting the string by semicolon when you specified you wanted the below output. You would be better served just appending ";zero" to the end of the string rather than parsing an array.
name1;name2;name3;zero
name4;name5;name6;zero
You can specify an if statement to check for the zero value.
Example:
arr = ["name1", "name2", "name3", "zero", "name4", "name5", "name6", "zero"];
arr.each { |x|
if x != "zero" then
print x
else
puts x
end
}
Output:
name1name2name3zero
name4name5name6zero
print will print inline.
puts will print on a new line.
Just implement this logic in your code and you're good to go.

How to count how many line are between a specific part of a file?

So, I'm trying to parse a Cucumber file (*.feature), in order to identify how many lines each Scenario has.
Example of file:
Scenario: Add two numbers
Given I have entered 50 into the calculator
And I have entered 70 into the calculator
When I press add
Then the result should be 120 on the screen
Scenario: Add many numbers
Given I have entered 50 into the calculator
And I have entered 20 into the calculator
And I have entered 20 into the calculator
And I have entered 30 into the calculator
When I press add
Then the result should be 120 on the screen
So, I'm expecting to parse this file and get results like:
Scenario: Add two numbers ---> it has 4 lines!
Scenario: Add many numbers ---> it has 6 lines!
What's the best approach to do that?
Enumerable#slice_before is pretty much tailor-made for this.
File.open('your cuke scenario') do |f|
f.slice_before(/^\s*Scenario:/) do |scenario|
title = scenario.shift.chomp
ct = scenario.map(&:strip).reject(&:empty?).size
puts "#{title} --> has #{ct} lines"
end
end
Why don't you start simple? Like #FeRtoll suggested, going line by line might be the easiest solution. Something as simple as the following might be what you are looking for :
scenario = nil
scenarios = Hash.new{ |h,k| h[k] = 0 }
File.open("file_or_argv[0]_or_whatever.features").each do |line|
next if line.strip.empty?
if line[/^Scenario/]
scenario = line
else
scenarios[scenario] += 1
end
end
p scenarios
Output :
{"Scenario: Add two numbers \n"=>4, "Scenario: Add many numbers\n"=>6}
This is the current piece of code I'm working on (based on Kyle Burton approach):
def get_scenarios_info
#scenarios_info = [:scenario_name => "", :quantity_of_steps => []]
#all_files.each do |file|
line_counter = 0
File.open(file).each_line do |line|
line.chomp!
next if line.empty?
line_counter = line_counter + 1
if line.include? "Scenario:"
#scenarios_info << {:scenario_name => line, :scenario_line => line_counter, :feature_file => file, :quantity_of_steps => []}
next
end
#scenarios_info.last[:quantity_of_steps] << line
end
end
#TODO: fix me here!
#scenarios_info.each do |scenario|
if scenario[:scenario_name] == ""
#scenarios_info.delete(scenario)
end
scenario[:quantity_of_steps] = scenario[:quantity_of_steps].size
end
puts #scenarios_info
end
FeRtoll suggested a good approach: accumulating by section. The simplest way to parse it for me was to scrub out parts that I can ignore (i.e. comments) and then split into sections:
file = ARGV[0] or raise "Please supply a file name to parse"
def preprocess file
data = File.read(file)
data.gsub! /#.+$/, '' # strip (ignore) comments
data.gsub! /#.+$/, '' # strip (ignore) tags
data.gsub! /[ \t]+$/, '' # trim trailing whitespace
data.gsub! /^[ \t]+/, '' # trim leading whitespace
data.split /\n\n+/ # multiple blanks separate sections
end
sections = {
:scenarios => [],
:background => nil,
:feature => nil,
:examples => nil
}
parts = preprocess file
parts.each do |part|
first_line, *lines = part.split /\n/
if first_line.include? "Scenario:"
sections[:scenarios] << {
:name => first_line.strip,
:lines => lines
}
end
if first_line.include? "Feature:"
sections[:feature] = {
:name => first_line.strip,
:lines => lines
}
end
if first_line.include? "Background:"
sections[:background] = {
:name => first_line.strip,
:lines => lines
}
end
if first_line.include? "Examples:"
sections[:examples] = {
:name => first_line.strip,
:lines => lines
}
end
end
if sections[:feature]
puts "Feature has #{sections[:feature][:lines].size} lines."
end
if sections[:background]
puts "Background has #{sections[:background][:lines].size} steps."
end
puts "There are #{sections[:scenarios].size} scenarios:"
sections[:scenarios].each do |scenario|
puts " #{scenario[:name]} has #{scenario[:lines].size} steps"
end
if sections[:examples]
puts "Examples has #{sections[:examples][:lines].size} lines."
end
HTH

Reading a .txt file with escaped characters in Ruby

I'm having difficulty reading a file with escaped characters in Ruby...
My text file has the string "First Line\r\nSecond Line" and when I use File.read, I get a string back that escapes my escaped characters: "First Line\r\nSecond Line"
These two strings are not the same things...
1.9.2-p318 :006 > f = File.read("file.txt")
=> "First Line\\r\\nSecond Line"
1.9.2-p318 :007 > f.count('\\')
=> 2
1.9.2-p318 :008 > f = "First Line\r\nSecond Line"
=> "First Line\r\nSecond Line"
1.9.2-p318 :009 > f.count('\\')
=> 0
How can I get the File.read to not escape my escaped characters?
Create a method to remove all the additional escape characters that the File.Read method added, like this:
# Define a method to handle unescaping the escape characters
def unescape_escapes(s)
s = s.gsub("\\\\", "\\") #Backslash
s = s.gsub('\\"', '"') #Double quotes
s = s.gsub("\\'", "\'") #Single quotes
s = s.gsub("\\a", "\a") #Bell/alert
s = s.gsub("\\b", "\b") #Backspace
s = s.gsub("\\r", "\r") #Carriage Return
s = s.gsub("\\n", "\n") #New Line
s = s.gsub("\\s", "\s") #Space
s = s.gsub("\\t", "\t") #Tab
s
end
Then see it in action:
# Create your sample file
f = File.new("file.txt", "w")
f.write("First Line\\r\\nSecond Line")
f.close
# Use the method to solve your problem
f = File.read("file.txt")
puts "BEFORE:", f
puts f.count('\\')
f = unescape_escapes(f)
puts "AFTER:", f
puts f.count('\\')
# Here's a more elaborate use of it
f = File.new("file2.txt", "w")
f.write("He used \\\"Double Quotes\\\".")
f.write("\\nThen a Backslash: \\\\")
f.write('\\nFollowed by \\\'Single Quotes\\\'.')
f.write("\\nHere's a bell/alert: \\a")
f.write("\\nThis is a backspaces\\b.")
f.write("\\nNow we see a\\rcarriage return.")
f.write("\\nWe've seen many\\nnew lines already.")
f.write("\\nHow\\sabout\\ssome\\sspaces?")
f.write("\\nWe'll also see some more:\\n\\ttab\\n\\tcharacters")
f.close
# Read the file without the method
puts "", "BEFORE:"
puts File.read("file2.txt")
# Read the file with the method
puts "", "AFTER:"
puts unescape_escapes(File.read("file2.txt"))
You could just hack them back in.
foo = f.gsub("\r\n", "\\r\\n")
#=> "First Line\\r\\nSecond Line"
foo.count("\\")
#=> 2

Better way to parse "Description (tag)" to "Description, tag"

I have a text file with many 1000s of lines like this, which are category descriptions with the keyword enclosed in parentheses
Chemicals (chem)
Electrical (elec)
I need to convert these lines to comma separated values like so:
Chemicals, chem
Electrical, elec
What I am using is this:
lines = line.gsub!('(', ',').gsub!(')', '').split(',')
I would like to know if there is a better way to do this.
for posterity, this is the full code (based on the answers)
require 'rubygems'
require 'csv'
csvfile = CSV.open('output.csv', 'w')
File.open('c:/categories.txt') do |f|
f.readlines.each do |line|
(desc, cat) = line.split('(')
desc.strip!
cat.strip!
csvfile << [desc, cat[0,cat.length-1]]
end
end
Try something like this:
line.sub!(/ \((\w+)\)$/, ', \1')
The \1 will be replaced with the first match of the given regexp (in this case it will be always the category keyword). So it will basically change the (chem) with , chem.
Let's create an example using a text file:
lines = []
File.open('categories.txt', 'r') do |file|
while line = file.gets
lines << line.sub(/ \((\w+)\)$/, ', \1')
end
end
Based on the question updates I can propose this:
require 'csv'
csv_file = CSV.open('output.csv', 'w')
File.open('c:/categories.txt') do |f|
f.each_line {|c| csv_file << c.scan(/^(.+) \((\w+)\)$/)}
end
csv_file.close
Starting with Ruby 1.9, you can do it in one method call:
str = "Chemicals (chem)\n"
mapping = { ' (' => ', ',
')' => ''}
str.gsub(/ \(|\)/, mapping) #=> "Chemicals, chem\n"
In Ruby, a cleaner, more efficient, way to do it would be:
description, tag = line.split(' ', 2) # split(' ', 2) will return an 2 element array of
# the all characters up to the first space and all characters after. We can then use
# multi assignment syntax to assign each array element in a different local variable
tag = tag[1, (tag.length - 1) - 1] # extract the inside characters (not first or last) of the string
new_line = description << ", " << tag # rejoin the parts into a new string
This will be computationally faster (if you have a lot of rows) because it uses direct string operations instead of regular expressions.
No need to manipulate the string. Just grab the data and output it to the CSV file.
Assuming that you have something like this in the data:
Chemicals (chem)
Electrical (elec)
Dyes & Intermediates (dyes)
This should work:
File.open('categories.txt', 'r') do |file|
file.each_line do |line|
csvfile << line.match(/^(.+)\s\((.+)\)$/) { |m| [m[1], m[2]] }
end
end
Benchmarks relevant to discussion in #hundredwatt's answer:
require 'benchmark'
line = "Chemicals (chem)"
# #hundredwatt
puts Benchmark.measure {
100000.times do
description, tag = line.split(' ', 2)
tag = tag[1, (tag.length - 1) - 1]
new_line = description << ", " << tag
end
} # => 0.18
# NeX
puts Benchmark.measure {
100000.times do
line.sub!(/ \((\w+)\)$/, ', \1')
end
} # => 0.08
# steenslag
mapping = { ' (' => ', ',
')' => ''}
puts Benchmark.measure {
100000.times do
line.gsub(/ \(|\)/, mapping)
end
} # => 0.08
know nothing about ruby, but it is easy in php
preg_match_all('~(.+)\((.+)\)~','Chemicals (chem)',$m);
$result = $m[1].','.$m[2];

Search box with Ruby Shoes

I have a short script that uses regular expressions to search a file for a specific phrase that a user types in. Basically, it's a simple search box.
I'm now trying to make this search box have a GUI, so that users are able to type into a box, and have their matches 'alerted' to them.
I'm new to using ruby shoes in any great detail, and have been using the examples on TheShoeBox website.
Can anyone point out where I'm going wrong with my code?
Here is my command line version that works:
string = File.read('db.txt')
puts "Enter what you're looking for below"
begin
while(true)
break if string.empty?
print "Search> "; STDOUT.flush; phrase = gets.chop
break if phrase.empty?
names = string.split(/\n/)
matches = names.select { |name| name[/#{phrase}/i] }
puts "\n \n"
puts matches
puts "\n \n"
end
end
Here is my attempt at using it within Ruby Shoes:
Shoes.app :title => "Search v0.1", :width => 300, :height => 150 do
string = File.read('db.txt')
names = string.split(/\n/)
matches = names.select { |name| name[/#{phrase}/i] }
def search(text)
text.tr! "A-Za-z", "N-ZA-Mn-za-m"
end
#usage = <<USAGE
Search - This will search for the inputted text within the database
USAGE
stack :margin => 10 do
para #usage
#input = edit_box :width => 200
end
flow :margin => 10 do
button('Search') { #output.matches }
end
stack(:margin => 0) { #output = para }
end
Many thanks
Well, for starters, the first code bit can be neatened up.
file = File.open 'db.txt', 'rb'
puts "Enter (regex) search term or quit:"
exit 1 unless file.size > 0
loop do
puts
print "query> "
redo if ( query = gets.chomp ).empty?
exit 0 if query == "quit"
file.each_line do |line|
puts "#{file.lineno}: #{line}" if line =~ /#{query}/i
end
file.rewind
end
The rb option lets it work as expected in Windows (especially with Shoes, you should try and be platform-independent). chomp strips off \r\n and \n but not a for example, while chop just blindly takes off the last character. loop do end is nicer than while true. Also why store matches in a variable? Just read through the file line by line (which allows for CRLF endings) as opposed to splitting by \n although the residual \r wouldn't really pose much of a problem...
As for the Shoes bit:
Shoes.app :title => "Search v0.2", :width => 500, :height => 600 do
#file = File.open 'db.txt', 'rb'
def search( file, query )
file.rewind
file.select {|line| line =~ /#{query}/i }.map {|match| match.chomp }
end
stack :margin => 10 do
#input = edit_line :width => 400
button "search" do
matches = search( #file, #input.text )
#output.clear
#output.append do
matches.empty? ?
title( "Nothing found :(" ) :
title( "Results\n" )
end
matches.each do |match|
#output.append { para match }
end
end
#output = stack { title "Search for something." }
end
end
You never defined #output.matches or called your search() method. See if it makes sense now.

Resources