Parsing PDF removing month - ruby

I'm parsing a pdf that has some dates by splitting the lines and then searching them. The following are example lines:
Posted Date: 02/11/2015
Effective Date: 02/05/2015
When I find Posted Date, I split on the : and pull out 02/11/2015. But when I do the same for effective date, it only returns /05/2015. When I write all lines, it displays that date as /05/2015 while the PDF has the 02. Would 02 be converted to nil for some reason? Am I missing something?
lines = reader.pages[0].text.split(/\r?\n/)
lines.each_with_index do |line, index|
values_to_insert = []
if line.include? "Legal Name:"
name_line = line.split(":")
values_to_insert.push(name_line[1])
end
if line.include? "Active/Pending Insurance"
topLine = lines[index+2].split(" ")
middleLine = lines[index+5].split(" ")
insuranceLine = lines[index + 7]
insurance_line_split = insuranceLine.split(" ")
insurance_line_split.each_with_index do |word, i|
if word.include? "Insurance"
values_to_insert.push(insuranceLine.split(":")[1])
end
end
topLine.each_with_index do |word, i|
if word.include? "Posted"
values_to_insert.push(topLine[i + 2])
end
end
middleLine.each_with_index do |word, i|
if word.include? "Effective" or word.include? "Cancellation"
#puts middleLine[0]
puts middleLine[1]
#puts middleLine[i + 1].split(":")[1]
end
end
end
end
Here is what happens when I print all lines:
Active/Pending Insurance:
Form: 91X Type: BIPD/Primary Posted Date: 02/11
/2015
Policy/Surety Number:A 3491819 Coverage From: $0
To: $1,000,000
Effective Date:/05/2015 Cancellation Date:
Insurance Carrier: PROGRESSIVE EXPRESS INSURANCE COMPANY
Attn: CUSTOMER SERVICE
Address: P. O. BOX 94739
CLEVELAND, OH 44101 US
Telephone: (800) 444 - 4487 Fax: (440) 603 - 4555
Edited to show the code and even add a picture. I'm splitting by lines and then splitting again on colons and sometimes spaces. It's not amazingly clean but I don't think there's a much better way.

The problem occurs at positions where multiple pieces of text are on the same line but don't use exactly the same base line. In case of the PDF at hands,
(at least) the policy number and the effective date are positioned slightly higher than their respective labels.
The cause for this is the way the pdf-reader library used by the OP brings together the text pieces drawn on the page:
It determines a number of columns and rows to arrange the letters in and
creates an array of the rows number of strings filled with the columns number of spaces.
It then combines consecutive text pieces from the PDF on exactly the same base line and
finally puts these combined text pieces into the string array starting from the position best matching their starting position in the PDF.
As fonts used in PDFs usually are not monospaced, this procedure can result in overlapping strings, i.e. erasure of one of the two. The step combining strings on the same baseline prevents erasure in that case, but for strings on slightly different base lines, this overlapping effect can still occur.
What one can do, is increase the number of columns used here.
The library in page_layout.rb defines
def col_count
#col_count ||= ((#page_width / #mean_glyph_width) * 1.05).floor
end
As you see there already is some magic number 1.05 in use to slightly increase the number of columns. By increasing this number even more, no erasures as observed by the OP should occur anymore. One should not increase the factor too much, though, because that can introduce unwanted space characters where none belong.
The OP reported that increasing the magic number to 1.10 sufficed in his case.

Related

how do i make my code read random lines 37 different times?

def pick_random_line
chosen_line = nil
File.foreach("id'sForCascade.txt").each_with_index do |line, id|
chosen_line = line if rand < 1.0/(id+1)
end
return chosen_line
end`enter code here
Hey, i'm trying to make that code pick 37 different lines. So how would I do that i'm stuck and confused.
Assuming you don't want the same line to repeat more than once, I would do it in one line like this:
File.read("test.txt").split("\n").shuffle.first(37)
File.read("test.txt") reads the entire file.
split("\n") splits the file to lines based on the \n delimiter (I assume your file is textual and have lines separated by new line character).
shuffle is a very convenient method of Array that shuffles the lines randomly. You can read about it here:
http://docs.ruby-lang.org/en/2.0.0/Array.html#method-i-shuffle
Finally, first(37) gives you the first 37 lines out of the shuffled array. These are guaranteed to be random from the shuffle operation.
You can do something like this:
input_lines = File.foreach("test.txt").map(&:to_s)
output_lines = []
37.times do
output_lines << input_lines.delete_at(rand(input_lines.length))
end
puts output_lines
This will ensure that you aren't grabbing duplicate lines and you don't need to do any fancy checking.
However, if your file is less than 37 lines this may cause a problem, it also assumes that your file exists.
EDIT:
What is happening is the rand call is now changing the range on which it is called based on the size of the input lines. And since you are deleting at an index when you take the line out, the length shrinks and you do not risk duplicating lines.
If you want to save relatively few lines from a large file, reading the entire file into an array (and then randomly selecting lines) could be costly. It might be better to count the number of lines in the file, randomly select line offsets and then save the lines at those offsets to an array. This approach is no more difficult to implement than the former one, but makes the method more robust, even if the files in the current application are not overly large.1
Suppose your filename were given by FName. Here are three ways to count the numbers of lines in the file:
Count lines, literally
cnt = File.foreach(FName).reduce(0) { |c,_| c+1 }
Use $.
File.foreach(FName) {}
cnt = $.
On Unix-family computers, shell-out to the operating system
cnt = %x{wc -l #{FName}}.split.first.to_ii
The third option is very fast.
Random offsets (base 1) for n lines to be saved could be computed as follows:
lines = (1..cnt).to_a.sample(n).sort
Saving the lines at those offsets to an array is straightforward; for example:
File.foreach(FName).with_object([]) do |line,a|
if lines.first == $.
a << line
lines.shift
break a if lines.empty?
end
end
Note that $. #=> 1 after the first line is first line is read, and $. is incremented by 1 after each successive line is read. (Hence base 1 for line offsets.)
1 Moreover, many programmers, not just Rubiests, are repelled by the idea of amassing large numbers of anything and then discarding all but a few.

How to best wrap Ruby optparse code and output?

For the following code, which according to the style guide should be wrapped at 80 chars:
opts.on('--scores_min <uint>', Integer, 'Drop reads if a single position in ',
'the index have a quality score ',
'below scores_main (default= ',
"#{DEFAULT_SCORE_MIN})") do |o|
options[:scores_min] = o
end
The resulting output is:
--scores_min <uint> Drop reads if a single position in
the index have a quality score
below scores_main (default=
16)
Which wraps at 72 chars and looks wrong :o(
I really want it wrapped at 80 chars and aligned like this:
--scores_min <uint> Drop reads if a single position in the
index have a quality score below
scores_min (default=16)
How can this be achieved in a clever way?
The easiest solution in this case is to stack parameters like this:
opts.on('--scores_min <uint>',
Integer,
"Drop reads if a single position in the ",
"index have a quality score below ",
"scores_min (default= #{DEFAULT_SCORE_MIN})") do |o|
options[:scores_min] = o
end
That results in a fairly pleasant output:
--scores_min <uint> Drop reads if a single position in the
index have a quality score below
scores_min (default= 16)
More generally, here docs can make it easier to format output strings in a way that looks good both in the code and in the output:
# Deeply nested code
puts <<~EOT
Drop reads if a single position in the
index have a quality score below
scores_min (default= #{DEFAULT_SCORE_MIN})
EOT
But in this case it doesn't work so well since the description string is indented automatically.
So I think the solution is to follow the Ruby Style Guide:
When using heredocs for multi-line strings keep in mind the fact that
they preserve leading whitespace. It's a good practice to employ some
margin based on which to trim the excessive whitespace.
code = <<-END.gsub(/^\s+\|/, '')
|def test
| some_method
| other_method
|end
END
# => "def test\n some_method\n other_method\nend\n"
[EDIT] In Ruby 2.3 you can do (same ref):
code = <<~END
def test
some_method
other_method
end
END

Ruby Delete From Array On Criteria

I'm just learning Ruby and have been tackling small code projects to accelerate the process.
What I'm trying to do here is read only the alphabetic words from a text file into an array, then delete the words from the array that are less than 5 characters long. Then where the stdout is at the bottom, I'm intending to use the array. My code currently works, but is very very slow since it has to read the entire file, then individually check each element and delete the appropriate ones. This seems like it's doing too much work.
goal = File.read('big.txt').split(/\s/).map do |word|
word.scan(/[[:alpha:]]+/).uniq
end
goal.each { |word|
if word.length < 5
goal.delete(word)
end
}
puts goal.sample
Is there a way to apply the criteria to my File.read block to keep it from mapping the short words to begin with? I'm open to anything that would help me speed this up.
You might want to change your regex instead to catch only words longer than 5 characters to begin with:
goal = File.read('C:\Users\bkuhar\Documents\php\big.txt').split(/\s/).flat_map do |word|
word.scan(/[[:alpha:]]{6,}/).uniq
end
Further optimization might be to maintain a Set instead of an Array, to avoid re-scanning for uniqueness:
goal = Set.new
File.read('C:\Users\bkuhar\Documents\php\big.txt').scan(/\b[[:alpha:]]{6,}\b/).each do |w|
goal << w
end
In this case, use the delete_if method
goal => your array
goal.delete_if{|w|w.length < 5}
This will return a new array with the words of length lower than 5 deleted.
Hope this helps.
I really don't understand what a lot of the stuff you are doing in the first loop is for.
You take every chunk of text separated by white space, and map it to a unique value in an array generated by chunking together groups of letter characters, and plug that into an array.
This is way too complicated for what you want. Try this:
goal = File.readlines('big.txt').select do |word|
word =~ /^[a-zA-Z]+$/ &&
word.length >= 5
end
This makes it easy to add new conditions, too. If the word can't contain 'q' or 'Q', for example:
goal = File.readlines('big.txt').select do |word|
word =~ /^[a-zA-Z]+$/ &&
word.length >= 5 &&
! word.upcase.include? 'Q'
end
This assumes that each word in your dictionary is on its own line. You could go back to splitting it on white space, but it makes me wonder if the file you are reading in is written, human-readable text; a.k.a, it has 'words' ending in periods or commas, like this sentence. In that case, splitting on whitespace will not work.
Another note - map is the wrong array function to use. It modifies the values in one array and creates another out of those values. You want to select certain values from an array, but not modify them. The Array#select method is what you want.
Also, feel free to modify the Regex back to using the :alpha: tag if you are expecting non-standard letter characters.
Edit: Second version
goal = /([a-z][a-z']{4,})/gi.match(File.readlines('big.txt').join(" "))[1..-1]
Explanation: Load a file, and join all the lines in the file together with a space. Capture all occurences of a group of letters, at least 5 long and possibly containing but not starting with a '. Put all those occurences into an array. the [1..-1] discards "full match" returned by the MatchData object, which would be all the words appended together.
This works well, and it's only one line for your whole task, but it'll match
sugar'
in
I'd like some 'sugar', if you know what I mean
Like above, if your word can't contain q or Q, you could change the regex to
/[a-pr-z][a-pr-z']{4,})[ .'",]/i
And an idea - do another select on goal, removing all those entries that end with a '. This overcomes the limitations of my Regex

How does this Ruby app know to select the middle third of sentences?

I am currently following Beginning Ruby by Peter Cooper and have put together my first app, a text analyzer. However, whilst I understand all of the concepts and the way in which they work, I can't for the life of me understand how the app knows to select the middle third of sentences sorted by length from this line:
ideal_sentances = sentences_sorted.slice(one_third, one_third + 1)
I have included the whole app for context any help is much appreciated as so far everything is making sense.
#analyzer.rb --Text Analyzer
stopwords = %w{the a by on for of are with just but and to the my I has some in do}
lines = File.readlines(ARGV[0])
line_count = lines.size
text = lines.join
#Count the characters
character_count = text.length
character_count_nospaces = text.gsub(/\s+/, '').length
#Count the words, sentances, and paragraphs
word_count = text.split.length
paragraph_count = text.split(/\n\n/).length
sentence_count = text.split(/\.|\?|!/).length
#Make a list of words in the text that aren't stop words,
#count them, and work out the percentage of non-stop words
#against all words
all_words = text.scan(/\w+/)
good_words = all_words.select {|word| !stopwords.include?(word)}
good_percentage = ((good_words.length.to_f / all_words.length.to_f)*100).to_i
#Summarize the text by cherry picking some choice sentances
sentances = text.gsub(/\s+/, ' ').strip.split(/\.|\?|!/)
sentances_sorted = sentences.sort_by { |sentence| sentance.length }
one_third = sentences_sorted.length / 3
ideal_sentances = sentences_sorted.slice(one_third, one_third + 1)
ideal_sentances = ideal_sentences.select{ |sentence| sentence =~ /is|are/ }
#Give analysis back to user
puts "#{line_count} lines"
puts "#{character_count} characters"
puts "#{character_count_nospaces} characters excluding spaces"
puts "#{word_count} words"
puts "#{paragraph_count} paragraphs"
puts "#{sentence_count} sentences"
puts "#{sentence_count / paragraph_count} sentences per paragraph (average)"
puts "#{word_count / sentence_count} words per sentence (average)"
puts "#{good_percentage}% of words are non-fluff words"
puts "Summary:\n\n" + ideal_sentences.join(". ")
puts "-- End of analysis."
Obviously I am a beginner so plain English would help enormously, cheers.
It gets a third of the length of the sentence with one_third = sentences_sorted.length / 3 then the line you posted ideal_sentances = sentences_sorted.slice(one_third, one_third + 1) says "grab a slice of all the sentences starting at the index equal to 1/3rd and continue 1/3rd of the length +1".
Make sense?
The slice method in you look it up in the ruby API say this:
If passed two Fixnum objects, returns a substring starting at the
offset given by the first, and a length given by the second.
This means that if you have a sentence broken into three pieces
ONE | TWO | THREE
slice(1/3, 1/3+1)
will return the string starting at 1/3 from the beginning
| TWO | THREE (this is what you are looking at now)
then you return the string that is 1/3+1 distance from where you are, which gives you
| TWO |
sentences is a list of all sentences. sentances_sorted is that list sorted by sentence length, so the middle third will be the sentences with the most average length. slice() grabs that middle third of the list, starting from the position represented by one_third and counting one_third + 1 from that point.
Note that the correct spelling is 'sentence' and not 'sentance'. I mention this only because you have some code errors that result from spelling it inconsistently.
I was stuck on this when I first started too. In plain English, you have to realize that the slice method can take 2 parameters here.
The first is the index. The second is how long slice goes for.
So lets say you start off with 6 sentences.
one_third = 2
slice(one_third, one_third+1)
1/3 of 6 is 2.
1) here the 1/3 means you start at element 2 which is index[1]
2) then it goes on for 2 (6/3) more + 1 length, so a total of 3 spaces
so it is affecting indexes 1 to index 3

How can I do fuzzy substring matching in Ruby?

I found lots of links about fuzzy matching, comparing one string to another and seeing which gets the highest similarity score.
I have one very long string, which is a document, and a substring. The substring came from the original document, but has been converted several times, so weird artifacts might have been introduced, such as a space here, a dash there. The substring will match a section of the text in the original document 99% or more. I am not matching to see from which document this string is, I am trying to find the index in the document where the string starts.
If the string was identical because no random error was introduced, I would use document.index(substring), however this fails if there is even one character difference.
I thought the difference would be accounted for by removing all characters except a-z in both the string and the substring, compare, and then use the index I generated when compressing the string to translate the index in the compressed string to the index in the real document. This worked well where the difference was whitespace and punctuation, but as soon as one letter is different it failed.
The document is typically a few pages to a hundred pages, and the substring from a few sentences to a few pages.
You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: https://github.com/flori/amatch.
Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:
include 'amatch'
module FuzzyFinder
def scanner( input )
out = [] unless block_given?
pos = 0
input.scan(/(\w+)(\W*)/) do |word, white|
startpos = pos
pos = word.length + white.length
if block_given?
yield startpos, word
else
out << [startpos, word]
end
end
end
def find( text, doc )
index = scanner(doc)
sstr = text.gsub(/\W/,'')
levenshtein = Amatch::Levensthtein.new(sstr)
minlen = sstr.length
maxndx = index.length
possibles = []
minscore = minlen*2
index.each_with_index do |x, i|
spos = x[0]
str = x[1]
si = i
while (str.length < minlen)
i += 1
break unless i < maxndx
str += index[i][1]
end
str = str.slice(0,minlen) if (str.length > minlen)
score = levenshtein.search(str)
if score < minscore
possibles = [spos]
minscore = score
elsif score == minscore
possibles << spos
end
end
[minscore, possibles]
end
end
Obviously there are numerous improvements possible and probably necessary! A few off the top:
Process the document once and store
the results, possibly in a database.
Determine a usable length of string
for an initial check, process
against that initial substring first
before trying to match the entire
fragment.
Following up on the previous,
precalculate starting fragments of
that length.
A simple one is fuzzy_match
require 'fuzzy_match'
FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus') #=> seamus
A more elaborated one (you wouldn't say it from this example though) is levenshein, which computes the number of differences.
require 'levenshtein'
Levenshtein.distance('test', 'test') # => 0
Levenshtein.distance('test', 'tent') # => 1
You should look at the StrikeAMatch implementation detailed here:
A better similarity ranking algorithm for variable length strings
Instead of relying on some kind of string distance (i.e. number of changes between two strings), this one looks at the character pairs patterns. The more character pairs occur in each string, the better the match. It has worked wonderfully for our application, where we search for mistyped/variable length headings in a plain text file.
There's also a gem which combines StrikeAMatch (an implementation of Dice's coefficient on character-level bigrams) and Levenshtein distance to find matches: https://github.com/seamusabshere/fuzzy_match
It depends on the artifacts that can end up in the substring. In the simpler case where they are not part of [a-z] you can use parse the substring and then use Regexp#match on the document:
document = 'Ulputat non nullandigna tortor dolessi illam sectem laor acipsus.'
substr = "tortor - dolessi _%&# +illam"
re = Regexp.new(substr.split(/[^a-z]/i).select{|e| !e.empty?}.join(".*"))
md = document.match re
puts document[md.begin(0) ... md.end(0)]
# => tortor dolessi illam
(Here, as we do not set any parenthesis in the Regexp, we use begin and end on the first (full match) element 0 of MatchData.
If you are only interested in the start position, you can use =~ operator:
start_pos = document =~ re
I have used none of them, but I found some libraries just by doing a search for 'diff' in rubygems.org. All of them can be installed by gem. You might want to try them. I myself is interested, so if you already know these or if you try them out, it would be helpful if you leave your comment.
diff
diff-lcs
differ
difflcs
pretty_diff
diffy
kronk
khtmldiff
gdiff
ruby_diff
tdiff
diffrenderer
diffplex
dbdiff
diff_dirs
rsyncdiff
wdiff
diff4all
davidtrogers-htmldiff
edouard-htmldiff
diff2xml
dirdiff
rrdiff
nokogiri-diff
pretty-diff
easy_diff
smartdiff

Resources