How can I extract the meaning of a paragraph? - ruby

I need to develop method that extracts the meaning from a string for a record in a database. Here is an example of the a string:
MyString = "Purse $75,000. (up To $14,250 Nysbfoa) For Maidens, Fillies And Mares Three Years Old And Upward. Three Year Olds, 118 Lbs.; Older, 123 Lbs. One And One Eighth Miles. (Inner turf)"
Given the string, I need to process it in such a way that I can create a race_record:
race_record[:purse] = 75000
race_record[:race_type] = "Maidens"
race_record[:sex] = "Fillies And Mares"
race_record[:age] = "Three Year Old And Upward"
race_record[:distance] = "One And One Eighth Miles"
race_record[:surface] = "inner turf"
I was planning on to use ruby and a series of regular expressions to extract the data. For example:
race_record[:purse] = Mystring.scan(/(?<=\Purse\s[$])(.*?)(?=\.)/)
race_record[:race_type] = Mystring.sub(....)
etc.
My question isn't so much what the correct regular expressions are. Given the objective, is the approach I proposed the right way to go, or is there a better approach or even a gem that can do the heavy lifting?

You could use one regex to extract all the relevant parts into capturing groups at once;
regexp =
/Purse\s\$ # Leading text
([\d,]+) # Group 1
.*?For\s # Intervening text
(\w+) # Group 2
,\s # Intervening text
(\w+\sAnd\s\w+) # Group 3, etc. etc.
\s
([^.]*)
\.[^;]*;[^.]*\.\s
([^.]*)
\.\s\(
([^()]*)
\)/x
Then you can do
irb(main):025:0> match = regexp.match(mystring)
=> #<MatchData "Purse $75,000. (up To $14,250 Nysbfoa) For Maidens, Fillies And Mares Three Years Old And Upward. Three Year Olds, 118 Lbs.; Older, 123 Lbs. One And One Eighth Miles. (Inner turf)"
1:"75,000" 2:"Maidens" 3:"Fillies And Mares" 4:"Three Years Old And Upward"
5:"One And One Eighth Miles" 6:"Inner turf">
irb(main):026:0> match[1]
=> "75,000"
irb(main):027:0> match[2]
=> "Maidens"
...etc.

If your input is fairly structured, i.e. it has a specific and know grammar, you could build a 'parser' to parse the grammar.
In the old days, we'd do this with yacc and lex, two old unix tools used to build compilers. Yacc and Lex have Ruby implementations. While the original intent was to output lower level code (such as machine assembly codes when building a real compiler), there is nothing that prevents you from calling any ruby code when a specific grammatical construct has been recognized by your parser.
NOTE: even though there is a Yacc/lex Ruby gem out there, I wouldn't say it will 'DO THE HEAVY LIFTING', learning yacc and lex has a small learning curve. Using something like yacc/lex would make your life easier in the long run, especially if you have a large grammar and must constantly adjust it.

Related

Python3 Make tie-breaking lambda sort more pythonic?

As an exercise in python lambdas (just so I can learn how to use them more properly) I gave myself an assignment to sort some strings based on something other than their natural string order.
I scraped apache for version number strings and then came up with a lambda to sort them based on numbers I extracted with regexes. It works, but I think it can be better I just don't know how to improve it so it's more robust.
from lxml import html
import requests
import re
# Send GET request to page and parse it into a list of html links
jmeter_archive_url='https://archive.apache.org/dist/jmeter/binaries/'
jmeter_archive_get=requests.get(url=jmeter_archive_url)
page_tree=html.fromstring(jmeter_archive_get.text)
list_of_links=page_tree.xpath('//a[#href]/text()')
# Filter out all the non-md5s. There are a lot of links, and ultimately
# it's more data than needed for his exercise
jmeter_md5_list=list(filter(lambda x: x.endswith('.tgz.md5'), list_of_links))
# Here's where the 'magic' happens. We use two different regexes to rip the first
# and then the second number out of the string and turn them into integers. We
# then return them in the order we grabbed them, allowing us to tie break.
jmeter_md5_list.sort(key=lambda val: (int(re.search('(\d+)\.\d+', val).group(1)), int(re.search('\d+\.(\d+)', val).group(1))))
print(jmeter_md5_list)
This does have the desired effect, The output is:
['jakarta-jmeter-2.5.1.tgz.md5', 'apache-jmeter-2.6.tgz.md5', 'apache-jmeter-2.7.tgz.md5', 'apache-jmeter-2.8.tgz.md5', 'apache-jmeter-2.9.tgz.md5', 'apache-jmeter-2.10.tgz.md5', 'apache-jmeter-2.11.tgz.md5', 'apache-jmeter-2.12.tgz.md5', 'apache-jmeter-2.13.tgz.md5']
So we can see that the strings are sorted into an order that makes sense. Lowest version first and highest version last. Immediate problems that I see with my solution are two-fold.
First, we have to create two different regexes to get the numbers we want instead of just capturing groups 1 and 2. Mainly because I know there are no multiline lambdas, I don't know how to reuse a single regex object instead of creating a second.
Secondly, this only works as long as the version numbers are two numbers separated by a single period. The first element is 2.5.1, which is sorted into the correct place but the current method wouldn't know how to tie break for 2.5.2, or 2.5.3, or for any string with an arbitrary number of version points.
So it works, but there's got to be a better way to do it. How can I improve this?
This is not a full answer, but it will get you far along the road to one.
The return value of the key function can be a tuple, and tuples sort naturally. You want the output from the key function to be:
((2, 5, 1), 'jakarta-jmeter')
((2, 6), 'apache-jmeter')
etc.
Do note that this is a poor use case for a lambda regardless.
Originally, I came up with this:
jmeter_md5_list.sort(key=lambda val: list(map(int, re.compile('(\d+(?!$))').findall(val))))
However, based on Ignacio Vazquez-Abrams's answer, I made the following changes.
def sortable_key_from_string(value):
version_tuple = tuple(map(int, re.compile('(\d+(?!$))').findall(value)))
match = re.match('^(\D+)', value)
version_name = ''
if match:
version_name = match.group(1)
return (version_tuple, version_name)
and this:
jmeter_md5_list.sort(key = lambda val: sortable_key_from_string(val))

Regex causing high CPU load, causing Rails to not respond

I have a Ruby 1.8.7 script to parse iOS localization files:
singleline_comment = /\/\/(.*)$/
multiline_comment = /\/\*(.*?)\*\//m
string_line = /\s*"(.*?)"\s*=\s*"(.*?)"\s*\;\s*/xm
out = decoded_src.scan(/(?:#{singleline_comment}|#{multiline_comment})?\s*?#{string_line}/)
It used to work fine, but today we tested it with a file that is 800Kb, and that doesn't have ; at the end of each line. The result was a high CPU load and no response from the Rails server. My assumption is that it took the whole file as a single string in the capturing group and that blocked the server.
The solution was to add ? (regex quantificator, 0 or 1 time) to the ; literal character:
/\s*"(.*?)"\s*=\s*"(.*?)"\s*\;?\s*/xm
Now it works fine again even with those files in the old iOS format, but my fear now is, what if a user submits a malformed file, like one with no ending ". Will my server get blocked again?
And how do I prevent this? Is there any way to try to run this only for five seconds? What I can I do to avoid halting my whole Rails application?
It looks like you're trying to parse an entire configuration as if it was a string. While that is doable, it's error-prone. Regular expression engines have to do a lot of looking forward and backward, and poorly written patterns can end up wasting a huge amount of CPU time. Sometimes a minor tweak will fix the problem, but the more text being processed, and the more complex the expression, the higher the chance of something happening that will mess you up.
From benchmarking different ways of getting at data for my own work, I've learned that anchoring regexp patterns can make a huge difference in speed. If you can't anchor a pattern somehow, then you are going to suffer from the backtracking and greediness of patterns unless you can limit what the engine wants to do by default.
I have to parse a lot of device configurations, but instead of trying to treat them as a single string, I break them down into logical blocks consisting of arrays of lines, and then I can provide logic to extract data from those blocks based on knowledge that blocks contain certain types of information. Small blocks are faster to search, and it's a lot easier to write patterns that can be anchored, providing huge speedups.
Also, don't hesitate to use Ruby's String methods, like split to tear apart lines, and sub-string matching to find lines containing what you want. They're very fast and less likely to induce slowdowns.
If I had a string like:
config = "name:\n foo\ntype:\n thingie\nlast update:\n tomorrow\n"
chunks = config.split("\n").slice_before(/^\w/).to_a
# => [["name:", " foo"], ["type:", " thingie"], ["last update:", " tomorrow"]]
command_blocks = chunks.map{ |k, v| [k[0..-2], v.strip] }.to_h
command_blocks['name'] # => "foo"
command_blocks['last update'] # => "tomorrow"
slice_before is a very useful method for this sort of task as it lets us define a pattern that is then used to test for breaks in the master array, and group by those. The Enumerable module has lots of useful methods in it, so be sure to look through it.
The same data could be parsed.
Of course, without sample data for what you're trying to do it's difficult to suggest something that works better, but the idea is, break down your input into small manageable chunks and go from there.
As a comment on how you're defining your patterns.
Instead of using /\/.../ (which is known as "leaning-toothpicks syndrome") use %r which allows you to define a different delimiter:
singleline_comment = /\/\/(.*)$/ # => /\/\/(.*)$/
singleline_comment = %r#//(.*)$# # => /\/\/(.*)$/
multiline_comment = /\/\*(.*?)\*\//m # => /\/\*(.*?)\*\//m
multiline_comment = %r#/\*(.*?)\*/#m # => /\/\*(.*?)\*\//m
The first line in each sample above is how you're doing it, and the second is how I'd do it. They result in identical regexp objects, but the second ones are easier to understand.
You can even have Regexp help you by escaping things for you:
NONGREEDY_CAPTURE_NONE_TO_ALL_CHARS = '(.*?)'
GREEDY_CAPTURE_NONE_TO_ALL_CHARS = '(.*)'
EOL = '$'
Regexp.new(Regexp.escape('//') + GREEDY_CAPTURE_NONE_TO_ALL_CHARS + EOL) # => /\/\/(.*)$/
Regexp.new(Regexp.escape('/*') + NONGREEDY_CAPTURE_NONE_TO_ALL_CHARS + Regexp.escape('*/'), Regexp::MULTILINE) # => /\/\*(.*?)\*\//m
Doing this you can iteratively build up extremely complex expressions while keeping them relatively easy to maintain.
As far as halting your Rails app, don't try to process the files in the same Ruby process. Run a separate job that watches for the files and process them and store whatever you're looking for to be accessed as needed later. That way your server will continue to respond rather than lock up. I wouldn't do it in a thread, but would write a separate Ruby script that looks for incoming data, and if nothing is found, sleeps for some interval of time then looks again. Ruby's sleep method will help with that, or you could use the cron capability of your OS.

Ruby regular expression for asterisks/underscore to strong/em?

As part of a chat app I'm writing, I need to use regular expressions to match asterisks and underscores in chat messages and turn them into <strong> and <em> tags. Since I'm terrible with regex, I'm really stuck here. Ideally, we would have it set up such that:
One to three words, but not more, can be marked for strong/em.
Patterns such as "un*believ*able" would be matched.
Only one or the other (strong OR em) work within one line.
The above parameters are in order of importance, with only #1 being utterly necessary - the others are just prettiness. The closest I came to anything that worked was:
text = text.sub(/\*([(0-9a-zA-Z).*])\*/,'<b>\1<\/b>')
text = text.sub(/_([(0-9a-zA-Z).*])_/,'<i>\1<\/i>')
But it obviously doesn't work with any of our params.
It's odd that there's not an example of something similar already out there, given the popularity of using asterisks for bold and whatnot. If there is, I couldn't find it outside of plugins/gems (which won't work for this instance, as I really only need it in in one place in my model). Any help would be appreciated.
This should help you finish what you are doing:
sub(/\*(.*)\*/,'<b>\1</b>')
sub(/_(.*)_/,'<i>\1</i>')
Firstly, your criteria are a little strange, but, okay...
It seems that a possible algorithm for this would be to find the number of matches in a message, count them to see if there are less than 4, and then try to perform one set of substitutions.
strong_regexp = /\*([^\*]*)\*/
em_regexp = /_([^_]*)_/
def process(input)
if input ~= strong_regexp && input.match(strong_regexp).size < 4
input.sub strong_regexp, "<b>\1<\b>"
elsif input ~= em_regexp && intput.match(em_regexp).size < 4
input.sub em_regexp, "<i>\1<\i>"
end
end
Your specifications aren't entirely clear, but if you understand this, you can tweak it yourself.

Applying a diff-patch to a string/file

For an offline-capable smartphone app, I'm creating a one-way text sync for Xml files. I'd like my server to send the delta/difference (e.g. a GNU diff-patch) to the target device.
This is the plan:
Time = 0
Server: has version_1 of Xml file (~800 kiB)
Client: has version_1 of Xml file (~800 kiB)
Time = 1
Server: has version_1 and version_2 of Xml file (each ~800 kiB)
computes delta of these versions (=patch) (~10 kiB)
sends patch to Client (~10 kiB transferred)
Client: computes version_2 from version_1 and patch <= this is the problem =>
Is there a Ruby library that can do this last step to apply a text patch to files/strings? The patch can be formatted as required by the library.
Thanks for your help!
(I'm using the Rhodes Cross-Platform Framework, which uses Ruby as programming language.)
Your first task is to choose a patch format. The hardest format for humans to read (IMHO) turns out to be the easiest format for software to apply: the ed(1) script. You can start off with a simple /usr/bin/diff -e old.xml new.xml to generate the patches; diff(1) will produce line-oriented patches but that should be fine to start with. The ed format looks like this:
36a
<tr><td class="eg" style="background: #182349;"> </td><td><tt>#182349</tt></td></tr>
.
34c
<tr><td class="eg" style="background: #66ccff;"> </td><td><tt>#xxxxxx</tt></td></tr>
.
20,23d
The numbers are line numbers, line number ranges are separated with commas. Then there are three single letter commands:
a: add the next block of text at this position.
c: change the text at this position to the following block. This is equivalent to a d followed by an a command.
d: delete these lines.
You'll also notice that the line numbers in the patch go from the bottom up so you don't have to worry about changes messing up the lines numbers in subsequent chunks of the patch. The actual chunks of text to be added or changed follow the commands as a sequence of lines terminated by a line with a single period (i.e. /^\.$/ or patch_line == '.' depending on your preference). In summary, the format looks like this:
[line-number-range][command]
[optional-argument-lines...]
[dot-terminator-if-there-are-arguments]
So, to apply an ed patch, all you need to do is load the target file into an array (one element per line), parse the patch using a simple state machine, call Array#insert to add new lines and Array#delete_at to remove them. Shouldn't take more than a couple dozen lines of Ruby to write the patcher and no library is needed.
If you can arrange your XML to come out like this:
<tag>
blah blah
</tag>
<other-tag x="y">
mumble mumble
</other>
rather than:
<tag>blah blah</tag><other-tag x="y">mumble mumble</other>
then the above simple line-oriented approach will work fine; the extra EOLs aren't going to cost much space so go for easy implementation to start.
There are Ruby libraries for producing diffs between two arrays (google "ruby algorithm::diff" to start). Combining a diff library with an XML parser will let you produce patches that are tag-based rather than line-based and this might suit you better. The important thing is the choice of patch formats, once you choose the ed format (and realize the wisdom of the patch working from the bottom to the top) then everything else pretty much falls into place with little effort.
I know this question is almost five years old, but I'm going to post an answer anyway. When searching for how to make and apply patches for strings in Ruby, even now, I was unable to find any resources that answer this question satisfactorily. For that reason, I'll show how I solved this problem in my application.
Making Patches
I'm assuming you're using Linux, or else have access to the program diff through Cygwin. In that case, you can use the excellent Diffy gem to create ed script patches:
patch_text = Diffy::Diff.new(old_text, new_text, :diff => "-e").to_s
Applying Patches
Applying patches is not quite as straightforward. I opted to write my own algorithm, ask for improvements in Code Review, and finally settle on using the code below. This code is identical to 200_success's answer except for one change to improve its correctness.
require 'stringio'
def self.apply_patch(old_text, patch)
text = old_text.split("\n")
patch = StringIO.new(patch)
current_line = 1
while patch_line = patch.gets
# Grab the command
m = %r{\A(?:(\d+))?(?:,(\d+))?([acd]|s/\.//)\Z}.match(patch_line)
raise ArgumentError.new("Invalid ed command: #{patch_line.chomp}") if m.nil?
first_line = (m[1] || current_line).to_i
last_line = (m[2] || first_line).to_i
command = m[3]
case command
when "s/.//"
(first_line..last_line).each { |i| text[i - 1].sub!(/./, '') }
else
if ['d', 'c'].include?(command)
text[first_line - 1 .. last_line - 1] = []
end
if ['a', 'c'].include?(command)
current_line = first_line - (command=='a' ? 0 : 1) # Adds are 0-indexed, but Changes and Deletes are 1-indexed
while (patch_line = patch.gets) && (patch_line.chomp! != '.') && (patch_line != '.')
text.insert(current_line, patch_line)
current_line += 1
end
end
end
end
text.join("\n")
end

Parsing text files in Ruby when the content isn't well formed

I'm trying to read files and create a hashmap of the contents, but I'm having trouble at the parsing step. An example of the text file is
put 3
returns 3
between
3
pargraphs 1
4
3
#foo 18
****** 2
The word becomes the key and the number is the value. Notice that the spacing is fairly erratic. The word isn't always a word (which doesn't get picked up by /\w+/) and the number associated with that word isn't always on the same line. This is why I'm calling it not well-formed. If there were one word and one number on one line, I could just split it, but unfortunately, this isn't the case. I'm trying to create a hashmap like this.
{"put"=>3, "#foo"=>18, "returns"=>3, "paragraphs"=>1, "******"=>2, "4"=>3, "between"=>3}
Coming from Java, it's fairly easy. Using Scanner I could just use scanner.next() for the next key and scanner.nextInt() for the number associated with it. I'm not quite sure how to do this in Ruby when it seems I have to use regular expressions for everything.
I'd recommend just using split, as in:
h = Hash[*s.split]
where s is your text (eg s = open('filename').read. Believe it or not, this will give you precisely what you're after.
EDIT: I realized you wanted the values as integers. You can add that as follows:
h.each{|k,v| h[k] = v.to_i}

Resources