Best way to split text into sentences avoiding acronyms clashes [closed] - pseudocode

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Given the following phrase
Ms. Mary got to know her husband Mr. Dave in her trip to U.S.A. and it
was cool. Did you know Dave worked for Microsoft? Well he did. He was even part of Internet Explorer devs.
What is the best "pseudo-code" way to split it into sentences? Python or any other similar language is also fine because of its pseudo-code resemblance.
What I've thought is to replace every occurrence of " a-zA-Z." (notice the space), ".a-zA-Z" and ".a-zA-Z." to its equivalent without the dot of course, so for example
" a."
" b."
" c."
" d."
" e."
" f."
...
and
".a."
".b."
".c."
".d."
".e."
".f."
...
and
" ab."
" ac."
" ad."
...
" ba."
" bc."
" bd."
...
The phrase should be nicely converted to the following
Ms Mary got to know her husband Mr Dave in her trip to USA and it
was cool. Did you know Dave worked for Microsoft? Well he did. He was even part of Internet Explorer devs.
...or am I wrong somewhere and I have a flawed logic?
For the future what's your question comments, I need to know what's the best way to split the example text into correct sentences avoiding clashes with acronyms.
This either explained in pseudo-code, Python or other languages similar to pseudo-code. I want it to be language agnostic so it can be implemented by anyone, regardless of the language they use.

All acronyms in the example are of the pattern Uppercase . or Uppercase lowercase .; none of the other -- regular -- occurrences of the full stop match this particular pattern.
So a simple RegEx can be used to remove the full stops. What's left after that can be split on the regular punctuation marks .!?. In Javascript:
str2 = str.replace(/([A-Z][a-z]?)\./g, '$1');
or using a GREP flavor that does understand most common character classes:
str2 = str.replace(/(\u\l?)\./g, '$1');
This results directly in the output as shown.
Using a RegEx is straightforward (and easily expanded!), but the same pattern can be tested in other languages as well. In C, you can copy input to output and test only when seeing the . character:
int main (void)
{
char input[] = "Ms. Mary got to know her husband Mr. Dave in her trip to "
"U.S.A. and it was cool. Did you know Dave worked for Microsoft? Well "
"he did. He was even part of Internet Explorer devs.";
char output[256], *readptr, *writeptr;
printf ("in: %s\n", input);
readptr = input;
writeptr = output;
while (*readptr)
{
if (*readptr == '.')
{
if ((readptr > input && isupper(readptr[-1])) ||
(readptr > input+1 && isupper(readptr[-2]) && islower(readptr[-1])))
{
readptr++;
continue;
}
}
*writeptr = *readptr;
readptr++;
writeptr++;
}
*writeptr = 0;
printf ("out: %s\n", output);
return 0;
}
These solutions remove full stops from the source text. If you want to keep them, you can replace them with a placeholder (for example, a character that does not normally occur in the source text), or do the reverse: when splitting on sentences, test to see whether or not a full stop is a valid breaking point.
Afterthought: it does work on the original sample sentence... but it does not on the one in the comments:
I made a trip to the U.S.A. It was cool.I liked it very much.
where you get the output
I made a trip to the USA It was cool.I liked it very much.
This requires checking for more possible scenarios:
common abbreviations, such as Ms. and Mr.: \u\l\.
in-sentence acronyms; "U.S.A." followed by a lowercase: (\u\.)+ (?=\l), where the full stop needs removing;
end-of-sentence acronyms; "U.S.A." followed by an uppercase: (\u\.)+ (?=\u), where the last full stop should remain.

Related

Intelligent addition of words to make a question from the statement

I have 5000 videos and I want to add words in front of it, to make a question out of the title.
For eg. Video title is
1. 'Historian era' I want a question out of it - What is Historian era
2. 'Solve using Quadratic Equation' - 'How to solve using quadratic equation'
My bet:
I would analyze the first word of the title.
If it is a verb, then add 'How to ' in front of it, for example.
For knowing what the first word actually is, I would check them against an API, such as: https://developer.oxforddictionaries.com/
Then you can act accordingly.

ruby placeholders NOT populating in external .txt file

Here is what I currently have; the only problem being the external file loads without the placeholder text updated -- instead rather, the placeholder text just says '[NOUN]' instead of actual noun inserted from user in earlier program prompt.
Update; cleaned up with #tadmans suggestions, it is however, still not passing user input to placeholder text in external .txt file.
puts "\n\nOnce upon a time on the internet... \n\n"
puts "Name 1 website:"
print "1. "
loc=gets
puts "\n\Write 5 adjectives: "
print "1. "
adj1=gets
print "\n2. "
adj2=gets
print "\n3. "
adj3=gets
print "\n4. "
adj4=gets
print "\n5. "
adj5=gets
puts "\n\Write 2 nouns: "
print "1. "
noun1=gets
print "\n2. "
noun2=gets
puts "\n\nWrite 1 verb: "
print "1. "
verb=gets
puts "\n\nWrite 1 adverb: "
print "1. "
ptverb=gets
string_story = File.read("dynamicstory.txt")
puts string_story
Currently output is (i.e. placeholders not populated):
\n\nOnce upon a time on the internet...\n\n
One dreary evening while browsing the #{loc} website online, I stumbled accross a #{adj1} Frog creature named Earl. This frog would sit perturbed for hours at a time at the corner of my screen like Malware. One day, the frog appeared with a #{adj2} companion named Waldo that sat on the other corner of my screen. He had a #{adj3} set of ears with sharp #{noun1} inside. As the internet frogs began conversing and becoming close friends in hopes of #{noun2}, they eventually created a generic start-up together. They knew their start-up was #{adj4} but didn't seem to care and pushed through anyway. They would #{verb} on the beach with each other in the evenings after operating with shady ethics by day. They could only dream of a shiny and #{adj5} future full of gold. But then they eventually #{ptverb} and moved to Canada.\n\n
The End\n\n\n
It's important to note that the Ruby string interpolation syntax is only valid within actual Ruby code, and it does not apply in external files. Those are just plain strings.
If you want to do rough interpolation on those you'll need to restructure your program in order to make it easy to do. The last thing you want is to have to eval that string.
When writing code, always think about breaking up your program into methods or functions that have a specific function and can be used in a variety of situations. Ruby generally encourages code-reuse and promoting the "DRY principle", or "Don't Repeat Yourself".
For example, your input method boils down to this generic method:
def input(thing, count = 1)
puts "Name %d %s:" % [ count, thing ]
count.times.map do |i|
print '%d. ' % (i + 1)
gets.chomp
end
end
Where that gets input for a random thing with an arbitrary count. I'm using sprintf-style formatters here with % but you're free to use regular interpolation if that's how you like it. I just find it leads to a less cluttered string, especially when interpolating complicated chunks of code.
Next you need to organize that data into a proper container so you can access it programmatically. Using a bunch of unrelated variables is problematic. Using a Hash here makes it easy:
puts "\n\nOnce upon a time on the internet... \n\n"
words = { }
words[:website] = input('website')
words[:adjective] = input('adjectives', 5)
words[:noun] = input('nouns', 2)
words[:verb] = input('verb')
words[:adverb] = input('adverb')
Notice how you can now alter the order of these things by re-ordering the lines of code, and you can change how many of something you ask for by adjusting a single number, very easy.
The next thing to fix is your interpolation problem. Instead of using Ruby notation #{...}, which is hard to evaluate, go with something simple. In this case %verb1 and %noun2 are used:
def interpolate(string, values)
string.gsub(/\%(website|adjective|noun|verb|adverb)(\d+)/) do
values.dig($1.to_sym, $2.to_i - 1)
end
end
That looks a bit ugly, but the regular expression is used to identify those tags and $1 and $2 pull out the two parts, word and number, separately, based on the capturing done in the regular expression. This might look a bit advanced, but if you take the time to understand this method you can very quickly solve fairly complicated problems with little fuss. It's something you'll use in a lot of situations when parsing or rewriting strings.
Here's a quick way to test it:
string_story = File.read("dynamicstory.txt")
puts interpolate(string_story, words)
Where the content of your file looks like:
One dreary evening while browsing the %website1 website online,
I stumbled accross a %adjective1 Frog creature named Earl.
You could also adjust your interpolate method to pick random words.

Nokogiri How can I extract text from HTML with correct spacing?

I'm trying to extract the text for a document to index it for search. The below mostly works except various words and punctuation run together. When it removes tags, I need to replace them with spaces so I do not get this issue. I have been trying to figure out the most efficient way to do this but I'm coming up empty so far.
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
doc.xpath("//style").remove
doc.xpath("//a").remove
text = doc.text.gsub(/\s+/,' ')
Here is some sample text I extracted from http://www.washingtontimes.com/blog/redskins-watch/2012/oct/18/redskins-linemen-respond-jason-pierre-paul-rg3-com/
Before the season it was New York Giants defensive end Osi Umenyiora
who made waves by saying he wouldn't call Robert Griffin III by “RG3”
until he did something. Until then, it was “Bob Griffin.”After
Griffin's 76-yard touchdown run in the Washington Redskins' victory
over the Minnesota Vikings, fellow Giants defensive end Jason
Pierre-Paul was the one who had some comments about Griffin.“Don’t
bring it to my side," Pierre-Paul told New York media. “Go the other
way. …“Yes, it'll be a very good matchup. Not on my side, though. Not
on my side. Or the other side.”Griffin, asked jokingly Wednesday about
running for office, said: “I’ve got a lot other guys to be running
away from right now, Pierre-Paul, Osi, all those guys.”But according
to a couple of Redskins linemen, Griffin shouldn't have much to worry
about Sunday if he gets into the open field.“If Robert gets into that
situation, I don't think there's many people that can run him down,”
right guard Chris Chester said. “I'm still going to go out there and
try to block and make sure no one touches Robert at all. But he's a
plenty good athlete to be able to outrun a lot of people in this
league.”Prompted with Pierre-Paul's comments, left tackle Trent
Williams responded: “What do you want me to say about that?”“Robert's
my guy. I don't know Pierre-Paul. I don't know why he would say
something like that,” he said. “Maybe he knows something I don't.”
You could try inserting a space before each p tag:
doc.search('p').each{|el| el.before ' '}
but a better approach probably is something like:
text = doc.search('div.story p').map{|p| p.text}.join(" ")
Other answers are discussing inserting whitespace into the document, but if (as the question asks) your requirement is to replace those nodes with whitespace, Nokogiri has a replace method. So to replace script tags with spaces do:
doc.xpath('//script').each do |node|
node.replace(' ')
end
The question also asks about 'correct' spacing. Most browsers will not insert a space when they render around a <script> tag, so while useful for text extraction, this is not necessarily the 'correct' thing to do.

How can I extract the meaning of a paragraph?

I need to develop method that extracts the meaning from a string for a record in a database. Here is an example of the a string:
MyString = "Purse $75,000. (up To $14,250 Nysbfoa) For Maidens, Fillies And Mares Three Years Old And Upward. Three Year Olds, 118 Lbs.; Older, 123 Lbs. One And One Eighth Miles. (Inner turf)"
Given the string, I need to process it in such a way that I can create a race_record:
race_record[:purse] = 75000
race_record[:race_type] = "Maidens"
race_record[:sex] = "Fillies And Mares"
race_record[:age] = "Three Year Old And Upward"
race_record[:distance] = "One And One Eighth Miles"
race_record[:surface] = "inner turf"
I was planning on to use ruby and a series of regular expressions to extract the data. For example:
race_record[:purse] = Mystring.scan(/(?<=\Purse\s[$])(.*?)(?=\.)/)
race_record[:race_type] = Mystring.sub(....)
etc.
My question isn't so much what the correct regular expressions are. Given the objective, is the approach I proposed the right way to go, or is there a better approach or even a gem that can do the heavy lifting?
You could use one regex to extract all the relevant parts into capturing groups at once;
regexp =
/Purse\s\$ # Leading text
([\d,]+) # Group 1
.*?For\s # Intervening text
(\w+) # Group 2
,\s # Intervening text
(\w+\sAnd\s\w+) # Group 3, etc. etc.
\s
([^.]*)
\.[^;]*;[^.]*\.\s
([^.]*)
\.\s\(
([^()]*)
\)/x
Then you can do
irb(main):025:0> match = regexp.match(mystring)
=> #<MatchData "Purse $75,000. (up To $14,250 Nysbfoa) For Maidens, Fillies And Mares Three Years Old And Upward. Three Year Olds, 118 Lbs.; Older, 123 Lbs. One And One Eighth Miles. (Inner turf)"
1:"75,000" 2:"Maidens" 3:"Fillies And Mares" 4:"Three Years Old And Upward"
5:"One And One Eighth Miles" 6:"Inner turf">
irb(main):026:0> match[1]
=> "75,000"
irb(main):027:0> match[2]
=> "Maidens"
...etc.
If your input is fairly structured, i.e. it has a specific and know grammar, you could build a 'parser' to parse the grammar.
In the old days, we'd do this with yacc and lex, two old unix tools used to build compilers. Yacc and Lex have Ruby implementations. While the original intent was to output lower level code (such as machine assembly codes when building a real compiler), there is nothing that prevents you from calling any ruby code when a specific grammatical construct has been recognized by your parser.
NOTE: even though there is a Yacc/lex Ruby gem out there, I wouldn't say it will 'DO THE HEAVY LIFTING', learning yacc and lex has a small learning curve. Using something like yacc/lex would make your life easier in the long run, especially if you have a large grammar and must constantly adjust it.

Ruby regular expression for asterisks/underscore to strong/em?

As part of a chat app I'm writing, I need to use regular expressions to match asterisks and underscores in chat messages and turn them into <strong> and <em> tags. Since I'm terrible with regex, I'm really stuck here. Ideally, we would have it set up such that:
One to three words, but not more, can be marked for strong/em.
Patterns such as "un*believ*able" would be matched.
Only one or the other (strong OR em) work within one line.
The above parameters are in order of importance, with only #1 being utterly necessary - the others are just prettiness. The closest I came to anything that worked was:
text = text.sub(/\*([(0-9a-zA-Z).*])\*/,'<b>\1<\/b>')
text = text.sub(/_([(0-9a-zA-Z).*])_/,'<i>\1<\/i>')
But it obviously doesn't work with any of our params.
It's odd that there's not an example of something similar already out there, given the popularity of using asterisks for bold and whatnot. If there is, I couldn't find it outside of plugins/gems (which won't work for this instance, as I really only need it in in one place in my model). Any help would be appreciated.
This should help you finish what you are doing:
sub(/\*(.*)\*/,'<b>\1</b>')
sub(/_(.*)_/,'<i>\1</i>')
Firstly, your criteria are a little strange, but, okay...
It seems that a possible algorithm for this would be to find the number of matches in a message, count them to see if there are less than 4, and then try to perform one set of substitutions.
strong_regexp = /\*([^\*]*)\*/
em_regexp = /_([^_]*)_/
def process(input)
if input ~= strong_regexp && input.match(strong_regexp).size < 4
input.sub strong_regexp, "<b>\1<\b>"
elsif input ~= em_regexp && intput.match(em_regexp).size < 4
input.sub em_regexp, "<i>\1<\i>"
end
end
Your specifications aren't entirely clear, but if you understand this, you can tweak it yourself.

Resources