Perform sentence segmentation on paragraphs without punctuation?

Perform sentence segmentation on paragraphs without punctuation? - algorithm

I have a bunch of badly formatted text with lots of missing punctuation. I want to know if there was any method to segment text into sentences when periods, semi-colons, capitalization, etc. are missing.
For example, consider the paragraph: "the lion is called the king of the forest it has a majestic appearance it eats flesh it can run very fast the roar of the lion is very famous".
This text should be segmented as separate sentences:
the lion is called the king of the forest
it has a majestic appearance
it eats flesh
it can run very fast
the roar of the lion is very famous
Can this be done or is it impossible? Any suggestion is much appreciated!

You can try using the following Python implementation from here.
import torch
model, example_texts, languages, punct, apply_te = torch.hub.load(repo_or_dir='snakers4/silero-models', model='silero_te')
#your text goes here. I imagine it is contained in some list
input_text = input('Enter input text\n')
apply_te(input_text, lan='en')

Related

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?

Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

ruby placeholders NOT populating in external .txt file

Here is what I currently have; the only problem being the external file loads without the placeholder text updated -- instead rather, the placeholder text just says '[NOUN]' instead of actual noun inserted from user in earlier program prompt.
Update; cleaned up with #tadmans suggestions, it is however, still not passing user input to placeholder text in external .txt file.
puts "\n\nOnce upon a time on the internet... \n\n"
puts "Name 1 website:"
print "1. "
loc=gets
puts "\n\Write 5 adjectives: "
print "1. "
adj1=gets
print "\n2. "
adj2=gets
print "\n3. "
adj3=gets
print "\n4. "
adj4=gets
print "\n5. "
adj5=gets
puts "\n\Write 2 nouns: "
print "1. "
noun1=gets
print "\n2. "
noun2=gets
puts "\n\nWrite 1 verb: "
print "1. "
verb=gets
puts "\n\nWrite 1 adverb: "
print "1. "
ptverb=gets
string_story = File.read("dynamicstory.txt")
puts string_story
Currently output is (i.e. placeholders not populated):
\n\nOnce upon a time on the internet...\n\n
One dreary evening while browsing the #{loc} website online, I stumbled accross a #{adj1} Frog creature named Earl. This frog would sit perturbed for hours at a time at the corner of my screen like Malware. One day, the frog appeared with a #{adj2} companion named Waldo that sat on the other corner of my screen. He had a #{adj3} set of ears with sharp #{noun1} inside. As the internet frogs began conversing and becoming close friends in hopes of #{noun2}, they eventually created a generic start-up together. They knew their start-up was #{adj4} but didn't seem to care and pushed through anyway. They would #{verb} on the beach with each other in the evenings after operating with shady ethics by day. They could only dream of a shiny and #{adj5} future full of gold. But then they eventually #{ptverb} and moved to Canada.\n\n
The End\n\n\n

It's important to note that the Ruby string interpolation syntax is only valid within actual Ruby code, and it does not apply in external files. Those are just plain strings.
If you want to do rough interpolation on those you'll need to restructure your program in order to make it easy to do. The last thing you want is to have to eval that string.
When writing code, always think about breaking up your program into methods or functions that have a specific function and can be used in a variety of situations. Ruby generally encourages code-reuse and promoting the "DRY principle", or "Don't Repeat Yourself".
For example, your input method boils down to this generic method:
def input(thing, count = 1)
puts "Name %d %s:" % [ count, thing ]
count.times.map do |i|
print '%d. ' % (i + 1)
gets.chomp
end
end
Where that gets input for a random thing with an arbitrary count. I'm using sprintf-style formatters here with % but you're free to use regular interpolation if that's how you like it. I just find it leads to a less cluttered string, especially when interpolating complicated chunks of code.
Next you need to organize that data into a proper container so you can access it programmatically. Using a bunch of unrelated variables is problematic. Using a Hash here makes it easy:
puts "\n\nOnce upon a time on the internet... \n\n"
words = { }
words[:website] = input('website')
words[:adjective] = input('adjectives', 5)
words[:noun] = input('nouns', 2)
words[:verb] = input('verb')
words[:adverb] = input('adverb')
Notice how you can now alter the order of these things by re-ordering the lines of code, and you can change how many of something you ask for by adjusting a single number, very easy.
The next thing to fix is your interpolation problem. Instead of using Ruby notation #{...}, which is hard to evaluate, go with something simple. In this case %verb1 and %noun2 are used:
def interpolate(string, values)
string.gsub(/\%(website|adjective|noun|verb|adverb)(\d+)/) do
values.dig($1.to_sym, $2.to_i - 1)
end
end
That looks a bit ugly, but the regular expression is used to identify those tags and $1 and $2 pull out the two parts, word and number, separately, based on the capturing done in the regular expression. This might look a bit advanced, but if you take the time to understand this method you can very quickly solve fairly complicated problems with little fuss. It's something you'll use in a lot of situations when parsing or rewriting strings.
Here's a quick way to test it:
string_story = File.read("dynamicstory.txt")
puts interpolate(string_story, words)
Where the content of your file looks like:
One dreary evening while browsing the %website1 website online,
I stumbled accross a %adjective1 Frog creature named Earl.
You could also adjust your interpolate method to pick random words.

Rewriting sentences while retaining semantic meaning

Is it possible to use WordNet to rewrite a sentence so that the semantic meaning of the sentence still ways the same (or mostly the same)?
Let's say I have this sentence:
Obama met with Putin last week.
Is it possible to use WordNet to rephrase the sentence into alternatives like:
Obama and Putin met the previous week.
Obama and Putin met each other a week ago.
If changing the sentence structure is not possible, can WordNet be used to replace only the relevant synonyms?
For example:
Obama met Putin the previous week.

If the question is the possibility to use WordNet to do sentence paraphrases. It is possible with much grammatical/syntax components. You would need system that:
First get the individual semantics of the tokens and parse the sentence for its syntax.
Then understand the overall semantics of the composite sentence (especially if it's metaphorical)
Then rehash the sentence with some grammatical generator.
Up till now I only know of ACE parser/generator that can do something like that but it takes a LOT of hacking the system to make it work as a paraphrase generator. http://sweaglesw.org/linguistics/ace/
So to answer your questions,
Is it possible to use WordNet to rephrase the sentence into alternatives? Sadly, WordNet isn't a silverbullet. You will need more than semantics for a paraphrase task.
If changing the sentence structure is not possible, can WordNet be used to replace only the relevant synonyms? Yes this is possible. BUT to figure out which synonym is replace-able is hard... And you would also need some morphology/syntax component.
First you will run into a problem of multiple senses per word:
from nltk.corpus import wordnet as wn
sent = "Obama met Putin the previous week"
for i in sent.split():
possible_senses = wn.synsets(i)
print i, len(possible_senses), possible_senses
[out]:
Obama 0 []
met 13 [Synset('meet.v.01'), Synset('meet.v.02'), Synset('converge.v.01'), Synset('meet.v.04'), Synset('meet.v.05'), Synset('meet.v.06'), Synset('meet.v.07'), Synset('meet.v.08'), Synset('meet.v.09'), Synset('meet.v.10'), Synset('meet.v.11'), Synset('suffer.v.10'), Synset('touch.v.05')]
Putin 1 [Synset('putin.n.01')]
the 0 []
previous 3 [Synset('previous.s.01'), Synset('former.s.03'), Synset('previous.s.03')]
week 3 [Synset('week.n.01'), Synset('workweek.n.01'), Synset('week.n.03')]
Then even if you know the sense (let's say the first sense), you get multiple words per sense and not every word can be replaced in the sentence. Moreover, they are in the lemma form not a surface form (e.g. verbs are in their base form (simple present tense) and nouns are in singular):
from nltk.corpus import wordnet as wn
sent = "Obama met Putin the previous week"
for i in sent.split():
possible_senses = wn.synsets(i)
if possible_senses:
print i, possible_senses[0].lemma_names
else:
print i
[out]:
Obama
met ['meet', 'run_into', 'encounter', 'run_across', 'come_across', 'see']
Putin ['Putin', 'Vladimir_Putin', 'Vladimir_Vladimirovich_Putin']
the
previous ['previous', 'old']
week ['week', 'hebdomad']

One approach is grammatical analysis with nltk read more here and after analysis convert your sentence in to active voice or passive voice.

Nokogiri How can I extract text from HTML with correct spacing?

I'm trying to extract the text for a document to index it for search. The below mostly works except various words and punctuation run together. When it removes tags, I need to replace them with spaces so I do not get this issue. I have been trying to figure out the most efficient way to do this but I'm coming up empty so far.
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
doc.xpath("//style").remove
doc.xpath("//a").remove
text = doc.text.gsub(/\s+/,' ')
Here is some sample text I extracted from http://www.washingtontimes.com/blog/redskins-watch/2012/oct/18/redskins-linemen-respond-jason-pierre-paul-rg3-com/
Before the season it was New York Giants defensive end Osi Umenyiora
who made waves by saying he wouldn't call Robert Griffin III by “RG3”
until he did something. Until then, it was “Bob Griffin.”After
Griffin's 76-yard touchdown run in the Washington Redskins' victory
over the Minnesota Vikings, fellow Giants defensive end Jason
Pierre-Paul was the one who had some comments about Griffin.“Don’t
bring it to my side," Pierre-Paul told New York media. “Go the other
way. …“Yes, it'll be a very good matchup. Not on my side, though. Not
on my side. Or the other side.”Griffin, asked jokingly Wednesday about
running for office, said: “I’ve got a lot other guys to be running
away from right now, Pierre-Paul, Osi, all those guys.”But according
to a couple of Redskins linemen, Griffin shouldn't have much to worry
about Sunday if he gets into the open field.“If Robert gets into that
situation, I don't think there's many people that can run him down,”
right guard Chris Chester said. “I'm still going to go out there and
try to block and make sure no one touches Robert at all. But he's a
plenty good athlete to be able to outrun a lot of people in this
league.”Prompted with Pierre-Paul's comments, left tackle Trent
Williams responded: “What do you want me to say about that?”“Robert's
my guy. I don't know Pierre-Paul. I don't know why he would say
something like that,” he said. “Maybe he knows something I don't.”

You could try inserting a space before each p tag:
doc.search('p').each{|el| el.before ' '}
but a better approach probably is something like:
text = doc.search('div.story p').map{|p| p.text}.join(" ")

Other answers are discussing inserting whitespace into the document, but if (as the question asks) your requirement is to replace those nodes with whitespace, Nokogiri has a replace method. So to replace script tags with spaces do:
doc.xpath('//script').each do |node|
node.replace(' ')
end
The question also asks about 'correct' spacing. Most browsers will not insert a space when they render around a <script> tag, so while useful for text extraction, this is not necessarily the 'correct' thing to do.

Change every word in a paragraph with Ruby

So I'm coding in Ruby and I've got a few sentences:
The sky above the port was the color of television, tuned to a dead channel. "It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency." It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.
And I need to modify every word in the paragraph without changing the structure. My original idea was to just split on whitespace and then rejoin it, but the issue with that is you get the punctuation as well. If you split so that you just get the word, it's hard to rejoin because you don't know the proper punctuation.
Are there better ways to do this than the traditional split, map, join combo? Or maybe just a good split regex so it's easy to rejoin?

Use gsub with a block:
str = %q(The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd
around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates;
you could drink there for a week and never hear two words in Japanese.)
puts str.gsub(/\w+/){|word| word.tr('aeiou','uoaei') }
result:
Tho sky ubevo tho pert wus tho celer ef tolovasaen, tinod te u doud chunnol.
"It's net lako I'm isang," Cuso hourd semoeno suy, us ho sheildorod has wuy threigh tho crewd
ureind tho deer ef tho Chut. "It's lako my bedy's dovolepod thas mussavo drig dofacaoncy."
It wus u Spruwl veaco und u Spruwl jeko. Tho Chutsibe wus u bur fer prefossaenul oxputrautos;
yei ceild drank thoro fer u wook und novor hour twe werds an Jupunoso.
Well, this #tr method would work without the regex, but you get the idea.

I would match words between word boundaries with a regex to avoid affecting punctuation or whitespace, e.g.:
s = "This is a test, ok? Yes, fine!"
s.gsub!(/\b(\w+)\b/) {|x| "_#{x}_"}
s = "_This_ _is_ _a_ _test_, _ok_? _Yes_, _fine_!"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio