Change every word in a paragraph with Ruby - ruby

So I'm coding in Ruby and I've got a few sentences:
The sky above the port was the color of television, tuned to a dead channel. "It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency." It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.
And I need to modify every word in the paragraph without changing the structure. My original idea was to just split on whitespace and then rejoin it, but the issue with that is you get the punctuation as well. If you split so that you just get the word, it's hard to rejoin because you don't know the proper punctuation.
Are there better ways to do this than the traditional split, map, join combo? Or maybe just a good split regex so it's easy to rejoin?

Use gsub with a block:
str = %q(The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd
around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates;
you could drink there for a week and never hear two words in Japanese.)
puts str.gsub(/\w+/){|word| word.tr('aeiou','uoaei') }
result:
Tho sky ubevo tho pert wus tho celer ef tolovasaen, tinod te u doud chunnol.
"It's net lako I'm isang," Cuso hourd semoeno suy, us ho sheildorod has wuy threigh tho crewd
ureind tho deer ef tho Chut. "It's lako my bedy's dovolepod thas mussavo drig dofacaoncy."
It wus u Spruwl veaco und u Spruwl jeko. Tho Chutsibe wus u bur fer prefossaenul oxputrautos;
yei ceild drank thoro fer u wook und novor hour twe werds an Jupunoso.
Well, this #tr method would work without the regex, but you get the idea.

I would match words between word boundaries with a regex to avoid affecting punctuation or whitespace, e.g.:
s = "This is a test, ok? Yes, fine!"
s.gsub!(/\b(\w+)\b/) {|x| "_#{x}_"}
s = "_This_ _is_ _a_ _test_, _ok_? _Yes_, _fine_!"

Related

ruby placeholders NOT populating in external .txt file

Here is what I currently have; the only problem being the external file loads without the placeholder text updated -- instead rather, the placeholder text just says '[NOUN]' instead of actual noun inserted from user in earlier program prompt.
Update; cleaned up with #tadmans suggestions, it is however, still not passing user input to placeholder text in external .txt file.
puts "\n\nOnce upon a time on the internet... \n\n"
puts "Name 1 website:"
print "1. "
loc=gets
puts "\n\Write 5 adjectives: "
print "1. "
adj1=gets
print "\n2. "
adj2=gets
print "\n3. "
adj3=gets
print "\n4. "
adj4=gets
print "\n5. "
adj5=gets
puts "\n\Write 2 nouns: "
print "1. "
noun1=gets
print "\n2. "
noun2=gets
puts "\n\nWrite 1 verb: "
print "1. "
verb=gets
puts "\n\nWrite 1 adverb: "
print "1. "
ptverb=gets
string_story = File.read("dynamicstory.txt")
puts string_story
Currently output is (i.e. placeholders not populated):
\n\nOnce upon a time on the internet...\n\n
One dreary evening while browsing the #{loc} website online, I stumbled accross a #{adj1} Frog creature named Earl. This frog would sit perturbed for hours at a time at the corner of my screen like Malware. One day, the frog appeared with a #{adj2} companion named Waldo that sat on the other corner of my screen. He had a #{adj3} set of ears with sharp #{noun1} inside. As the internet frogs began conversing and becoming close friends in hopes of #{noun2}, they eventually created a generic start-up together. They knew their start-up was #{adj4} but didn't seem to care and pushed through anyway. They would #{verb} on the beach with each other in the evenings after operating with shady ethics by day. They could only dream of a shiny and #{adj5} future full of gold. But then they eventually #{ptverb} and moved to Canada.\n\n
The End\n\n\n
It's important to note that the Ruby string interpolation syntax is only valid within actual Ruby code, and it does not apply in external files. Those are just plain strings.
If you want to do rough interpolation on those you'll need to restructure your program in order to make it easy to do. The last thing you want is to have to eval that string.
When writing code, always think about breaking up your program into methods or functions that have a specific function and can be used in a variety of situations. Ruby generally encourages code-reuse and promoting the "DRY principle", or "Don't Repeat Yourself".
For example, your input method boils down to this generic method:
def input(thing, count = 1)
puts "Name %d %s:" % [ count, thing ]
count.times.map do |i|
print '%d. ' % (i + 1)
gets.chomp
end
end
Where that gets input for a random thing with an arbitrary count. I'm using sprintf-style formatters here with % but you're free to use regular interpolation if that's how you like it. I just find it leads to a less cluttered string, especially when interpolating complicated chunks of code.
Next you need to organize that data into a proper container so you can access it programmatically. Using a bunch of unrelated variables is problematic. Using a Hash here makes it easy:
puts "\n\nOnce upon a time on the internet... \n\n"
words = { }
words[:website] = input('website')
words[:adjective] = input('adjectives', 5)
words[:noun] = input('nouns', 2)
words[:verb] = input('verb')
words[:adverb] = input('adverb')
Notice how you can now alter the order of these things by re-ordering the lines of code, and you can change how many of something you ask for by adjusting a single number, very easy.
The next thing to fix is your interpolation problem. Instead of using Ruby notation #{...}, which is hard to evaluate, go with something simple. In this case %verb1 and %noun2 are used:
def interpolate(string, values)
string.gsub(/\%(website|adjective|noun|verb|adverb)(\d+)/) do
values.dig($1.to_sym, $2.to_i - 1)
end
end
That looks a bit ugly, but the regular expression is used to identify those tags and $1 and $2 pull out the two parts, word and number, separately, based on the capturing done in the regular expression. This might look a bit advanced, but if you take the time to understand this method you can very quickly solve fairly complicated problems with little fuss. It's something you'll use in a lot of situations when parsing or rewriting strings.
Here's a quick way to test it:
string_story = File.read("dynamicstory.txt")
puts interpolate(string_story, words)
Where the content of your file looks like:
One dreary evening while browsing the %website1 website online,
I stumbled accross a %adjective1 Frog creature named Earl.
You could also adjust your interpolate method to pick random words.

Perform sentence segmentation on paragraphs without punctuation?

I have a bunch of badly formatted text with lots of missing punctuation. I want to know if there was any method to segment text into sentences when periods, semi-colons, capitalization, etc. are missing.
For example, consider the paragraph: "the lion is called the king of the forest it has a majestic appearance it eats flesh it can run very fast the roar of the lion is very famous".
This text should be segmented as separate sentences:
the lion is called the king of the forest
it has a majestic appearance
it eats flesh
it can run very fast
the roar of the lion is very famous
Can this be done or is it impossible? Any suggestion is much appreciated!
You can try using the following Python implementation from here.
import torch
model, example_texts, languages, punct, apply_te = torch.hub.load(repo_or_dir='snakers4/silero-models', model='silero_te')
#your text goes here. I imagine it is contained in some list
input_text = input('Enter input text\n')
apply_te(input_text, lan='en')

Nokogiri How can I extract text from HTML with correct spacing?

I'm trying to extract the text for a document to index it for search. The below mostly works except various words and punctuation run together. When it removes tags, I need to replace them with spaces so I do not get this issue. I have been trying to figure out the most efficient way to do this but I'm coming up empty so far.
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
doc.xpath("//style").remove
doc.xpath("//a").remove
text = doc.text.gsub(/\s+/,' ')
Here is some sample text I extracted from http://www.washingtontimes.com/blog/redskins-watch/2012/oct/18/redskins-linemen-respond-jason-pierre-paul-rg3-com/
Before the season it was New York Giants defensive end Osi Umenyiora
who made waves by saying he wouldn't call Robert Griffin III by “RG3”
until he did something. Until then, it was “Bob Griffin.”After
Griffin's 76-yard touchdown run in the Washington Redskins' victory
over the Minnesota Vikings, fellow Giants defensive end Jason
Pierre-Paul was the one who had some comments about Griffin.“Don’t
bring it to my side," Pierre-Paul told New York media. “Go the other
way. …“Yes, it'll be a very good matchup. Not on my side, though. Not
on my side. Or the other side.”Griffin, asked jokingly Wednesday about
running for office, said: “I’ve got a lot other guys to be running
away from right now, Pierre-Paul, Osi, all those guys.”But according
to a couple of Redskins linemen, Griffin shouldn't have much to worry
about Sunday if he gets into the open field.“If Robert gets into that
situation, I don't think there's many people that can run him down,”
right guard Chris Chester said. “I'm still going to go out there and
try to block and make sure no one touches Robert at all. But he's a
plenty good athlete to be able to outrun a lot of people in this
league.”Prompted with Pierre-Paul's comments, left tackle Trent
Williams responded: “What do you want me to say about that?”“Robert's
my guy. I don't know Pierre-Paul. I don't know why he would say
something like that,” he said. “Maybe he knows something I don't.”
You could try inserting a space before each p tag:
doc.search('p').each{|el| el.before ' '}
but a better approach probably is something like:
text = doc.search('div.story p').map{|p| p.text}.join(" ")
Other answers are discussing inserting whitespace into the document, but if (as the question asks) your requirement is to replace those nodes with whitespace, Nokogiri has a replace method. So to replace script tags with spaces do:
doc.xpath('//script').each do |node|
node.replace(' ')
end
The question also asks about 'correct' spacing. Most browsers will not insert a space when they render around a <script> tag, so while useful for text extraction, this is not necessarily the 'correct' thing to do.

How to limit the number of retrieved characters from a database field in rails?

Consider a passage (~400 characters) in a database table(text).
Like
There is only one more week to Easter. I have already started my
holiday. The idea of visiting my uncle during this Easter is
wonderful. His farm is in this village down in Cornwall. This village
is very peaceful and beautiful. I have asked my aunt if I can bring
Sam, my dog, with me. I promise her I will keep him under control. He
attacked and he ate some animals from her farm in October. But he is
part of the family and I cannot leave him behind.
but i need to retrieve only limited characters from that like ~150 characters only.
There is only one more week to Easter. I have already started my
holiday. The idea of visiting my uncle during this Easter is
wonderful. His farm is in this village down in Cornwall. This village
is very peaceful...
Is there any function in rails or only truncate(:limit,:option{}) function for that output?
Assuming you have a model Passage with a field text, you can select specific field (and use SQL functions within) like this:
passages = Passage.select("id, LEFT(text,10) as text_short, CHAR_LENGTH(text) as text_length")
# => [#<Passage id: 1>, #<Passage id: 2>, #<Passage id: 3>]
passages.first.id
# => 1
passages.first.text_short
# => "There is o"
passages.first.text_length
# => 453
Why not get the whole string and only use the first 150 characters? I doubt it will slow things down much at all.
somehow_access_string[0...150] + '...'

Ruby regular expression for asterisks/underscore to strong/em?

As part of a chat app I'm writing, I need to use regular expressions to match asterisks and underscores in chat messages and turn them into <strong> and <em> tags. Since I'm terrible with regex, I'm really stuck here. Ideally, we would have it set up such that:
One to three words, but not more, can be marked for strong/em.
Patterns such as "un*believ*able" would be matched.
Only one or the other (strong OR em) work within one line.
The above parameters are in order of importance, with only #1 being utterly necessary - the others are just prettiness. The closest I came to anything that worked was:
text = text.sub(/\*([(0-9a-zA-Z).*])\*/,'<b>\1<\/b>')
text = text.sub(/_([(0-9a-zA-Z).*])_/,'<i>\1<\/i>')
But it obviously doesn't work with any of our params.
It's odd that there's not an example of something similar already out there, given the popularity of using asterisks for bold and whatnot. If there is, I couldn't find it outside of plugins/gems (which won't work for this instance, as I really only need it in in one place in my model). Any help would be appreciated.
This should help you finish what you are doing:
sub(/\*(.*)\*/,'<b>\1</b>')
sub(/_(.*)_/,'<i>\1</i>')
Firstly, your criteria are a little strange, but, okay...
It seems that a possible algorithm for this would be to find the number of matches in a message, count them to see if there are less than 4, and then try to perform one set of substitutions.
strong_regexp = /\*([^\*]*)\*/
em_regexp = /_([^_]*)_/
def process(input)
if input ~= strong_regexp && input.match(strong_regexp).size < 4
input.sub strong_regexp, "<b>\1<\b>"
elsif input ~= em_regexp && intput.match(em_regexp).size < 4
input.sub em_regexp, "<i>\1<\i>"
end
end
Your specifications aren't entirely clear, but if you understand this, you can tweak it yourself.

Resources