Remove SAPI4 voices? - windows

Verbose text reader keeps defaulting to the terrible sounding SAPI4 "Peter Adult male #1" voice. This doesn't happen on my other computer. How do I remove the SAPI4 voices, as I don't use them anyways.

Related

How to use Kettle to handle the linefeed in a field

I want to use Kettle to handle a data set.
The format of the data set is as below
product/productId: B00006HAXW
review/userId: A1RSDE90N6RSZF
review/profileName: Joseph M. Kotow
review/helpfulness: 9/9
review/score: 5.0
review/time: 1042502400
review/summary: Pittsburgh - Home of the OLDIES
review/text: I have all of the doo wop DVD's and this one is as good or better than the
1st ones. Remember once these performers are gone, we'll never get to see them again.
Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE
this DVD !!
I read every line of the data first and then transform every eight rows into a record.
However, there is such data:
review/profileName: nancy "crzyfnyfrog
I love my purple pigtails."
This field contains a linefeed and I don't know how to handle it.
Now I use the script code to achieve the function I want, but I still want to know how to use based components to solve it.

Perform sentence segmentation on paragraphs without punctuation?

I have a bunch of badly formatted text with lots of missing punctuation. I want to know if there was any method to segment text into sentences when periods, semi-colons, capitalization, etc. are missing.
For example, consider the paragraph: "the lion is called the king of the forest it has a majestic appearance it eats flesh it can run very fast the roar of the lion is very famous".
This text should be segmented as separate sentences:
the lion is called the king of the forest
it has a majestic appearance
it eats flesh
it can run very fast
the roar of the lion is very famous
Can this be done or is it impossible? Any suggestion is much appreciated!
You can try using the following Python implementation from here.
import torch
model, example_texts, languages, punct, apply_te = torch.hub.load(repo_or_dir='snakers4/silero-models', model='silero_te')
#your text goes here. I imagine it is contained in some list
input_text = input('Enter input text\n')
apply_te(input_text, lan='en')

How to select data in scrapy from html file having both class and id?

<div class="section-body" id="section-2"><p>Most people with aortic stenosis do not develop symptoms until the disease is advanced. The diagnosis may have been made when the health care provider heard a heart murmur and performed tests.</p><p>Symptoms of aortic stenosis include:</p><ul><li>Chest discomfort: The chest pain may get worse with activity and reach into the arm, neck, or jaw. The chest may also feel tight or squeezed.</li><li>Cough, possibly bloody.</li><li>Breathing problems when exercising.</li><li>Becoming easily tired.</li><li>Feeling the heartbeat (palpitations).</li><li>Fainting, weakness, or dizziness with activity.</li></ul><p>In infants and children, symptoms include:</p><ul><li>Becoming easily tired with exertion (in mild cases)</li><li>Failure to gain weight</li><li>Poor feeding</li><li>Serious breathing problems that develop within days or weeks of birth (in severe cases)</li></ul><p>Children with mild or moderate aortic stenosis may get worse as they get older. They are also at risk for a heart infection called bacterial endocarditis.</p></div></div></section>
I have above script and I want to scrap the data in the list. i.e. in
I have tried following commands in scrapy but not working. It is giving '[]' as a output.
response.css("article div.section-body p").extract() <-- this is giving all info under section body but I want only under section-2
response.css("article div.section-body.section-2 p::text").extract()
response.xpath("//article/*[contains(#id, 'setion-2')]").extract()
please help me to extract. Thanks
Try
response.css("article div.section-body#section-2 p::text").extract()
div.section-body#section-2 means select DIV having both Class section-body and ID section-2
Note that ID is selected by # and class is selected by . ... So your CSS Selector posted in your question was wrong.

Nokogiri How can I extract text from HTML with correct spacing?

I'm trying to extract the text for a document to index it for search. The below mostly works except various words and punctuation run together. When it removes tags, I need to replace them with spaces so I do not get this issue. I have been trying to figure out the most efficient way to do this but I'm coming up empty so far.
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
doc.xpath("//style").remove
doc.xpath("//a").remove
text = doc.text.gsub(/\s+/,' ')
Here is some sample text I extracted from http://www.washingtontimes.com/blog/redskins-watch/2012/oct/18/redskins-linemen-respond-jason-pierre-paul-rg3-com/
Before the season it was New York Giants defensive end Osi Umenyiora
who made waves by saying he wouldn't call Robert Griffin III by “RG3”
until he did something. Until then, it was “Bob Griffin.”After
Griffin's 76-yard touchdown run in the Washington Redskins' victory
over the Minnesota Vikings, fellow Giants defensive end Jason
Pierre-Paul was the one who had some comments about Griffin.“Don’t
bring it to my side," Pierre-Paul told New York media. “Go the other
way. …“Yes, it'll be a very good matchup. Not on my side, though. Not
on my side. Or the other side.”Griffin, asked jokingly Wednesday about
running for office, said: “I’ve got a lot other guys to be running
away from right now, Pierre-Paul, Osi, all those guys.”But according
to a couple of Redskins linemen, Griffin shouldn't have much to worry
about Sunday if he gets into the open field.“If Robert gets into that
situation, I don't think there's many people that can run him down,”
right guard Chris Chester said. “I'm still going to go out there and
try to block and make sure no one touches Robert at all. But he's a
plenty good athlete to be able to outrun a lot of people in this
league.”Prompted with Pierre-Paul's comments, left tackle Trent
Williams responded: “What do you want me to say about that?”“Robert's
my guy. I don't know Pierre-Paul. I don't know why he would say
something like that,” he said. “Maybe he knows something I don't.”
You could try inserting a space before each p tag:
doc.search('p').each{|el| el.before ' '}
but a better approach probably is something like:
text = doc.search('div.story p').map{|p| p.text}.join(" ")
Other answers are discussing inserting whitespace into the document, but if (as the question asks) your requirement is to replace those nodes with whitespace, Nokogiri has a replace method. So to replace script tags with spaces do:
doc.xpath('//script').each do |node|
node.replace(' ')
end
The question also asks about 'correct' spacing. Most browsers will not insert a space when they render around a <script> tag, so while useful for text extraction, this is not necessarily the 'correct' thing to do.

Change every word in a paragraph with Ruby

So I'm coding in Ruby and I've got a few sentences:
The sky above the port was the color of television, tuned to a dead channel. "It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency." It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.
And I need to modify every word in the paragraph without changing the structure. My original idea was to just split on whitespace and then rejoin it, but the issue with that is you get the punctuation as well. If you split so that you just get the word, it's hard to rejoin because you don't know the proper punctuation.
Are there better ways to do this than the traditional split, map, join combo? Or maybe just a good split regex so it's easy to rejoin?
Use gsub with a block:
str = %q(The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd
around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates;
you could drink there for a week and never hear two words in Japanese.)
puts str.gsub(/\w+/){|word| word.tr('aeiou','uoaei') }
result:
Tho sky ubevo tho pert wus tho celer ef tolovasaen, tinod te u doud chunnol.
"It's net lako I'm isang," Cuso hourd semoeno suy, us ho sheildorod has wuy threigh tho crewd
ureind tho deer ef tho Chut. "It's lako my bedy's dovolepod thas mussavo drig dofacaoncy."
It wus u Spruwl veaco und u Spruwl jeko. Tho Chutsibe wus u bur fer prefossaenul oxputrautos;
yei ceild drank thoro fer u wook und novor hour twe werds an Jupunoso.
Well, this #tr method would work without the regex, but you get the idea.
I would match words between word boundaries with a regex to avoid affecting punctuation or whitespace, e.g.:
s = "This is a test, ok? Yes, fine!"
s.gsub!(/\b(\w+)\b/) {|x| "_#{x}_"}
s = "_This_ _is_ _a_ _test_, _ok_? _Yes_, _fine_!"

Resources