Ruby regex and special characters like dash (—) and » - ruby

I'm trying to replace all punctuation and the likes in some text with just a space. So I have the line
text = "—Bonne chance Harry murmura t il »"
How can I remove the dash and the dash and »? I tried
text.gsub( /»|—/, ' ')
which gives an error, not surprisingly. I'm new to ruby and just trying to get a hang of things by writing a script to pull all the words out of a chapter of a book. I figure I'd just remove the punctuation and symbols and just use text.split. Any help would be appreciated. I couldn't find much

It turns out the problem had to do with the utf-8 encoding. Adding
# encoding: utf-8
solved my issues and what #Andrewlton said works great

This should properly substitute in the way you were trying to do it; just add brackets and remove the pipe:
text.gsub(/[»—]/, ' ')
The standard punctuation regexp also works:
text.gsub(/\p{P}/, ' ')
You should be able to use regexp pretty universally, coming from whatever language you know. Hope this helps!

Related

Single quote string interpolation to access a file in linux

How do I make the parameter file of the method sound become the file name of the .fifo >extension using single quotes? I've searched up and down, and tried many different >approaches, but I think I need a new set of eyes on this one.
def sound(file)
#cli.stream_audio('audio\file.fifo')
end
Alright so I finally got it working, might not be the correct way but this seemed to do the trick. First thing, there may have been some white space interfering with my file parameter. Then I used the File.join option that I saw posted here by a few different people.
I used a bit of each of the answers really, and this is how it came out:
def sound(file)
file = file.strip
file = File.join('audio/',"#{file}.fifo")
#cli.stream_audio(file) if File.exist? file
end
Works like a charm! :D
Ruby interpolation requires that you use double quotes.
Is there a reason you need to use single quotes?
def sound(FILE)
#cli.stream_audio("audio/#{FILE}.fifo")
end
As Charles Caldwell stated in his comment, the best way to get cross-platform file paths to work correctly would be to use File.join. Using that, your method would look like this:
def sound(FILE)
#cli.stream_audio(File.join("audio", "#{FILE}.fifo"))
end
Your problem is with your usage of file path separators. You are using a \. Whereas this may not seem like a big deal, it actually is when used in Ruby strings.
When you use \ in a single quoted string, nothing happens. It is evaluated as-is:
puts 'Hello\tWorld' #=> Hello\tWorld
Notice what happens when we use double quotes:
puts "Hello\tWorld" #=> "Hello World"
The \t got interpreted as a tab. That's because, much like how Ruby will interpolate #{} code in a double quote, it will also interpret \n or \t into a new line or tab. So when it sees "audio\file.fifo" it is actually seeing "audio" with a \f and "ile.fifo". It then determines that \f means 'form feed' and adds it to your string. Here is a list of escape sequences. It is for C++ but it works across most languages.
As #sawa pointed out, if your escape sequence does not exist (for instance \y) then it will just remove the \ and leave the 'y'.
"audio\yourfile.fifo" #=> audioyourfile.fifo
There are three possible solutions:
Use a forward slash:
"audio/#{file}.fifo"
The forward slash will be interpreted as a file path separator when passed to the system. I do most my work on Windows which uses \ but using / in my code is perfectly fine.
Use \\:
"audio\\#{file}.fifo"
Using a double \\ escapes the \ and causes it to be read as you intended it.
Use File.join:
File.join("audio", "#{file}.fifo")
This will output the parameters with whatever file separator is setup as in the File::SEPARATOR constant.

Stripping non-alphanumeric chars but leaving spaces in Ruby

Trying to change this:
"The basketball-player is great! (Kobe Bryant)"
into this:
"the basketball player is great kobe bryant"
Want to downcase and remove all punctuation but leave spaces...
Tried string.downcase.gsub(/[^a-z ]/, '') but it removes the spaces
You can simply add \s (whitespace)
string.downcase.gsub(/[^a-z0-9\s]/i, '')
If you want to catch non-latin characters, too:
str = "The basketball-player is great! (Kobe Bryant) (ひらがな)"
str.downcase.gsub(/[^[:word:]\s]/, '')
#=> "the basketballplayer is great kobe bryant ひらがな"
Some fine solutions, but simplest is usually best:
string.downcase.gsub /\W+/, ' '
All the other answers strip out numbers as well. That works for the example given but doesn't really answer the question which is how to strip out non-alphanumeric.
string.downcase.gsub(/[^\w\s]/, '')
Note this will not strip out underscores. If you need that then:
string.downcase.gsub(/[^a-zA-Z\s\d]/, '')
a.downcase.gsub(/[^a-z ]/, "")
Note the whitespace I have added after a-z.
Also if you want to replace all whitespaces(not only space use \s as proposed by gmalette).
All the previous answers make basketball-player into basketballplayer or remove numbers entirely, which is not exactly what is required.
The following code does exactly what you asked:
text.downcase
.gsub(/[^[:word:]\s]/, ' ') # Replace sequences of non-alphanumerical chars by a single space
Hope this helps someone!

Ruby gsub issues

I have a piece of text that resembled the following:
==EXCLUDE
#lots of lines of text
==EXCLUDE
#this is what I actually want
And so I was trying to remove the unwanted bit by doing:
str.gsub!(/==EX.*?==EXCLUDE/, '')
However, its not working. When I tried to remove the \n chars first, it worked like a dream. The issue is that I can't actually remove the \n characters. How can I do a substitution like this while leaving newlines in place?
By default, the . does not match line break chars. If you enable the m modifier in Ruby (in other languages, this is the s modifier) it should work:
str.gsub!(/==EX.*?==EXCLUDE/m, '')
Here's a live demo on Rubular: http://rubular.com/r/YxLSB1Iq95
Try str.gsub!(/==EX.*?==EXCLUDE/m, '')
That should make it span new lines.

Rubular/Ruby discrepancy in captured text

I've carefully cut and pasted from this Rubular window http://rubular.com/r/YH8Qj2EY9j to my code, yet I get different results. The Rubular match capture is what I want. Yet
desc_pattern = /^<DD>(.*\n?.*)\n/
if desc =~ desc_pattern
puts description = $1
end
only gets me the first line, i.e.
<DD>#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
I don't think it's my test data, but that's possible. What am I missing?
(ruby 1.9 on Ubuntu 10.10(
Paste your test data into an editor that is able to display control characters and verify your line break characters. Normally it should be only \n on a Linux system as in your regex. (I had unusual linebreaks a few weeks ago and don't know why.)
The other check you can do is, change your brackets and print your capturing groups. so that you can see which part of your regex matches what.
/^<DD>(.*)\n?(.*)\n/
Another idea to get this to work is, change the .*. Don't say match any character, say match anything, but \n.
^<DD>([^\n]*\n?[^\n]*)\n
I believe you need the multiline modifier in your code:
/m Multiline mode: dot matches newlines, ^ and $ both match line starts and endings.
The following:
#!/usr/bin/env ruby
desc= '<DD>#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
– Johnny Badhair (8spiders) http://twitter.com/8spiders/status/92876473853157377
<DT>la la this should not be matched oh good'
desc_pattern = /^<DD>(.*\n?.*)\n/
if desc =~ desc_pattern
puts description = $1
end
prints
#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
– Johnny Badhair (8spiders) http://twitter.com/8spiders/status/92876473853157377
on my system (Linux, Ruby 1.8.7).
Perhaps your line breaks are really \r\n (Windows style)? What if you try:
desc_pattern = /^<DD>(.*\r?\n?.*)\r?\n/

in Ruby, trying to convert those weird quotes into "regular" quotes

I am trying to parse a text file that has the weird quotes like
“ and ” into "normal quotes like "
I tried this:
text.gsub!("“",'"')
text.gsub!("”",'"')
but when it's done, they are still there and show up as
\x93 and \x94
so I tried adding that too with no luck:
text.gsub!('\\x93', '"')
text.gsub!('\\x94', '"')
The problem is, when I try to show those weird quotes on a webpage, it makes that weird diamond with a question mark symbol: �
It seems to work:
text = "“foo”"
=> "\342\200\234foo\342\200\235"
irb(main):002:0> text.gsub!("“",'"')
=> "\"foo\342\200\235"
irb(main):003:0> text.gsub!("”",'"')
=> "\"foo\""
You need to use a hex editor to figure out all the character codes involved.
Re: the second question of why the weird quotes show on a web page as the � symbol:
Your problem is that your web page is not in UTF-8 mode. To get it there, see
http://www.w3.org/International/O-HTTP-charset
If you can't change your web server, add a meta line in the head section of your web pages: http://www.utf-8.com/
Larry
Your first gsubs should work. The reason the second set of gsubs don't work is that you're using single quotes and double backslash. Try the other way around:
text.gsub!("\x93", '"')
text.gsub!("\x94", '"')
You can also do this in one line:
text.gsub!("\x93", '"').gsub!("\x94", '"')
# or
text.gsub!(/(\x93|\x94)/, '"')
Are you sure the encoding of the string is correct?

Resources