how to remove whitespace but not utf-8 character in ruby - ruby

I want to prevent users to write an empty comment (whitespaces, , etc.). so I apply the following:
var.gsub(/^\s+|\s+\z|\s* \s*/.'')
However, then a smart user find a hole by using \302 or \240 unicode characters so I filtered out these characters too.
Then I ran into problem as I introduced several languages support, then a word like Déjà vu becomes an error. because part of the à character contains \240. is there any way to remove the whitespaces but leave the latin characters untouched?

A way around this is to use iconv to discard the invalid unicode characters (such as \230 on its own) before using your regexp to remove the whitespaces:
require 'iconv'
var1 = "Déjà vu"
var2 = "\240"
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid1 = ic.iconv(var1) # => "D\303\251j\303\240 vu"
valid2 = ic.iconv(var2) # => ""

Related

What to use as a delimiter so I can detect the original inputs. Any good ideas. Ruby

I have an Encoder (using openssl) that can encrypt and decrypt strings like so:
new_addresses
=> ["fasfds", "someaddress", "123- this is also a valid address"]
[8] pry(#<Sinatra::Application>)> Encoder.encrypt(new_addresses.join(' '))
=> "55C2FB253468204EA9D3F5CE6D58DC4088BD52731B90B9C0C8EB5FE7FA1CD4E7B41F0A84DC46C69E09A10DC1931C6A976A58E29C"
[9] pry(#<Sinatra::Application>)> enc=_
=> "55C2FB253468204EA9D3F5CE6D58DC4088BD52731B90B9C0C8EB5FE7FA1CD4E7B41F0A84DC46C69E09A10DC1931C6A976A58E29C"
[10] pry(#<Sinatra::Application>)> Encoder.decrypt(enc)
=> "fasfds someaddress 123- this is also a valid address"
The issue I have here is that I have no idea which were the original 3 addresses. The new_addresses which are merely params that come in from a form are an array separated by commas. But when I join them together and encode it, I lose the comma delimiter and the array structure when I decrypt it so I have no idea what were the original 3 addresses. Any ideas on what I can do so that after I decrypt the string, I still can detect on what the original 3 addresses are.
These are valid characters in an address:
' '
-
_
^
%
$
#
...
really any characters.
It looks like your encryption algorithm uses only the characters 0-9 and A-Z. In that case, you can use any character that is not one of those characters to join() your encrypted strings together, for instance "-":
encrypted_str = "55C2FB253-3F5CE6D58DC4-B5FE7FA1CD4E7"
encyrpted_pieces = encrypted_str.split '-'
decrypted_pieces = encrypted_pieces.map do |piece|
Encoder.decrypt piece
end
On the other hand, if you want to join your strings together first, then encrypt the combined string, you can use the non printing ascii character named NUL to glue the pieces together. NUL's ascii code is 0, which can be represented by the hex escape \x00 inside a String:
decrypted_str = "fasfds\x00someaddress\x00123- this is also a valid address"
puts decrypted_str
pieces = decrypted_str.split "\x00"
p pieces
--output:--
fasfdssomeaddress123- this is also a valid address
["fasfds", "someaddress", "123- this is also a valid address"]
Magic.
Of course, the separator character should be a character that won't appear in the input. If the input can be binary data, e.g. an image, then you can't use \x00 as the separator.
These are valid characters in an address:
' '
-
_
^
%
$
#
...
Note that you didn't list a comma, which would be an obvious choice for the separator.

How to get "celavita" from "C\u00EAlaV\u00EDta"? [duplicate]

I am trying to create a 'normalized' copy of a string, to help reduce duplicate names in a database. The names contain many international characters (ie. accented letters), and I want to create a copy with the accents removed.
I did come across the method below, but cannot get it to work. I can't seem to find what the Unicode Hacks plugin is.
# Utility method that retursn an ASCIIfied, downcased, and sanitized string.
# It relies on the Unicode Hacks plugin by means of String#chars. We assume
# $KCODE is 'u' in environment.rb. By now we support a wide range of latin
# accented letters, based on the Unicode Character Palette bundled inMacs.
def self.normalize(str)
n = str.chars.downcase.strip.to_s
n.gsub!(/[à áâãäåÄÄ?]/u, 'a')
n.gsub!(/æ/u, 'ae')
n.gsub!(/[ÄÄ?]/u, 'd')
n.gsub!(/[çÄ?ÄÄ?Ä?]/u, 'c')
n.gsub!(/[èéêëÄ?Ä?Ä?Ä?Ä?]/u, 'e')
n.gsub!(/Æ?/u, 'f')
n.gsub!(/[ÄÄ?Ä¡Ä£]/u, 'g')
n.gsub!(/[ĥħ]/, 'h')
n.gsub!(/[ììíîïīĩĭ]/u, 'i')
n.gsub!(/[įıijĵ]/u, 'j')
n.gsub!(/[ķĸ]/u, 'k')
n.gsub!(/[Å?ľĺļÅ?]/u, 'l')
n.gsub!(/[ñÅ?Å?Å?Å?Å?]/u, 'n')
n.gsub!(/[òóôõöøÅÅ?ÅÅ]/u, 'o')
n.gsub!(/Å?/u, 'oe')
n.gsub!(/Ä?/u, 'q')
n.gsub!(/[Å?Å?Å?]/u, 'r')
n.gsub!(/[Å?Å¡Å?ÅÈ?]/u, 's')
n.gsub!(/[ťţŧÈ?]/u, 't')
n.gsub!(/[ùúûüūůűŭũų]/u,'u')
n.gsub!(/ŵ/u, 'w')
n.gsub!(/[ýÿŷ]/u, 'y')
n.gsub!(/[žżź]/u, 'z')
n.gsub!(/\s+/, ' ')
n.gsub!(/[^\sa-z0-9_-]/, '')
n
end
Do I need to 'require' a particular library/gem? Or maybe someone could recommend another way to go about this.
I am not using Rails, nor do I plan on doing so.
I generally use I18n to handle this:
1.9.3p392 :001 > require "i18n"
=> true
1.9.3p392 :002 > I18n.transliterate("Hé les mecs!")
=> "He les mecs!"
The parameterize method could be a nice and simple solution to remove special characters in order to use the string as human readable identifier:
> "Françoise Isaïe".parameterize
=> "francoise-isaie"
So far the following is the only way I've been able to accomplish what I need:
str.tr(
"ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
"AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")
But using this feels very 'hackish', and I would love to find a better way.
If you are using rails:
"L'Oréal".parameterize(separator: ' ')
Solution:
DIACRITICS = [*0x1DC0..0x1DFF, *0x0300..0x036F, *0xFE20..0xFE2F].pack('U*')
def removeaccents(str)
str
.unicode_normalize(:nfd)
.tr(DIACRITICS, '')
.unicode_normalize(:nfc)
end
Example (before/after):
ÀÁÂÃÄÅàáâãäåĀāĂ㥹ạảÇçĆćĈĉĊċČčĎďÈÉÊËèéêểệễëĒēĔĕĖėĘęĚěẹĜĝĞğĠġĢģĤĥÌÍÎÏìíîïĨĩĪīĬĭĮįİıịỉĴĵĶķĸĹĺĻļĽľÑñŃńŅņŇňÒÓÔÕÖòóôộỗổõöŌōŎŏŐőọỏơởợỡŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųụưủửữựŴŵÝýÿŶŷŸŹźŻżŽžứừửựữốồộỗổờóợỏỡếềễểệẩẫấầậỳỹýỷỵặẵẳằắ
AAAAAAaaaaaaAaAaAaaaCcCcCcCcCcDdEEEEeeeeeeeEeEeEeEeEeeGgGgGgGgHhIIIIiiiiIiIiIiIiIıiiJjKkĸLlLlLlNnNnNnNnOOOOOooooooooOoOoOoooooooRrRrRrSsSsSsSsſTtTtUUUUuuuuUuUuUuUuUuUuuuuuuuWwYyyYyYZzZzZzuuuuuooooooooooeeeeeaaaaayyyyyaaaaa
Explanations:
Decompose the single-codepoint characters into their constituting codepoints characters (where applicable).
Remove the diacritical mark codepoints (Unicode 15.0.0 reference) found in the following blocks:
Combining Diacritical Marks Supplement (U+1DC0 → U+1DFF)
Combining Diacritical Marks (U+0300 → U+036F)
Combining Half Marks (U+FE20 → U+FE2F)
Recompose the characters.
Caveats:
While these diacritics are predominantly used for text, some of them can also be used with symbols. These symbols will see these diacritics removed when they shouldn't be.
Obscure codepoints such as subtending marks are not removed. Despite their naming, they are not treated as combining marks by the unicode reference but as format characters. An example is the arabic hamza above ◌ٔ (U+0654) that probably doesn't even get properly displayed in your browser.
Not a caveat per se but worth nothing: diacritics that are preceded by a space or a breaking space are also removed. They are displayed as standalone characters in some text-rendering software so it may be undesired.

Regexp non alphanumerical but not German characters

I would like to remove all non alpha numerical characters from a string. Except space, - and some German characters.
Example
regexp = "mönchengladbach."
regexp.gsub(/[^0-9a-z \-]/i, '')
=> mnchengladbach
I need this:
=> mönchengladbach
It should also not replace other German characters such as:
ä ö ü ß
Thanks!
Edit:
It was just me not testing properly. The IRB did not accept special characters. This works for me:
regexp.gsub(/[^0-9a-z \-äüöß]/i, '')
To remove all that is not a letter or a space you can use this:
str.gsub(/[^\p{L}\s]+/, '')
I use here a negated character class, [^\p{L}\s] means all that is not a letter (in all language you want) or a white charater (space, tab, newlines)
\p{L} is an unicode character class for Letters.
You can easily add other characters you want to preserve like -:
str.gsub(/[^\p{L}\s-]+/, '')
example script:
# encoding: UTF-8
str = "mönchengladbach."
str = str.gsub(/[^\p{L}\s]+/, '#')
puts str
I think you want:
/[^[:alnum:] -]/
Note the //i is not necessary and no need to escape - when it's at the end of a []

Regex to remove non letters

I'm trying to remove non-letters from a string. Would this do it:
c = o.replace(o.gsub!(/\W+/, ''))
Just gsub! is sufficient:
o.gsub!(/\W+/, '')
Note that gsub! modifies the original o object. Also, if the o does not contain any non-word characters, the result will be nil, so using the return value as the modified string is unreliable.
You probably want this instead:
c = o.gsub(/\W+/, '')
Remove anything that is not a letter:
> " sd 190i.2912390123.aaabbcd".gsub(/[^a-zA-Z]/, '')
"sdiaaabbcd"
EDIT: as ikegami points out, this doesn't take into account accented characters, umlauts, and other similar characters. The solution to this problem will depend on what exactly you are referring to as "not a letter". Also, what your input will be.
Keep in mind that ruby considers the underscore _ to be a word character. So if you want to keep underscores as well, this should do it
string.gsub!(/\W+/, '')
Otherwise, you need to do this:
string.gsub!(/[^a-zA-Z]/, '')
That will work most of the cases, except when o initially does not contain any non-letter, in which case gsub! will return nil.
If you just want a replaced string, it can be simpler:
c = o.gsub(/\W+/, '')
Using \W or \w to select or delete only characters won't work. \w means A-Z, a-z, 0-9, and "_":
irb(main):002:0> characters = (' ' .. "\x7e").to_a.join('')
=> " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
irb(main):003:0> characters.gsub(/\W+/, '')
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"
So, stripping using \W preserves digits and underscores.
If you want to match characters use /[A-Za-z]+/, or the POSIX character class [:alpha:], i.e. /[[:alpha:]]+/, or /\p{ALPHA}/.
The final format is the Unicode property for 'A'..'Z' + 'a'..'z' in ASCII, and gets extended when dealing with Unicode, so if you have multibyte characters you should probably use that.
use Regexp#union to create a big matching object
allowed = Regexp.union(/[a-zA-Z0-9]/, " ", "-", ":", ")", "(", ".")
cleanstring = dirty_string.chars.select {|c| c =~ allowed}.join("")
I don't see what that o.replace is in there for if you have a string:
string = 't = 4 6 ^'
And you do:
string.gsub!(/\W+/, '')
You get:
t46
If you want to get rid of the number characters too, you can do:
string.gsub!(/\W+|\d+/, '')
And you get:
t

How to remove these kind of symbols (junk) from string?

Imagine I have String in C#: "I Don’t see ya.."
I want to remove (replace to nothing or etc.) these "’" symbols.
How do I do this?
That 'junk' looks a lot like someone interpreted UTF-8 data as ISO 8859-1 or Windows-1252, probably repeatedly.
’ is the sequence C3 A2, E2 82 AC, E2 84 A2.
UTF-8 C3 A2 = U+00E2 = â
UTF-8 E2 82 AC = U+20AC = €
UTF-8 E2 84 A2 = U+2122 = ™
We then do it again: in Windows 1252 this sequence is E2 80 99, so the character should have been U+2019, RIGHT SINGLE QUOTATION MARK (’)
You could make multiple passes with byte arrays, Encoding.UTF8 and Encoding.GetEncoding(1252) to correctly turn the junk back into what was originally entered. You will need to check your processing to find the two places that UTF-8 data was incorrectly interpreted as Windows-1252.
"I Don’t see ya..".Replace( "’", string.Empty);
How did that junk get in there the first place? That's the real question.
By removing any non-latin character you'll be intentionally breaking some internationalization support.
Don't forget the poor guy who's name has a "â" in it.
This looks disturbingly familiar to a character encoding issue dealing with the Windows character set being stored in a database using the standard character encoding. I see someone voted Will down, but he has a point. You may be solving the immediate issue, but the combinations of characters are limitless if this is the issue.
If you really have to do this, regular expressions are probably the best solution.
I would strongly recommend that you think about why you have to do this, though - at least some of the characters your listing as undesirable are perfectly valid and useful in other languages, and just filtering them out will most likely annoy at least some of your international users. As a swede, I can't emphasize enough how much I hate systems that can't handle our å, ä and ö characters correctly.
Consider Regex.Replace(your_string, regex, "") - that's what I use.
Test each character in turn to see if it is a valid alphabetic or numeric character and if not then remove it from the string. The character test is very simple, just use...
char.IsLetterOrDigit;
Please there are various others such as...
char.IsSymbol;
char.IsControl;
Regex.Replace("The string", "[^a-zA-Z ]","");
That's how you'd do it in C#, although that regular expression ([^a-zA-Z ]) should work in most languages.
[Edited: forgot the space in the regex]
The ASCII / Integer code for these characters would be out of the normal alphabetic Ranges. Seek and replace with empty characters. String has a Replace method I believe.
Either use a blacklist of stuff you do not want, or preferably a white list (set). With a white list you iterate over the string and only copy the letters that are in your white list to the result string. You said remove, and the way you do that is having two pointers one you read from (R) and one you write to (W):
I Donââ‚
W R
if comma is in your whitelist then you would in this case read the comma and write it where à is then advance both pointers. UTF-8 is a multi-byte encoding, so you advancing the pointer may not just be adding to the address.
With C an easy to way to get a white list by using one of the predefined functions (or macros): isalnum, isalpha, isascii, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, isxdigit. In this case you send up with a white list function instead of a set of course.
Usually when I see data like you have I look for memory corruption, or evidence to suggest that the encoding I expect is different than the one the data was entered with.
/Allan
I had the same problem with extraneous junk thrown in by adobe in an EXIF dump. I spent an hour looking for a straight answer and trying numerous half-baked suggestions which did not work here.
This thread more than most I have read was replete with deep, probing questions like 'how did it get there?', 'what if somebody has this character in their name?', 'are you sure you want to break internationalization?'.
There were some impressive displays of erudition positing how this junk could have gotten here and explaining the evolution of the various character encoding schemes. The person wanted to know how to remove it, not how it came to be or what the standards orgs are up to, interesting as this trivia may be.
I wrote a tiny program which gave me the right answer. Instead of paraphrasing the main concept, here is the entire, self-contained, working (at least on my system) program and the output I used to nuke the junk:
#!/usr/local/bin/perl -w
# This runs in a dos window and shows the char, integer and hex values
# for the weird chars. Install the HEX values in the REGEXP below until
# the final test line looks normal.
$str = 's: “Brian'; # Nuke the 3 werid chars in front of Brian.
#str = split(//, $str);
printf("len str '$str' = %d, scalar \#str = %d\n",
length $str, scalar #str);
$ii = -1;
foreach $c (#str) {
$ii++;
printf("$ii) char '$c', ord=%03d, hex='%s'\n",
ord($c), unpack("H*", $c));
}
# Take the hex characters shown above, plug them into the below regexp
# until the junk disappears!
($s2 = $str) =~ s/[\xE2\x80\x9C]//g; # << Insert HEX values HERE
print("S2=>$s2<\n"); # Final test
Result:
M:\new\6s-2014.1031-nef.halloween>nuke_junk.pl
len str 's: GÇ£Brian' = 11, scalar #str = 11
0) char 's', ord=115, hex='73'
1) char ':', ord=058, hex='3a'
2) char ' ', ord=032, hex='20'
3) char 'G', ord=226, hex='e2'
4) char 'Ç', ord=128, hex='80'
5) char '£', ord=156, hex='9c'
6) char 'B', ord=066, hex='42'
7) char 'r', ord=114, hex='72'
8) char 'i', ord=105, hex='69'
9) char 'a', ord=097, hex='61'
10) char 'n', ord=110, hex='6e'
S2=>s: Brian<
It's NORMAL!!!
One other actionable, working suggestion I ran across:
iconv -c -t ASCII < 6s-2014.1031-238246.halloween.exf.dif > exf.ascii.dif
If String having the any Junk date , This is good to way remove those junk date
string InputString = "This is grate kingdom¢Ã‚¬â";
string replace = "’";
string OutputString= Regex.Replace(InputString, replace, "");
//OutputString having the following result
It's working good to me , thanks for looking this review.

Resources