I need to replace certain UTF8 hex codes to the equivalent character in an irregular text string like:
\C3\A1 á
\C3\A9 é
\C3\AD í
\C3\B3 ó
\C3\BA ú
I'm not sure if I should be using RegEx or a macro to find and replace each entry since they won't appear regularly in the text strings I'm working with.
Thanks.
Just a macro - as there's no "variable" data in your search, you don't need a regex. Record a macro to replace the first character, then the second and so on. Use the "normal mode" to replace with the correct character.
You need a reasonably new version of N++ to be able to record search & replace actions - from 5.8.2/3 I think.
Dave.
Related
Here is the Unicode characters table for the Tibetan language,
https://en.m.wikipedia.org/wiki/Tibetan_(Unicode_block)
How to I use the codes in that chart in a fmt.Printf(mycode) statement, in order to print, say the Tibetan letter ཏ, which is located at line U+0F4x and column F of that unicode chart.
Do I have to write:
Fmt.Printf(“U+0F4xF”)
or something like that, or do I have to drop the “U” or the “U+“ ?
To print ཏ (U+0F4F TIBETAN LETTER TA) (or any other Unicode character), you can put the character directly into your string literal, use a \u0F4F escape, or use the correspoding rune (Unicode codepoint):
fmt.Printf("Direct: ཏ\n")
fmt.Printf("Escape: \u0F4F\n")
fmt.Printf("Rune: %c\n", rune(0x0F4F))
The Go blog has some details...
I am new to ruby and I'm trying to work with regex.
I have a text which looks something like:
HEADING
Some text which is always non capitalized. Headings are always capitalized, followed by a space or nothing more.
YOU CAN HAVE MULTIPLE WORDS IN HEADING
I'm using this regular expression to choose all headings:
^[A-Z]{2,}\s?([A-Z]{2,}\s?)*$
However, it matches all headings which does not contain chars as Č, Š, Ž(slovenian characters).
So I'm guessing [A-Z] only matches ASCII characters? How could I get utf8?
You are right in that when you define the ASCII range A-Z, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
You could make a larger character class that matches the slovenian characters you need, by listing them.
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
Altering your regular expression with just this adjustment:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
^[[:upper:]\s]+$
You can use unicode upper case letter:
\p{Lu}
Your regex:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx Demo
I have an html file that I need to replace some characters with html entities. Right now I'm trying to replace — with — but when I use the Replace All button, the result is that all of those instances of — are replaced with —mdash;
I thought maybe escaping the "&" will work, so I changed the Replace with value to \— but that just results in \—mdash;
The strange thing is that if I go to each, one by one, i.e., click Next, then click Replace, and so on, then it replaces it correctly.
Is this a bug in MacVim? Or am I missing something?
Enter into command line:
:%s/—/\—/g
Also it's possible to get character code. Place your cursor on the character and press ga. Use decimal, hex or octal code into replacement string:
\%d match specified decimal character
\%x match specified hex character
\%o match specified octal character
\%u match specified multibyte character
\%U match specified large multibyte character
:%s/\%d8212/\$mdash;/g
I am trying to come up with a regex to remove all special characters except some. For example, I have a string:
str = "subscripción gustaría♥"
I want the output to be "subscripción gustaría".
The way I tried to do is, match anything which is not an ascii character (00 - 7F) and not special character I want and replace it with blank.
str.gsub(/(=?[^\x00-\x7F])(=?^\xC3\xB3)(=?^\xC3\xA1)/,'')
This doesn't work. The last special character is not removed.
Can someone help? (This is ruby 1.8)
Update: I am trying to make the question a little more clear. The string is utf-8 encoded. And I am trying to whitelist the ascii characters plus ó and í and blacklist everything else.
Oniguruma has support for all the characters you care about without having to deal with codepoints. You can just add the unicode characters inside the character class you're whitelisting, followed by the 'u' option.
ruby-1.8.7-p248 > str = "subscripción gustaría♥"
=> "subscripci\303\263n gustar\303\255a\342\231\245"
ruby-1.8.7-p248 > puts str.gsub(/[^a-zA-Z\sáéíóúÁÉÍÓÚ]/u,'')
subscripción gustaría
=> nil
str.split('').find_all {|c| (0x00..0x7f).include? c.ord }.join('')
The question is a bit vague. There is not a word about encoding of the string. Also, you want to white-list characters or black list? Which ones?
But you get the idea, decide what you want, and then use proper ranges as colleagues here already proposed. Some examples:
if str = "subscripción gustaría♥" is utf-8
then you can blacklist all char above the range (excl. whitespaces):
str.gsub(/[^\x{0021}-\x{017E}\s]/,'')
if string is in ISO-8859-1 codepage you can try to match all quirky characters like the "heart" from the beginning of ASCII range:
str.gsub(/[\x01-\x1F]/,'')
The problem is here with regex, has nothing to do with Ruby. You probably will need to experiment more.
It is not completely clear which characters you want to keep and which you want to delete. The example string's character is some Unicode character that, in my browser, displays as a heart symbol. But it seems you are dealing with 8-bit ASCII characters (since you are using ruby 1.8 and your regular expressions point that way).
Nonetheless, you should be able to do it in one of two ways; either specify the characters you want to keep or, alternatively, specify the characters you want to delete. For example, the following specifies that all characters 0x00-0x7F and 0xC0-0xF6 should be kept (remove everything that is not in that group):
puts str.gsub(/[^\x00-\x7F\xC0-\xF6]/,'')
This next example specifies that characters 0xA1 and 0xC3 should be deleted.
puts str.gsub(/[\xA1\xC3]/,'')
I ended up doing this: str.gsub(/[^\x00-\x7FÁáÉéÍíÑñÓóÚúÜü]/,''). It doesn't work on my mac but works on linux.
What is the ASCII number for the double quote? (")
Also, is there a link to a list anywhere?
Finally, how do you enter it in the C family (esp. C#)
The ASCII code for the quotation mark is 34.
There are plenty of ASCII tables on the web. Note that some describe the standard 7-bit ASCII code, while others describe various 8-bit extensions that are super-sets of ASCII.
To put quotation marks in a string, you escape it using a backslash:
string msg = "Let's just call it a \"duck\" and be done with it.";
To put a quotation mark in a character literal, you don't need to escape it:
char quotationMark = '"';
Note: Strings and characters in C# are not ASCII, they are Unicode. As Unicode is a superset of ASCII the codes are still usable though. You would need a Unicode character table to look up some characters, but an ASCII table works fine for the most common characters.
It's 34. And you can find a list on Wikipedia.
yes, the answer the 34
In order to find the ascii value for special character and other alpha character I'm writing here small vbscript. In a note pad, write the below script save as abc.vbs(any name with extention .vbs) double click on the file to execute and you can see double quotes for 34.
For i=0 to 150
msgbox i&"="&char(i)
next
you are never going to need only 1 quote, right?
so I declare a CHAR variable i.e
char DoubleQuote;
then drop in a double quote
Convert.ToChar(34);
so use the variable DoubleQuote where you need it
works in SQL to generate dynamic SQL but there you need
DECLARE #SingleQuote CHAR(1)
and
SET #SingleQuote=CHAR(39)