ruby string convert ascii to unicode - ruby

I have a string which has ascii special characters and i want to convert those to respective unicode characters. For example below is the string
A “razor” is a rule of thumb that simplifies decision.. \nWe’re in a post-content age. In the past,\nhealthier, wealthier life: • Toxic relationships • Comparisons • Inactivity • Complaints • Instant gratification • Overthinking • Crazy “what if” fears
Expect output
A "razor" is a rule of thumb that simplifies decision.. \nWe're in a post-content age. In the past,\nhealthier, wealthier life: • Toxic relationships • Comparisons • Inactivity • Complaints • Instant gratification • Overthinking • Crazy "what if" fears
The best result I could get is using unidecode gem. Which converted the above string to this
"A \"razor\" is a rule of thumb that simplifies decision..\nWe're in a post-content age. In the past,\nhealthier, wealthier life: * Toxic relationships * Comparisons * Inactivity * Complaints * Instant gratification * Overthinking * Crazy \"what if\" fears "
The problem with the approach is unidecode to_ascii method will convert the character if the string is in another language.

So what you are asking about is not ascii but ASNI also known as windows-1252, I would recommend you take a look at the Windows-1252 wiki as it has a table with the Unicode code points marked on the table. Essentually there is no easy and quick way to convert from ansi to unicode and the way it was done with the table in that wiki page is the same glyph was found in unicode and substituted in.
One thing about ansi, asci, and unicode is the first 128 characters are all the same between them.
personally I would just make a look up table and also how ruby seems to handle unicode character strings is using the following: "\u<code point in hex>" where you replace the <code point in hex> with the hexadecimal value of the code point so for say the bullet point: "•" would be converted to: "\u2022" if you need to look up unicode code points I recommend: unicodeplus.com as it even gives you the escape sequence used for each codepoint for a few different programming languages.

Related

What are valid date-time separators in RFC3339 strings?

I'm quite confused as to what's allowed as the time separator/designator in the RFC3339 standard. By time separator I mean the sequence of characters that draw the line between date and time.
The standard states in section 5.6 different things that are unclear or conflicting. First of all, it says that the production rule for a full datetime is this:
date-time = full-date "T" full-time
Meaning that the delimiter between the date and the time is an uppercase T. Right after comes this:
NOTE: Per [ABNF] and ISO8601, the "T" and "Z" characters in this
syntax may alternatively be lower case "t" or "z" respectively
Meaning the upper case T may be a lower case t. It conflicts with the ABNF, but OK, it stills sounds to me within the realm of reasonable. Then the following is stated
NOTE: ISO 8601 defines date and time separated by "T".
Applications using this syntax may choose, for the sake of
readability, to specify a full-date and full-time separated by
(say) a space character.
Which is very confusing. Does this allow not only a space character but anything? which is what this say implies. Or does it by this syntax refer to ISO8601 and unnecessarily describes a detail of that other standard?
In other words, are the following valid RFC3339 strings?
2020-09-07 20:26:03.623359300+02:00
2020-09-07hey johnny20:26:03.623359300+02:00
2020-09-07💩20:26:03.623359300+02:00
Meaning the upper case T may be a lower case t. It conflicts with the ABNF, [...]
It does not. See 2.3 Terminal Values of RFC 2234:
Literal text strings are interpreted as a concatenated set of
printable characters.
NOTE: ABNF strings are case-insensitive and
the character set for these strings is us-ascii.
So it is allowed to use t here.
NOTE: ISO 8601 defines date and time separated by "T".
Applications using this syntax may choose, for the sake of
readability, to specify a full-date and full-time separated by
(say) a space character.
Which is very confusing. Does this allow not only a space character
but anything?
This "deviation" is used for readability to the user when displayed. So when the value is displayed to the user in some kind, it can be displayed as:
2020-09-07 20:26:03.623359300+02:00
2020-09-07, 20:26:03.623359300+02:00
That way it might be easier for the user to see the clear space between the date and time, so they don't have to look for the T or t character to find the separation. It is indeed a vague sentence as it basically mean the application can do anything.
To answer your question: These listed date formats are not valid according to RFC 3339.
Short answer: T (or t as discouraged alternative).
After reading on this as much as I could, it turns out the time separator must be a T or t. What has made think this way is first of all this thread in the GNU lists where F. Alexander Njemz contacted the authors of RFC3339 Graham Klyne and Chris Newman asking if T is mandatory and got this response from Mr. Klyne:
In short: "yes"
Per section 5.5, the intent in this draft was to specify a timestamp format using
elements from and compatible with 8601, but eliminating as far as
reasonable any variations that could make timestamp data harder to
process. This includes making the 'T' mandatory in date+time values.
#g
Just for clarity's sake, this is stated in the section 5.5:
Simplicity is achieved by making most fields and punctuation
mandatory.
This clearly clashes with a non-mandatory T and strongly makes me think that the this syntax in that problematic passage refers to ISO8601 and not RFC3339.
For those who want to read more, here are some links regarding the confusion created by this specific point:
https://lists.gnu.org/archive/html/bug-coreutils/2006-05/msg00014.html
http://validator.w3.org/feed/docs/error/InvalidRFC3339Date.html
https://www.rfc-editor.org/errata/eid5783
Plus of course divergent implementations. For instance, the developers of GNU Date chose to use a space character:
$ date --rfc-3339=seconds
2020-09-14 14:53:51+02:00

What characters can des(unix) have?

All lowercase and uppercase, all digits, dot and slash.
Have I missed anything?
This seems like an very easy question found to find at Google but actually I haven't found any information about it :(
Edit, if anybody missunderstod, what characters can the OUTPUT have.
I'm not asking what kind of stuff I can hash, I'm asking what the hash looks like.
DES (and many other encryption algorithms) work on a bit level - it has no concept of what's a valid character and what isn't, the range of the output characters can be anything from 0x00 to 0xFF.
Any output to the contrary is likely just characters not supported by whatever you're trying to display the output with, which are typically replaced by some predefined character.
The output can also be converted to hex characters for cosmetic or storage purposes (I'm not sure whether the des command would do this - it's simple enough to see by just running it), e.g. a single 'a' (0x61) character will be converted to two characters: '61'. The resulting output characters would thus be in the range A-F or a-f and 0-9.
Note that keys require ASCII, but this is not a requirement of DES itself, as can be derived from "Bugs" on the same page, and it doesn't affect the range of output values.
The DES algorithm is considered obsolete and unsafe. The DES standard (FIPS 46-3) has been withdrawn in 2005.
Use at your own risk.
See http://en.wikipedia.org/wiki/Data_Encryption_Standard

Convert unicode codepoint to string character in Ruby

I have these values from a unicode database but I'm not sure how to translate them into the human readable form. What are these even called?
Here they are:
U+2B71F
U+2A52D
U+2A68F
U+2A690
U+2B72F
U+2B4F7
U+2B72B
How can I convert these to there readable symbols?
How about:
# Using pack
puts ["2B71F".hex].pack("U")
# Using chr
puts (0x2B71F).chr(Encoding::UTF_8)
In Ruby 1.9+ you can also do:
puts "\u{2B71F}"
I.e. the \u{} escape sequence can be used to decode Unicode codepoints.
The unicode symbols like U+2B71F are referred to as a codepoint.
The unicode system defines a unique codepoint for each character in a multitude of world languages, scientific symbols, currencies etc. This character set is steadily growing.
For example, U+221E is infinity.
The codepoints are hexadecimal numbers. There is always exactly one number defined per character.
There are many ways to arrange this in memory. This is known as an encoding of which the common ones are UTF-8 and UTF-16. The conversion to and fro is well defined.
Here you are most probably looking for converting the unicode codepoint to UTF-8 characters.
codepoint = "U+2B71F"
You need to extract the hex part coming after U+ and get only 2B71F. This will be the first group capture. See this.
codepoint.to_s =~ /U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/
And you're UTF-8 character will be:
utf_8_character = [$1.hex].pack("U")
References:
Convert Unicode codepoints to UTF-8 characters with Module#const_missing.
Tim Bray on the goodness of unicode.
Joel Spolsky - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Dissecting the Unicode regular expression

How to convert UTF8 combined Characters into single UTF8 characters in ruby?

Some characters such as the Unicode Character 'LATIN SMALL LETTER C WITH CARON' can be encoded as 0xC4 0x8D, but can also be represented with the two code points for 'LATIN SMALL LETTER C' and 'COMBINING CARON', which is 0x63 0xcc 0x8c.
More info here: http://www.fileformat.info/info/unicode/char/10d/index.htm
I wonder if there is a library which can convert a 'LATIN SMALL LETTER C' + 'COMBINING CARON' into 'LATIN SMALL LETTER C WITH CARON'. Or is there a table containing these conversions?
These conversions don't always exist. The combination of U+0063 (c) with U+030C (combining caron) can be represented as a single character, for instance, but there's no precomposed character representing a lowercase 'w' with a caron (w̌).
Nevertheless, there exist libraries which can perform this composition where possible. Look for a Unicode function called "NFC" (Normalization Form: Composition). See, for instance: http://unicode-utils.rubyforge.org/classes/UnicodeUtils.html#M000015
Generally, you use Unicode Normalization to do this.
Using UnicodeUtils.nfkc using the gem unicode_utils (https://github.com/lang/unicode_utils) should get you the specific behavior you're asking for; unicode normalization form kC will use a compatibility decomposition followed by converting the string to a composed form, if available (basically what you asked for by your example). (You may also get close to what you want with normalization form c, sometimes acronymized NFC).
How to replace the Unicode gem on Ruby 1.9? has additional details.
In Ruby 1.8.7, you'd need do gem install Unicode, for which there is a similar function available.
Edited to add: The main reason why you'll probably want normalization form kC instead of just normalization form C is that ligatures (characters that are squeezed together for historical/typographical reasons) will first be decomposed to the individual characters, which is sometimes desirable if you're doing lexicographic ordering or searching).
String#encode can be used since Ruby 1.9. UTF-8-MAC is a variant of NFD. The codepoints in the range between U+2000 and U+2FFF, or U+F900 and U+FAFF, or U+2F800 and U+2FAFF are not decomposed. See https://developer.apple.com/library/mac/qa/qa1173/_index.html for the details. UTF-8-HFS can be also used insted of UTF-8-MAC.
# coding: utf-8
s = "\u010D"
s.encode!('UTF-8-MAC', 'UTF-8')
s.force_encoding('UTF-8')
p "\x63\xcc\x8c" == s
p "\u0063" == s[0]
p "\u030C" == s[1]

Are VHDL character substitutions ever used in real life?

VHDL allows the following substitutions, presumably because some computers might not support the vertical bar (or pipe symbol) (|) or the hash (or pound sign / number sign) (#):
case A|B can be written as case A!B
16#fff# can be written as 16:fff:
Any computer nowadays supports the vertical bar and the hash symbol, so I figured nobody uses these substitutions... Until somebody requested support for the exclamation mark.
My question: is this a lone case or are other people also using the exclamation mark as substitute for the vertical bar? Anybody using the colon?
Data point 1: Not me :)
And I've never seen it as far as I recall in any code - nor was I taught it at any point (in fact, this is the first I knew of those substitutions).
I had a quick look in Ashenden's Designer's Guide to VHDL, and the ! alternative is not even mentioned when the | is introduced for case statements.
These are inherited from Ada (in which they are obsolescent since Ada95). The Ada83 Rationale says "For portability reasons, it is possible to write any program in a 56 character subset of the ISO character set." in which ISO character set must be understood as ISO-646, aka ASCII (well ISO-646 has provision for replacing some characters for national variant, ASCII can be understood as the US national variant of ISO-646)
There is a third replacement: % can be use instead of " as string delimitor (both must be replaced).
I seem to remember that EBCDIC is using | or ! for the same code point depending on the variant.

Resources