Is this a legal quoted-printable encoding? - mime

Is this a legal quoted-printable encoding?
a ==
3D b
How about this one?
a = b
the second line
I wonder if = can occur without encoding, and an encoding such as =3D can be put on two lines. The RFC is ambiguous.

In Quoted-Printable encoding, the = character MUST be encoded as =3D
Here is the relevant excerpt from RFC 2045:
Octets with decimal values of
33 through 60 inclusive, and 62 through 126, inclusive,
MAY be represented as the US-ASCII characters which
correspond to those octets (EXCLAMATION POINT through
LESS THAN, and GREATER THAN through TILDE,
respectively).
The = ASCII character has decimal code 61, which explains why this number is explicitly forbidden by the RFC. Therefore, both of your examples are not legal Quoted-Printable encodings. The following encoding is legal:
a =3D b
the second line

Related

Are byte slices of utf8 also utf8?

Given a slice of bytes that is valid utf8, is it true that any sub-slice of such slice is also valid utf8?
In other words, given b1: [u8] that is valid utf8, can I assume that
b2 = b1[i..j] is valid utf8 for any i,j : i<j?
If not, what would be the counter-example?
what would be the counter-example?
Any code point that encodes as more than 1 byte. For example π in hex is cf80, and slicing it in the middle produces two (separate) invalid UTF-8 strings.

How can I convert ASCII code to characters in Verilog language

I've been looking into this but searching seems to lead to nothing.
It might be too simple to be described, but here I am, scratching my head...
Any help would be appreciated.
Verilog knows about "strings".
A single ASCII character requires 8 bits. Thus to store 8 characters you need 64 bits:
wire [63:0] string8;
assign string8 = "12345678";
There are some gotchas:
There is no End-Of-String character (like the C null-character)
The most RHS character is in bits 7:0.
Thus string8[7:0] will hold 8h'38. ("8").
To walk through a string you have to use e.g.: string[ index +: 8];
As with all Verilog vector assignments: unused bits are set to zero thus
assign string8 = "ABCD"; // MS bit63:32 are zero
You can not use two dimensional arrays:
wire [7:0] string5 [0:4]; assign string5 = "Wrong";
You are probably mislead by a misconception about characters. There are no such thing as a character in hardware. There are only sets of bits or codes. The only thing which converts binary codes to characters is your terminal. It interprets codes in a certain way and forming letters for you to se. So, all the printfs in 'c' and $display in verilog only send the codes to the terminal (or to a file).
The thing which converts characters to the codes is your keyboard, which you also use to type in the program. The compiler then interprets your program. Verilog (as well as the 'c') compiler represents strings in double quotes (which you typed in) as a set of bytes directly. Verilog, as well as 'c' use ascii-8 encoding for such character strings, meaning that the code for 'a' is decimal 97 and 'b' is 98, .... Every character is 8-bit wide and the quoted string forms a concatenation of bytes of ascii codes.
So, answering you question, you can convert an ascii codes to characters by sending them to the terminal via $display (or other) function, using the %s modifier.
So, an example:
module A;
reg[8*5-1:0] hello;
reg[8*3 - 1: 0] bye;
initial begin
hello = "hello"; // 5 bytes of characters
bye = {8'd98, 8'd121, 8'd101}; // 3 bytes 'b' 'y' 'e'
$display("hello=%s bye=%s", hello, bye);
end
endmodule

Convert UTF-16 to code-page and remove unicode text direction control characters?

Short version
Given: 1/16/2006 2∶30∶11 ᴘᴍ
How to get: 1/16/2006 2:30:11 PM
rather than: ?1/?16/?2006 ??2:30:11 ??
Background
I have an example Unicode (UTF-16) encoded string:
U+200e U+0031 U+002f U+200e U+0031 U+0036 U+002f U+200e U+0032 U+0030 U+0030 U+0036 U+0020 U+200f U+200e U+0032 U+2236 U+0033 U+0030 U+2236 U+0031 U+0031 U+0020 U+1d18 U+1d0d
[LTR] 1 / [LTR] 1 6 / [LTR] 2 0 0 6 [RTL] [LTR] 2 ∶ 3 0 ∶ 1 1 ᴘ ᴍ
In a slightly easier to read form is:
LTR1/LTR16/LTR2006 RTLLTR2∶30∶11 ᴘᴍ
The actual final text as you're supposed to see it is:
I currently use the Windows function WideCharToMultiByte to convert the UTF-16 to the local code-page:
WideCharToMultiByte(CP_ACP, 0, text, length, NULL, 0, NULL, NULL);
and when i do the text comes out as:
?1/?16/?2006 ??2:30:11 ??
I don't control the presence of the Unicode text direction markers; it's a security thing. But obviously when i'm converting the Unicode to (for example) ISO-8859-1, those characters are irrelevant, make no sense, and i would hope can be dropped.
Is there a Windows function (e.g. FoldString, WideCharToMultiByte) that can be instructed to drop these non-mappable non-printable character?
1/16/2006 2∶30∶11 ᴘᴍ
That gets us close
If a function did that, dropped the non-printing characters that don't have a representation in the target code-page, we would get:
1/16/2006 2∶30∶11 ᴘᴍ
When converted to ISO-8859-1, it becomes:
1/16/2006 2?30?11 ??
That's because some of those characters don't map exactly into ISO-8859-1:
1/16/2006 2U+223630U+223611 U+1d18U+1d0d
1/16/2006 2RATIO30RATIO11 Small Capital PSmall Capital M
But when you see them, it doesn't seem unreasonable that they could be best-fit mapped into:
Original: 1/16/2006 2∶30∶11 ᴘᴍ
Mapped: 1/16/2006 2:30:11 PM
Is there a function that can do that?
I'm happy to suffer with:
1/16/2006 2?30?11 ??
But i really need to fix:
?1/?16/?2006 ??2:30:11 ??
Unicode has the notion
Unicode already has the notion of what "fancy" character you can replace with what "normal" character.
U+00BA º → o (Masculine ordinal indicator) → (Small latin letter o, superscripted)
U+FF0F / → / (Fullwidth solidus) → (solidus, wide)
U+00BC ¼ → 1/4 (Vulgar fraction one quarter)
U+2033 ″ → ′′ (Double prime)
U+FE64: ﹤ → <
I know these are technically for a different purpose;. But there is also the general notion of a mapping list (which again is for a different purpose).
Microsoft SQL Server, when being asked to insert a Unicode string into a non-unicode varchar column does an even better job:
Is there a mapping list for the purpose of unicode best-fit?
Because the reality is that it just makes a mess for users:

Ruby Cyphering Leads to non Alphanumeric Characters [duplicate]

This question already has answers here:
Rotating letters in a string so that each letter is shifted to another letter by n places
(4 answers)
Closed 5 years ago.
I'm trying to make a basic cipher.
def caesar_crypto_encode(text, shift)
(text.nil? or text.strip.empty? ) ? "" : text.gsub(/[a-zA-Z]/){ |cstr|
((cstr.ord)+shift).chr }
end
but when the shift is too high I get these kinds of characters:
Test.assert_equals(caesar_crypto_encode("Hello world!", 127), "eBIIL TLOIA!")
Expected: "eBIIL TLOIA!", instead got: "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
What is this format?
The reason you get the verbose output is because Ruby is running with UTF-8 encoding, and your conversion has just produced gibberish characters (an invalid character sequence under UTF-8 encoding).
ASCII characters A-Z are represented by decimal numbers (ordinals) 65-90, and a-z is 97-122. When you add 127 you push all the characters into 8-bit space, which makes them unrecognizable for proper UTF-8 encoding.
That's why Ruby inspect outputs the encoded strings in quoted form, which shows each character as its hexadecimal number "\xC7...".
If you want to get some semblance of characters out of this, you could re-encode the gibberish into ISO8859-1, which supports 8-bit characters.
Here's what you get if you do that:
s = "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
>> s.encoding
=> #<Encoding:UTF-8>
# Re-encode as ISO8859-1.
# Your terminal (and Ruby) is using UTF-8, so Ruby will refuse to print these yet.
>> s.force_encoding('iso8859-1')
=> "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
# In order to be able to print ISO8859-1 on an UTF-8 terminal, you have to
# convert them back to UTF-8 by re-encoding. This way your terminal (and Ruby)
# can display the ISO8859-1 8-bit characters using UTF-8 encoding:
>> s.encode('UTF-8')
=> "Çäëëî öîñëã!"
# Another way is just to repack the bytes into UTF-8:
>> s.bytes.pack('U*')
=> "Çäëëî öîñëã!"
Of course the proper way to do this, is not to let the numbers overflow into 8-bit space under any circumstance. Your encryption algorithm has a bug, and you need to ensure that the output is in the 7-bit ASCII range.
A better solution
Like #tadman suggested, you could use tr instead:
AZ_SEQUENCE = *'A'..'Z' + *'a'..'z'
"Hello world!".tr(AZ_SEQUENCE.join, AZ_SEQUENCE.rotate(127).join)
=> "eBIIL tLOIA!
I'm still curious about that format though...
Those characters represent the corresponding ASCII encoding after getting the ordinal (ord) of each letter and adding 127 to it (i.e. (cstr.ord)+shift).chr)
Why? Check Integer#chr, from the docs:
Returns a string containing the character represented by the int's
value according to encoding.
So, for example, take your first letter "H":
char_ord = "H".ord
#=> 72
new_char_ord = char_ord + 127
#=> 199
new_char_ord.chr
#=> "\xC7"
So, 199 corresponds to "\xC7". Keep changing all characters in "Hello world" and you will get "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3".
To avoid this you need to loop only with ord values that represent a letter (answer in the Possible duplicate link).

Convert escaped unicode (\u008E) to accented character (Ž) in Ruby?

I am having a very difficult time with this:
# contained within:
"MA\u008EEIKIAI"
# should be
"MAŽEIKIAI"
# nature of string
$ p string3
"MA\u008EEIKIAI"
$ puts string3
MAEIKIAI
$ string3.inspect
"\"MA\\u008EEIKIAI\""
$ string3.bytes
#<Enumerator: "MA\u008EEIKIAI":bytes>
Any ideas on where to start?
Note: this is not a duplicate of my previous question.
\u008E means that the unicode character with the codepoint 8e (in hex) appears at that point in the string. This character is the control character “SINGLE SHIFT TWO” (see the code chart (pdf)). The character Ž is at the codepoint u017d. However it is at position 8e in the Windows CP-1252 encoding. Somehow you’ve got your encodings mixed up.
The easiest way to “fix” this is probably just to open the file containing the string (or the database record or whatever) and edit it to be correct. The real solution will depend on where the string in question came from and how many bad strings you have.
Assuming the string is in UTF-8 encoding, \u008E will consist of the two bytes c2 and 8e. Note that the second byte, 8e, is the same as the encoding of Ž in CP-1252. On way to convert the string would be something like this:
string3.force_encoding('BINARY') # treat the string just as bytes for now
string3.gsub!(/\xC2/n, '') # remove the C2 byte
string3.force_encoding('CP1252') # give the string the correct encoding
string3.encode('UTF-8') # convert to the desired encoding
Note that this isn’t a general solution to fix all issues like this. Not all CP-1252 characters, when mangled and expressed in UTF-8 this way will amenable to conversion like this. Some will be two bytes c2 xx where xx the correct byte (like in this case), others will be c3 yy where yy is a different byte.
What about using Regexp & String#pack to convert the Unicode escape?
str = "MA\\u008EEIKIAI"
puts str #=> MA\u008EEIKIAI
str.gsub!(/\\u(.{4})/) do |match|
[$1.to_i(16)].pack('U')
end
puts str #=> MA EIKIAI

Resources