I just have a question to make sure I get something well.
If I used my computer to sum 10+11 which is 21, it will store 21 in usually a byte such that 0001 0101, however, when it prints it on the screen, it will actually represent it as two digits 2 (0110010) and 1 (0110001) appended to each other to form 21) using ASCII.
Is that right?
Thank you!
That is correct.
The representation of characters in a simple terminal is ASCII, where each character is represented by a (technically 7-bit) code.
Some terminals support more complex encodings like UTF8, but since UTF8 is backwards compatible with ASCII you need not worry about it.
Related
I was solving below problem while reading its solution in first line I read this
can anyone help me in explaining assume char set is ASCII **I Don't want any other solution for this problem I just want to understand the statement **
Implement an algorithm to determine if a string has all unique characters. What if you can not use additional data structures
Thanks in advance for the help.
There is no text but encoded text.
Text is a sequence of "characters", members of a character set. A character set is a one-to-one mapping between a notional character and a non-negative integer, called a codepoint.
An encoding is a mapping between a codepoint and a sequence of bytes.
Examples:
ASCII, 128 codepoints, one encoding
OEM437, 256 codepoints, one encoding
Windows-1252, 251 codepoints, one encoding
ISO-8859-1, 256 codepoints, one encoding
Unicode, 1,114,112 codepoints, many encodings: UTF-8, UTF-16, UTF-32,…
When you receive a byte stream or read a file that represents text, you have to know the character set and encoding. Conversely, when you send a byte stream or write a file that represents text, you have let the receiver know the character set and encoding. Otherwise, you have a failed communication.
Note: Program source code is almost always text files. So, this communication requirement also applies between you, your editor/IDE and your compiler.
Note: Program console input and output are text streams. So, this communication requirement also applies between the program, its libraries and your console (shell). Go locale or chcp to find out what the encoding is.
Many character sets are a superset of ASCII and some encodings map the same characters with the same byte sequences. This causes a lot of confusion, limits learning, promotes usage of poor terminology and the partial interoperablity leads to buggy code. A deliberate approach to specifications and coding eliminates that.
Examples:
Some people say "ASCII" when they mean the common subset of characters between ASCII and the character set they are actually using. In Unicode and elsewhere this is called C0 Controls and Basic Latin.
Some people say "ASCII Code" when they just mean codepoint or the codepoint's encoded bytes (or code units).
The context of your question is unclear but the statement is trying to say that the distinct characters in your data are in the ASCII character set and therefore their number is less than or equal to 128. Due to the similarity between character sets, you can assume that the codepoint range you need to be concerned about is 0 to 127. (Put comments, asserts or exceptions as applicable in your code to make that clear to readers and provide some runtime checking.)
What this means in your programming language depends on the programming language and its libraries. Many modern programming languages use UTF-16 to represent strings and UTF-8 for streams and files. Programs are often built with standard libraries that account for the console's encoding (actual or assumed) when reading or writing from the console.
So, if your data comes from a file, you must read it using the correct encoding. If your data comes from a console, your program's standard libraries will possibly change encodings from the console's encoding to the encoding of the language's or standard library's native character and string datatypes. If your data comes from a source code file, you have to save it in one specific encoding and tell the compiler what that is. (Usually, you would use the default source code encoding assumed by the compiler because that generally doesn't change from system to system or person to person.)
The "additional" data structures bit probably refers to what a language's standard libraries provide, such as list, map or dictionary. Use what you've been taught so far, like maybe just an array. Of course, you can just ask.
Basically, assume that character codes will be within the range 0-127. You won't need to deal with crazy accented characters.
More than likely though, they won't use many, if any codes below 32; since those are mostly non-printables.
Characters such as 'a' 'b' '1' or '#' are encoded into a binary number when stored and used by a computer.
e.g.
'a' = 1100001
'b' = 1100010
There are a number of different standards that you could use for this encoding. ASCII is one of those standards. The other most common standard is called UTF-8.
Not all characters can be encoded by all standards. ASCII has a much more limited set of characters than UTF-8. As such an encoding also defines the set of characters "char set" that are supported by that encoding.
ASCII encodes each character into a single byte. It supports the letters A-Z, and lowercase a-z, the digits 0-9, a small number of familiar symbols, and a number of control characters that were used in early communication protocols.
The full set of characters supported by ASCII can be seen here: https://en.wikipedia.org/wiki/ASCII
I'm still confused about the bits and bytes although I've been searching through the internet. Is that one character of ASCII = 1 bytes = 8 bits? So 8 bits have 256 unique pattern that covered all the ASCII code, what form is it stored in our computer?
And if I typed "Hello" does that mean this consists of 5 bytes?
Yes to everything you wrote. "Bit" is a binary digit: a 0 or a 1. Historically there existed bytes of smaller sizes; now "byte" only ever means "8 bits of information", or a number between 0 and 255.
No. ASCII is a character set with 128 codepoints stored as the values 0-127. Modern computers predominantly address 8-bit memory and disk locations so a 7-bit ASCII value takes up 8 bits.
There is no text but encoded text. An encoding maps a member of a character set to one or more bytes. Unless you absolutely know you are using ASCII, you probably aren't. There are quite a few character sets with encodings that cover all 256 byte values and use any combination of byte values to encode a string.
There are several character sets that are similar but have a few less than 256 characters. And others that use more than one byte to encode a codepoint and don't use every combination of byte values.
Just so you know, Unicode is the predominant character set except in very specialized situations. It has several encodings. UTF-8 is often used for storage and streams. UTF-16 is often used in memory, particularly in Java, .NET, JavaScript, XML, …. When text is communicated between systems, there has to be an agreement, specification, standard, or indication about which character set and encoding it uses so a sequence of bytes can be interpreted as characters.
To add to the confusion, programming languages have data types called char, Character, etc. You have to look at the specific language's reference manual to see what they mean. For example in C, char is simply an integer that is defined as the size of the encoding of character used by that C implementation. (C also calls this a "byte" and it is not necessarily 8 bits. In all other contexts, people mean 8 bits when they say "byte". If they want to be exceedingly unambiguous they might say "octet".)
"Hello" is five characters. In a specific character set, it is five codepoints. In a specific encoding for that character set, it could be 5, 10 or 20, or ??? bytes.
Also, in the source code of a specific language, a literal string like that might be "null-terminated". This means that you could say it is 6 "characters". Other languages might store a string as a counted sequence of code units. Again, you have to look at the language reference to know the underlying data structure of strings. Of, if the language and the libraries used with it are sufficiently high-level, you might never need to know such internals.
I want to print the value '½' in a file. I searched for the ascii value of this as Alt+(ascii Value) will give you the same. To my surprise I found 2 ascii values for this symbol in various sites. One is 171 and the other is 189.
I tried to write this symbol by using 171 and 189. Again to my surprise, if I am writing in Windows, 171 will give me this symbol. But if I am writing in UNIX, 189 will give me this symbol.
I was aware that there cant be 2 ASCII Values for a same symbol. But I got 2 valid codes for the same symbol in different OS. So can anyone tell what is the real ASCII Code for the symbol ½ ??
½ is not a character in the ASCII character set.
The values you're finding online probably differ because they're using different character sets. For example, before Unicode was invented, localized versions of Windows all used different code pages, in which the basic ASCII set was extended with some additional characters.
Now, of course, everything is (or should be) fully Unicode. Detailed Unicode information for that character (vulgar fraction one half) can be found here. Note that there are also multiple representations for the same numerical value (e.g., base 10, hex, binary, etc.).
In Windows if you use the ALT codes,
3 digits will insert the equivalent "Code page 850" character
so ALT + 171 will insert the ½ symbol
189 is the ANSI/UTF-8/WIN-1252/ISO8859-1 value for the ½ symbol.
To use ALT codes for ANSI you MUST press 0 first,
so ALT + 0189 inserts the ½ symbol
Please read the ASCII wikipedia page. You'll learn that ASCII has no "one half" character.
These days, most systems can be configured to use UTF-8 encoding (which is the "default" or at least the most commonly used encoding on the Web and on Unix systems).
UTF-8 is a variable length encoding for Unicode. So many characters or glyphs are represented by several bytes. For the ½ (officially the vulgar fraction one half unicode character) its UTF8 encoding is the two hex bytes 0Xc2 0xBD so in C notation \302\275
I am using the Linux Gnome Character Map utility gucharmap to find all that.
You might be interested in UTF-32 (a fixed length encoding using 32 bits characters, in which ½ is represented by 0x000000BD), or even UTF-16 in which many characters are 16 bits (in particular ½ is 0x00BD e.g. one 16 bit character in UTF-16), but not all. You may also be interested in wide characters i.e. the wchar_t of recent C or C++ standards (which is UTF-16 on Windows and UTF-32 on many Unix).
FWIW, Qt is using QChar as UTF-16 (Java also has UTF-16 char ...), but Gtk use UTF-8 i.e. variable length characters.
Notice that with variable length character encodings like UTF-8 getting the N-th character (which is not the N-th byte!) in a string requires to scan the string. Also, some byte combinations are not valid UTF-8.
As others have pointed out: it's not in the ASCII table (values 0..127).
But it has a Unicode code of:
½ U+00BD Vulgar Fraction One Half
It can also be put into text using the unicode U+2044 Fraction Slash:
where your text contains the three code points: 1⁄2
but it gets rendered as 1⁄2
This has the virtue of working for any fractions:
1⁄2
3⁄5
22⁄7
355⁄113
355⁄113 - 1⁄3748629
I am quite sure that it is indeed part of the ASCII Table:
In Windows, ensure 'NumLock' is on then try [ALT + (NumPAD)171] = ½.
For ¼ use [ALT + 172]
What is the maximum number of bytes for a single UTF-8 encoded character?
I'll be encrypting the bytes of a String encoded in UTF-8 and therefore need to be able to work out the maximum number of bytes for a UTF-8 encoded String.
Could someone confirm the maximum number of bytes for a single UTF-8 encoded character please
The maximum number of bytes per character is 4 according to RFC3629 which limited the character table to U+10FFFF:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
accessible range) are encoded using sequences of 1 to 4 octets.
(The original specification allowed for up to six byte character codes for code points past U+10FFFF.)
Characters with a code less than 128 will require 1 byte only, and the next 1920 character codes require 2 bytes only. Unless you are working with an esoteric language, multiplying the character count by 4 will be a significant overestimation.
Without further context, I would say that the maximum number of bytes for a character in UTF-8 is
answer: 6 bytes
The author of the accepted answer correctly pointed this out as the "original specification". That was valid through RFC-2279 1. As J. Cocoe pointed out in the comments below, this changed in 2003 with RFC-3629 2, which limits UTF-8 to encoding for 21 bits, which can be handled with the encoding scheme using four bytes.
answer if covering all unicode: 4 bytes
But, in Java <= v7, they talk about a 3-byte maximum for representing unicode with UTF-8? That's because the original unicode specification only defined the basic multi-lingual plane (BMP), i.e. it is an older version of unicode, or subset of modern unicode. So
answer if representing only original unicode, the BMP: 3 bytes
But, the OP talks about going the other way. Not from characters to UTF-8 bytes, but from UTF-8 bytes to a "String" of bytes representation. Perhaps the author of the accepted answer got that from the context of the question, but this is not necessarily obvious, so may confuse the casual reader of this question.
Going from UTF-8 to native encoding, we have to look at how the "String" is implemented. Some languages, like Python >= 3 will represent each character with integer code points, which allows for 4 bytes per character = 32 bits to cover the 21 we need for unicode, with some waste. Why not exactly 21 bits? Because things are faster when they are byte-aligned. Some languages like Python <= 2 and Java represent characters using a UTF-16 encoding, which means that they have to use surrogate pairs to represent extended unicode (not BMP). Either way that's still 4 bytes maximum.
answer if going UTF-8 -> native encoding: 4 bytes
So, final conclusion, 4 is the most common right answer, so we got it right. But, mileage could vary.
The maximum number of bytes to support US-ASCII, a standard English alphabet encoding, is 1. But limiting text to English is becoming less desirable or practical as time goes by.
Unicode was designed to represent the glyphs of all human languages, as well as many kinds of symbols, with a variety of rendering characteristics. UTF-8 is an efficient encoding for Unicode, although still biased toward English. UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction.
While the maximum number of bytes per UTF-8 character is 3 for supporting just the 2-byte address space of Plane 0, the Basic Multilingual Plane (BMP), which can be accepted as minimal support in some applications, it is 4 for supporting all 17 current planes of Unicode (as of 2019). It should be noted that many popular "emoji" characters are likely to be located in Plane 16, which requires 4 bytes.
However, this is just for basic character glyphs. There are also various modifiers, such as making accents appear over the previous character, and it is also possible to link together an arbitrary number of code points to construct one complex "grapheme". In real world programming, therefore, the use or assumption of a fixed maximum number of bytes per character will likely eventually result in a problem for your application.
These considerations imply that UTF-8 character strings should not "expanded" into arrays of fixed length prior to processing, as has sometimes been done. Instead, programming should be done directly, using string functions specifically designed for UTF-8.
Condidering just technical limitations - it's possible to have up to 7 bytes following current UTF8 encoding scheme. According to it - if first byte is not self-sufficient ASCII character, than it should have pattern: 1(n)0X(7-n), where n is <= 7.
Also theoretically it could be 8 but then first byte would have no zero bit at all. While other aspects, like continuation byte differing from leading, are still there (allowing error detection), I heared, that byte 11111111 could be invalid, but I can't be sure about that.
Limitatation for max 4 bytes is most likely for compatibility with UTF-16, which I tend to consider a legacy, because the only quality where it excels, is processing speed, but only if string byte order matches (i.e. we read 0xFEFF in the BOM).
I want to encrypt some info for a licensing system and I want the result to be able to be typed in by the user.
Update: This operation must be reversible (decrypt-able)
E.g.,
Encrypt ( ComputerID+ProductID) -> (any standard ASCII character that can be typed. Ideally maybe even just A-Z).
So far what I did was to convert the encrypted text to HEX (so it's any character from 0-F) but that doubles the number of characters.
I'm using VB6.
I'm thinking I'd do some operation on each pair of (Input$(x) and Key$(x)) and then do a MOD to keep it within a range of ascii values (maybe 0-9-A-Z)
Any suggestions of a good algorithm?
Look into Base64 "encryption."
Base 64 will convert a number into 64 different ASCII characters, verses hex which is only 16 different ASCII characters... Making Base64 more compact and what you are looking for.
EDIT:
Code to do this in VB6 is available here: http://www.nonhostile.com/howto-encode-decode-base64-vb6.asp
Per Fuzzy Lollipop, below, Base32 looks like an even better option. Bonus points if you can find an example of that.
EDIT: I found an example of Base32 for VB6 although I've not tried it yet. -Clay
encode the encrypted bytes in HEX, or Base32 or Base64
Do you want this to be reversible -- to recover the IDs from the encrypted text? If so then it matters how you combine the key and input strings.
Usually you'd XOR each byte pair (work with byte arrays to avoid Unicode issues), circulating on the key string if it's shorter than the input. You can then use Base N encoding (32, 64 etc) to generate the license string.
Both operations are reversible: you can recover the XORed strings from the Base N string, then XOR with the key again to get the original IDs.
If you don't care about reversing the operations, then any convolution of key and ID will do. XOR is just the simplest.