I searched in Turbo Pascal 6 the complete reference for "rset" and "lset" - statements of QB 4.5 for justifying strings inside their variables bytes - equivalents in Turbo Pascal and found no result as if Turbo Pascal doesn't need these statements or Turbo Pascal is always justifying strings to left. Is this true?
In Turbo Pascal 6.0 the string type reserves 256 bytes of memory.
s: string;
The memory layout is that the first byte (value 0..255) indicates the length of the string. The following bytes hold the characters, always left aligned.
A variation of the previous is that you can declare string variables with a maximum length, like f.ex.
s: string[10];
This example will reserve 11 bytes of memory. Again the first byte indicates length of the actual string (value in this case 0..10).
The memory content after a statement like
s := 'test';
will be
4, 't', 'e', 's', 't', #0, #0, #0, #0, #0, #0
There is no statement to modify the character allocation to right align the character data (like RSET in QBasic).
However, in theory, you may precede a string with a series of null characters, e.g s :=#0#0#0#0#0#0+'test';, to achieve a similar effect, but the length will become the length of the string variable.
10, #0, #0, #0, #0, #0, #0, 't', 'e', 's', 't'
The null chars are in this case part of the string, and this might be problematic. A null char might be taken for the end of the string. Also, as the null chars might be omitted by e.g. a printing procedure in some case and not in other cases. This could lead to misalignment of data in printouts, or other.
Related
In Oracle SQL:
When we pass any non-ascii characters to ascii function, it is returning some number. How we can interpret that number. If we have character set as AL32UTF8, why its not returning Unicode point for given character
select * from nls_database_parameters where parameter = 'NLS_CHARACTERSET';--AL32UTF8
select ascii('Á') from dual;--50049
what is the meaning of this value 50049? I was expecting 193
Here - mostly for my own education - is an explanation of the value 50049 for the accented A (code point: 193 in the Unicode coded character set). I just learned how this works, from reading https://www.fileformat.info/info/unicode/utf8.htm Another pretty clear explanation and an example for a three-byte encoding on Wikipedia: https://en.wikipedia.org/wiki/UTF-8#Encoding
The computation of the encoding is derived from the code point 193 and has nothing to do with which specific character is associated with that code point.
UTF-8 uses a relatively simple scheme to encode code points up to 1,141,111 (or, likely, more these days; for now let's only worry about code points up to that upper limit).
Code points from 1 to 127 (decimal) are encoded as a single byte, equal to the code point (so the byte always has a leading zero when represented in binary).
Code points from 128 to 2047 are encoded as two bytes. The first byte, in binary representation, always begins with 110, and the second with 10. A byte that begins with 110 is always the first byte of a two-byte encoding, and a byte that begins with 10 is always a "continuation" byte (second, third or fourth) in a multi-byte encoding. These mandatory prefixes are part of the encoding scheme (the "rules" of UTF8 encoding); they are hard-coded values in the rules.
So: for code points from 128 to 2047, the encoding is in two bytes, of the exact format 110xxxxx and 10xxxxxx in binary notation. The last five digits (bits) from the first byte, plus the last six digits from the second (total: 11 bits) are the binary representation of the code point (the value from 128 to 2047 that must be encoded).
2047 = 2^11 - 1 (this is why 2047 is relevant here). The code point can be represented as an 11-bit binary number (possibly with leading zeros). Take the first five bits (after left-padding with 0 to a length of 11 bits) and attach that to the mandatory 110 prefix of the first byte, and take the last six bits of the code point and attach them to the mandatory prefix 10 of the second byte. That gives the UTF8 encoding (in two bytes) of the given code point.
Let's do that for code point 193(decimal). In binary, and padding with 0 to the left, that is 00011000001. So far, nothing fancy.
Split this into five bits || six bits: 00011 and 000001.
Attach the mandatory prefixes: 11000011 and 10000001.
Rewrite these in hex: \xC3 and \x81. Put them together; this is hex C381, or decimal 50049.
See documentation: ASCII
ASCII returns the decimal representation in the database character set of the first character of char.
Binary value of character Á (U+00C1) in UTF-8 is xC381 which is decimal 50049.
193 is the Code Point. For UTF-8 the code point is equal to binary representation only for characters U+0000 - U+007F (0-127) . For UTF-16BE the code point is equal to binary representation only for characters U+0000 - U+FFFF (0-65535),
Maybe you are looking for
ASCIISTR('Á')
which returns \00C1, you only have to convert it to a decimal value.
Some time ago I developed this function, which is more advanced than ASCIISTR, it works also work multicode characters.
CREATE OR REPLACE TYPE VARCHAR_TABLE_TYPE AS TABLE OF VARCHAR2(100);
FUNCTION UNICODECHAR(uchar VARCHAR2) RETURN VARCHAR_TABLE_TYPE IS
UTF16 VARCHAR2(32000) := ASCIISTR(uchar);
UTF16_Table VARCHAR_TABLE_TYPE := VARCHAR_TABLE_TYPE();
sg1 VARCHAR2(4);
sg2 VARCHAR2(4);
codepoint INTEGER;
res VARCHAR_TABLE_TYPE := VARCHAR_TABLE_TYPE();
i INTEGER;
BEGIN
IF uchar IS NULL THEN
RETURN VARCHAR_TABLE_TYPE();
END IF;
SELECT REGEXP_SUBSTR(UTF16, '(\\[[:xdigit:]]{4})|.', 1, LEVEL)
BULK COLLECT INTO UTF16_Table
FROM dual
CONNECT BY REGEXP_SUBSTR(UTF16, '\\[[:xdigit:]]{4}|.', 1, LEVEL) IS NOT NULL;
i := UTF16_Table.FIRST;
WHILE i IS NOT NULL LOOP
res.EXTEND;
IF REGEXP_LIKE(UTF16_Table(i), '^\\') THEN
IF REGEXP_LIKE(UTF16_Table(i), '^\\D(8|9|A|B)') THEN
sg1 := REGEXP_SUBSTR(UTF16_Table(i), '[[:xdigit:]]{4}');
i := UTF16_Table.NEXT(i);
sg2 := REGEXP_SUBSTR(UTF16_Table(i), '[[:xdigit:]]{4}');
codepoint := 2**10 * (TO_NUMBER(sg1, 'XXXX') - TO_NUMBER('D800', 'XXXX')) + TO_NUMBER(sg2, 'XXXX') - TO_NUMBER('DC00', 'XXXX') + 2**16;
res(res.LAST) := 'U+'||TO_CHAR(codepoint, 'fmXXXXXX');
ELSE
res(res.LAST) := 'U+'||REGEXP_REPLACE(UTF16_Table(i), '^\\');
END IF;
ELSE
res(res.LAST) := 'U+'||LPAD(TO_CHAR(ASCII(UTF16_Table(i)), 'fmXX'), 4, '0');
END IF;
i := UTF16_Table.NEXT(i);
END LOOP;
RETURN res;
END UNICODECHAR;
Try some examples from https://unicode.org/emoji/charts/full-emoji-list.html#1f3f3_fe0f_200d_1f308
UNICODECHAR('🏴☠️')
should return
U+1F3F4
U+200D
U+2620
U+FE0F
I came across this line in the book 'The Go Programming Languague' on page 112.
fmt.Printf("#%-5d %9.9s %.55s\n", item.Number, item.User.Login, item.Title)
What do %9.9s and %.55s mean?
From go doc fmt:
Width is specified by an optional decimal number immediately preceding the verb. If absent, the width is whatever is necessary to represent the value. ....
For strings, byte slices and byte arrays, however, precision limits the length of the input to be formatted (not the size of the output), truncating if necessary.
Thus, %.9.9s means minimal width 9 runes with input truncated at 9, and thus exactly length 9. Similar %.55s means no minimal width but input truncated at 55 which means output is at most 55 runes.
I am using ruby to parse a datastream, some parts of which are IEEE-754 floats, but am not sure how to print these as floats. For example:
f = 0xbe80fd31 # -0.2519317
puts "%f" % f
3196124465.000000
how do I get -0.2519317 ?
Any time your converting a binary byte stream to something else, you usually end up using String#unpack (and Array#pack if you're going the other way).
If you have these bytes:
bytes = [0xbe, 0x80, 0xfd, 0x31]
then you could say:
bytes.map(&:chr).join.unpack('g')
# [-0.25193169713020325]
and then unwrap the array. This:
bytes.map(&:chr).join
packs the bytes into the string:
"\xbe\x80\xfd\x31"
which is suitable for #unpack. You could also (thanks Stefan) say:
# Variations on getting the bytes into a string for `#unpack`
bytes.pack('C4').unpack('g').first
[0xbe80fd31].pack('L>').unpack('g').first
# Variations using `#unpack1`
bytes.map(&:chr).join.unpack1('g')
bytes.pack('C4').unpack1('g')
[0xbe80fd31].pack('L>').unpack1('g')
If you already have the string then you go can straight to #unpack or #unpack1.
You'll want to use 'e' instead of 'g' your bytes are in a different order and 'E' or 'G' if you actually have an eight byte double rather than a four byte float.
I've been looking into this but searching seems to lead to nothing.
It might be too simple to be described, but here I am, scratching my head...
Any help would be appreciated.
Verilog knows about "strings".
A single ASCII character requires 8 bits. Thus to store 8 characters you need 64 bits:
wire [63:0] string8;
assign string8 = "12345678";
There are some gotchas:
There is no End-Of-String character (like the C null-character)
The most RHS character is in bits 7:0.
Thus string8[7:0] will hold 8h'38. ("8").
To walk through a string you have to use e.g.: string[ index +: 8];
As with all Verilog vector assignments: unused bits are set to zero thus
assign string8 = "ABCD"; // MS bit63:32 are zero
You can not use two dimensional arrays:
wire [7:0] string5 [0:4]; assign string5 = "Wrong";
You are probably mislead by a misconception about characters. There are no such thing as a character in hardware. There are only sets of bits or codes. The only thing which converts binary codes to characters is your terminal. It interprets codes in a certain way and forming letters for you to se. So, all the printfs in 'c' and $display in verilog only send the codes to the terminal (or to a file).
The thing which converts characters to the codes is your keyboard, which you also use to type in the program. The compiler then interprets your program. Verilog (as well as the 'c') compiler represents strings in double quotes (which you typed in) as a set of bytes directly. Verilog, as well as 'c' use ascii-8 encoding for such character strings, meaning that the code for 'a' is decimal 97 and 'b' is 98, .... Every character is 8-bit wide and the quoted string forms a concatenation of bytes of ascii codes.
So, answering you question, you can convert an ascii codes to characters by sending them to the terminal via $display (or other) function, using the %s modifier.
So, an example:
module A;
reg[8*5-1:0] hello;
reg[8*3 - 1: 0] bye;
initial begin
hello = "hello"; // 5 bytes of characters
bye = {8'd98, 8'd121, 8'd101}; // 3 bytes 'b' 'y' 'e'
$display("hello=%s bye=%s", hello, bye);
end
endmodule
My goal is to export a file with fixed-width columns. I have the following HQL:
insert overwrite table destination_table
select concat(rpad(p.artist_name,40," "),rpad(p.release_name,40," "))
from source_table;
"destination_table" is an external table which writes to a file. When artist_name and release_name contains normal English characters, no problem, the result is the following:
paulo kuong[29 space characters]I am terribly stuck album
I got 40 charaters fixed width columns. However, when the strings are not English, I got:
장재인[31 space characters]다른 누구도 아닌 너에게
Which suppose to be 37 space characters. LPAD seems not able to pad the spaces correctly. When I do "length(장재인)" it returns 3 characters.. So there is something weird going on with lpad and rpad in HIVE
Any idea?
I thought the rpad works as expected. According to the documents,
rpad(string str, int len, string pad)
#Returns str, right-padded with pad to a length of len
So, in your case the length of 장재인[31 space characters] should be 40.
In short the length of 장재인 should be 9.
I did a check in python, it then length of 장재인 is indeed 9.
>>> a = '장재인'
>>> len(a)
9