File size in UTF-8 encoding? - utf-8

I have created a file with UTF-8 encoding, but I don't understand the rules for the size it takes up on disk. Here is my complete research:
First I created the file with a single Hindi letter 'क' and the file size on Windows 7 was
8 bytes.
Now with two letter 'कक' and the file size was 11 bytes.
Now with three letter 'ककक'and the file size was 14 bytes.
Can someone please explain me why it is showing such sizes?

The first three bytes are used for the BOM (Byte Order Mark) EF BB BF.
Then, the bytes E0 A4 95 encode the letter क.
Then the bytes 0D 0A encode a carriage return.
Total: 8 bytes. For each letter क you add, you need three more bytes.

On linux based systems, you can use hexdump to get the hexadecimal dump(used by Tim in his answer) and understand how many bytes a character is allocating.
echo -n a | hexdump -C
echo -n क | hexdump -C
Here's the output of the above two command.

Related

why does hexdump reverse the input when given "\xFF\x00"? [duplicate]

This question already has answers here:
hexdump output order
(2 answers)
Closed 10 months ago.
Why does hexdump print 00ff here? I expected it to print ff00 , like it got in stdin, but
$ printf "\xFF\x00" | hexdump
0000000 00ff
0000002
hexdump decided to reverse it? why?
This is because hexdump is dumping 16-bit WORDS (=2-bytes-hex) and x86 processors stores words in little-endian format (you're probably using this processor).
From Wikipedia, Endianness:
A big-endian system stores the most significant byte of a word at the
smallest memory address and the least significant byte at the largest.
A little-endian system, in contrast, stores the least-significant byte
at the smallest address.
Notice that, when you use hexdump without specifiyng a parameter, the output is similar to -x.
From hexdump, man page:
If no format strings are specified, the default
display is very similar to the -x output format (the
-x option causes more space to be used between format
units than in the default output).
...
-x, --two-bytes-hex
Two-byte hexadecimal display. Display the input offset in
hexadecimal, followed by eight space-separated, four-column,
zero-filled, two-byte quantities of input data, in
hexadecimal, per line.
If you want to dump single bytes in order, use -C parameter or specify your custom formatting with -e.
$ printf "\xFF\x00" | hexdump -C
00000000 ff 00 |?.|
00000002
$ printf "\xFF\x00" | hexdump -e '"%07.7_ax " 8/1 "%02x " "\n"'
0000000 ff 00
From the hexdump(1) man page:
If no format strings are specified, the default display is very similar to the -x output format
-x, --two-bytes-hex
Two-byte hexadecimal display. Display the input offset in hexadecimal, followed by eight space-separated, four-column, zero-filled, two-byte quantities of input data, in hexadecimal, per line.
On a little-endian host the most significant byte (here 0xFF) is listed last.

How to convert hex to ASCII while preserving non-printable characters

I've been experiencing some weird issues today while debugging, and I've managed to trace this to something I overlooked at first.
Take a look at the outputs of these two commands:
root#test:~# printf '%X' 10 | xxd -r -p | xxd -p
root#test:~# printf '%X' 43 | xxd -r -p | xxd -p
2b
root#test:~#
The first xxd command converts hex to ASCII. The second converts ASCII back to hex. (43 decimal = 2b hex).
Unfortunately, it seems that converting hex to ASCII does not preserve non-printable characters. For example, the raw hex "A" (10 decimal = A hex), somehow gets eaten up by xxd -r -p. Thus, when I perform the inverse operation, I get an empty result.
What I am trying to do is feed some data into minimodem. I need to generate Call Waiting Caller ID (FSK), effectively via bit banging. My bash script has the right bits, but if I do a hexdump, the non-printable characters are missing. Unfortunately, it seem that minimodem only accepts ASCII characters, and I need to feed it raw hex, but it seems that gets eaten up in the conversion. Is it possible to preserve these characters somehow? I don't see it as any option, so wondering if there's a better way.
xxd expects two characters per byte. One A is invalid. Do:
printf '%02X' 10 | xxd -r -p | xxd -p
How to convert hex to ASCII while preserving non-printable characters
Use xxd. If your input has one character, pad it with an initial 0.
ASCII does not preserve non-printable characters
It does preserve any bytes, xxd is the common tool to work with any binary data in shell.
Is it possible to preserve these characters somehow?
Yes - input sequence of two characters per byte to xxd.

Is it possible to create a space character without using space? What is the character for this hexidecimal: 342 200 211

I came across an unusual space character the other day:
[user#server] ~ $ echo AB583 923 | od -c
0000000 A B 5 8 3 342 200 211 9 2 3 \n
0000014
[user#server] ~ $ echo AB583 923 | od -c
0000000 A B 5 8 3 9 2 3 \n
0000012
I tried to decipher it with the hexidecimal representation command, but I don't understand enough about base level data to understand what this character really is. Can anyone help me find out?
Well according to this, 342\200\211 is a thin space in Unicode.
What do you mean by "create a space character without using space"?
The value shown by od -c is in octal. The character that is represented by those three numbers has to be searched for. Getting back the numbers from octal to hex:
342 200 211 = 0xE2 0x80 0x89
Searching for utf8 0xE2 0x80 0x89 this site is found which shows that the UTF-8 byte sequence 0xE2 0x80 0x89 belongs to the UNICODE code point 2009, or simply U-02009.
That code point is named thin space, which, yes, is a character similar to U-0020, space.
So yes, there are several characters similar to an space in UNICODE, all of them valid and all of them similar to a simple space.
I just wonder: Why are you asking?.

Counting characters in a UTF-8 file

wc -c
appears to only do a dumb bytecount, not interpret actual characters with regard for encoding.
How can I get the actual character count?
Use -m or --chars option.
For example (text file contains two Korean characters and newline):
falsetru#jmlee12:~$ cat text
안녕
falsetru#jmlee12:~$ wc -c text
7 text
falsetru#jmlee12:~$ wc -m text
3 text
According to wc(1):
-c, --bytes
print the byte counts
-m, --chars
print the character counts
Don't confuse chars, chars and bytes. A byte is 8 bits long, and -c counts bytes in your file whatever you put in. A char in many programming languages is also 8 bits long this is why counting bytes uses -c! If you want to count how many characters (chars) of a given alphabet you have in a file, then you need to specify in some way which encoding of chars have been used, and sometimes, that encoding uses more than a byte for a char. Read the manual for wc, it will tell you that -m will use you current locale (roughly your language/charset preferences) to decode the file and count your chars.

Why od is not ordering bytes correctly?

Here's the output dumped from od -cx(on linux you can reproduce with echo -ne "\r\n\n" |od -cx):
0000000 \r \n \n \0
0a0d 000a
0000003
The correct first 2 bytes should be 0d0a but it outputs 0a0d,why?
because you're on a little-endian system? a 16-bit integer will be the high byte followed by the low byte; in this case the 2nd byte followed by the first.
Because your computer is using the so-called "little-endian" method to represent words in the memory (the x86 processor architecture is a common example of little endian systems).
Because it's reading it as shorts and not as bytes. Short is 2 bytes reversed.

Resources