Why is my code behaving this way when I dereference *str + i? I know that, if I were going to try to print each character of the string one-by-one, I would have done str[i] rather than *str + i, but I wanted to see what happened here.
Is the computer recognizing that 'A' is the first letter, finding the memory location of 'A', and then just going up through the ASCII table? It almost seems like there is just one place in the computer in which the letter 'A' is stored, I found it, and then because a char is one byte it just went through the rest of the ASCII table on the for loop.
Thank you!
input:
char *str1 = "Abc";
for (int i = 0; i < 30; i++)
{
printf("letter: %c - ", *str1 + i);
printf("memory address: %p", &str1 + i);
printf("\n");
}
output:
letter: A - memory address: 0x7ffea8ea9510
letter: B - memory address: 0x7ffea8ea9518
letter: C - memory address: 0x7ffea8ea9520
letter: D - memory address: 0x7ffea8ea9528
letter: E - memory address: 0x7ffea8ea9530
letter: F - memory address: 0x7ffea8ea9538
letter: G - memory address: 0x7ffea8ea9540
letter: H - memory address: 0x7ffea8ea9548
letter: I - memory address: 0x7ffea8ea9550
letter: J - memory address: 0x7ffea8ea9558
letter: K - memory address: 0x7ffea8ea9560
letter: L - memory address: 0x7ffea8ea9568
letter: M - memory address: 0x7ffea8ea9570
letter: N - memory address: 0x7ffea8ea9578
letter: O - memory address: 0x7ffea8ea9580
etc. etc. etc.
When you write
char *str1 = "ABC";
memory looks like this:
+---+---+---+---+
| A | B | C | \0|
+---+---+---+---+
^
|
+---+
| | str1
+---+
With that in mind, what does
*str + i
do? Well, C interprets this to mean "give me the character pointed at by str, then add i to it." Since str1 points to the first character of the string, the value of *str is 'A'. Adding i to 'A' then starts advancing through the characters of the alphabet sequentially, which is why you see A, then B, then C, etc.
On the other hand, when you write
&str1 + i
C interprets this to mean "give me the address of the variable str1, then shift forward i positions." So, for example, &str1 + 0 is the address of the str1 pointer, then &str1 + 1 is the memory address of where a char* one position past str1 would be, etc. But none of those addresses, other than &str1 + 0, actually represents anything.
If you want to see the addresses of the array elements, just write str1 + i. That means "go where str1 points, then advance forward i positions."
Related
First time I need to work on raw data (with different endianness, 2's complement, ...) and thus finally figured out how to work with the bytes type.
I need to implement the following checksum algorithm. I understand the C code, but wonder how to gracefully do this in Python3...
I'm sure I could come up with something that works, but would be terribly inefficient or unreliable
The checksum algorithm used is the 8-bit Fletcher algorithm. This algorithm works as follows:
Buffer[N] is an array of bytes that contains the data over which the checksum is to be calculated.
The two CK_A and CK_A values are 8-bit unsigned integers, only! If implementing with larger- sized integer values, make sure to mask both
CK_A and CK_B with the value 0xff after both operations in the loop.
After the loop, the two U1 values contain the checksum, transmitted after the message payload, which concludes the frame.
CK_A = 0, CK_B = 0 For (I = 0; I < N; I++)
{
CK_A = CK_A + Buffer[I]
CK_B = CK_B + CK_A
} ```
My data structure is as follows:
source = b'\xb5b\x01<#\x00\x01\x00\x00\x00hUX\x17\xdd\xff\xff\xff^\xff\xff\xff\xff\xff\xff\xff\xa6\x00\x00\x00F\xee\x88\x01\x00\x00\x00\x00\xa5\xf5\xd1\x05d\x00\x00\x00d\x00\x00\x00j\x00\x00\x00d\x00\x00\x00\xcb\x86\x00\x00\x00\x00\x00\x007\x01\x00\x00\xcd\xa2'
I came up with a couple of ideas on how to do this but have issues.
The following is where I am now, I've added comments on how I think it would work (but doesn't).
for b in source[5:-2]:
# The following results in "TypeError("can't concat int to bytes")"
# So I take one element of a byte, then I would expect to get a single byte.
# However, I get an int.
# Should I convert the left part of the operation to an int first?
# I suppose I could get this done in a couple of steps but it seems this can't be the "correct" way...
CK_A[-1:] += b
# I hoped the following would work as a bitmask,
# (by keeping only the last byte) thus "emulating" an uint8_t
# Might not be the correct/best assumption...
CK_A = CK_A[-1:]
CK_B[-1:] += CK_A
CK_B = CK_B[-1:]
ret = CK_A + CK_B
Clearly, I do not completely grasp how this Bytes type works/should be used.
Seems I was making things too difficult...
CK_A = 0
CK_B = 0
for b in source:
CK_A += b
CK_B += CK_A
CK_A %= 0x100
CK_B %= 0x100
ret = bytes()
ret = int.to_bytes(CK_A,1, 'big') + int.to_bytes(CK_B,1,'big')
The %=0x100 works as a bit mask, leaving only the 8 LSB...
we have num 0x1234
In bigEndian:
low address -----------------> high address
0x12 | 0x34
In littleEndian:
low address -----------------> high address
0x34 | 0x12
we can see function below in binary.go:
func (bigEndian) PutUint16(b []byte, v uint16) {
_ = b[1] // early bounds check to guarantee safety of writes below
b[0] = byte(v >> 8)
b[1] = byte(v)
}
I download golang code of x86 and powpc arch and find the same definition.
https://golang.org/dl/
go1.12.7.linux-ppc64le.tar.gz Archive Linux ppc64le 99MB 8eda20600d90247efbfa70d116d80056e11192d62592240975b2a8c53caa5bf3
Now let's see what happen in this function.
If cpu is little endian, we store 0x1234 in memory like this:
low address -----------------> high address
0x34 | 0x12
v >> 8 means shift 8 bits right, means /2^8, so we get this in memory:
low address -----------------> high address
0x12 | 0x00
byte(v>>8), we get byte 0x12 which is in low address -> b[0]
byte(v), we get byte 0x34 -> b[1]
so we get the result which i think it's right:
[0x12,0x34]
=====================================
If cpu is big endian, we store 0x1234 in memory like this:
low address -----------------> high address
0x12 | 0x34
v >> 8 means shift 8 bits right, means /2^8, so we get this in memory:
low address -----------------> high address
0x00 | 0x12
byte(v>>8), we get byte 0x00 which is in low address -> b[0]
byte(v), we get byte 0x12 -> b[1]
so we get the result which i think it's not right:
[0x00,0x12]
I find in web how to check your cpu bigendian or little endian, and i write function below:
func IsBigEndian() bool {
test16 := uint16(0x1234)
test8 := *(*uint8)(unsafe.Pointer(&test16))
if test8 == 0x12{
return true
}else{
fmt.Printf("little")
return false
}
}
According to this function, I think byte() means get low address byte, am I right?
If right, why i get wrong result in analysis of "if cpu is big endian ..." ?
thanks a lot #Volker, I found this post Does bit-shift depend on endianness? . And know "byte(xxx)" operate in processor's register which not depend on the endianness in memory, so byte(0x1234) always get 0x34.
I found the following statement in a verilog modul:
localparam str2=" Display Demo ", str2len=16;
Seems to me that str2 is a string value but I wonder how this is processed in the following code snippet.
always#(write_base_addr)
case (write_base_addr[8:7])//select string as [y]
0: write_ascii_data <= 8'hff & (str1 >> ({3'b0, (str1len - 1 - write_base_addr[6:3])} << 3));//index string parameters as str[x]
1: write_ascii_data <= 8'hff & (str2 >> ({3'b0, (str2len - 1 - write_base_addr[6:3])} << 3));
2: write_ascii_data <= 8'hff & (str3 >> ({3'b0, (str3len - 1 - write_base_addr[6:3])} << 3));
3: write_ascii_data <= 8'hff & (str4 >> ({3'b0, (str4len - 1 - write_base_addr[6:3])} << 3));
endcase
Will the string value be convertet into a bit value first? Write_ascii_data is only 8 bits long, seems to me that it is too short for fully storing the end result of the case process. Is there any vhdl equivalent of localparam string ?
Verilog has no string types. A string literal gets converted to the equivalent ASCII bit vector, 8 bits per character. So str2 is a 128 bit vector parameter. The RHS expressions are shifting str2 to the left by some multiple of 8 bits, selecting one ASCII character.
I have just started learning c++ and have come across various data types in c++. I also learnt how the computer stores values when the data type is specified . One doubt that occurred to me while learning char data types was how did the computer differentiate between integers and characters.
I learnt that the character data type uses 8 bits to store a character and the computer can store a character in its memory location by following ASCII encoding rules. However, I didn't realise how the computer knows whether the byte 00100001 represents the latter 'a' or the integer 65. Is there any special bit assigned for this purpose?
when we do
int a = 65
or
char ch = 'a'
If we check the memory address we will see the value 00100001 as expected.
In application layer we choose to cast as character or integer
prinf("%d", ch)
will print 65
Characters are represented as integers inside the computer. Hence the data type "char" is simply a subset of the data type "int".
Refer to following page: will clear all the ambiguities in your mind.
Data Types Detail
The computer itself does not remember or set any bits to distinguish chars from ints. Instead it's the compiler which maintains that information and generates proper machine code which operates on data appropriately.
You can even override and 'mislead' the compiler if you want. For example you can cast a char pointer to a void pointer and then to an int pointer and then try to read the location referred to as an int. I think 'dynamic casts' are also possible. If there was an actual bit used then such operations would not be possible.
Adding more details in response to comment:
Hi, really what you should ask is that who will retrieve the values? Imagine that you write the contents of memory to file and send them over the Internet. If the receiver "knows" that its receiving chars then there is no need to encode the identity of chars. But if the receiver could receive either chars or ints then it would need identifying bits. In the same way, when you compile a program and the compiler knows what's stored where, there is no need to 'figure out' anything since you already know it. Now how a char is encoded as bits vs a float vs an int is decided by a standard like IEEE standard
You have asked a simple yet profound question. :-)
Answers and an example or two are below.
(see edit2, at bottom, for a longer example that tries to illustrate what happens when you interpret a single memory location's bit patterns in different ways).
The "profound" aspect of it lies in the astounding variety of character encodings that exist. There are many - I wager more than you believe there could possibly be. :-)
This is a worthwhile read: http://www.joelonsoftware.com/articles/Unicode.html
full title: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
As for your first question: "how did the computer differentiate between integers and characters":
The computer doesn't (for better or worse).
The meaning of bit patterns is interpreted by whatever reads them.
Consider this example bit pattern (8 bits, or one byte):
01000001b = 41x = 65d (binary, hex & decimal respectively).
If that bit pattern is based on ASCII it will represent an uppercase A.
If that bit pattern is EBCDIC it will represent an "non-breaking space" character (at least according to the EBCDIC chart at wikipedia, most of the others I looked at don't say what 65d means in EBCDIC).
(Just for trivia's sake, in EBCDIC, 'A' would be represented with a different bit pattern entirely: C1x or 193d.)
If you read that bit pattern is an integer (perhaps a short), it may indicate you have 65 dollars in a bank account (or euros, or something else - just like the character set your bit pattern won't have anything in it to tell you what currency it is.
If that bit pattern is part of a 24-bit pixel encoding for your display (3 bytes for RBG), perhps 'blue' in RBG encoding, it may indicate your pixel is roughly 25% blue (e.g. 65/255 is about 25.4%); 0% would be black, 100% would be as blue as possible.
So, yeah, there are lots of variations on how bits can be interpreted. It is up to your program to keep track of that.
edit: it is common to add metadata to track that, so if you are dealing with currencies you may have one byte for currency type and other bytes for the quantity of a given currency. Currency type would have to be encoded as well; there are different ways to do that... something that "C++ enum" attempts to solve in a space-efficient way: http://www.cprogramming.com/tutorial/enum.html ).
As for 8 bits (one byte) per character, that is an Fair Assumption when you're starting out. But it isn't always true. Lots of languages will use 2+ bytes for each character when you get into Unicode.
However... ASCII is very common and it fits into a single byte (8 bits).
If you are handling simple english text (A-Z, 0-9 and so on), that my be enough for you.
Spend some time browsing here and look at acsii, ebcdic and others:
http://www.lookuptables.com/
If you're running on linux or smth, hexdump can be your friend.
Try the following
$ hexdump -C myfile.dat
Whatever operating system you're using, you will want to find a hexdump utility you can use to see what is really in your data files.
You mentioned C++, I think it would would be an interesting exercise to write a "thing" byte-dumper utility, just a short program that takes a void* pointer and the number of bytes it has and then prints out that many bytes worth of values.
Good luck with your studies! :-)
Edit 2: I added a small research program... I don't know how to illustrate the idea more concisely (seems easer in C than C++).
Anyway...
In this example program, I have two character pointers that are referencing memory used by an integer.
The actual code (see 'example program', way below) is messier with casting, but this illustrates the basic idea:
unsigned short a; // reserve 2 bytes of memory to store our 'unsigned short' integer.
char *c1 = &a; // point to first byte at a's memory location.
char *c2 = c1 + 1; // point to next byte at a's memory location.
Note how 'c1' and 'c2' both share the memory that is also used by 'a'.
Walking through the output...
The sizeof's basically tells you how many bytes something uses.
The ===== Message Here ===== lines are like a comment printed out by the dump() function.
The important thing about the dump() function is that it is using the bit patterns in the memory location for 'a'.
dump() doesn't change those bit patterns, it just retrieves them and displays them via cout.
In the first run, before calling dump I assign the following bit pattern to a:
a = (0x41<<8) + 0x42;
This left-shifts 0x41 8 bits and adds 0x42 to it.
The resulting bit pattern is = 0x4142 (which is 16706 decimal, or 100001 100010 binary).
One of the bytes will be 0x41, the other will hold 0x42.
Next it calls the dump() method:
dump( "In ASCII, 0x41 is 'A' and 0x42 is 'B'" );
Note the output for this run on my virtual box Ubuntu found the address of a was 0x6021b8.
Which nicely matches the expected addresses pointed to by both c1 & c2.
Then I modify the bit pattern in 'a'...
a += 1; dump(); // why did this find a 'C' instead of 'B'?
a += 5; dump(); // why did this find an 'H' instead of 'C' ?
As you dig deeper into C++ (and maybe C ) you will want to be able to draw memory maps like this (more or less):
=== begin memory map ===
+-------+-------+
unsigned short a : byte0 : byte1 : holds 2 bytes worth of bit patterns.
+-------+-------+-------+-------+
char * c1 : byte0 : byte1 : byte3 : byte4 : holds address of a
+-------+-------+-------+-------+
char * c2 : byte0 : byte1 : byte3 : byte4 : holds address of a + 1
+-------+-------+-------+-------+
=== end memory map ===
Here is what it looks like when it runs; I encourage you to walk through the C++ code
in one window and tie each piece of output back to the C++ expression that generated it.
Note how sometimes we do simple math to add a number to a (e.g. "a +=1" followed by "a += 5").
Note the impact that has on the characters that dump() extracts from memory location 'a'.
=== begin run ===
$ clear; g++ memfun.cpp
$ ./a.out
sizeof char =1, unsigned char =1
sizeof short=2, unsigned short=2
sizeof int =4, unsigned int =4
sizeof long =8, unsigned long =8
===== In ASCII, 0x41 is 'A' and 0x42 is 'B' =====
a=16706(dec), 0x4142 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=B
c2=A
in hex, c1=42
in hex, c2=41
===== after a+= 1 =====
a=16707(dec), 0x4143 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=C
c2=A
in hex, c1=43
in hex, c2=41
===== after a+= 5 =====
a=16712(dec), 0x4148 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=H
c2=A
in hex, c1=48
in hex, c2=41
===== In ASCII, 0x58 is 'X' and 0x42 is 'Y' =====
a=22617(dec), 0x5859 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=Y
c2=X
in hex, c1=59
in hex, c2=58
===== In ASCII, 0x59 is 'Y' and 0x5A is 'Z' =====
a=22874(dec), 0x595a (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=Z
c2=Y
in hex, c1=5a
in hex, c2=59
Done.
$
=== end run ===
=== begin example program ===
#include <iostream>
#include <string>
using namespace std;
// define some global variables
unsigned short a; // declare 2 bytes in memory, as per sizeof()s below.
char *c1 = (char *)&a; // point c1 to start of memory belonging to a (1st byte).
char * c2 = c1 + 1; // point c2 to next piece of memory belonging to a (2nd byte).
void dump(const char *msg) {
// so the important thing about dump() is that
// we are working with bit patterns in memory we
// do not own, and it is memory we did not set (at least
// not here in dump(), the caller is manipulating the bit
// patterns for the 2 bytes in location 'a').
cout << "===== " << msg << " =====\n";
cout << "a=" << dec << a << "(dec), 0x" << hex << a << dec << " (address of a: " << &a << ")\n";
cout << "c1=" << (void *)c1 << " (should be the same as 'address of a')\n";
cout << "c2=" << (void *)c2 << " (should be just 1 more than 'address of a')\n";
cout << "c1=" << (char)(*c1) << "\n";
cout << "c2=" << (char)(*c2) << "\n";
cout << "in hex, c1=" << hex << ((int)(*c1)) << dec << "\n";
cout << "in hex, c2=" << hex << (int)(*c2) << dec << "\n";
}
int main() {
cout << "sizeof char =" << sizeof( char ) << ", unsigned char =" << sizeof( unsigned char ) << "\n";
cout << "sizeof short=" << sizeof( short ) << ", unsigned short=" << sizeof( unsigned short ) << "\n";
cout << "sizeof int =" << sizeof( int ) << ", unsigned int =" << sizeof( unsigned int ) << "\n";
cout << "sizeof long =" << sizeof( long ) << ", unsigned long =" << sizeof( unsigned long ) << "\n";
// this logic changes the bit pattern in a then calls dump() to interpret that bit pattern.
a = (0x41<<8) + 0x42; dump( "In ASCII, 0x41 is 'A' and 0x42 is 'B'" );
a+= 1; dump( "after a+= 1" );
a+= 5; dump( "after a+= 5" );
a = (0x58<<8) + 0x59; dump( "In ASCII, 0x58 is 'X' and 0x42 is 'Y'" );
a = (0x59<<8) + 0x5A; dump( "In ASCII, 0x59 is 'Y' and 0x5A is 'Z'" );
cout << "Done.\n";
}
=== end example program ===
int is an integer, a number that has no digits after the decimal point. It can be positive or negative. Internally, integers are stored as binary numbers. On most computers, integers are 32-bit binary numbers, but this size can vary from one computer to another. When calculations are done with integers, anything after the decimal point is lost. So if you divided 2 by 3, the result is 0, not 0.6666.
char is a data type that is intended for holding characters, as in alphanumeric strings. This data type can be positive or negative, even though most character data for which it is used is unsigned. The typical size of char is one byte (eight bits), but this varies from one machine to another. The plot thickens considerably on machines that support wide characters (e.g., Unicode) or multiple-byte encoding schemes for strings. But in general char is one byte.
To clarify the question, please observe the c/c++ code fragment:
int a = 10, b = 20, c = 30, d = 40; //consecutive 4 int data values.
int* p = &d; //address of variable d.
Now, in visual studio (tested on 2013), if value of p == hex_value (which can be viewed in debugger memory window), then, you can observe that, the addresses for other variables a, b, c, and d are each at a 12 byte difference!
So, if p == hex_value, then it follows:
&c == hex_value + 0xC (note hex C is 12 in decimal)
&b == &c + 0xC
&a == &b + 0xC
So, why is there a 12 bytes offset instead of 4 bytes -- int are just 4 bytes?
Now, if we declared an array:
int array[] = {10,20,30,40};
The values 10, 20, 30, 40 each are located at 4 bytes difference as expected!
Can anyone please explain this behavior?
The standard C++ states in section 8.3.4 Arrays that "An object of array type contains a contiguously allocated non-empty set of N subobjects of type T."
This is why, array[] will be a set of contiguous int, and that difference between one element and the next will be exactly sizeof(int).
For local/block variables (automatic storage), no such guarantee is given. The only statements are in section 1.7. The C++ memory model: "Every byte has a unique address." and 1.8. The C++ object model: "the address of that object is the address of the first byte it occupies. Two objects (...) shall have distinct addresses".
So everything that you do assuming contiguousness of such objects would be undefined behavior and non portable. You cannot even be sure of the order of the addresses in which these objects are created.
Now I have played with a modified version of your code:
int a = 10, b = 20, c = 30, d = 40; //consecutive 4 int data values.
int* p = &d; //address of variable d.
int array[] = { 10, 20, 30, 40 };
char *pa = reinterpret_cast<char*>(&a),
*pb = reinterpret_cast<char*>(&b),
*pc = reinterpret_cast<char*>(&c),
*pd = reinterpret_cast<char*>(&d);
cout << "sizeof(int)=" << sizeof(int) << "\n &a=" << &a << \
" +" << pa - pb << "char\n &b=" << &b << \
" +" << pb - pc << "char\n &c=" << &c << \
" +" << pc - pd << "char\n &d=" << &d;
memset(&d, 0, (&a - &d)*sizeof(int));
// ATTENTION: undefined behaviour:
// will trigger core dump on leaving
// "Runtime check #2, stack arround the variable b was corrupted".
When running this code I get:
debug release comment on release
sizeof(int)=4 sizeof(int)=4
&a=0052F884 +12char &a=009EF9AC +4char
&b=0052F878 +12char &b=009EF9A8 +-8char // is before a
&c=0052F86C +12char &c=009EF9B0 +12char // is just after a !!
&d=0052F860 &d=009EF9A4
So you see that the order of the addresses may even be altered on the same compiler, depending on the build options !! In fact, in release mode the variables are contiguous but not in the same order.
The extra spaces on the debug version come from option /RTCs. I have on purpose overwritten the variables with a harsh memset() that assumes they are contiguous. Upon exit of the execution, I get immediately a message: "Runtime check #2, stack arround the variable b was corrupted", which clearly demonstrate the purpose of these extra chars.
If you remove the option, you will get with MSVC13 contiguous variables, each of 4 bytes as you did expect. But there will be no more error message about corruption of stack either.