Ensuring variable size for bitwise operations in Ruby - ruby

I have a group of hex values that describe an object in ruby and I want to string them all together into a single bit bucket. In C++ I would do the following:
int descriptor = 0 // or uint64_t to be safe
descriptor += (firstHexValue << 60)
descriptor += (secondHex << 56)
descriptor += (thirdHex << 52)
// ... etc
descriptor += (sixteenthHex << 0)
I want to do the same thing in Ruby, but as Ruby is untyped, I am worried about overflow. If I try and do the same thing in Ruby, is there a way to ensure that descriptor contains 64 bits? Once the descriptors are set, I don't want to suddenly find that only 32 bits are represented and I've lost half of it! How can I safely achieve the same result as above?
Note: Working on OS X 64bit if that is relevant.

Ruby has unlimited integers, so don't worry about that. You won't lose a single bit.
a = 0
a |= (1 << 200)
a # => 1606938044258990275541962092341162602522202993782792835301376

Related

Performing checksum calculation on python bytes type

First time I need to work on raw data (with different endianness, 2's complement, ...) and thus finally figured out how to work with the bytes type.
I need to implement the following checksum algorithm. I understand the C code, but wonder how to gracefully do this in Python3...
I'm sure I could come up with something that works, but would be terribly inefficient or unreliable
The checksum algorithm used is the 8-bit Fletcher algorithm. This algorithm works as follows:
Buffer[N] is an array of bytes that contains the data over which the checksum is to be calculated.
The two CK_A and CK_A values are 8-bit unsigned integers, only! If implementing with larger- sized integer values, make sure to mask both
CK_A and CK_B with the value 0xff after both operations in the loop.
After the loop, the two U1 values contain the checksum, transmitted after the message payload, which concludes the frame.
CK_A = 0, CK_B = 0 For (I = 0; I < N; I++)
{
CK_A = CK_A + Buffer[I]
CK_B = CK_B + CK_A
} ```
My data structure is as follows:
source = b'\xb5b\x01<#\x00\x01\x00\x00\x00hUX\x17\xdd\xff\xff\xff^\xff\xff\xff\xff\xff\xff\xff\xa6\x00\x00\x00F\xee\x88\x01\x00\x00\x00\x00\xa5\xf5\xd1\x05d\x00\x00\x00d\x00\x00\x00j\x00\x00\x00d\x00\x00\x00\xcb\x86\x00\x00\x00\x00\x00\x007\x01\x00\x00\xcd\xa2'
I came up with a couple of ideas on how to do this but have issues.
The following is where I am now, I've added comments on how I think it would work (but doesn't).
for b in source[5:-2]:
# The following results in "TypeError("can't concat int to bytes")"
# So I take one element of a byte, then I would expect to get a single byte.
# However, I get an int.
# Should I convert the left part of the operation to an int first?
# I suppose I could get this done in a couple of steps but it seems this can't be the "correct" way...
CK_A[-1:] += b
# I hoped the following would work as a bitmask,
# (by keeping only the last byte) thus "emulating" an uint8_t
# Might not be the correct/best assumption...
CK_A = CK_A[-1:]
CK_B[-1:] += CK_A
CK_B = CK_B[-1:]
ret = CK_A + CK_B
Clearly, I do not completely grasp how this Bytes type works/should be used.
Seems I was making things too difficult...
CK_A = 0
CK_B = 0
for b in source:
CK_A += b
CK_B += CK_A
CK_A %= 0x100
CK_B %= 0x100
ret = bytes()
ret = int.to_bytes(CK_A,1, 'big') + int.to_bytes(CK_B,1,'big')
The %=0x100 works as a bit mask, leaving only the 8 LSB...

How to properly convert dicom image to opencv

I have problems converting .dcm image from dcmtk format to opencv.
My code:
DicomImage dcmImage(in_file.c_str());
int depth = dcmImage.getDepth();
std::cout << "bit-depth: " << depth << "\n"; //this outputs 10
Uint8* imgData = (uchar*)dcmImage.getOutputData(depth);
std::cout << "size: " << dcmImage.getOutputDataSize() << "\n"; //this outputs 226100
cv::Mat image(int(dcmImage.getWidth()), int(dcmImage.getHeight()), CV_32S, imgData);
std::cout << dcmImage.getWidth() << " " << dcmImage.getHeight() << "\n"; //this outputs 266 and 425
imshow("image view", image); //this shows malformed image
So I am not sure about CV_32S and getOutputData parameter. What should i put there? Also 226100/(266*425) == 2 so it should be 2 bytes pre pixel (?)
When getDepth() returns 10, that means you have 10 bits (most probably grayscale) per pixel.
Depending on the pixel representation of the DICOM image (0x0028,0x0103), you have to specify signed or unsigned 16 bit integer for the matrix type:
CV_16UC2 or CV_16SC2.
Caution: As only 10 bits of 2 bytes are used, you might find garbage in the upper 6 bits which should be masked out before passing the buffer to the mat.
Update:
About your comments and your source code:
DicomImage::getInterData()::getPixelRepresentation() does not return the pixel representation as found in the DICOM header but an internal enumeration expressing bit depth and signed/unsigned at the same time. To obtain the value in the header - use the DcmDataset or DcmFileFormat
I am not an openCV expert, but I think you are applying an 8 bit bitmask to the 16 bit image which cannot work properly
The bitmask should read (1 >> 11) - 1
The question is whether you really need rendered pixel data as returned by DicomImage::getOutputData(), or if you need the original pixel data from the DICOM image (also see answer from #kritzel_sw). When using getOutputData() you should pass the requested bit depth as a parameter (e.g. 8 bits per sample) and not the value returned by getDepth().
When working with CT images, you probably want to use pixel data in Hounsfield Units (which is a signed integer value that is the result of the Modality LUT transformation).

Differentiation between integer and character

I have just started learning c++ and have come across various data types in c++. I also learnt how the computer stores values when the data type is specified . One doubt that occurred to me while learning char data types was how did the computer differentiate between integers and characters.
I learnt that the character data type uses 8 bits to store a character and the computer can store a character in its memory location by following ASCII encoding rules. However, I didn't realise how the computer knows whether the byte 00100001 represents the latter 'a' or the integer 65. Is there any special bit assigned for this purpose?
when we do
int a = 65
or
char ch = 'a'
If we check the memory address we will see the value 00100001 as expected.
In application layer we choose to cast as character or integer
prinf("%d", ch)
will print 65
Characters are represented as integers inside the computer. Hence the data type "char" is simply a subset of the data type "int".
Refer to following page: will clear all the ambiguities in your mind.
Data Types Detail
The computer itself does not remember or set any bits to distinguish chars from ints. Instead it's the compiler which maintains that information and generates proper machine code which operates on data appropriately.
You can even override and 'mislead' the compiler if you want. For example you can cast a char pointer to a void pointer and then to an int pointer and then try to read the location referred to as an int. I think 'dynamic casts' are also possible. If there was an actual bit used then such operations would not be possible.
Adding more details in response to comment:
Hi, really what you should ask is that who will retrieve the values? Imagine that you write the contents of memory to file and send them over the Internet. If the receiver "knows" that its receiving chars then there is no need to encode the identity of chars. But if the receiver could receive either chars or ints then it would need identifying bits. In the same way, when you compile a program and the compiler knows what's stored where, there is no need to 'figure out' anything since you already know it. Now how a char is encoded as bits vs a float vs an int is decided by a standard like IEEE standard
You have asked a simple yet profound question. :-)
Answers and an example or two are below.
(see edit2, at bottom, for a longer example that tries to illustrate what happens when you interpret a single memory location's bit patterns in different ways).
The "profound" aspect of it lies in the astounding variety of character encodings that exist. There are many - I wager more than you believe there could possibly be. :-)
This is a worthwhile read: http://www.joelonsoftware.com/articles/Unicode.html
full title: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
As for your first question: "how did the computer differentiate between integers and characters":
The computer doesn't (for better or worse).
The meaning of bit patterns is interpreted by whatever reads them.
Consider this example bit pattern (8 bits, or one byte):
01000001b = 41x = 65d (binary, hex & decimal respectively).
If that bit pattern is based on ASCII it will represent an uppercase A.
If that bit pattern is EBCDIC it will represent an "non-breaking space" character (at least according to the EBCDIC chart at wikipedia, most of the others I looked at don't say what 65d means in EBCDIC).
(Just for trivia's sake, in EBCDIC, 'A' would be represented with a different bit pattern entirely: C1x or 193d.)
If you read that bit pattern is an integer (perhaps a short), it may indicate you have 65 dollars in a bank account (or euros, or something else - just like the character set your bit pattern won't have anything in it to tell you what currency it is.
If that bit pattern is part of a 24-bit pixel encoding for your display (3 bytes for RBG), perhps 'blue' in RBG encoding, it may indicate your pixel is roughly 25% blue (e.g. 65/255 is about 25.4%); 0% would be black, 100% would be as blue as possible.
So, yeah, there are lots of variations on how bits can be interpreted. It is up to your program to keep track of that.
edit: it is common to add metadata to track that, so if you are dealing with currencies you may have one byte for currency type and other bytes for the quantity of a given currency. Currency type would have to be encoded as well; there are different ways to do that... something that "C++ enum" attempts to solve in a space-efficient way: http://www.cprogramming.com/tutorial/enum.html ).
As for 8 bits (one byte) per character, that is an Fair Assumption when you're starting out. But it isn't always true. Lots of languages will use 2+ bytes for each character when you get into Unicode.
However... ASCII is very common and it fits into a single byte (8 bits).
If you are handling simple english text (A-Z, 0-9 and so on), that my be enough for you.
Spend some time browsing here and look at acsii, ebcdic and others:
http://www.lookuptables.com/
If you're running on linux or smth, hexdump can be your friend.
Try the following
$ hexdump -C myfile.dat
Whatever operating system you're using, you will want to find a hexdump utility you can use to see what is really in your data files.
You mentioned C++, I think it would would be an interesting exercise to write a "thing" byte-dumper utility, just a short program that takes a void* pointer and the number of bytes it has and then prints out that many bytes worth of values.
Good luck with your studies! :-)
Edit 2: I added a small research program... I don't know how to illustrate the idea more concisely (seems easer in C than C++).
Anyway...
In this example program, I have two character pointers that are referencing memory used by an integer.
The actual code (see 'example program', way below) is messier with casting, but this illustrates the basic idea:
unsigned short a; // reserve 2 bytes of memory to store our 'unsigned short' integer.
char *c1 = &a; // point to first byte at a's memory location.
char *c2 = c1 + 1; // point to next byte at a's memory location.
Note how 'c1' and 'c2' both share the memory that is also used by 'a'.
Walking through the output...
The sizeof's basically tells you how many bytes something uses.
The ===== Message Here ===== lines are like a comment printed out by the dump() function.
The important thing about the dump() function is that it is using the bit patterns in the memory location for 'a'.
dump() doesn't change those bit patterns, it just retrieves them and displays them via cout.
In the first run, before calling dump I assign the following bit pattern to a:
a = (0x41<<8) + 0x42;
This left-shifts 0x41 8 bits and adds 0x42 to it.
The resulting bit pattern is = 0x4142 (which is 16706 decimal, or 100001 100010 binary).
One of the bytes will be 0x41, the other will hold 0x42.
Next it calls the dump() method:
dump( "In ASCII, 0x41 is 'A' and 0x42 is 'B'" );
Note the output for this run on my virtual box Ubuntu found the address of a was 0x6021b8.
Which nicely matches the expected addresses pointed to by both c1 & c2.
Then I modify the bit pattern in 'a'...
a += 1; dump(); // why did this find a 'C' instead of 'B'?
a += 5; dump(); // why did this find an 'H' instead of 'C' ?
As you dig deeper into C++ (and maybe C ) you will want to be able to draw memory maps like this (more or less):
=== begin memory map ===
+-------+-------+
unsigned short a : byte0 : byte1 : holds 2 bytes worth of bit patterns.
+-------+-------+-------+-------+
char * c1 : byte0 : byte1 : byte3 : byte4 : holds address of a
+-------+-------+-------+-------+
char * c2 : byte0 : byte1 : byte3 : byte4 : holds address of a + 1
+-------+-------+-------+-------+
=== end memory map ===
Here is what it looks like when it runs; I encourage you to walk through the C++ code
in one window and tie each piece of output back to the C++ expression that generated it.
Note how sometimes we do simple math to add a number to a (e.g. "a +=1" followed by "a += 5").
Note the impact that has on the characters that dump() extracts from memory location 'a'.
=== begin run ===
$ clear; g++ memfun.cpp
$ ./a.out
sizeof char =1, unsigned char =1
sizeof short=2, unsigned short=2
sizeof int =4, unsigned int =4
sizeof long =8, unsigned long =8
===== In ASCII, 0x41 is 'A' and 0x42 is 'B' =====
a=16706(dec), 0x4142 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=B
c2=A
in hex, c1=42
in hex, c2=41
===== after a+= 1 =====
a=16707(dec), 0x4143 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=C
c2=A
in hex, c1=43
in hex, c2=41
===== after a+= 5 =====
a=16712(dec), 0x4148 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=H
c2=A
in hex, c1=48
in hex, c2=41
===== In ASCII, 0x58 is 'X' and 0x42 is 'Y' =====
a=22617(dec), 0x5859 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=Y
c2=X
in hex, c1=59
in hex, c2=58
===== In ASCII, 0x59 is 'Y' and 0x5A is 'Z' =====
a=22874(dec), 0x595a (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=Z
c2=Y
in hex, c1=5a
in hex, c2=59
Done.
$
=== end run ===
=== begin example program ===
#include <iostream>
#include <string>
using namespace std;
// define some global variables
unsigned short a; // declare 2 bytes in memory, as per sizeof()s below.
char *c1 = (char *)&a; // point c1 to start of memory belonging to a (1st byte).
char * c2 = c1 + 1; // point c2 to next piece of memory belonging to a (2nd byte).
void dump(const char *msg) {
// so the important thing about dump() is that
// we are working with bit patterns in memory we
// do not own, and it is memory we did not set (at least
// not here in dump(), the caller is manipulating the bit
// patterns for the 2 bytes in location 'a').
cout << "===== " << msg << " =====\n";
cout << "a=" << dec << a << "(dec), 0x" << hex << a << dec << " (address of a: " << &a << ")\n";
cout << "c1=" << (void *)c1 << " (should be the same as 'address of a')\n";
cout << "c2=" << (void *)c2 << " (should be just 1 more than 'address of a')\n";
cout << "c1=" << (char)(*c1) << "\n";
cout << "c2=" << (char)(*c2) << "\n";
cout << "in hex, c1=" << hex << ((int)(*c1)) << dec << "\n";
cout << "in hex, c2=" << hex << (int)(*c2) << dec << "\n";
}
int main() {
cout << "sizeof char =" << sizeof( char ) << ", unsigned char =" << sizeof( unsigned char ) << "\n";
cout << "sizeof short=" << sizeof( short ) << ", unsigned short=" << sizeof( unsigned short ) << "\n";
cout << "sizeof int =" << sizeof( int ) << ", unsigned int =" << sizeof( unsigned int ) << "\n";
cout << "sizeof long =" << sizeof( long ) << ", unsigned long =" << sizeof( unsigned long ) << "\n";
// this logic changes the bit pattern in a then calls dump() to interpret that bit pattern.
a = (0x41<<8) + 0x42; dump( "In ASCII, 0x41 is 'A' and 0x42 is 'B'" );
a+= 1; dump( "after a+= 1" );
a+= 5; dump( "after a+= 5" );
a = (0x58<<8) + 0x59; dump( "In ASCII, 0x58 is 'X' and 0x42 is 'Y'" );
a = (0x59<<8) + 0x5A; dump( "In ASCII, 0x59 is 'Y' and 0x5A is 'Z'" );
cout << "Done.\n";
}
=== end example program ===
int is an integer, a number that has no digits after the decimal point. It can be positive or negative. Internally, integers are stored as binary numbers. On most computers, integers are 32-bit binary numbers, but this size can vary from one computer to another. When calculations are done with integers, anything after the decimal point is lost. So if you divided 2 by 3, the result is 0, not 0.6666.
char is a data type that is intended for holding characters, as in alphanumeric strings. This data type can be positive or negative, even though most character data for which it is used is unsigned. The typical size of char is one byte (eight bits), but this varies from one machine to another. The plot thickens considerably on machines that support wide characters (e.g., Unicode) or multiple-byte encoding schemes for strings. But in general char is one byte.

Why are consecutive int data type variables located at 12 bytes offset in visual studio?

To clarify the question, please observe the c/c++ code fragment:
int a = 10, b = 20, c = 30, d = 40; //consecutive 4 int data values.
int* p = &d; //address of variable d.
Now, in visual studio (tested on 2013), if value of p == hex_value (which can be viewed in debugger memory window), then, you can observe that, the addresses for other variables a, b, c, and d are each at a 12 byte difference!
So, if p == hex_value, then it follows:
&c == hex_value + 0xC (note hex C is 12 in decimal)
&b == &c + 0xC
&a == &b + 0xC
So, why is there a 12 bytes offset instead of 4 bytes -- int are just 4 bytes?
Now, if we declared an array:
int array[] = {10,20,30,40};
The values 10, 20, 30, 40 each are located at 4 bytes difference as expected!
Can anyone please explain this behavior?
The standard C++ states in section 8.3.4 Arrays that "An object of array type contains a contiguously allocated non-empty set of N subobjects of type T."
This is why, array[] will be a set of contiguous int, and that difference between one element and the next will be exactly sizeof(int).
For local/block variables (automatic storage), no such guarantee is given. The only statements are in section 1.7. The C++ memory model: "Every byte has a unique address." and 1.8. The C++ object model: "the address of that object is the address of the first byte it occupies. Two objects (...) shall have distinct addresses".
So everything that you do assuming contiguousness of such objects would be undefined behavior and non portable. You cannot even be sure of the order of the addresses in which these objects are created.
Now I have played with a modified version of your code:
int a = 10, b = 20, c = 30, d = 40; //consecutive 4 int data values.
int* p = &d; //address of variable d.
int array[] = { 10, 20, 30, 40 };
char *pa = reinterpret_cast<char*>(&a),
*pb = reinterpret_cast<char*>(&b),
*pc = reinterpret_cast<char*>(&c),
*pd = reinterpret_cast<char*>(&d);
cout << "sizeof(int)=" << sizeof(int) << "\n &a=" << &a << \
" +" << pa - pb << "char\n &b=" << &b << \
" +" << pb - pc << "char\n &c=" << &c << \
" +" << pc - pd << "char\n &d=" << &d;
memset(&d, 0, (&a - &d)*sizeof(int));
// ATTENTION: undefined behaviour:
// will trigger core dump on leaving
// "Runtime check #2, stack arround the variable b was corrupted".
When running this code I get:
debug release comment on release
sizeof(int)=4 sizeof(int)=4
&a=0052F884 +12char &a=009EF9AC +4char
&b=0052F878 +12char &b=009EF9A8 +-8char // is before a
&c=0052F86C +12char &c=009EF9B0 +12char // is just after a !!
&d=0052F860 &d=009EF9A4
So you see that the order of the addresses may even be altered on the same compiler, depending on the build options !! In fact, in release mode the variables are contiguous but not in the same order.
The extra spaces on the debug version come from option /RTCs. I have on purpose overwritten the variables with a harsh memset() that assumes they are contiguous. Upon exit of the execution, I get immediately a message: "Runtime check #2, stack arround the variable b was corrupted", which clearly demonstrate the purpose of these extra chars.
If you remove the option, you will get with MSVC13 contiguous variables, each of 4 bytes as you did expect. But there will be no more error message about corruption of stack either.

Packing a long binary integer in Ruby

I'm trying to send a very long binary integer over UDP (on the order of 200 bits). When I try to use Array's pack method, it complains the string I'm trying to convert is too large.
Am I going about this the wrong way?
ruby-1.8.7-p352 :003 > [0b1101001010101101111010100101010011010101010110010101010101010010010101001010101010101011101010101010101111010101010101010101].pack('i')
RangeError: bignum too big to convert into `unsigned long'
from (irb):3:in `pack'
from (irb):3
This number is supposed to represent a DNS query packet (this is for a homework assignment; we're not allowed to use any DNS libraries).
You need to break apart your number into smaller pieces. Probably best is to encode 32 bits at a time:
> num = 0b1101001010101101111010100101010011010101010110010101010101010010010101001010101010101011101010101010101111010101010101010101
=> 17502556204775004286774747314501014869
> low_1 = num & 0xFFFFFFFF
=> 2864534869
> low_2 = (num >> 32) & 0xFFFFFFFF
=> 625650362
> low_3 = (num >> 64) & 0xFFFFFFFF
=> 1297454421
> low_4 = (num >> 96) & 0xFFFFFFFF
=> 220913317
> (low_4 << 96) + (low_3 << 64) + (low_2 << 32) + low_1
=> 17502556204775004286774747314501014869
> msg = [low_4, low_3, low_2, low_1].pack("NNNN")
=> "\r*\336\245MU\225U%J\252\272\252\275UU"
> msg.unpack("NNNN").inject {|sum, elem| (sum << 32) + elem}
=> 17502556204775004286774747314501014869
I prefer 32 bits here because you pack these in Network Byte Order, which makes interopation with other platforms much easier. The pack() method doesn't provide a network byte order 64-bit integer. (Which isn't too surprising, since POSIX doesn't provide a 64-bit routine.)
Ruby 1.9.3 works normally.
irb(main):001:0> [0b1101001010101101111010100101010011010101010110010101010101010010010101001010101010101011101010101010101111010101010101010101].pack('i')
=> "UU\xBD\xAA"

Resources