when to use hton/ntoh and when to convert data myself? - endianness

to convert a byte array from another machine which is big-endian, we can use:
long long convert(unsigned char data[]) {
long long res;
res = 0;
for( int i=0;i < DATA_SIZE; ++i)
res = (res << 8) + data[i];
return res;
}
if another machine is little-endian, we can use
long long convert(unsigned char data[]) {
long long res;
res = 0;
for( int i=DATA_SIZE-1;i >=0 ; --i)
res = (res << 8) + data[i];
return res;
}
why do we need the above functions? shouldn't we use hton at sender and ntoh when receiving? Is it because hton/nton is to convert integer while this convert() is for char array?

The hton/ntoh functions convert between network order and host order. If these two are the same (i.e., on big-endian machines) these functions do nothing. So they cannot be portably relied upon to swap endianness. Also, as you pointed out, they are only defined for 16-bit (htons) and 32-bit (htonl) integers; your code can handle up to the sizeof(long long) depending on how DATA_SIZE is set.

Through the network you always receive a series of bytes (octets), which you can't directly pass to ntohs or ntohl. Supposing the incoming bytes are buffered in the (unsigned) char array buf, you could do
short x = ntohs(*(short *)(buf+offset));
but this is not portable unless buf+offset is always even, so that you read with correct alignment. Similarly, to do
long y = ntohl(*(long *)(buf+offset));
you have to make sure that 4 divides buf+offset. Your convert() functions, though, don't have this limitation, they can process byte series at arbitrary (unaligned) memory address.

Related

Unsigned char out of range

I am trying to figure out, how to use an unsigned char type of a variable inside a for loop, while not "breaking" out of range for unsigned char, which can vary form 0 to 255.
main(void) {
TRISC = 0;
LATC = 0;
unsigned char j;
for (j = 0; j <= 255 ; j++){
LATC = j;
__delay_ms(1000);
}
return;
}
This is code in C, where PIC is programmed. "TRISC = 0" means setting port C as an output and "LATC" is referring to port C itself. Basically I want to assign values from including 0 to 255 to this port. But if I try to compile this, the compiler (xc8) returns following two warnings:
I cannot quite understand what these two are saying, but I assume it has to do something with variable j exceeding the limit value of unsigned char, that is 255 (in last iteration j = 256, which is not allowed/defined).
However, this code gets compiled and works as meant. But I still want to write and understand a code that assigns port C the value of 255 without entering "prohibited" range of values.
*P.S. I would use any other variable type than unsigned char or char, however to ports in PICs only these two types can be applied directly (without conversion).
j <= 255 is always true if j is only 8 Bit wide.
This version should work:
main(void) {
TRISC = 0;
LATC = 0;
int j;
for (j = 0; j <= 255 ; j++){
LATC = (unsigned char)j;
__delay_ms(1000);
}
return;
}
First, in microcontroller firmware, you should not return from main(). Your main() should include some kind of endless loop.
j <= 255 is always true for a uint8_t variable. Because j can't be 256. Adding 1 to j when it's 255, makes it 0, not 256.
As others have suggested, using an 16-bit integer, signed or unsigned, is the easiest and the cleanest way. However, in performance sensitive loops you may prefer to stick with 8 bit loop counters as these are the fastest ones for a 8-bit PIC microcontroller.
This particular one-time loop can be written as:
uint8_t j = 0;
do {
LATC = j++;
__delay_ms(1000);
} while (j != 0);

Binary to decimal (on huge numbers)

I am building a C library on big integer number. Basically, I'm seeking a fast algorythm to convert any integer in it binary representation to a decimal one
I saw JDK's Biginteger.toString() implementation, but it looks quite heavy to me, as it was made to convert the number to any radix (it uses a division for each digits, which should be pretty slow while dealing with thousands of digits).
So if you have any documentations / knowledge to share about it, I would be glad to read it.
EDIT: more precisions about my question:
Let P a memory address
Let N be the number of bytes allocated (and set) at P
How to convert the integer represented by the N bytes at address P (let's say in little endian to make things simpler), to a C string
Example:
N = 1
P = some random memory address storing '00101010'
out string = "42"
Thank for your answer still
The reason for the BigInteger.toString method looking heavy is doing the conversion in chunks.
A trivial algorithm would take the last digits and then divide the whole big integer by the radix until there is nothing left.
One problem with this is that a big integer division is quite expensive, so the number is subdivided into chunks that can be processed with regular integer division (opposed to BigInt division):
static String toDecimal(BigInteger bigInt) {
BigInteger chunker = new BigInteger(1000000000);
StringBuilder sb = new StringBuilder();
do {
int current = bigInt.mod(chunker).getInt(0);
bigInt = bigInt.div(chunker);
for (int i = 0; i < 9; i ++) {
sb.append((char) ('0' + remainder % 10));
current /= 10;
if (currnet == 0 && bigInt.signum() == 0) {
break;
}
}
} while (bigInt.signum() != 0);
return sb.reverse().toString();
}
That said, for a fixed radix, you are probably even better off with porting the "double dabble" algorithm to your needs, as suggested in the comments: https://en.wikipedia.org/wiki/Double_dabble
I recently got the challenge to print a big mersenne prime: 2**82589933-1. On my CPU that takes ~40 minutes with apcalc and ~120 minutes with python 2.7. It's a number with 24 million digits and a bit.
Here is my own little C code for the conversion:
// print 2**82589933-1
#include <stdio.h>
#include <math.h>
#include <stdint.h>
#include <inttypes.h>
#include <string.h>
const uint32_t exponent = 82589933;
//const uint32_t exponent = 100;
//outputs 1267650600228229401496703205375
const uint32_t blocks = (exponent + 31) / 32;
const uint32_t digits = (int)(exponent * log(2.0) / log(10.0)) + 10;
uint32_t num[2][blocks];
char out[digits + 1];
// blocks : number of uint32_t in num1 and num2
// num1 : number to convert
// num2 : free space
// out : end of output buffer
void conv(uint32_t blocks, uint32_t *num1, uint32_t *num2, char *out) {
if (blocks == 0) return;
const uint32_t div = 1000000000;
uint64_t t = 0;
for (uint32_t i = 0; i < blocks; ++i) {
t = (t << 32) + num1[i];
num2[i] = t / div;
t = t % div;
}
for (int i = 0; i < 9; ++i) {
*out-- = '0' + (t % 10);
t /= 10;
}
if (num2[0] == 0) {
--blocks;
num2++;
}
conv(blocks, num2, num1, out);
}
int main() {
// prepare number
uint32_t t = exponent % 32;
num[0][0] = (1LLU << t) - 1;
memset(&num[0][1], 0xFF, (blocks - 1) * 4);
// prepare output
memset(out, '0', digits);
out[digits] = 0;
// convert to decimal
conv(blocks, num[0], num[1], &out[digits - 1]);
// output number
char *res = out;
while(*res == '0') ++res;
printf("%s\n", res);
return 0;
}
The conversion is destructive and tail recursive. In each step it divides num1 by 1_000_000_000 and stores the result in num2. The remainder is added to out. Then it calls itself with num1 and num2 switched and often shortened by one (blocks is decremented). out is filled from back to front. You have to allocate it large enough and then strip leading zeroes.
Python seems to be using a similar mechanism for converting big integers to decimal.
Want to do better?
For large number like in my case each division by 1_000_000_000 takes rather long. At a certain size a divide&conquer algorithm does better. In my case the first division would be by dividing by 10 ^ 16777216 to split the number into divident and remainder. Then convert each part separately. Now each part is still big so split again at 10 ^ 8388608. Recursively keep splitting till the numbers are small enough. Say maybe 1024 digits each. Those convert with the simple algorithm above. The right definition of "small enough" would have to be tested, 1024 is just a guess.
While the long division of two big integer numbers is expensive, much more so than a division by 1_000_000_000, the time spend there is then saved because each separate chunk requires far fewer divisions by 1_000_000_000 to convert to decimal.
And if you have split the problem into separate and independent chunks it's only a tiny step away from spreading the chunks out among multiple cores. That would really speed up the conversion another step. It looks like apcalc uses divide&conquer but not multi-threading.

Algorithm Challenge: Arbitrary in-place base conversion for lossless string compression

It might help to start out with a real world example. Say I'm writing a web app that's backed by MongoDB, so my records have a long hex primary key, making my url to view a record look like /widget/55c460d8e2d6e59da89d08d0. That seems excessively long. Urls can use many more characters than that. While there are just under 8 x 10^28 (16^24) possible values in a 24 digit hex number, just limiting yourself to the characters matched by a [a-zA-Z0-9] regex class (a YouTube video id uses more), 62 characters, you can get past 8 x 10^28 in only 17 characters.
I want an algorithm that will convert any string that is limited to a specific alphabet of characters to any other string with another alphabet of characters, where the value of each character c could be thought of as alphabet.indexOf(c).
Something of the form:
convert(value, sourceAlphabet, destinationAlphabet)
Assumptions
all parameters are strings
every character in value exists in sourceAlphabet
every character in sourceAlphabet and destinationAlphabet is unique
Simplest example
var hex = "0123456789abcdef";
var base10 = "0123456789";
var result = convert("12245589", base10, hex); // result is "bada55";
But I also want it to work to convert War & Peace from the Russian alphabet plus some punctuation to the entire unicode charset and back again losslessly.
Is this possible?
The only way I was ever taught to do base conversions in Comp Sci 101 was to first convert to a base ten integer by summing digit * base^position and then doing the reverse to convert to the target base. Such a method is insufficient for the conversion of very long strings, because the integers get too big.
It certainly feels intuitively that a base conversion could be done in place, as you step through the string (probably backwards to maintain standard significant digit order), keeping track of a remainder somehow, but I'm not smart enough to work out how.
That's where you come in, StackOverflow. Are you smart enough?
Perhaps this is a solved problem, done on paper by some 18th century mathematician, implemented in LISP on punch cards in 1970 and the first homework assignment in Cryptography 101, but my searches have borne no fruit.
I'd prefer a solution in javascript with a functional style, but any language or style will do, as long as you're not cheating with some big integer library. Bonus points for efficiency, of course.
Please refrain from criticizing the original example. The general nerd cred of solving the problem is more important than any application of the solution.
Here is a solution in C that is very fast, using bit shift operations. It assumes that you know what the length of the decoded string should be. The strings are vectors of integers in the range 0..maximum for each alphabet. It is up to the user to convert to and from strings with restricted ranges of characters. As for the "in-place" in the question title, the source and destination vectors can overlap, but only if the source alphabet is not larger than the destination alphabet.
/*
recode version 1.0, 22 August 2015
Copyright (C) 2015 Mark Adler
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Mark Adler
madler#alumni.caltech.edu
*/
/* Recode a vector from one alphabet to another using intermediate
variable-length bit codes. */
/* The approach is to use a Huffman code over equiprobable alphabets in two
directions. First to encode the source alphabet to a string of bits, and
second to encode the string of bits to the destination alphabet. This will
be reasonably close to the efficiency of base-encoding with arbitrary
precision arithmetic. */
#include <stddef.h> // size_t
#include <limits.h> // UINT_MAX, ULLONG_MAX
#if UINT_MAX == ULLONG_MAX
# error recode() assumes that long long has more bits than int
#endif
/* Take a list of integers source[0..slen-1], all in the range 0..smax, and
code them into dest[0..*dlen-1], where each value is in the range 0..dmax.
*dlen returns the length of the result, which will not exceed the value of
*dlen when called. If the original *dlen is not large enough to hold the
full result, then recode() will return non-zero to indicate failure.
Otherwise recode() will return 0. recode() will also return non-zero if
either of the smax or dmax parameters are less than one. The non-zero
return codes are 1 if *dlen is not long enough, 2 for invalid parameters,
and 3 if any of the elements of source are greater than smax.
Using this same operation on the result with smax and dmax reversed reverses
the operation, restoring the original vector. However there may be more
symbols returned than the original, so the number of symbols expected needs
to be known for decoding. (An end symbol could be appended to the source
alphabet to include the length in the coding, but then encoding and decoding
would no longer be symmetric, and the coding efficiency would be reduced.
This is left as an exercise for the reader if that is desired.) */
int recode(unsigned *dest, size_t *dlen, unsigned dmax,
const unsigned *source, size_t slen, unsigned smax)
{
// compute sbits and scut, with which we will recode the source with
// sbits-1 bits for symbols < scut, otherwise with sbits bits (adding scut)
if (smax < 1)
return 2;
unsigned sbits = 0;
unsigned scut = 1; // 2**sbits
while (scut && scut <= smax) {
scut <<= 1;
sbits++;
}
scut -= smax + 1;
// same thing for dbits and dcut
if (dmax < 1)
return 2;
unsigned dbits = 0;
unsigned dcut = 1; // 2**dbits
while (dcut && dcut <= dmax) {
dcut <<= 1;
dbits++;
}
dcut -= dmax + 1;
// recode a base smax+1 vector to a base dmax+1 vector using an
// intermediate bit vector (a sliding window of that bit vector is kept in
// a bit buffer)
unsigned long long buf = 0; // bit buffer
unsigned have = 0; // number of bits in bit buffer
size_t i = 0, n = 0; // source and dest indices
unsigned sym; // symbol being encoded
for (;;) {
// encode enough of source into bits to encode that to dest
while (have < dbits && i < slen) {
sym = source[i++];
if (sym > smax) {
*dlen = n;
return 3;
}
if (sym < scut) {
buf = (buf << (sbits - 1)) + sym;
have += sbits - 1;
}
else {
buf = (buf << sbits) + sym + scut;
have += sbits;
}
}
// if not enough bits to assure one symbol, then break out to a special
// case for coding the final symbol
if (have < dbits)
break;
// encode one symbol to dest
if (n == *dlen)
return 1;
sym = buf >> (have - dbits + 1);
if (sym < dcut) {
dest[n++] = sym;
have -= dbits - 1;
}
else {
sym = buf >> (have - dbits);
dest[n++] = sym - dcut;
have -= dbits;
}
buf &= ((unsigned long long)1 << have) - 1;
}
// if any bits are left in the bit buffer, encode one last symbol to dest
if (have) {
if (n == *dlen)
return 1;
sym = buf;
sym <<= dbits - 1 - have;
if (sym >= dcut)
sym = (sym << 1) - dcut;
dest[n++] = sym;
}
// return recoded vector
*dlen = n;
return 0;
}
/* Test recode(). */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <assert.h>
// Return a random vector of len unsigned values in the range 0..max.
static void ranvec(unsigned *vec, size_t len, unsigned max) {
unsigned bits = 0;
unsigned long long mask = 1;
while (mask <= max) {
mask <<= 1;
bits++;
}
mask--;
unsigned long long ran = 0;
unsigned have = 0;
size_t n = 0;
while (n < len) {
while (have < bits) {
ran = (ran << 31) + random();
have += 31;
}
if ((ran & mask) <= max)
vec[n++] = ran & mask;
ran >>= bits;
have -= bits;
}
}
// Get a valid number from str and assign it to var
#define NUM(var, str) \
do { \
char *end; \
unsigned long val = strtoul(str, &end, 0); \
var = val; \
if (*end || var != val) { \
fprintf(stderr, \
"invalid or out of range numeric argument: %s\n", str); \
return 1; \
} \
} while (0)
/* "bet n m len count" generates count test vectors of length len, where each
entry is in the range 0..n. Each vector is recoded to another vector using
only symbols in the range 0..m. That vector is recoded back to a vector
using only symbols in 0..n, and that result is compared with the original
random vector. Report on the average ratio of input and output symbols, as
compared to the optimal ratio for arbitrary precision base encoding. */
int main(int argc, char **argv)
{
// get sizes of alphabets and length of test vector, compute maximum sizes
// of recoded vectors
unsigned smax, dmax, runs;
size_t slen, dsize, bsize;
if (argc != 5) { fputs("need four arguments\n", stderr); return 1; }
NUM(smax, argv[1]);
NUM(dmax, argv[2]);
NUM(slen, argv[3]);
NUM(runs, argv[4]);
dsize = ceil(slen * ceil(log2(smax + 1.)) / floor(log2(dmax + 1.)));
bsize = ceil(dsize * ceil(log2(dmax + 1.)) / floor(log2(smax + 1.)));
// generate random test vectors, encode, decode, and compare
srandomdev();
unsigned source[slen], dest[dsize], back[bsize];
unsigned mis = 0, i;
unsigned long long dtot = 0;
int ret;
for (i = 0; i < runs; i++) {
ranvec(source, slen, smax);
size_t dlen = dsize;
ret = recode(dest, &dlen, dmax, source, slen, smax);
if (ret) {
fprintf(stderr, "encode error %d\n", ret);
break;
}
dtot += dlen;
size_t blen = bsize;
ret = recode(back, &blen, smax, dest, dlen, dmax);
if (ret) {
fprintf(stderr, "decode error %d\n", ret);
break;
}
if (blen < slen || memcmp(source, back, slen)) // blen > slen is ok
mis++;
}
if (mis)
fprintf(stderr, "%u/%u mismatches!\n", mis, i);
if (ret == 0)
printf("mean dest/source symbols = %.4f (optimal = %.4f)\n",
dtot / (i * (double)slen), log(smax + 1.) / log(dmax + 1.));
return 0;
}
As has been pointed out in other StackOverflow answers, try not to think of summing digit * base^position as converting it to base ten; rather, think of it as directing the computer to generate a representation of the quantity represented by the number in its own terms (for most computers probably closer to our concept of base 2). Once the computer has its own representation of the quantity, we can direct it to output the number in any way we like.
By rejecting "big integer" implementations and asking for letter-by-letter conversion you are at the same time arguing that the numerical/alphabetical representation of quantity is not actually what it is, namely that each position represents a quantity of digit * base^position. If the nine-millionth character of War and Peace does represent what you are asking to convert it from, then the computer at some point will need to generate a representation for Д * 33^9000000.
I don't think any solution can work generally because if ne != m for some integer e and some MAX_INT because there's no way to calculate the value of the target base in a certain place p if np > MAX_INT.
You can get away with this for the case where ne == m for some e because the problem is recursively doable (the first e digits of n can be summed and converted into the first digit of M, and then chopped off and repeated.
If you don't have this useful property, then eventually you're going to have to try to take some part of the original base and try to perform modulus in np and np is going to be greater than MAX_INT, which means it's impossible.

How does capitalize work?

I'm trying to understand how String#capitalize! works internally. I can create a hash. Given string foo = "the", foo[0] is "t", look up the lower_case "t", and match it with upper case "T" value. In fact, Ruby source shows:
static VALUE
rb_str_capitalize_bang(VALUE str)
{
rb_encoding *enc;
char *s, *send;
int modify = 0;
unsigned int c;
int n;
str_modify_keep_cr(str);
enc = STR_ENC_GET(str);
rb_str_check_dummy_enc(enc);
if (RSTRING_LEN(str) == 0 || !RSTRING_PTR(str)) return Qnil;
s = RSTRING_PTR(str); send = RSTRING_END(str);
c = rb_enc_codepoint_len(s, send, &n, enc);
if (rb_enc_islower(c, enc)) {
rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
modify = 1;
}
s += n;
while (s < send) {
c = rb_enc_codepoint_len(s, send, &n, enc);
if (rb_enc_isupper(c, enc)) {
rb_enc_mbcput(rb_enc_tolower(c, enc), s, enc);
modify = 1;
}
s += n;
}
if (modify) return str;
return Qnil;
}
The relevant function is toupper. How does it know toupper("t") equals "T"?
You're wondering how it knows what the uppercase version of the character is? Like most real-world implementations of this kind of function, it uses a lookup table.
toupper is an ansi C function. This means that the exact implementation actually depends on the provider of your library, which most of the times is your compiler.
Chances are that it follows the ASCII table, because there is no lookup as faster as a sum of integers - one of the steps in the lookup should involve a sum, to calculate the new address.
So, on gcc, we have this implementation
char
ctype<char>::do_toupper(char __c) const
{
int __x = __c;
return (this->is(ctype_base::lower, __c) ? (__x - 'a' + 'A') : __x);
}
This basically checks if it lower. If it is, returns the lower. Otherwise, it does subtracts 97 and then sum 65, which is the same thing than subtract 32. Remember that characters and numbers are the same for a computer, just binary data. And then, note how characters are used instead of numbers for a better readability (well, at least for C folks).
Without looking at any source code, I would guess that one way would be to convert a character to its' respective ASCII value, subtract 32 from it and convert the ASCII value back to a char.

how to convert long int to char

#include <iostream>
#include <Windows.h>
#include <string>
using namespace std;
HANDLE hPort = CreateFile("COM2",
GENERIC_WRITE|GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
DCB dcb;
bool writebyte(char*data)
{
DWORD byteswritten;
if (!GetCommState(hPort,&dcb))
{
printf("\nSerial port can't be open\n");
return false;
}
dcb.BaudRate = CBR_9600;
dcb.ByteSize = 8;
dcb.Parity = NOPARITY;
dcb.StopBits = ONESTOPBIT;
if (!SetCommState(hPort,&dcb))
return false;
bool retVal = WriteFile(hPort,data,1,&byteswritten,NULL);
return retVal;
}
int ReadByte()
{
int Val;
BYTE Byte;
DWORD dwBytesTransferred;
DWORD dwCommModemStatus;
if (!GetCommState(hPort,&dcb))
return 0;
SetCommMask(hPort,EV_RXCHAR | EV_ERR);
WaitCommEvent (hPort,&dwCommModemStatus,0);
if (dwCommModemStatus & EV_RXCHAR)
ReadFile (hPort,&Byte,1,&dwBytesTransferred,0);
Val = Byte;
return Val;
}
int main() {
POINT p;
int x;
int y;
int z;
while(0==0){
GetCursorPos(&p);
x = p.x;
y = p.y;
HDC hDC;
hDC = GetDC(NULL);
cin >> z;
cout << GetPixel(hDC, x, y) << endl;
Sleep(z);
ReleaseDC(NULL, hDC);
char data = GetPixel(hDC, x, y);
if (writebyte(&data))
cout <<" DATA SENT.. " << (int)data<< "\n";
}
}
in the part of sending data through serial communication, instead of sending the data as GetPixel(hDC, x, y), it only sends the value "-1" . I was thinking it is because char is only for small integers and the output I was giving is a very very long number. I tried to change it to long int but i still get the same result. That it only sends "-1". I thought that the solution might be converting char to long int or long int to char before sending the data but I don't know how..can someone help me?
Why do you use hDC after releasing it?
ReleaseDC(NULL, hDC);
char data = GetPixel(hDC, x, y);
GetPixel will return -1 (CLR_INVALID) in case of an error (see MSDN).
And, by the way, a COLORREF is not a char, so you lose Information when storing the return value of GetPixel in char data. You should store the complete COLORREF and send/receive all of it's bytes (so send/receive sizeof(COLORREF) bytes).
Also be aware of byte order. If you are transferring multi byte data between two machines then you must assure that both agree on the order of the bytes. If for example one machine is little endian and the other big endian, then they store COLORREF with different byte order in memory. One is storing the COLORREF 0x00BBGGRR in memory as { 0, 0xbb, 0xgg, 0xrr } whereas the other is storing it as { 0xrr, 0xgg, 0xbb, 0 }. So you need to define a transmit byte order which both sides use independant of their host byte order. If you don't want to invent the wheel new, you can take a look at network byte order and reuse that. Socket API gives you some functions like ntohl and htonl which help you in converting from host byte order to network byte order and vice versa.

Resources