I am running C++14 on MacOS High Sierra.
I have an uint16_t returned from a method and the value can range from 100 to like 8000 odd.
I want to convert it to a float. So, if it is 289 then the float should be 289.0. I am trying all different ways to cast the uint16_t but it my float variable always gets zeroes.
uint16_t i_value = 289;
Tried this:
float f_value = static_cast(i_value);
And tried this:
float f_value = (float)i_value;
But nothing works.
Question:
How can I cast uint16_t into a float?
It is an implicit conversion (both ways), no cast is required:
uint16_t i_value = 289;
float f = i_value;
Related
I'm wondering why the integer ii is initiallized at compile time, but not the float ff here:
int main() {
const int i = 1;
constexpr int ii = i;
const float f = 1.0;
constexpr float ff = f;
}
This is what happens when I try to compile:
> g++ -std=c++11 test.cc
test.cc: In function ‘int main()’:
test.cc:6:24: error: the value of ‘f’ is not usable in a constant expression
constexpr float ff = f;
^
test.cc:5:15: note: ‘f’ was not declared ‘constexpr’
const float f = 1.0;
Constant variables of integral types with constant initializers are integral constant expressions (de facto implicitely constexpr; see expr.const in ISO C++). float is not an integral type and does not meet the requirements for constant expression without the use of constexpr. (A similar case is why int can be but float cannot be a template parameter.)
In C++ constant integers are treated differently than other constant types. If they are initialized with a compile-time constant expression they can be used in a compile time expression. This was done so that array size could be a const int instead of #defined (like you were forced in C):
(Assume no VLA extensions)
const int s = 10;
int a[s]; // OK in C++
[I tried to compute a float multiplication, I observed the value was getting saturated to 65536 and was not updating.
the issue is only with the below code.]1
Result for the above code
I tried this with online GCC compiler the issue was still the same.
does this have anything to do with float precision ? is compiler optimizing my float precision during operation?
is there any compiler flags that I can add to overcome this issue?
can anyone please guide me on how to solve this issue?
Attaching the code for reference
#include <stdio.h>
int main()
{
float dummy1, dummy2;
unsigned int i =0;
printf("Hello World");
printf("size of float = %ld\n", sizeof(dummy1));
dummy2 = 0.0;
dummy1 =65535.5;
dummy2 = 60.00 * 0.00005;
for( i= 0; i< 300; i++)
{
dummy1 = dummy1 + dummy2;
printf("dummy1 = %f %f\n", dummy1, dummy2);
}
return 0;
};
(This answers presumes IEEE-754 single and double precision binary formats are used for float and double.)
60.00 * 0.00005 is computed with double arithmetic and produces 0.003000000000000000062450045135165055398829281330108642578125. When this is stored in dummy2, it is converted to 0.0030000000260770320892333984375.
In the loop, dummy1 eventually reaches the value 65535.99609375. Then, when dummy1 and dummy2 are added, the result computed with real-number arithmetic would be 65535.9990000000260770320892333984375. This value is not representable in the float format, so it is rounded to the nearest value representable in the float format, and that is the result that the + operator produces.
The nearest representable values in the float format are 65535.99609375 and 65536. Since 65536 is closer to 65535.9990000000260770320892333984375, it is the result.
In the next iteration, 65536 and 0.0030000000260770320892333984375 are added. The real-arithmetic result would be 65536.0030000000260770320892333984375. This is also not representable in float. The nearest representable values are 65536 and 65536.0078125. Again 65536 is closer, so it is the computed result.
From then on, the loop always produces 65536 as a result.
You can get better results either by using double arithmetic or by computing dummy1 afresh in each iteration instead of accumulating rounding errors from iteration to iteration:
for (i = 0; i < 300; ++i)
{
dummy1 = 65535.5 + i * 60. * .00005;
printf("%.99g\n", dummy1);
}
Note that because dummy1 is a float, it does not have the precision required to distinguish some successive values of the sequence. For example, output of the above includes:
65535.9921875
65535.99609375
65535.99609375
65536
65536.0078125
65536.0078125
65536.0078125
65536.015625
65536.015625
65536.015625
Problem
I am working with flash memory optimization of STM32F051. It's revealed, that conversion between floatand int types consumes a lot of flash.
Digging into this, it turned out that the conversion to int takes around 200 bytes of flash memory; while the conversion to unsigned int takes around 1500 bytes!
It’s known, that both int and unsigned int differ only by the interpretation of the ‘sign’ bit, so such behavior – is a great mystery for me.
Note: Performing the 2-stage conversion float -> int -> unsigned int also consumes only around 200 bytes.
Questions
Analyzing that, I have such questions:
1) What is a mechanism of the conversion of float to unsigned int. Why it takes so many memory space, when in the same time conversion float->int->unsigned int takes so little memory? Maybe it’s connected with IEEE 754 standard?
2) Are there any problems expected when the conversion float->int->unsigned int is used instead of a direct float ->int?
3) Are there any methods to wrap float -> unsigned int conversion keeping the low memory footprint?
Note: The familiar question has been already asked here (Trying to understand how the casting/conversion is done by compiler,e.g., when cast from float to int), but still there is no clear answer and my question is about the memory usage.
Technical data
Compiler: ARM-NONE-EABI-GCC (gcc version 4.9.3 20141119 (release))
MCU: STM32F051
MCU's core: 32 bit ARM CORTEX-M0
Code example
float -> int (~200 bytes of flash)
int main() {
volatile float f;
volatile int i;
i = f;
return 0;
}
float -> unsigned int (~1500 bytes! of flash)
int main() {
volatile float f;
volatile unsigned int ui;
ui = f;
return 0;
}
float ->int-> unsigned int (~200 bytes of flash)
int main() {
volatile float f;
volatile int i;
volatile unsigned int ui;
i = f;
ui = i;
return 0;
}
There is no fundamental reason for the conversion from float to unsigned int should be larger than the conversion from float to signed int, in practice the float to unsigned int conversion can be made smaller than the float to signed int conversion.
I did some investigations using the GNU Arm Embedded Toolchain (Version 7-2018-q2) and
as far as I can see the size problem is due to a flaw in the gcc runtime library. For some reason this library does not provide an specialized version of the __aeabi_f2uiz function for Arm V6m, instead it falls back on a much larger general version.
I want to define a Point message in Protocol Buffers which represents an RGB Colored Point in 3-dimensional space.
message Point {
float x = 1;
float y = 2;
float z = 3;
uint8_t r = 4;
uint8_t g = 5;
uint8_t b = 6;
}
Here, x, y, z variables defines the position of Point and r, g, b defines the color in RGB space.
Since uint8_t is not defined in Protocol Buffers, I am looking for a workaround to define it. At present, I am using uint32 in place of uint8_t.
There isn't anything in protobuf that represents a single byte - it simply isn't a thing that the wire-format worries about. The options are:
varint (up to 64 bits input, up to 10 bytes on the wire depending on the highest set bit)
fixed 32 bit
fixed 64 bit
length-prefixed (strings, sub-objects, packed arrays)
(group tokens; a rare implementation detail)
A single byte isn't a good fit for any of those. Frankly, I'd use a single fixed32 for all 3, and combine/decompose the 3 bytes manually (via shifting etc). The advantage here is that it would only have one field header for the 3 bytes, and wouldn't be artificially stretched via having high bits (I'm not sure that a composed RGB value is a good candidate for varint). You'd also have a spare byte if you want to add something else at a later date (alpha, maybe).
So:
message Point {
float x = 1;
float y = 2;
float z = 3;
fixed32 rgb = 4;
}
IMHO this is the correct approach. You should use the nearest data type capable of holding all values to be sent between the system. The source & destination systems should validate the data if it is in the correct range. For uint8_t this is int32 indeed.
Some protocol buffers implementations actually allow this. In particular, nanopb allows to either have .options file alongside the .proto file or use its extension directly in .proto file to fine tune interpretation of individual fields.
Specifying int_size = IS_8 will convert uint32 from message to uint8_t in generated structure.
import "nanopb.proto";
message Point {
float x = 1;
float y = 2;
float z = 3;
uint32 r = 4 [(nanopb).int_size = IS_8];
uint32 g = 5 [(nanopb).int_size = IS_8];
uint32 b = 6 [(nanopb).int_size = IS_8];
}
I am using following code to typecast from float to int. I always have float with up to 1 decimal point. First I multiply it with 10 and then typecast it to int
float temp1 = float.Parse(textBox.Text);
int temp = (int)(temp1*10);
for 25.3 I get 252 after typecasting but for 25.2, 25.4 I get correct output 252,254 respectively.
Now performing same operation little bit in different way gives correct output.
float temp1 = float.Parse(textBox.Text);
temp1 = temp1*10;
int temp = (int)temp1;
now for 25.3 I get 253. What is the reason for this because logically first method is also correct? I am using Visual Studio 2010.
Its all because of precision in float & double & there rounding off to integer
By default arithmetic operation are performed to in double precision
here is your first code
float temp1 = float.Parse(textBox.Text);
int temp = (int)(temp1*10);
which gets executed as
float temp1 = float.Parse(textBox.Text);
double xVariable = temp1*10
int temp = (int)xVariable;
here is your second code which executes as float conversion of multiplication
float temp1 = float.Parse(textBox.Text);
float xVariable = temp1*10;
int temp = (int)xVariable;
More Information about precision
http://en.wikipedia.org/wiki/Single-precision_floating-point_format
What range of numbers can be represented in a 16-, 32- and 64-bit IEEE-754 systems?
try
decimal temp1 = decimal.Parse(textBox.Text);
int temp = (int)(temp1*10);
Use this:
float temp1 = 25.3;
int temp = Int32. conversation (temp1)