Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 10 months ago.
Improve this question
Why not get zero when taking ADC readings on Raspberry Pi Pico? Even though I ground the ADC pin, the analog reading always fluctuates between 10 to 20. How can analog reading be reduced to zero?
Question
How to make Rpi Pico do ADC with an output range starting from zero, and hopefully no fluctuation of 10 to 20?
Answer
I googled and found How2Electronics has a newbie friendly tutorial on Pico ADC, with a short demo code (References 1, 2. Appendices B, C).
The demo code does not seem to have any problem as reported by the OP. So I decided to try the demo code in my Pico setup, to see if I can repeat the OP's situation.
I read the Pico datasheet which says that Pico ADC has an internal offset of about 30mV, and that its offset can be reduced by using an external voltage of 3.0V (Appendix D).
Now the OP says his ADC results of 0V (signal grounded) always fluctuates between 10 and 20. So let me calculate if this 10~20 offset is within the Pico datasheet's 20~30mV internal/intrinsic zero offset. Since Pico ADC resolution is 12 bit, and if 3V3 power rail is used as analog reference, then the intrinsic is 30mV / 3V * 4096 ~= 30/3000 * 4096 ~= 5 < 10. In other words, the OP's zero offset seems twice the spec.
I have not verified my always dodgy calculation. Perhaps I can verify it by working backwards: What is the offset in mV, if offset value is 20/4096. Suppose analog reference is 3V3, then offset in mV = 3V3 * (20/4096) = 3300mV * 20 / 4096 = 16mV, looks OK.
Anyway, perhaps I can repeat the OP's test and compare with his zero offsets.
I modified/expanded How2Electronics's demo program, to do the following:
(a) Do ADC with the three ADC pins, ADC0, ADC1, ADC2, (b) Find max, min, avg values and print out results (Appdenix F).
I then use the program to find the outputs of ADC0, 1, 2 pins connected to (a) Analog Ground, (b) Enabled 3V3. The ADC results are summarized below:
--- Sample Output ---
%Run -c $EDITOR_CONTENT
Name = testPicoAdcV01()
Function = test Pico ADC pins GP26, 27, 28
Date = 2022apr25hkt1111
ADC Results = [368, 368, 336] (ADC0, 1, 2 connected Aanlog Gound)
Max = 368
Min = 336
Avg = 357
ADC Results = [65391, 65519, 65519] (ADC0, 1, 2 connected 3V3)
Max = 65519
Min = 65391
Avg = 65476
The ADC output for (a) 3V3 input is avg 65476, and (b) Analog ground is avg is 357, or
ADC zero offset of 357/65476 = 5.45% ~= 5%, or 3V3 = 3300mV * 5% ~= 165mV.
The Pico ADC error spec of 5% or 150mV (not yet verified) of measurement range of 3V3 does not seem reasonably good. So I am thinking of using an external 3V0 voltage reference for the Pico Analog Reference pin, hoping that accuracy could be improved. I will be using the voltage reference IC TL431 to check if it is good (Appendix H, Ref 5).
Using a 2.5V external reference, the zero offset has improved to 0.4%, but still not very good. I would prefer to use AD7705, with two real differential input channels, built in PGA, and assembled modules with on board LM285-2.5 voltage references.
/ to continue, ...
References
(1) How to use ADC in Raspberry Pi Pico - How2Electronics, 2021apr21
(2) Raspberry Pi Pico Complete Guide [Pinout + Features + ADC (08:56) + I2C + OLED + Internal Temperature Sensor + DHT11 - How2Electronics
(3) Rpi Pico Datasheet (4.3. Using the ADC) - Rpi
(4) Raspberry Pi Pico 3.3V_EN pin control voltage inquiry - EE.SE, Asked 2021jun22, Viewed 1k times
(5) TL431 / TL432 Precision Programmable Reference IC - TI
(6) LM385B-2.5 2.5V Micropower Voltage References - TI
(7) Zonri ADS1256 24-Bit Sampling Module, ADC Module, Single/Differential Input - AliEXpress US$26
(8) AD7705 SPI 2 fully differential input channel 16-bit Σ-Δ ADC, PGA Datasheet - Analog Devices
(9) AD7705/TM7705 16-Bit ADC Module, Input Gain, Programmable SPI Interface, (LM285-2.5 external voltage reference)- AliExpress US$1.2
/ to continue, ...
Appendices
Appendix A - Rpi Pico ADC Pinout
Appendix B - Rpi Pico wiring for testing ADC program v0.1
------------------------------------------------------------------------------------
Pin name Pin # Connected to
-------------------------------------------------------------------------------------
Rpi 3V3 Output 36 -
Rpi 3V3 Enable 37 -
Analog Gnd 33 Rpi Pico Ground
Analog Ref 35 Rpi 3V3
ADC0 26 Rpi Ground
ADC1 27 Rpi 3V3
ADC2 28 2V5
ADC3 - Not available, connected to Pico internal temperature sensor
-------------------------------------------------------------------------------------
Appendix C - How2Eloectronics's ADC Demo Code
Appendix D - How to improve ADC Performance
Rpi Pico Datasheet (4.3. Using the ADC) - Rpi
4.3. Using the ADC
The RP2040 ADC does not have an on-board reference and therefore uses its own power supply as a reference.
On Pico the ADC_AVDD pin (the ADC supply) is generated from the SMPS 3.3V by using an R-C filter (201 ohms into 2.2μF). This is a simple solution but does have the following drawbacks:
We are relying on the 3.3V SMPS output accuracy which isn’t great
We can only do so much filtering and therefore ADC_AVDD will be somewhat noisy
The ADC draws current (about 150μA if the temperature sense diode is disabled, but it varies from chip to chip) and therefore there will be an inherent offset of about 150μA200 = ~30mV*. There is a small difference in current draw when the ADC is sampling (about +20μA) so that offset will also vary with sampling as well as operating temperature.
Changing the resistance between the ADC_VREF and 3V3 pin can reduce the offset at the expense of more noise - which may be OK especially if the use case can support averaging over multiple samples.
Driving high the SMPS mode pin (GPIO23), to force the power supply into PWM mode, can greatly reduce the inherent ripple of the SMPS at light load, and therefore the ripple on the ADC supply. This does reduce the power efficiency of the board at light load, so the low-power PFM mode can be re-enabled between infrequent ADC measurements by driving GPIO23 low once more. See Section 4.4.
The ADC offset can be reduced by tying a second channel of the ADC to ground, and using this zero-measurement as an approximation to the offset.
For much improved ADC performance, an external 3.0V shunt reference, such as LM4040, can be connected from the ADC_VREF pin to ground.
Note that if doing this the ADC range is limited to 0-3.0V signals (rather than 0-3.3V), and the shunt reference will draw continuous current through the 200R filter resistor (3.3V-3.0V)/200 = ~1.5mA.
Note that the 1R resistor on Pico (R9) is designed to (maybe) help with shunt references that would otherwise become unstable when directly connected to 2.2μF. It also makes sure there is a little filtering even in the case that 3.3V and ADC_VREF are shorted together (which is a valid thing to do if you don’t care about noise and want to reduce the inherent offset).
Finally, R7 is a physically large 1608 metric (0603) package resistor, so can be relatively easily removed if a user wants to isolate ADC_VREF and do their own thing with the ADC voltage, for example powering it from an entirely separate voltage (e.g. 2.5V). Note that the ADC on RP2040 has only been qualified at 3.0/3.3V but should work down to about 2V.
Appendix E - Pico 3V3 Enable pin to enable 3V3 Power
Raspberry Pi Pico 3.3V_EN pin control voltage inquiry - Asked 2021jun22, Viewed 1k times
Appendix F - Testing Pico's 3 ADC Pins V0.1
programName = 'testPicoAdcV01()'
programFunction = 'test Pico ADC pins GP26, 27, 28'
programDate = '2022apr25hkt1111'
programAuthor = 'tlfong01'
systemInfo = 'Chinese Windows 10, Thonny IDE 3.3.13, Python 3.7.9, Rpi Pico'
import machine
import utime
# *** Configuration ***
adcPinNum0 = 26
adcPinNum1 = 27
adcPinNum2 = 28
adcPin0 = machine.ADC(adcPinNum0)
adcPin1 = machine.ADC(adcPinNum1)
adcPin2 = machine.ADC(adcPinNum2)
adcPinDict = \
{
'0' : {
'AdcPinNum': 0,
'AdcPin' : adcPin0,
},
'1' : {
'AdcPinNum': 1,
'AdcPin' : adcPin1,
},
'2' : {
'AdcPinNum': 2,
'AdcPin' : adcPin2,
},
}
# *** Adc Functions ***
def getAdcResults(adcPinNum):
adcPin = adcPinDict[str(adcPinNum)]['AdcPin']
adcResults = adcPin.read_u16()
return adcResults
# *** Sample Test ***
#adcResults = getAdcResults(adcPinNum = 0)
#print(adcResults)
def getAdcResultsList(adcPinNumList):
adcResultsList = [0] * len(adcPinNumList)
for adcPinNum in adcPinNumList:
adcResults = getAdcResults(adcPinNum)
adcResultsList[adcPinNum] = adcResults
return adcResultsList
# *** Sample Tests ***
#adcResultsList = getAdcResultsList([0, 1, 2])
#print(adcResultsList)
def printAdcResultsList(adcResultsList):
print('ADC Results =', adcResultsList)
print('Max =', max(adcResultsList))
print('Min =', min(adcResultsList))
print('Avg =', (sum(adcResultsList)) / len(adcResultsList))
return
# *** Sample Tests ***
#adcResultsList = getAdcResultsList([0, 1, 2])
#printAdcResultsList(adcResultsList)
def testPicoAdcV01():
print('Name =', programName)
print('Function =', programFunction)
print('Date =', programDate)
adcResultsList = getAdcResultsList([0, 1, 2])
printAdcResultsList(adcResultsList)
return
# *** Main ***
testPicoAdcV01()
# *** End of Program ***
# *** Sample Output ***
'''
>>> %Run -c $EDITOR_CONTENT
Name = testPicoAdcV01()
Function = test Pico ADC pins GP26, 27, 28
Date = 2022apr25hkt1102
ADC Results = [208, 20645, 17828]
Max = 20645
Min = 208
Avg = 12893.67
>>>
'''
# *** End of sample output ***
Appendix G - ADC results with ADC0, 1, 2 connected to (a) Analog Ground, (b) Enabled 3V3 Output
Complete program listing with sample output
# *** Sample Output ***
'''
>>> %Run -c $EDITOR_CONTENT
Name = testPicoAdcV01()
Function = test Pico ADC pins GP26, 27, 28
Date = 2022apr25hkt1111
ADC Results = [368, 368, 336] (ADC0, 1, 2 connected Aanlog Gound)
Max = 368
Min = 336
Avg = 357
ADC Results = [65391, 65519, 65519] (ADC0, 1, 2 connected 3V3)
Max = 65519
Min = 65391
Avg = 65476
>>>
'''
Appendix H - Rpi Pico ADC External Analog Voltage Reference Using TL431
Appendix I - Rpi Pico ADC External Analog Voltage Reference Using TI LM385B25 2V5 Volgtage References
LM385B-2.5 Micropower Voltage References - TI
Appendix J - Zonri ADS1256 24-bit ADC Module to calibrate Rpi Pico ADC Pins
Appendix K - AD7705 16-bit ADC to Calibrate Rpi Pico ADC
(8) AD7705 SPI 2 fully differential input channel 16-bit Σ-Δ ADC Datasheet - Analog Devices
(9) AD7705 16-Bit ADC Module, Input Gain, Programmable SPI Interface, TM7705- AliExpress US$1.2
/ to continue in TEAMS, ...
.END
Related
I am accelerating a MPI program using cuBlas function. To evaluate the application's efficiency, I want to know the FLOPS, memory usage and other stuff of GPU after the program has ran, especially FLOPS.
I have read the relevant question:How to calculate Gflops of a kernel. I think the answers give two ways to calculate the FLOPS of a program:
The model count of an operation divided by the cost time of the operation
Using NVIDIA's profiling tools
The first solution doesn't depend on any tools. But I'm not sure the meaning of model count. It's O(f(N))? Like the model count of GEMM is O(N^3)? And if I multiply two matrices of 4 x 5 and 5 x 6 and the cost time is 0.5 s, is the model count 4 x 5 x 6 = 120? So the FLOPS is 120 / 0.5 = 240?
The second solution uses nvprof, which is deprecated now and replaced by Nsight System and Nsight Compute. But those two tools only work for CUDA program, instead of MPI program launching CUDA function. So I am wondering whether there is a tool to profile the program launching CUDA function.
I have been searching for this question for two days but still can't find an acceptable solution.
But I'm not sure the meaning of model count. It's O(f(N))? Like the
model count of GEMM is O(N^3)? And if I multiply two matrices of 4 x 5
and 5 x 6 and the cost time is 0.5 s, is the model count 4 x 5 x 6 =
120? So the FLOPS is 120 / 0.5 = 240?
The standard BLAS GEMM operation is C <- alpha * (A dot B) + beta * C and for A (m by k), B (k by n) and C (m by n), each inner product of a row of A and a column of B multiplied by alpha is 2 * k + 1 flop and there are m * n inner products in A dot B and another 2 * m * n flop for adding beta * C to that dot product. So the total model FLOP count is (2 * k + 3) * (m * n) when alpha and beta are both non-zero.
For your example, assuming alpha = 1 and beta = 0 and the implementation is smart enough to skip the extra operations (and most are) GEMM flop count is (2 * 5) * (4 * 6) = 240, and if the execution time is 0.5 seconds, the model arithmetic throughput is 240 / 0.5 = 480 flop/s.
I would recommend using that approach if you really need to calculate performance of GEMM (or other BLAS/LAPACK operations). This is the way that most of the computer linear algebra literature and benchmarking has worked since the 1970’s and how most reported results you will find are calculated, including the HPC LINPACK benchmark.
The Using the CLI to Analyze MPI Codes states clearly how to use nsys to collect MPI program runtime information.
And the gitlab Roofline Model on NVIDIA GPUs uses ncu to collect real time FLOPS and memory usage of the program. The methodology to compute these metrics is:
Time:
sm__cycles_elapsed.avg / sm__cycles_elapsed.avg.per_second
FLOPs:
DP: sm__sass_thread_inst_executed_op_dadd_pred_on.sum + 2 x
sm__sass_thread_inst_executed_op_dfma_pred_on.sum +
sm__sass_thread_inst_executed_op_dmul_pred_on.sum
SP: sm__sass_thread_inst_executed_op_fadd_pred_on.sum + 2 x
sm__sass_thread_inst_executed_op_ffma_pred_on.sum +
sm__sass_thread_inst_executed_op_fmul_pred_on.sum
HP: sm__sass_thread_inst_executed_op_hadd_pred_on.sum + 2 x
sm__sass_thread_inst_executed_op_hfma_pred_on.sum +
sm__sass_thread_inst_executed_op_hmul_pred_on.sum
Tensor Core: 512 x sm__inst_executed_pipe_tensor.sum
Bytes:
DRAM: dram__bytes.sum
L2: lts__t_bytes.sum
L1: l1tex__t_bytes.sum
Does anybody know the binary format of a PeakTech 1330 oscilloscope?
What I do know:
The first 32 byte seem to be a header describing the instrument.
The last 94 byte seem to describe the setting (gain, time scale, channel used ...) - but I have no clue of the coding.
In the middle it looks like a dump of the ADC samples (1 byte per sample)
What I need:
I want to read the scaling from the last 94 bytes to give the data a physical meaning in Volts and Seconds. (multiplying ADC values with gain factors and sample number with time scale).
byte 0..9: headder holding device name
byte 23..26: record length (total), MSbyte at 23
byte 28..31: data field length (MSB at byte 28)
byte 32..end_data: ADC sample values (-128..+127)
end_data+1+x
x=6..9: number of sample points per channel
x=17: time scale (2ns/div=x00..100s/div=x20)
x=18..21: trigger offset, MSB first, 1LSB=0.2ns
x=26: length of channel description (n*67byte. n=number of channels)
x=27..29: channel name (CH1, CH2, CH3...)
x=38..41: trigger delay
x=42..45: samples in visible range
x=46..49: number of samples outside visible range
13: x=62..65: totqal number of samples
x=73: vertical offset. LSB=0.04div.
x=77: sensitivity (20mV/div=x00..50V/div=x0A
x=82..85: measured frequency 32bit float big endian (sign is in byte 82, Mantissa LSB is byte 85
I have an absolute encoder which is outputting a 10 bit value (0 to 1023) in Gray code. The problem I am trying to solve is how to figure out if the encoder is moving forwards or backwards.
I decided that the “best” algorithm is as follows:
first I convert the Gray code to regular binary (full credit to the last answer in: https://www.daniweb.com/programming/software-development/code/216355/gray-code-conversion):
int grayCodeToBinaryConversion(int bits)
{
bits ^= bits >> 16; // remove if word is 16 bits or less
bits ^= bits >> 8; // remove if word is 8 bits or less
bits ^= bits >> 4;
bits ^= bits >> 2;
bits ^= bits >> 1;
return bits;
}
Second I compare two values that were sampled apart by 250 milliseconds. I thought that comparing two values will let me know if I am moving forwards or backwards. For example:
if((SampleTwo – SampleOne) > 1)
{
//forward motion actions
}
if((SampleTwo – SampleOne) < 1)
{
//reverse motion actions
}
if(SampleTwo == SampleOne)
{
//no motion action
}
Right as I started to feel smart, to my disappointment I realized this algorithm has a fatal flaw. This solution works great when I am comparing a binary value of say 824 to 1015. At this point I know which way the encoder is moving. However at some point the encoder will roll over from 1023 to 0 and climb, and when I then go to compare the first sampled value of say 1015 to the second sampled value of say 44, even though I am physically moving in the same direction, the logic I have written does not correctly capture this. Another no go is Taking the Gray code value as an int, and comparing the two ints.
How do I compare two Gray code values that were taken 250 milliseconds apart and determine the direction of rotation while taking into account the rolling over aspect of the encoder? If you are so kind to help, could you please provide a simple code example?
Suppose A is your initial reading, and B is the reading after 250ms.
Let's take your example of A = 950 and B = 250 here.
Let's assume the encoder is moving forwards (its value is increasing with time).
Then, the distance covered is (B - A + 1024) % 1024. Let's call this d_forward.
For this example, d_forward comes out to be (250 - 950 + 1024) % 1024 = 324.
The distance covered going backwards (d_backward) would be 1024 - d_forward; which is 700.
The minimum of d_forward and d_backward would give the direction the encoder is travelling.
This will not work if the encoder is going to travel more than 1023/2 units in 250ms. In such a case, you should decrease the intervals between taking readings.
Rishav's answer is correct, but it can be more easily calculated.
Let A and B be two readings made 250ms apart and converted from gray code to binary.
The difference in encoder position is just diff = ((1536 + B - A) & 1023) - 512. If you'd prefer not to use bitwise math, then diff = ((1536 + B - A) % 1024) - 512.
Note that 1536 is 1024+512, and the answer diff is determined by two constraints:
diff = B-A mod 1024
diff is in the range [-512, 511], which would be the normal range for a 10-bit signed number.
If your encoder is allowed/expected to go faster in one direction than the other, then you can adjust the range in (2).
To allow answers in the range [MIN,MIN+1023], use diff = ((1024 - MIN + B - A) % 1024) + MIN
If MIN is positive, add a large enough multiple of 1024 to make sure that it's positive before you do the modulus operation, since the modulus operator in most languages behaves oddly with negative numbers.
I gave the the two GPUs on my machine a try and I expected the Titan-XP to be faster than the Quadro-P400. However, both gave almost the same execution time.
I need to know if PyTorch will dynamically choose one GPU over another, or, I myself will have to specify which one PyTorch will use, during run-time.
Here is the code snippet used in the test:
import torch
import time
def do_something(gpu_device):
torch.cuda.set_device(gpu_device) # torch.cuda.set_device(device_num)
print("current GPU device ", torch.cuda.current_device())
strt = time.time()
a = torch.randn(100000000).cuda()
xx = time.time() - strt
print("execution time, to create 1E8 random numbers, is ", xx)
# print(a)
# print(a + 2)
no_of_GPUs= torch.cuda.device_count()
print("how many GPUs are there:", no_of_GPUs)
for i in range(0, no_of_GPUs):
print(i, "th GPU is", torch.cuda.get_device_name(i))
do_something(i)
Sample output:
how many GPUs are there: 2
0 th GPU is TITAN Xp COLLECTORS EDITION
current GPU device 0
execution time, to create 1E8 random numbers, is 5.527713775634766
1 th GPU is Quadro P400
current GPU device 1
execution time, to create 1E8 random numbers, is 5.511776685714722
Despite what you might believe, the lack of performance difference which you see is because the random number generation is being run on your host CPU not the GPU. If I modify your do_something routine like this:
def do_something(gpu_device, ongpu=False, N=100000000):
torch.cuda.set_device(gpu_device)
print("current GPU device ", torch.cuda.current_device())
strt = time.time()
if ongpu:
a = torch.cuda.FloatTensor(N).normal_()
else:
a = torch.randn(N).cuda()
print("execution time, to create 1E8 random no, is ", time.time() - strt)
return a
and run it two ways, I get very different execution times:
In [4]: do_something(0)
current GPU device 0
execution time, to create 1E8 random no, is 7.736972808837891
Out[4]:
-9.3955e-01
-1.9721e-01
-1.1502e+00
......
-1.2428e+00
3.1547e-01
-2.1870e+00
[torch.cuda.FloatTensor of size 100000000 (GPU 0)]
In [5]: do_something(0,True)
current GPU device 0
execution time, to create 1E8 random no, is 0.001735687255859375
Out[5]:
4.1403e+06
5.7016e+06
1.2710e+07
......
8.9790e+06
1.3779e+07
8.0731e+06
[torch.cuda.FloatTensor of size 100000000 (GPU 0)]
i.e. your version takes 7 seconds and mine takes 1.7ms. I think it is obvious which one ran on the GPU....
I came across too strange behaviour of pointer arithmetic. I am developing a program to develop SD card from LPC2148 using ARM GNU toolchain (on Linux). My SD card a sector contains data (in hex) like (checked from linux "xxd" command):
fe 2a 01 34 21 45 aa 35 90 75 52 78
While printing individual byte, it is printing perfectly.
char *ch = buffer; /* char buffer[512]; */
for(i=0; i<12; i++)
debug("%x ", *ch++);
Here debug function sending output on UART.
However pointer arithmetic specially adding a number which is not multiple of 4 giving too strange results.
uint32_t *p; // uint32_t is typedef to unsigned long.
p = (uint32_t*)((char*)buffer + 0);
debug("%x ", *p); // prints 34012afe // correct
p = (uint32_t*)((char*)buffer + 4);
debug("%x ", *p); // prints 35aa4521 // correct
p = (uint32_t*)((char*)buffer + 2);
debug("%x ", *p); // prints 0134fe2a // TOO STRANGE??
Am I choosing any wrong compiler option? Pls help.
I tried optimization options -0 and -s; but no change.
I could think of little/big endian, but here i am getting unexpected data (of previous bytes) and no order reversing.
Your CPU architecture must support unaligned load and store operations.
To the best of my knowledge, it doesn't (and I've been using STM32, which is an ARM-based cortex).
If you try to read a uint32_t value from an address which is not divisible by the size of uint32_t (i.e. not divisible by 4), then in the "good" case you will just get the wrong output.
I'm not sure what's the address of your buffer, but at least one of the three uint32_t read attempts that you describe in your question, requires the processor to perform an unaligned load operation.
On STM32, you would get a memory-access violation (resulting in a hard-fault exception).
The data-sheet should provide a description of your processor's expected behavior.
UPDATE:
Even if your processor does support unaligned load and store operations, you should try to avoid using them, as it might affect the overall running time (in comparison with "normal" load and store operations).
So in either case, you should make sure that whenever you perform a memory access (read or write) operation of size N, the target address is divisible by N. For example:
uint08_t x = *(uint08_t*)y; // 'y' must point to a memory address divisible by 1
uint16_t x = *(uint16_t*)y; // 'y' must point to a memory address divisible by 2
uint32_t x = *(uint32_t*)y; // 'y' must point to a memory address divisible by 4
uint64_t x = *(uint64_t*)y; // 'y' must point to a memory address divisible by 8
In order to ensure this with your data structures, always define them so that every field x is located at an offset which is divisible by sizeof(x). For example:
struct
{
uint16_t a; // offset 0, divisible by sizeof(uint16_t), which is 2
uint08_t b; // offset 2, divisible by sizeof(uint08_t), which is 1
uint08_t a; // offset 3, divisible by sizeof(uint08_t), which is 1
uint32_t c; // offset 4, divisible by sizeof(uint32_t), which is 4
uint64_t d; // offset 8, divisible by sizeof(uint64_t), which is 8
}
Please note, that this does not guarantee that your data-structure is "safe", and you still have to make sure that every myStruct_t* variable that you are using, is pointing to a memory address divisible by the size of the largest field (in the example above, 8).
SUMMARY:
There are two basic rules that you need to follow:
Every instance of your structure must be located at a memory address which is divisible by the size of the largest field in the structure.
Each field in your structure must be located at an offset (within the structure) which is divisible by the size of that field itself.
Exceptions:
Rule #1 may be violated if the CPU architecture supports unaligned load and store operations. Nevertheless, such operations are usually less efficient (requiring the compiler to add NOPs "in between"). Ideally, one should strive to follow rule #1 even if the compiler does support unaligned operations, and let the compiler know that the data is well aligned (using a dedicated #pragma), in order to allow the compiler to use aligned operations where possible.
Rule #2 may be violated if the compiler automatically generates the required padding. This, of course, changes the size of each instance of the structure. It is advisable to always use explicit padding (instead of relying on the current compiler, which may be replaced at some later point in time).
LDR is the ARM instruction to load data. You have lied to the compiler that the pointer is a 32bit value. It is not aligned properly. You pay the price. Here is the LDR documentation,
If the address is not word-aligned, the loaded value is rotated right by 8 times the value of bits [1:0].
See: 4.2.1. LDR and STR, words and unsigned bytes, especially the section Address alignment for word transfers.
Basically your code is like,
p = (uint32_t*)((char*)buffer + 0);
p = (p>>16)|(p<<16);
debug("%x ", *p); // prints 0134fe2a
but has encoded to one instruction on the ARM. This behavior is dependent on the ARM CPU type and possibly co-processor values. It is also highly non-portable code.
It's called "undefined behavior". Your code is casting a value which is not a valid unsigned long * into an unsigned long *. The semantics of that operation are undefined behavior, which means pretty much anything can happen*.
In this case, the reason two of your examples behaved as you expected is because you got lucky and buffer happened to be word-aligned. Your third example was not as lucky (if it was, the other two would not have been), so you ended up with a pointer with extra garbage in the 2 least significant bits. Depending on the version of ARM you are using, that could result in an unaligned read (which it appears is what you were hoping for), or it could result in an aligned read (using the most significant 30 bits) and a rotation (word rotated by the number of bytes indicated in the least significant 2 bits). It looks pretty clear that the later is what happened in your 3rd example.
Anyway, technically, all 3 of your example outputs are correct. It would also be correct for the program to crash on all 3 of them.
Basically, don't do that.
A safer alternative is to write the bytes into a uint32_t. Something like:
uint32_t w;
memcpy(&w, buffer, 4);
debug("%x ", w);
memcpy(&w, buffer+4, 4);
debug("%x ", w);
memcpy(&w, buffer+2, 4);
debug("%x ", w);
Of course, that's still assuming sizeof(uint32_t) == 4 && CHAR_BITS == 8, but that's a much safer assumption. (Ie, it should work on pretty much any machine with 8 bit bytes.)