How does a CPU knows how to interpret instruction? - cpu

MIPS ISA has 3 instruction formats: R, I, J.
Each format being decode in the different way.
(R type - opcode | rs | rt | rd | shamt | funct) etc.
And we have some instruction - 10111010010000110011010101101110
How does CPU knows in which the way decode the instruction?

Related

Why does the same OpenCL code have different outputs from Intel Xeon CPU and NVIDIA GTX 1080 Ti GPU?

I am trying to parallelize Monte Carlo simulation by using OpenCL. I use the MWC64X as a uniform random number generator. The code runs well on different Intel CPUs, since the output of parallel computation is very close to the sequential one.
Using OpenCL device: Intel(R) Xeon(R) CPU E5-2630L v3 # 1.80GHz
Literal influence running time: 0.029048 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.029762 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.029742 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.02971 seconds ra seqInfl= 0.4771
Literal influence running time: 0.029225 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.04992 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.034636 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.049079 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.024442 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.04946 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.049071 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.053117 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.051642 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.052052 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.052118 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.051998 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.052069 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.71728 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:4781 influence0:0.478100 sum2:4781 influence2:0.478100 sum6:0 influence6:0.000000 sum10:0 sum12:0 influence12:0.000000 sum7:0 influence7:0.000000 influence10:0.000000 sum4:5962 influence4:0.596200 sum8:7971 influence8:0.797100 sum1:4781 influence1:0.478100 sum3:4781 influence3:0.478100 sum13:0 influence13:0.000000 sum11:1261 influence11:0.126100 sum9:0 influence9:0.000000 sum14:0 influence14:0.000000 sum5:0 influence5:0.000000 sum15:0 influence15:0.000000 sum16:0 influence16:0.000000
Parallel influence running time: 0.054391 seconds
Parallel maxInfluence Literal: trust57-4 Infl=0.7971
However, when I run the code on GeForce GTX 1080 Ti, with NVIDIA-SMI 430.40 and CUDA 10.1 and OpenCL 1.2 CUDA installed, the output is as below:
Using OpenCL device: GeForce GTX 1080 Ti
Influence:
Literal influence running time: 0.011119 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.011238 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.011408 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.01109 seconds ra seqInfl= 0.4771
Literal influence running time: 0.011132 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.018978 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.013093 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.018968 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.009105 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.018753 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.018583 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.02005 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.01957 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.019686 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.019632 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.019687 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.019859 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.272032 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:10000 sum1:10000 sum2:10000 sum3:10000 sum4:10000 sum5:0 sum6:0 sum7:0 sum8:10000 sum9:0 sum10:0 sum11:0 sum12:0 sum13:0 sum14:0 sum15:0 sum16:0
Parallel influence running time: 0.193581 seconds
The "Influence" value equals sum*1.0/10000, thus the parallel influence only composes of 1 and 0, which is incorrect (in GPU runs) and doesn't happen when parallelizing on a Intel CPU.
When I check the output of the random number generator if(flag==0) printf("randint=%u",randint);, it seems the outputs are all zero on GPU. Below is the clinfo and the .cl code:
Device Name GeForce GTX 1080 Ti
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 430.40
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 68:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 28
Max clock frequency 1721MHz
Compute Capability (NV) 6.1
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 11720130560 (10.92GiB)
Error Correction support No
Max memory allocation 2930032640 (2.729GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 458752 (448KiB)
Global Memory cache line size 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x32768 pixels
Max 3D image size 16384x16384x16384 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max number of constant args 9
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) Yes
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
#define N 70 // N > index, which is the total number of literals
#define BASE 4294967296UL
//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;
enum{ MWC64X_A = 4294883355U };
enum{ MWC64X_M = 18446383549859758079UL };
void MWC64X_Step(mwc64x_state_t *s)
{
uint X=s->x, C=s->c;
uint Xn=MWC64X_A*X+C;
uint carry=(uint)(Xn<C); // The (Xn<C) will be zero or one for scalar
uint Cn=mad_hi(MWC64X_A,X,carry);
s->x=Xn;
s->c=Cn;
}
//! Return a 32-bit integer in the range [0..2^32)
uint MWC64X_NextUint(mwc64x_state_t *s)
{
uint res=s->x ^ s->c;
MWC64X_Step(s);
return res;
}
__kernel void setInfluence(const int literals, const int size, const int dim1_size, __global int* lambdas, __global float* lambdap, __global int* dim2_size, __global float* influence){
int flag=get_global_id(0);
int sum=0;
int count=10000;
int assignment[N];
//or try to get newlambda like original version does
if(flag < literals){
mwc64x_state_t rng;
for(int i=0; i<count; i++){
for(int j=0; j<size; j++){
uint randint=MWC64X_NextUint(&rng);
float rand=randint*1.0/BASE;
//if(flag==0)
// printf("randint=%u",randint);
if(lambdap[j]<rand)
assignment[lambdas[j]]=0;
else
assignment[lambdas[j]]=1;
}
//the true case
assignment[flag]=1;
int valuet=0;
int index=0;
for(int m=0; m<dim1_size; m++){
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++){
if(assignment[lambdas[index+n]]==0){
valueMono=0;
index+=dim2_size[m];
break;
}
}
if(valueMono==1){
valuet=1;
break;
}
}
//the false case
assignment[flag]=0;
int valuef=0;
index=0;
for(int m=0; m<dim1_size; m++){
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++){
if(assignment[lambdas[index+n]]==0){
valueMono=0;
index+=dim2_size[m];
break;
}
}
if(valueMono==1){
valuef=1;
break;
}
}
sum += valuet-valuef;
}
influence[flag] = 1.0*sum/count;
printf("sum%d:%d\t", flag, sum);
}
}
What might be the problem when running the code on GPU? Is it MWC64X? According to its author, it can perform well on NVIDIA GPUs. If so, how can I fix it; if not, what might be the problem?
(This started out as a comment, it turns out this was the source of the problem so I'm turning it into an answer.)
You're not initialising your mwc64x_state_t rng; variable before reading from it, so any results will be undefined:
mwc64x_state_t rng;
for(int i=0; i<count; i++){
for(int j=0; j<size; j++){
uint randint=MWC64X_NextUint(&rng);
Where MWC64X_NextUint() immediately reads from the rng state before updating it:
uint MWC64X_NextUint(mwc64x_state_t *s)
{
uint res=s->x ^ s->c;
Note that you will probably want to seed your RNG differently for each work-item, otherwise you will get nasty correlation artifacts in your results.
All use-cases of a pseudo-random number are a next-level challenge in true-[PARALLEL] computing platforms (not languages, platforms).
Either, there is some source-of-randomness, which gets us into a trouble once massively-parallel requests are to get fair-handled in a truly [PARALLEL] fashion (here, hardware resources may help, yet at a cost of not being able to reproduce the same behaviour "outside" of this very same platform ( and moment-in-time, if such a source is not software-operated with some seed-injection feature, that may setup the "just"-pseudo-random algorithm that creates a pure-[SERIAL] sequence-of-produced "just"-pseudo-random numbers ) )
Or,there is some "shared"-generator of pseudo-random numbers, which enjoys of a higher level of system-wide level-of-entropy (which is good for the resulting "quality" of pseudo-randomness) but at a cost of pure-serial dependence (no parallel execution possible,serial sequence gets served one after another in a sequential manner) and having close to zero chance for repeatable runs (a must for reproducible science) providing repeatably same sequences, needed for testing and for method-validation cases.
RESUME :
The code may employ a work-item-"private" pseudo-random generating function(s) ( privacy is a must for the sake of both the parallel code-execution and the mutual independence (non-intervening processes) of generating these pseudo-random numbers ) , yet each of instances must be a) independently initialised, so as to provide the expected level of randomness achievable in parallelised code-runs and b) any such initialisation ought be performed in a repeatably reproducible manner, for the sake of running the test on different times, often using different OpenCL target computing-platforms.
For __kernel-s, that do not rely on hardware-specific sources-of-randomness, meeting the conditions a && b will suffice for receiving repeatably reproducible (same) results for testing "in vitro" and thus providing a reasonably random method for generating results during generic production-level use-case code-runs "in vivo".
The comparison of net-run-times (benchmarked above) seems to show that Amdahl's law add-on overhead costs plus a tail-end effect of the atomicity-of-work have finally decided the net-run-time was ~ 3.6x faster on XEON compared to GPU:
index1 = 17
size = 51
dim1_size = 6
sum0: 4781 influence0: 0.478100
sum2: 4781 influence2: 0.478100
sum6: 0 influence6: 0.000000
sum10: 0 influence10: 0.000000
sum12: 0 influence12: 0.000000
sum7: 0 influence7: 0.000000
sum4: 5962 influence4: 0.596200
sum8: 7971 influence8: 0.797100
sum1: 4781 influence1: 0.478100
sum3: 4781 influence3: 0.478100
sum13: 0 influence13: 0.000000
sum11: 1261 influence11: 0.126100
sum9: 0 influence9: 0.000000
sum14: 0 influence14: 0.000000
sum5: 0 influence5: 0.000000
sum15: 0 influence15: 0.000000
sum16: 0 influence16: 0.000000
Parallel influence running time: 0.054391 seconds on XEON E5-2630L v3 # 1.80GHz using OpenCL
|....
index1 = 17 |....
size = 51 |....
dim1_size = 6 |....
sum0: 10000 |....
sum1: 10000 |....
sum2: 10000 |....
sum3: 10000 |....
sum4: 10000 |....
sum5: 0 |....
sum6: 0 |....
sum7: 0 |....
sum8: 10000 |....
sum9: 0 |....
sum10: 0 |....
sum11: 0 |....
sum12: 0 |....
sum13: 0 |....
sum14: 0 |....
sum15: 0 |....
sum16: 0 |....
Parallel influence running time: 0.193581 seconds on GeForce GTX 1080 Ti using OpenCL

question about byte order and go standard library?

we have num 0x1234
In bigEndian:
low address -----------------> high address
0x12 | 0x34
In littleEndian:
low address -----------------> high address
0x34 | 0x12
we can see function below in binary.go:
func (bigEndian) PutUint16(b []byte, v uint16) {
_ = b[1] // early bounds check to guarantee safety of writes below
b[0] = byte(v >> 8)
b[1] = byte(v)
}
I download golang code of x86 and powpc arch and find the same definition.
https://golang.org/dl/
go1.12.7.linux-ppc64le.tar.gz Archive Linux ppc64le 99MB 8eda20600d90247efbfa70d116d80056e11192d62592240975b2a8c53caa5bf3
Now let's see what happen in this function.
If cpu is little endian, we store 0x1234 in memory like this:
low address -----------------> high address
0x34 | 0x12
v >> 8 means shift 8 bits right, means /2^8, so we get this in memory:
low address -----------------> high address
0x12 | 0x00
byte(v>>8), we get byte 0x12 which is in low address -> b[0]
byte(v), we get byte 0x34 -> b[1]
so we get the result which i think it's right:
[0x12,0x34]
=====================================
If cpu is big endian, we store 0x1234 in memory like this:
low address -----------------> high address
0x12 | 0x34
v >> 8 means shift 8 bits right, means /2^8, so we get this in memory:
low address -----------------> high address
0x00 | 0x12
byte(v>>8), we get byte 0x00 which is in low address -> b[0]
byte(v), we get byte 0x12 -> b[1]
so we get the result which i think it's not right:
[0x00,0x12]
I find in web how to check your cpu bigendian or little endian, and i write function below:
func IsBigEndian() bool {
test16 := uint16(0x1234)
test8 := *(*uint8)(unsafe.Pointer(&test16))
if test8 == 0x12{
return true
}else{
fmt.Printf("little")
return false
}
}
According to this function, I think byte() means get low address byte, am I right?
If right, why i get wrong result in analysis of "if cpu is big endian ..." ?
thanks a lot #Volker, I found this post Does bit-shift depend on endianness? . And know "byte(xxx)" operate in processor's register which not depend on the endianness in memory, so byte(0x1234) always get 0x34.

ATTiny85 PWM for 4 LEDs

I need to control 4 individual LEDs via PWM on an ATTiny85. I have found lots of info on how to control 3 LEDs. But apparently to control 4 with PWM, you have to really twist the 85 into knots. Is there an easier way to handle 4 LEDs on the 85, or would it be better to step over to the 84? If I went with the 84, would I be likely to run into the same brick walls as with the 85?
I found this code for controlling 4 on the 85, but it's above my skill level. Anyone see any issues with it?
/* Four PWM Outputs */
// ATtiny85 outputs
const int Red = 0;
const int Green = 1;
const int Blue = 4;
const int White = 3;
volatile uint8_t* Port[] = {&OCR0A, &OCR0B, &OCR1A, &OCR1B};
void setup() {
pinMode(Red, OUTPUT);
pinMode(Green, OUTPUT);
pinMode(Blue, OUTPUT);
pinMode(White, OUTPUT);
// Configure counter/timer0 for fast PWM on PB0 and PB1
TCCR0A = 3<<COM0A0 | 3<<COM0B0 | 3<<WGM00;
TCCR0B = 0<<WGM02 | 3<<CS00; // Optional; already set
// Configure counter/timer1 for fast PWM on PB4
TCCR1 = 1<<CTC1 | 1<<PWM1A | 3<<COM1A0 | 7<<CS10;
GTCCR = 1<<PWM1B | 3<<COM1B0;
// Interrupts on OC1A match and overflow
TIMSK = TIMSK | 1<<OCIE1A | 1<<TOIE1;
}
ISR(TIMER1_COMPA_vect) {
if (!bitRead(TIFR,TOV1)) bitSet(PORTB, White);
}
ISR(TIMER1_OVF_vect) {
bitClear(PORTB, White);
}
// Sets colour Red=0 Green=1 Blue=2 White=3
// to specified intensity 0 (off) to 255 (max)
void SetColour (int colour, int intensity) {
*Port[colour] = 255 - intensity;
}
void loop() {
for (int i=-255; i <= 254; i++) {
OCR0A = abs(i);
OCR0B = 255-abs(i);
OCR1A = abs(i);
OCR1B = 255-abs(i);
delay(10);
}
}
If you want to save pins at the expense of a more complicated strategy, you can get away with only 3 pins by connecting the LEDs as two sets of two like this...
Instead of using the built in PWM, you will need to do the PWM manually by setting a timer and then changing the INPUT/OUTPUT and ON/OFF of each of the pins each time the timer expires.
+-----+----------+----------+----------+
| LED | A | B | C |
+-----+----------+----------+----------+
| 1 | OUTPUT 1 | INPUT | OUTPUT 0 |
| 2 | OUTPUT 0 | INPUT | OUTPUT 1 |
| 3 | INPUT | OUTPUT 1 | OUTPUT 0 |
| 4 | INPUT | OUTPUT 0 | OUTPUT 1 |
+-----+----------+----------+----------+
Update or comment if you want more details on this strategy.
An easy strategy is to multiplex the 4 LEDs onto a single PWM pin. This will let independently control the brightness of each LED on the ATTINY using 5 pins total.
So, for example, you could connect all 4 of the cathodes together and connect those to a single PWM pin. Then you connect each of the 4 anodes to a different IO pin.
At any given moment, only one of the anodes is in output mode - the others are left floating. The means that at most 1 single LED is active and its brightness is controlled by the PWM duty cycle.
You can then use the overflow ISR for the PWM timer to activate the next LED in the sequence after each PWM cycle. You also update the PWM match to reflect the brightness of the next LED.
If you rotate though the LEDs quickly (faster than, say, 60 times per second), then visually they all just look like they are on at the desired brightness. PWM, after all, is just blinking an LED too quickly to see, so we are just adding a second dimension on to it.
One downside: Since only a single LED is on at any moment, the maximum total brightness will theoretically be 1/4 of what it would be if you drove all the LEDs independently. In practice this is likely an issue since the ATTINY is limited to how much current it can pass though all if its pins at once if you tried to light all the LEDs at the same time.
One hint: when setting up the PWM timer, make is so that the LED is OFF at the beginning of the cycle and turns on in the middle. This will give the ISR time to step to the next LED while all LEDs are off. This is better becuase it is Easy to see an LED that is on when it should not be, but not so easy to see an LED that is off when it should be on.
One suggestion: I will get flamed for this, but you can leave out any current limiting resistors when doing this since each LED is only on for at most 1/4 of the time. This will give you more brightness and also make it so you can dial down the PWM duty cycle so you have more off time at the beginning of each cycle to step to the next LED.
I have used this technique successfully many times, and even have been able to multiplex 6 RGB LEDs (three channels each) onto one chip and it works great.
Update the question if you have any questions about the details!

Extending SRecord to handle crc32_mpeg2?

statement of problem:
I'm working with a Kinetis L series (ARM Cortex M0+) that has a dedicated CRC hardware module. Through trial and error and using this excellent online CRC calculator, I determined that the CRC hardware is configured to compute CRC32_MPEG2.
I'd like to use srec_input (a part of SRecord 1.64) to generate a CRC for a .srec file whose results must match the CRC_MPEG2 computed by the hardware. However, srec's built-in CRC algos (CRC32 and STM32) don't generate the same results as the CRC_MPEG2.
the question:
Is there a straightforward way to extend srec to handle CRC32_MPEG2? My current thought is to fork the srec source tree and extend it, but it seems likely that someone's already been down this path.
Alternatively, is there a way for srec to call an external program? (I didn't see one after a quick scan.) That might do the trick as well.
some details
The parameters of the hardware CRC32 algorithm are:
Input Reflected: No
Output Reflected: No
Polynomial: 0x4C11DB7
Initial Seed: 0xFFFFFFFF
Final XOR: 0x0
To test it, an input string of:
0x10 0xB5 0x06 0x4C 0x23 0x78 0x00 0x2B
0x07 0xD1 0x05 0x4B 0x00 0x2B 0x02 0xD0
should result in a CRC32 value of:
0x938F979A
what generated the CRC value in the first place?
In response to Mark Adler's well-posed question, the firmware uses the Freescale fsl_crc library to compute the CRC. The relevant code and parameters (mildly edited) follows:
void crc32_update(crc32_data_t *crc32Config, const uint8_t *src, uint32_t lengthInBytes)
{
crc_config_t crcUserConfigPtr;
CRC_GetDefaultConfig(&crcUserConfigPtr);
crcUserConfigPtr.crcBits = kCrcBits32;
crcUserConfigPtr.seed = 0xffffffff;
crcUserConfigPtr.polynomial = 0x04c11db7U;
crcUserConfigPtr.complementChecksum = false;
crcUserConfigPtr.reflectIn = false;
crcUserConfigPtr.reflectOut = false;
CRC_Init(g_crcBase[0], &crcUserConfigPtr);
CRC_WriteData(g_crcBase[0], src, lengthInBytes);
crcUserConfigPtr.seed = CRC_Get32bitResult(g_crcBase[0]);
crc32Config->currentCrc = crcUserConfigPtr.seed;
crc32Config->byteCountCrc += lengthInBytes;
}
Peter Miller be praised...
It turns out that if you supply enough filters to srec_cat, you can make it do anything! :) In fact, the following arguments the correct checksum:
$ srec_cat test.srec -Bit_Reverse -CRC32LE 0x1000 -Bit_Reverse -XOR 0xff -crop 0x1000 0x1004 -Output -HEX_DUMP
00001000: 93 8F 97 9A #....
In other words, bit reverse the bits going to the CRC32 algorithm, bit reverse them on the way out, and 1's compliment them.

FFI code segfaults in jRuby but not MRI Ruby

I'm porting a Ruby gem written in C to Ruby with FFI.
When I run the tests using MRI Ruby there aren't any seg-faults.
When running in jRuby, I get a seg-fault.
This is the code in the test that I think is responsible:
if type == Date or type == DateTime then
assert_nil param.set_value(value.strftime("%F %T"));
else
assert_nil param.set_value(value);
end
#api.sqlany_bind_param(stmt, 0, param)
puts "\n#{param.inspect}"
#return if String === value or Date === value or DateTime === value
assert_succeeded #api.sqlany_execute(stmt)
The segmentation fault happens when running sqlany_execute, but only when the object passed to set_value is of the class String.
sqlany_execute just uses FFI's attach_function method.
param.set_value is more complicated. I'll focus just on the String specific part. Here is the original C code
case T_STRING:
s_bind->value.length = malloc(sizeof(size_t));
length = RSTRING_LEN(val);
*s_bind->value.length = length;
s_bind->value.buffer = malloc(length);
memcpy(s_bind->value.buffer, RSTRING_PTR(val), length);
s_bind->value.type = A_STRING;
break;
https://github.com/in4systems/sqlanywhere/blob/db25e7c7a2d5c855ab3899eacbc7a86b91114f53/ext/sqlanywhere.c#L1461
In my port, this became:
when String
self[:value][:length] = SQLAnywhere::LibC.malloc(FFI::Type::ULONG.size)
length = value.bytesize
self[:value][:length].write_int(length)
self[:value][:buffer] = SQLAnywhere::LibC.malloc(length + 1)
self[:value][:buffer_size] = length + 1
## Don't use put_string as that includes the terminating null
# value.each_byte.each_with_index do |byte, index|
# self[:value][:buffer].put_uchar(index, byte)
# end
self[:value][:buffer].put_string(0, value)
self[:value][:type] = :string
https://github.com/in4systems/sqlanywhere/blob/e49099a4e6514169395523391f57d2333fbf7d78/lib/bind_param.rb#L31
My question is: what's causing jRuby to seg fault and what can I do about it?
This answer is possibly overly detailed, but I thought it would be good to go into a bit of depth for those who run across similar problems in the future.
It looks like this was your problem:
self[:value][:length].write_int(length)
when it should have been:
self[:value][:length].write_ulong(length)
On a 64 bit system, bytes 4..7 of the memory self[:value][:length] points to could have contained garbage (since malloc does not clear the memory it returns), and when the native code reads a size_t quantity at that address, it will be garbage, potentially indicating a buffer larger than 4 gigabytes.
e.g. if the string length is really 15 bytes, the lower 4 bits will be set, and the upper 60 should be all zero.
bit 0 1 2 3 4 32 63
+---+---+---+---+---+ ~ +---+ ~ +---+
| 1 | 1 | 1 | 1 | 0 | ~ | 0 | ~ | 0 |
+---+---+---+---+---+ ~ +---+ ~ +---+
if just one bit in that upper 32 bits is set, then you get a > 4 gigabyte value
bit 0 1 2 3 4 32 63
+---+---+---+---+---+ ~ +---+ ~ +---+
| 1 | 1 | 1 | 1 | 0 | ~ | 1 | ~ | 0 |
+---+---+---+---+---+ ~ +---+ ~ +---+
which would be a length of 4294967311 bytes.
One way to fix it, is to define a SizeT struct and use that for the length.
e.g.
class SizeT < FFI::Struct
layout :value, :size_t
end
self[:value][:length] = SQLAnywhere::LibC.malloc(SizeT.size)
length = value.bytesize
SizeT.new(self[:value][:length])[:value] = length
or you could monkey patch FFI::Pointer:
class FFI::Pointer
if FFI.type_size(:size_t) == 4
def write_size_t(val)
write_int(val)
end
else
def write_size_t(val)
write_long_long(val)
end
end
end
Why was it only segfaulting on JRuby, not on MRI? Maybe MRI was a 32 bit executable (printing the value of FFI.type_size(:size_t) will tell you).

Resources