How to achieve a high sampling speed using an ADC with Raspberry Pi? - raspberry-pi3

So, I am using the LTC 2366, 3 MSPS ADC and using the code given below, was able to achieve a sampling rate of about 380 KSPS.
#include <stdio.h>
#include <time.h>
#include <bcm2835.h>
int main(int argc, char**argv) {
FILE *f_0 = fopen("adc_test.dat", "w");
clock_t start, end;
double time_taken;
if (!bcm2835_init()) {
return 1;
}
bcm2835_spi_begin();
bcm2835_spi_setBitOrder(BCM2835_SPI_BIT_ORDER_MSBFIRST);
bcm2835_spi_setDataMode(BCM2835_SPI_MODE0);
bcm2835_spi_setClockDivider(32);
bcm2835_spi_chipSelect(BCM2835_SPI_CS0);
bcm2835_spi_setChipSelectPolarity(BCM2835_SPI_CS0, LOW);
int i;
char buf_[0] = {0x01, (0x08|0)<<4, 0x00}; // really doesn't matter what this is
char readBuf_0[2];
start = clock();
for (i=0; i<380000; i++) {
bcm2835_spi_transfernb(buf_0, readBuf_0, 2);
fprintf(f_0, "%d\n", (readBuf_0[0]<<6) + (readBuf_0[1]>>2));
}
end = clock();
time_taken = ((double)(end-start)/CLOCKS_PER_SEC);
printf("%f", (double)(time_taken));
printf(" seconds \n");
bcm2835_spi_end();
bcm2835_close();
return 0;
}
This returns about 1 second every time.
When I used the exact same code with LTC 2315, I still get a sampling rate of about 380 KSPS. How come? First of all, why is the 3 MSPS ADC giving me only 380 KSPS and not something like 2 MSPS? Second, when I change the ADC to something that's about 70% faster, I get the same sampling rate, why is that? Is that the limit of the Pi? Any way of improving this to get at least 1 MSPS?
Thank you

I have tested a bit of the Raspberry Pi SPI and found out that the spi has some overheads. In my case, I tried pyspi, where one byte seems to take at least 15us, and 75us between two words ( see these captures). That's slower than what you measure, so good for you!
Increasing the SPI clock changes the length of the exchange, but not the overheads. Hence, the critical time doesn't change, as the overhead is the limiting factor. 380ksps means 2.6us by byte, that may be well close to your overhead ?
The easier way to improve the ADC speed would be to used parallel ADCs instead of serial - it has the potential to increased overall speed to 20Msps+.

Related

In NVIDIA gpu, Why is the elapse time the same as the number of thread increase to 3 times of gpu core?

This is my cuda code:
#include<stdio.h>
#include<stdint.h>
#include <chrono>
#include <cuda.h>
__global__ void test(int base, int* out)
{
int curTh = threadIdx.x+blockIdx.x*blockDim.x;
{
int tmp = base * curTh;
#pragma unroll
for (int i = 0; i<1000*1000*100; ++i) {
tmp *= tmp;
}
out[curTh] = tmp;
}
}
typedef std::chrono::high_resolution_clock Clock;
int main(int argc, char *argv[])
{
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
int data = rand();
int* d_out;
void* va_args[10] = {&data, &d_out};
int nth = 10;
if (argc > 1) {
nth = atoi(argv[1]);
}
int NTHREADS = 128;
printf("nth: %d\n", nth);
cudaMalloc(&d_out, nth*sizeof(int));
for (int i = 0; i < 10; ++i) {
auto start = Clock::now();
cudaLaunchKernel((const void*) test,
nth>NTHREADS ? nth/NTHREADS : 1,
nth>NTHREADS ? NTHREADS : nth, va_args, 0, stream);
cudaStreamSynchronize(stream);
printf("use :%ldms\n", (Clock::now()-start)/1000/1000);
}
cudaDeviceReset();
printf("host Hello World from CPU!\n");
return 0;
}
I compile my code, and run in 2080Ti, I found the thread elapse time is around 214 ms, but the thread count is 3 times of gpu core(in 2080Ti, it's 4352)
root#d114:~# ./cutest 1
nth: 1
use :255ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
root#d114:~# ./cutest 13056
nth: 13056
use :272ms
use :223ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
root#d114:~# ./cutest 21760
nth: 21760
use :472ms
use :424ms
use :424ms
use :424ms
use :424ms
use :424ms
use :424ms
use :424ms
use :424ms
use :428ms
So my question is Why is the elapse time the same as the number of thread increase to 3 times of gpu core?
It's mean the NVIDIA gpu computing power is 3 times of gpu core?
Even though gpu-pipeline can issue a new instruction at one per cycle rate, it can overlap multiple threads' instruction running at least 3-4 times for simple math operations so increased number of threads only adds few cycles of extra latency per thread. But as it is visible at thr=21760, giving more of same instruction fully fills the pipeline and starts waiting.
21760/13056=1.667
424ms/214ms=1.98
this difference of ratios could be originated from tail-effect. When pipelines are fully filled, adding small work doubles the latency because the new work is computed as a second wave of computation after only all others completed because all they have same exact instructions. You could add some more threads and it should stay at 424ms until you get a third wave of waiting threads because again the instructions are exactly same for all threads there is no branching between threads and they work like blocks of waiting from outside.
Loop iterating for 100million times with complete dependency chain is limiting the memory accesses too. Only 1 memory operation per 100m iterations will have too low bandwidth consumption on card's memory.
The kernel is neither compute nor memory bottlenecked (if you don't count the integer multiplication with no latency-hiding in its own thread as a computation). With this, all SM units of GPU must be running with nearly same timings (with some thread-launch latency that is not visible near 100m loop and is linearly increasing with more threads).
When the algorithm is a real-world one that uses multiple parts of pipeline (not just integer multiplication), SM unit can find more threads to overlap in the pipeline. For example, if SM unit supports 1024 threads per block (and if 2 blocks in-flight maximum) and if it has only 128 pipelines, then there has to be at least 2048/128 = 16 slots to overlap operations like reading main memory, floating-point multiplication/addition, reading constant cache, shuffling registers, etc and this lets it complete a task quicker.

How to convert CUDA clock cycles to milliseconds?

I'd like to measure the time a bit of code within my kernel takes. I've followed this question along with its comments so that my kernel looks something like this:
__global__ void kernel(..., long long int *runtime)
{
long long int start = 0;
long long int stop = 0;
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start));
/* Some code here */
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop));
runtime[threadIdx.x] = stop - start;
...
}
The answer says to do a conversion as follows:
The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
For which I do:
for(long i = 0; i < size; i++)
{
fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1.62)*1000.0);
}
Where 1.62 is the GPU Max Clock rate of my device. But the time I get in milliseconds does not look right because it suggests that each thread took minutes to complete. This cannot be correct as execution finishes in less than a second of wall-clock time. Is the conversion formula incorrect or am I making a mistake somewhere? Thanks.
The correct conversion in your case is not GHz:
fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1.62)*1000.0);
^^^^
but hertz:
fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1620000000.0f)*1000.0);
^^^^^^^^^^^^^
In the dimensional analysis:
clock cycles
clock cycles / -------------- = seconds
second
the first term is the clock cycle measurement. The second term is the frequency of the GPU (in hertz, not GHz), the third term is the desired measurement (seconds). You can convert to milliseconds by multiplying seconds by 1000.
Here's a worked example that shows a device-independent way to do it (so you don't have to hard-code the clock frequency):
$ cat t1306.cu
#include <stdio.h>
const long long delay_time = 1000000000;
const int nthr = 1;
const int nTPB = 256;
__global__ void kernel(long long *clocks){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
long long start=clock64();
while (clock64() < start+delay_time);
if (idx < nthr) clocks[idx] = clock64()-start;
}
int main(){
int peak_clk = 1;
int device = 0;
long long *clock_data;
long long *host_data;
host_data = (long long *)malloc(nthr*sizeof(long long));
cudaError_t err = cudaDeviceGetAttribute(&peak_clk, cudaDevAttrClockRate, device);
if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
err = cudaMalloc(&clock_data, nthr*sizeof(long long));
if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
kernel<<<(nthr+nTPB-1)/nTPB, nTPB>>>(clock_data);
err = cudaMemcpy(host_data, clock_data, nthr*sizeof(long long), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
printf("delay clock cycles: %ld, measured clock cycles: %ld, peak clock rate: %dkHz, elapsed time: %fms\n", delay_time, host_data[0], peak_clk, host_data[0]/(float)peak_clk);
return 0;
}
$ nvcc -arch=sm_35 -o t1306 t1306.cu
$ ./t1306
delay clock cycles: 1000000000, measured clock cycles: 1000000210, peak clock rate: 732000kHz, elapsed time: 1366.120483ms
$
This uses cudaDeviceGetAttribute to get the clock rate, which returns a result in kHz, which allows us to easily compute milliseconds in this case.
In my experience, the above method works generally well on datacenter GPUs that have the clock rate running at the reported rate (may be affected by settings you make in nvidia-smi.) Other GPUs such as GeForce GPUs may be running at (unpredictable) boost clocks that will make this method inaccurate.
Also, more recently, CUDA has the ability to preempt activity on the GPU. This can come about in a variety of circumstances, such as debugging, CUDA dynamic parallelism, and other situations. If preemption occurs for whatever reason, attempting to measure anything based on clock64() is generally not reliable.
clock64 returns a value in graphics clock cycles. The graphics clock is dynamic so I would not recommend using a constant to try to convert to seconds. If you want to convert to wall time then the better option is to use globaltimer, which is a 64-bit clock register accessible as:
asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(start));
The unit is in nanoseconds.
The default resolution is 32ns with update every µs. The NVIDIA performance tools force the update to every 32 ns (or 31.25 MHz). This clock is used by CUPTI for start time when capturing concurrent kernel trace.

Generate random number within a function with cuRAND without preallocation

Is it possible to generate random numbers within a device function without preallocate all the states? I would like to generate and use them in "realtime". I need to use them for Monte Carlo simulations what are the most suitable for this purpose? The number generated below are single precision is it possible to have them in double precision?
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
__global__ void cudaRand(float *d_out, unsigned long seed)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init(seed, i, 0, &state);
d_out[i] = curand_uniform(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
float *v = new float[N];
float *d_out;
cudaMalloc((void**)&d_out, N * sizeof(float));
// generate random numbers
cudaRand << < 1, N >> > (d_out, time(NULL));
cudaMemcpy(v, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
{
printf("out: %f\n", v[i]);
}
cudaFree(d_out);
delete[] v;
return 0;
}
UPDATE
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
#include <ctime>
__global__ void cudaRand(double *d_out)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
curandState state;
curand_init((unsigned long long)clock() + i, 0, 0, &state);
d_out[i] = curand_uniform_double(&state);
}
int main(int argc, char** argv)
{
size_t N = 1 << 4;
double *h_v = new double[N];
double *d_out;
cudaMalloc((void**)&d_out, N * sizeof(double));
// generate random numbers
cudaRand << < 1, N >> > (d_out);
cudaMemcpy(h_v, d_out, N * sizeof(double), cudaMemcpyDeviceToHost);
for (size_t i = 0; i < N; i++)
printf("out: %f\n", h_v[i]);
cudaFree(d_out);
delete[] h_v;
return 0;
}
How I was dealing with the similar situation in the past, within __device__/__global__ function:
int tId = threadIdx.x + (blockIdx.x * blockDim.x);
curandState state;
curand_init((unsigned long long)clock() + tId, 0, 0, &state);
double rand1 = curand_uniform_double(&state);
double rand2 = curand_uniform_double(&state);
So just use curand_uniform_double for generating random doubles and also I believe you don't want the same seed for all of the threads, thats what I am trying to achieve by using clock() + tId instead. This way the odds of having the same rand1/rand2 in any of the two threads are close to nil.
EDIT:
However, based on below comments, proposed approach may perhaps lead to biased result:
JackOLantern pointed me to this part of curand documentation:
Sequences generated with different seeds usually do not have statistically correlated values, but some choices of seeds may give statistically correlated sequences.
Also there is a devtalk thread devoted to how to improve performance of curand_init in which the proposed solution to speed up the curand initialization is:
One thing you can do is use different seeds for each thread and a fixed subsequence of 0 and offset of 0.
But the same poster is later stating:
The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don't know of any problems like this, and even if they do exist they will most likely be rare.
So it is basically up to you whether you want better performance (as I did) or 1000% unbiased results. If that is what you desire, then solution proposed by JackOLantern is the correct one, i.e. initialize curand as:
curand_init((unsigned long long)clock(), tId, 0, &state)
Using not 0 value for offset and subsequence parameters is, however, decreasing performance. For more info on these parameters you may review this SO thread and also curand documentation.
I see that JackOLantern stated in comment that:
I would say it is not recommandable to call curand_init and curand_uniform_double from within the same kernel from two reasons ........ Second, curand_init initializes the pseudorandom number generator and sets all of its parameters, so I'm afraid your approach will be somewhat slow.
I was dealing with this in my thesis on several pages, tried various approaches to get different random numbers in each thread and creating curandState in each of the threads turned out to be the most viable solution for me. I needed to generate ~10 random numbers in each thread and among others I tried:
developing my own simple random number generator (Linear Congruential Generator) whose intialization was basically for free, however, the performance suffered greatly when generating numbers, so in the end having curandState in each thread turned out to be superior,
pre-allocating curandStates and reusing them - this was memory heavy and when I decreased number of preallocated states then I had to use non zero values for offset/subsequence parameters of curand_uniform_double in order to get rid of bias which led to decreased performance when generating numbers.
So after making thorough analysis I decided to indeed call curand_init and curand_uniform_double in each thread. The only problem was with the amount of registry that these states were occupying so I had to be careful with the block sizes not to exceed the max number of registry available to each block.
Thats what I have to say about provided solution which I was finally able to test and it is working just fine on my machine/GPU. I run the code from UPDATE section in the above question and 16 different random numbers were displayed in the console correctly. Therefore I advise you to properly perform error checking after executing kernel to see what went wrong inside. This topic is very well covered in this SO thread.

Multiplexed 7 segments using PIC16F877A using C

this is only the second time i've asked a question on here. Last time was quite helpful so i thought i'd revisit since i'm stuck on another C project!
I'll just add that i'm more or less a total n00b at C but know almost enough to attempt this with minimal help (until now!) and i'm not asking for someone to do this for me, just for a few pointers (no pun intended) in the right direction.
I've done quite a bit of googling on this topic but i'm trying my best not to just copy and paste in code from some online source as i want to learn from this one so i'm trying to develop the code by myself.
What i'm trying to do then:
I've build myself a PIC development board with two common cathode 7 segment displays connected up to PORTD of the '877A. I've connected RB0 and RB1 to the transistors that switch on the 7 segs and have tested everything out with simple code and it works fine so the circuit has no issues at all. I've managed to create a program that counts from 0-9 which was very basic and i've decided now to try multiplexing and counting from 0-99. I've written some code and posted it below and i'd like to ask someone to kindly point out to me what i'm doing wrong with it. So far i've got the units digit counting 0-9 but the tens digit just seems to stay at 0.
I've a feeling i'm leaving out something but i dunno what. I'm probably also overcomplicating it a little.
I'm trying to work my way up to making a program that acts as a temperature sensor using the input of a thermistor potential divider circuit to the ADC of the PIC (which is my actual project) and displaying the value on a multiplexed display which isn't actually part of the project (we're only supposed to use one digit that alternates between '2' and '5' and 'C' for '25C' etc) but i want to take it a bit further so i'm trying to develop this for an improved version.
Anyway, that's enough of me rambling on i'll paste the code in and hopefully someone can help.
#include <stdio.h>
#include <stdlib.h>
#include <xc.h>
#pragma config CP = OFF, DEBUG = OFF, PWRTE = OFF
#pragma config CPD = OFF, LVP = OFF
#pragma config BOREN = OFF, WRT = OFF
#pragma config WDTE = OFF, FOSC = HS
#define _XTAL_FREQ 8000000
void segments (int digits);
int main(int argc, char** argv) {
TRISD = 0x00; //creates an output
TRISB = 0x00;
PORTD = 0x00; //sends zeros to all bits of port D
PORTB = 0x00;
int i,j,num,tens,units,digits;
do {
for (i=0;i<100;i++)
{
units=i%10; //extract units digit
num=i-units; //takes units away leaving multiple of 10
tens=num%10; //extract tens digit
for(j=0;j<20;j++) //should display each ten and unit for 200ms
{
RB0=1; //switch on units segment
RB1=0;
digits=units;
segments(digits);
__delay_ms (5);
RB0=0;
RB1=1;
digits=tens;
segments(digits);
__delay_ms (5);
}
}
}while (1); //do while runs forever
return (EXIT_SUCCESS);
}
void segments (int digits)
{
switch (digits)
{
case 0:
PORTD=0x3F; //zero
break;
case 1:
PORTD=0x06; //one
break;
case 2:
PORTD=0x5B; //two
break;
case 3:
PORTD=0x4F; //three
break;
case 4:
PORTD=0x66; //four
break;
case 5:
PORTD=0x6D; //five
break;
case 6:
PORTD=0x7D; //six
break;
case 7:
PORTD=0x07; //seven
break;
case 8:
PORTD=0x7F; //eight
break;
case 9:
PORTD=0x6F; //nine
break;
}
}
If i've forgotten to add anything please do let me know. Thanks very much in advance for any help!
You should use tens = num / 10; instead of %.
For example, if i is 52 while you calculate units y, extract the remaining from division by 10, wich is 2. Then you subtract 2 from 52 to get 50, and do the same in calculating tens, which will surely give you 0.

Using both GPU device of CUDA and zero copy pinned memory

I am using the CUSP library for sparse matrix-multiplication on CUDA a machine. My current code is
#include <cusp/coo_matrix.h>
#include <cusp/multiply.h>
#include <cusp/print.h>
#include <cusp/transpose.h>
#include<stdio.h>
#define CATAGORY_PER_SCAN 1000
#define TOTAL_CATAGORY 100000
#define MAX_SIZE 1000000
#define ELEMENTS_PER_CATAGORY 10000
#define ELEMENTS_PER_TEST_CATAGORY 1000
#define INPUT_VECTOR 1000
#define TOTAL_ELEMENTS ELEMENTS_PER_CATAGORY * CATAGORY_PER_SCAN
#define TOTAL_TEST_ELEMENTS ELEMENTS_PER_TEST_CATAGORY * INPUT_VECTOR
int main(void)
{
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
cusp::coo_matrix<long long int, double, cusp::host_memory> A(CATAGORY_PER_SCAN,MAX_SIZE,TOTAL_ELEMENTS);
cusp::coo_matrix<long long int, double, cusp::host_memory> B(MAX_SIZE,INPUT_VECTOR,TOTAL_TEST_ELEMENTS);
for(int i=0; i< ELEMENTS_PER_TEST_CATAGORY;i++){
for(int j = 0;j< INPUT_VECTOR ; j++){
int index = i * INPUT_VECTOR + j ;
B.row_indices[index] = i; B.column_indices[ index ] = j; B.values[index ] = i;
}
}
for(int i = 0;i < CATAGORY_PER_SCAN; i++){
for(int j=0; j< ELEMENTS_PER_CATAGORY;j++){
int index = i * ELEMENTS_PER_CATAGORY + j ;
A.row_indices[index] = i; A.column_indices[ index ] = j; A.values[index ] = i;
}
}
/*cusp::print(A);
cusp::print(B); */
//test vector
cusp::coo_matrix<long int, double, cusp::device_memory> A_d = A;
cusp::coo_matrix<long int, double, cusp::device_memory> B_d = B;
// allocate output vector
cusp::coo_matrix<int, double, cusp::device_memory> y_d(CATAGORY_PER_SCAN, INPUT_VECTOR ,CATAGORY_PER_SCAN * INPUT_VECTOR);
cusp::multiply(A_d, B_d, y_d);
cusp::coo_matrix<int, double, cusp::host_memory> y=y_d;
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
printf("time elaplsed %f ms\n",elapsedTime);
return 0;
}
cusp::multiply function uses 1 GPU only (as of my understanding).
How can I use setDevice() to run same program on both the GPU(one cusp::multiply per GPU) .
Measure the total time accurately.
How can I use zero-copy pinned memory with this library as I can use malloc myself.
1 How can I use setDevice() to run same program on both the GPU
If you mean "How can I perform a single cusp::multiply operation using two GPUs", the answer is you can't.
EDIT:
For the case where you want to run two separate CUSP sparse matrix-matrix products on different GPUs, it is possible to simply wrap the operation in a loop and call cudaSetDevice before the transfers and the cusp::multiply call. You will probably not, however get any speed up by doing so. I think I am correct in saying that both the memory transfers and cusp::multiply operations are blocking calls, so the host CPU will stall until they are finished. Because of this, the calls for different GPUs cannot overlap and there will be no speed up over performing the same operation on a single GPU twice. If you were willing to use a multithreaded application and have a host CPU with multiple cores, you could probably still run them in parallel, but it won't be as straightforward host code as it seems you are hoping for.
2 Measure the total time accurately
The cuda_event approach you have now is the most accurate way of measuring the execution time of a single kernel. If you had a hypthetical multi-gpu scheme, then the sum of the events from each GPU context would be the total execution time of the kernels. If, by total time, you mean the "wallclock" time to complete the operation, then you would need to either use a host timer around the whole multigpu segment of your code. I vaguely recall that it might be possible in the latest versions of CUDA to synchronize between events in streams from different contexts in some circumstances, so a CUDA event based timer might still be usable in such a scenario.
3 How can I use zero-copy pinned memory with this library as I can use malloc myself.
To the best of my knowledge that isn't possible. The underlying thrust library CUSP uses can support containers using zero copy memory, but CUSP doesn't expose the necessary mechanisms in the standard matrix constructors to be able to use allocate a CUSP sparse matrix in zero copy memory.

Resources