Producer consumer using boost::interprocess_confition with boost:interprocess shared memory. Consumer dominates 100% - boost

Just making a simple example because I am having issues with a more complex usecase and want to udnerstand the base case before spending too much time in trial and error.
Scenario:
I have two binaries that supposedly takes turns incrementing a number (stored in shared memory). What happens in practice is that the "consumer" app takes over 100% never letting the "creator" run.
If I add a small delay in the consumer in that case I obtain the intended behaviour.
Simple POD struct
#pragma once
#include <boost/interprocess/managed_shared_memory.hpp>
#include <boost/interprocess/containers/vector.hpp>
#include <boost/interprocess/allocators/allocator.hpp>
#include <boost/interprocess/sync/interprocess_mutex.hpp>
#include <boost/interprocess/sync/interprocess_condition.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
namespace bip = boost::interprocess;
namespace my_namespace {
static const char *name = "MySharedMemory";
struct MyStruct {
bip::interprocess_mutex mutex;
bip::interprocess_condition cond;
unsigned long counter;
MyStruct(): mutex(), cond(), counter(0) {
}
};
} // namespace my_namespace
"Creator/producer"
#include <signal.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <iostream>
#include <boost/interprocess/managed_shared_memory.hpp>
#include <boost/interprocess/allocators/allocator.hpp>
#include <boost/interprocess/sync/interprocess_mutex.hpp>
#include <boost/thread/locks.hpp>
#include "my_struct.h"
bool exit_flag = false;
void my_handler(int) {
exit_flag = true;
}
namespace bip = boost::interprocess;
int main() {
struct sigaction sigIntHandler;
sigIntHandler.sa_handler = my_handler;
sigemptyset(&sigIntHandler.sa_mask);
sigIntHandler.sa_flags = 0;
sigaction(SIGINT, &sigIntHandler, NULL);
bip::shared_memory_object::remove(my_namespace::name);
auto memory = bip::managed_shared_memory(bip::create_only, my_namespace::name, 65536);
auto *data = memory.construct<my_namespace::MyStruct>(my_namespace::name)();
long unsigned iterations = 0;
while (!exit_flag) {
boost::interprocess::scoped_lock lock(data->mutex);
data->counter++;
std::cout << "iteration:" << iterations << "Counter: " << data->counter << std::endl;
++iterations;
auto start = boost::posix_time::microsec_clock::universal_time();
auto wait_time = start + boost::posix_time::milliseconds(1000);
auto ret = data->cond.timed_wait(lock, wait_time);
if (!ret) {
std::cout << "Timeout" << std::endl;
}
}
return 0;
}
Consumer
#include <signal.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <chrono>
#include <iostream>
#include <thread>
#include <mutex>
#include "my_struct.h"
bool exit_flag = false;
void my_handler(int) {
exit_flag = true;
}
namespace bip = boost::interprocess;
int fib(int x) {
if ((x == 1) || (x == 0)) {
return (x);
} else {
return (fib(x - 1) + fib(x - 2));
}
}
int main() {
struct sigaction sigIntHandler;
sigIntHandler.sa_handler = my_handler;
sigemptyset(&sigIntHandler.sa_mask);
sigIntHandler.sa_flags = 0;
sigaction(SIGINT, &sigIntHandler, nullptr);
auto memory = bip::managed_shared_memory(bip::open_only, my_namespace::name);
auto *data = memory.find<my_namespace::MyStruct>(my_namespace::name).first;
long unsigned iterations = 0;
while (!exit_flag) {
{
boost::interprocess::scoped_lock lock(data->mutex);
std::this_thread::sleep_for(std::chrono::milliseconds(200));
data->counter += 1;
std::cout << "iteration:" << iterations << "Counter: " << data->counter << std::endl;
++iterations;
std::cout << "notify_one" << std::endl;
data->cond.notify_one();
}
// usleep(1); // If I add this it works
}
return 0;
}
If someone can shed some light I would be grateful.

You're doing sleeps while holding the lock. This maximizes lock contention. E.g. in your consumer
boost::interprocess::scoped_lock lock(data->mutex);
std::this_thread::sleep_for(200ms);
Could be
std::this_thread::sleep_for(200ms);
boost::interprocess::scoped_lock lock(data->mutex);
Mutexes are supposed to synchronize access to shared resources. As long as you do not require exclusive access to the shared resource, don't hold the lock. In general, make access atomic and as short as possible in any locking scenario.
Side Notes
You don't need the complicated posix_time manipulation:
auto ret = data->cond.wait_for(lock, 1000ms);
if (bip::cv_status::timeout == ret) {
std::cout << "Timeout" << std::endl;
}
Just for sharing a single POD struct, managed_shared_memory is a lot of overkill. Consider mapped_region.
Consider Asio for signal handling. In any case, make the exit_flag atomic so you don't suffer a data race:
static std::atomic_bool exit_flag{false};
{
struct sigaction sigIntHandler;
sigIntHandler.sa_handler = [](int) { exit_flag = true; };
sigemptyset(&sigIntHandler.sa_mask);
sigIntHandler.sa_flags = 0;
sigaction(SIGINT, &sigIntHandler, NULL);
}
Since your application is symmetrical, I'd expect the signaling to be symmetrical. If not, I'd expect the producing side to do signaling (after all, presumably there is nothing to consume when nothing was produced. Why be "busy" when you know nothing was produced?).
Live Demo
Live On Coliru
#include <boost/interprocess/managed_shared_memory.hpp>
#include <boost/interprocess/sync/interprocess_condition.hpp>
#include <boost/interprocess/sync/interprocess_mutex.hpp>
#include <iostream>
#include <mutex>
#include <thread>
#include <signal.h>
#include <unistd.h>
#ifdef COLIRU // coliru doesn't support shared memory
#include <boost/interprocess/managed_mapped_file.hpp>
#define managed_shared_memory managed_mapped_file
#endif
namespace bip = boost::interprocess;
using namespace std::chrono_literals;
namespace my_namespace {
static char const* name = "MySharedMemory";
struct MyStruct {
bip::interprocess_mutex mutex;
bip::interprocess_condition cond;
unsigned long counter = 0;
};
} // namespace my_namespace
namespace producer {
void run() {
static std::atomic_bool exit_flag{false};
{
struct sigaction sigIntHandler;
sigIntHandler.sa_handler = [](int) { exit_flag = true; };
sigemptyset(&sigIntHandler.sa_mask);
sigIntHandler.sa_flags = 0;
sigaction(SIGINT, &sigIntHandler, NULL);
}
bip::shared_memory_object::remove(my_namespace::name);
auto memory = bip::managed_shared_memory(bip::create_only, my_namespace::name, 65536);
auto& data = *memory.construct<my_namespace::MyStruct>(my_namespace::name)();
for (size_t iterations = 0; !exit_flag;) {
std::unique_lock lock(data.mutex);
data.counter++;
std::cout << "iteration:" << iterations << " Counter: " << data.counter << std::endl;
++iterations;
auto ret = data.cond.wait_for(lock, 1000ms);
if (bip::cv_status::timeout == ret) {
std::cout << "Timeout" << std::endl;
}
}
}
}
namespace consumer {
namespace bip = boost::interprocess;
void run() {
static std::atomic_bool exit_flag{false};
{
struct sigaction sigIntHandler;
sigIntHandler.sa_handler = [](int) { exit_flag = true; };
sigemptyset(&sigIntHandler.sa_mask);
sigIntHandler.sa_flags = 0;
sigaction(SIGINT, &sigIntHandler, nullptr);
}
bip::managed_shared_memory memory(bip::open_only, my_namespace::name);
auto& data = *memory.find<my_namespace::MyStruct>(my_namespace::name).first;
for (size_t iterations = 0; !exit_flag;) {
std::this_thread::sleep_for(200ms);
std::unique_lock lock(data.mutex);
data.counter += 1;
std::cout << "iteration:" << iterations << " Counter: " << data.counter << std::endl;
++iterations;
std::cout << "notify_one" << std::endl;
data.cond.notify_one();
}
}
}
int main(int argc, char**) {
if (argc>1)
producer::run();
else
consumer::run();
}
Testing with
g++ -std=c++20 -O2 -pthread main.cpp -lrt -DCOLIRU
./a.out producer&
sleep 1;
./a.out&
sleep 4; kill -INT %2; sleep 3;
./a.out&
sleep 4; kill -INT %1 %2 %3
Prints e.g.
PRODUCER iteration:0 Counter: 1
PRODUCER Timeout
PRODUCER iteration:1 Counter: 2
CONSUMER iteration:0 Counter: 3
CONSUMER notify_one
PRODUCER iteration:2 Counter: 4
CONSUMER iteration:1 Counter: 5
CONSUMER notify_one
PRODUCER iteration:3 Counter: 6
CONSUMER iteration:2 Counter: 7
CONSUMER notify_one
PRODUCER iteration:4 Counter: 8
CONSUMER iteration:3 Counter: 9
CONSUMER notify_one
PRODUCER iteration:5 Counter: 10
CONSUMER iteration:4 Counter: 11
CONSUMER notify_one
PRODUCER iteration:6 Counter: 12
CONSUMER iteration:5 Counter: 13
CONSUMER notify_one
PRODUCER iteration:7 Counter: 14
CONSUMER iteration:6 Counter: 15
CONSUMER notify_one
PRODUCER iteration:8 Counter: 16
CONSUMER iteration:7 Counter: 17
CONSUMER notify_one
PRODUCER iteration:9 Counter: 18
CONSUMER iteration:8 Counter: 19
CONSUMER notify_one
PRODUCER iteration:10 Counter: 20
CONSUMER iteration:9 Counter: 21
CONSUMER notify_one
PRODUCER iteration:11 Counter: 22
CONSUMER iteration:10 Counter: 23
CONSUMER notify_one
PRODUCER iteration:12 Counter: 24
CONSUMER iteration:11 Counter: 25
CONSUMER notify_one
PRODUCER iteration:13 Counter: 26
CONSUMER iteration:12 Counter: 27
CONSUMER notify_one
PRODUCER iteration:14 Counter: 28
CONSUMER iteration:13 Counter: 29
CONSUMER notify_one
PRODUCER iteration:15 Counter: 30
CONSUMER iteration:14 Counter: 31
CONSUMER notify_one
PRODUCER iteration:16 Counter: 32
CONSUMER iteration:15 Counter: 33
CONSUMER notify_one
PRODUCER iteration:17 Counter: 34
CONSUMER iteration:16 Counter: 35
CONSUMER notify_one
PRODUCER iteration:18 Counter: 36
CONSUMER iteration:17 Counter: 37
CONSUMER notify_one
PRODUCER iteration:19 Counter: 38
CONSUMER iteration:18 Counter: 39
CONSUMER notify_one
PRODUCER iteration:20 Counter: 40
CONSUMER iteration:19 Counter: 41
CONSUMER notify_one
PRODUCER iteration:21 Counter: 42
PRODUCER Timeout
PRODUCER iteration:22 Counter: 43
PRODUCER Timeout
PRODUCER iteration:23 Counter: 44
PRODUCER Timeout
PRODUCER iteration:24 Counter: 45
CONSUMER iteration:0 Counter: 46
CONSUMER notify_one
PRODUCER iteration:25 Counter: 47
CONSUMER iteration:1 Counter: 48
CONSUMER notify_one
PRODUCER iteration:26 Counter: 49
CONSUMER iteration:2 Counter: 50
CONSUMER notify_one
PRODUCER iteration:27 Counter: 51
CONSUMER iteration:3 Counter: 52
CONSUMER notify_one
PRODUCER iteration:28 Counter: 53
CONSUMER iteration:4 Counter: 54
CONSUMER notify_one
PRODUCER iteration:29 Counter: 55
CONSUMER iteration:5 Counter: 56
CONSUMER notify_one
PRODUCER iteration:30 Counter: 57
CONSUMER iteration:6 Counter: 58
CONSUMER notify_one
PRODUCER iteration:31 Counter: 59
CONSUMER iteration:7 Counter: 60
CONSUMER notify_one
PRODUCER iteration:32 Counter: 61
CONSUMER iteration:8 Counter: 62
CONSUMER notify_one
PRODUCER iteration:33 Counter: 63
CONSUMER iteration:9 Counter: 64
CONSUMER notify_one
PRODUCER iteration:34 Counter: 65
CONSUMER iteration:10 Counter: 66
CONSUMER notify_one
PRODUCER iteration:35 Counter: 67
CONSUMER iteration:11 Counter: 68
CONSUMER notify_one
PRODUCER iteration:36 Counter: 69

Related

problem in synchronization of timer0 and CPU clock cycles

I am a beginner in AVR. I need to sample input at odd intervals of 8 ms. i have used CTC mode for generating 8 ms timer. i used CTC with compare interrupt so that i can get a flag (timer_count) set at every comparison. i.e. after every 8 ms. The 8 ms timer starts on External Interrupt at PIN D0.
when i am checking the input conditions in main loop, due to large difference in frequency of main controller (18.432 MHz) and 8 ms timer, i am unable to sample inputs correctly. Can anyone tell me any other method to do this. The code is pasted here for reference.
#include <mega128.h>
#include <stdio.h>
#include <stdlib.h>
#define CHECK_BIT(ADDRESS,BIT) (ADDRESS & (1<<BIT))
#define SET_BIT(ADDRESS,BIT) (ADDRESS |= (1<<BIT))
#define CLEAR_BIT(ADDRESS,BIT) (ADDRESS &= (~(1<<BIT)))
#define TGL_BIT(ADDRESS, BIT) (ADDRESS ^= (1<<BIT))
volatile unsigned int flag;
volatile unsigned int timer_count=0;
volatile unsigned int frequency_979_sense;
volatile unsigned int frequency_885_sense;
volatile unsigned int frequency_933_sense;
volatile unsigned int flag_979_received;
volatile unsigned int flag_885_received;
volatile unsigned int data;
volatile unsigned int i;
volatile unsigned char SOP_valid;
volatile unsigned int previous_state=0;
volatile unsigned int current_state=0;
// Timer 0 output compare interrupt service routine
interrupt [TIM0_COMP] void timer0_comp_isr(void)
{
timer_count++;
}
void init_timer0()
{
// Timer/Counter 0 initialization
// Clock source: System Clock
// Clock value: 18.000 kHz
// Mode: CTC top=OCR0
// OC0 output: toggle output on compare match
ASSR=0x00;
TCCR0=0x1F;
TCNT0=0x00;
OCR0=0x90;
// Timer(s)/Counter(s) Interrupt(s) initialization
TIMSK=0x02;
}
// External Interrupt 0 service routine
interrupt [EXT_INT0] void ext_int0_isr(void)
{
if ((CHECK_BIT(PIND,4)==0) & (CHECK_BIT(PIND,5)==0))
{
init_timer0();
flag=1;
CLEAR_BIT(EIMSK,0);
CLEAR_BIT(EIFR,0);
}
}
void main(void)
{
// Port D initialization
PORTD=0xFF;
DDRD=0x00;
// External Interrupt(s) initialization
// INT0: On
// INT0 Mode: Rising Edge
EICRA=0x03;
EICRB=0x00;
EIMSK=0x01;
EIFR=0x01;
// Global enable interrupts
#asm("sei")
while (1)
{
while (timer_count > 0 & timer_count<32)
{
if (timer_count%2==1)
{
frequency_979_sense= CHECK_BIT(PIND,0);
frequency_885_sense= CHECK_BIT(PIND,4);
frequency_933_sense= CHECK_BIT(PIND,5);
if ((frequency_979_sense != 0) && (frequency_885_sense == 0) && (frequency_933_sense == 0) && (flag_885_received==0 || flag_885_received== 8))
{
flag_979_received++;
SET_BIT(data,i);
}
if ((frequency_979_sense == 0) && (frequency_885_sense != 0) && (frequency_933_sense == 0) && (flag_979_received==6))
{
flag_885_received++;
SET_BIT(data,i);
}
else
{
flag_979_received=0;
flag_885_received=0;
frequency_979_sense=0;
frequency_885_sense=0;
frequency_933_sense=0;
data=0;
TCCR0=0x00;
TIMSK=0x00;
timer_count=0;
i=0;
SET_BIT(EIMSK,0);
SET_BIT(EIFR,0);
}
}
i++;
}
if (data==65535)
{
SOP_valid=1;
}
}
}

How to check if stdout was closed - without writing data to it?

I wrote a program that reads data, filters and processes it and writes it to stdout. If stdout is piped to another process, and the piped process terminates, I get SIGPIPEd, which is great, because the program terminates, and the pipeline comes to a timely end.
Depending on the filter parameters however, there may not be a single write for tens of seconds, and during that time there won't be a SIGPIPE, although the downstream process has long finished. How can I detect this, without actually writing something to stdout? Currently, the pipeline is just hanging, until my program terminates of natural causes.
I tried writing a zero-length slice
if _, err := os.Stdout.Write([]byte{}); err != nil
but unfortunately that does not result in an error.
N.B. Ideally, this should work regardless of the platform, but if it works on Linux only, that's already an improvement.
This doesn't answer it in Go, but you can likely find a way to use this.
If you can apply Poll(2) to the write end of your pipe, you will get an notification when it becomes un-writable. How to integrate that into your Go code depends upon your program; hopefully it could be useful:
#include <errno.h>
#include <poll.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
void sp(int sno) {
write(2, "sigpipe!\n", 9);
_exit(1);
}
int waitfd(int fd) {
int n;
struct pollfd p;
p.fd = fd;
p.events = POLLOUT | POLLRDBAND;
/* RDBAND is for what looks like a bug in illumos fifovnops.c */
p.revents = 0;
if ((n=poll(&p, 1, -1)) == 1) {
if (p.revents & POLLOUT) {
return fd;
}
if (p.revents & (POLLERR|POLLHUP)) {
return -1;
}
}
fprintf(stderr, "poll=%d (%d:%s), r=%#x\n",
n, errno, strerror(errno), p.revents);
return -1;
}
int main() {
int count = 0;
char c;
signal(SIGPIPE, sp);
while (read(0, &c, 1) > 0) {
int w;
while ((w=waitfd(1)) != -1 &&
write(1, &c, 1) != 1) {
}
if (w == -1) {
break;
}
count++;
}
fprintf(stderr, "wrote %d\n", count);
return 0;
}
In linux, you can run this program as: ./a.out < /dev/zero | sleep 1 and it will print something like: wrote 61441. You can change it to sleep for 3s, and it will print the same thing. That is pretty good evidence that it is has filled the pipe, and is waiting for space.
Sleep will never read from the pipe, so when its time is up, it closes the read side, which wakes up poll(2) with a POLLERR event.
If you change the poll event to not include POLLOUT, you get the simpler program:
#include <errno.h>
#include <fcntl.h>
#include <poll.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int waitfd(int fd) {
int n;
struct pollfd p;
p.fd = fd;
p.events = POLLRDBAND;
p.revents = 0;
if ((n=poll(&p, 1, -1)) == 1) {
if (p.revents & (POLLERR|POLLHUP)) {
return -1;
}
}
fprintf(stderr, "poll=%d (%d:%s), r=%#x\n",
n, errno, strerror(errno), p.revents);
return -1;
}
int main() {
if (waitfd(1) == -1) {
fprintf(stderr, "Got an error!\n");
}
return 0;
}
where "Got an error!" indicates the pipe was closed. I don't know how portable this is, as poll(2) documentation is kinda sketchy.
Without the POLLRDBAND (so events is 0), this works on Linux, but wouldn't on UNIX (at least Solaris and macos). Again, docs were useless, but having the kernel source answers many questions :)
This example, using threads, can be directly mapped to go:
#include <pthread.h>
#include <errno.h>
#include <poll.h>
#include <signal.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int Events = POLLRDBAND;
void sp(int sno) {
char buf[64];
write(2, buf, snprintf(buf, sizeof buf, "%d: sig%s(%d)\n", getpid(), sys_siglist[sno], sno));
_exit(1);
}
int waitfd(int fd) {
int n;
struct pollfd p;
p.fd = fd;
p.events = Events;
/* RDBAND is for what looks like a bug in illumos fifovnops.c */
p.revents = 0;
if ((n=poll(&p, 1, -1)) == 1) {
if (p.revents & (POLLERR|POLLHUP)) {
return -1;
}
return fd;
}
return -1;
}
void *waitpipe(void *t) {
int x = (int)(intptr_t)t; /*gcc braindead*/
waitfd(x);
kill(getpid(), SIGUSR1);
return NULL;
}
int main(int ac) {
pthread_t killer;
int count = 0;
char c;
Events |= (ac > 1) ? POLLOUT : 0;
signal(SIGPIPE, sp);
signal(SIGUSR1, sp);
pthread_create(&killer, 0, waitpipe, (int *)1);
while (read(0, &c, 1) > 0) {
write(1, &c, 1);
count++;
}
fprintf(stderr, "wrote %d\n", count);
return 0;
}
Note that it parks a thread on poll, and it generates a SIGUSR1. Here is running it:
mcloud:pipe $ ./spthr < /dev/zero | hexdump -n80
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000050
185965: sigUser defined signal 1(10)
mcloud:pipe $ ./spthr < /dev/zero | sleep 1
185969: sigUser defined signal 1(10)
mcloud:pipe $ ./spthr | sleep 1
185972: sigUser defined signal 1(10)
mcloud:pipe $ ./spthr < /dev/zero | hexdump -n800000
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
00c3500
185976: sigBroken pipe(13)
In the first command, hexdump quits after 80 bytes, the poll is fundamentally racing with the read+write loop, so it could have generated either a sigpipe or sigusr1.
The second two demonstrate that sleep will cause a sigusr1 (poll returned an exception event) whether or not the write side of the pipe is full when the pipe reader exits.
The fourth, uses hexdump to read a lot of data, way more than pipe capacity, which more deterministically causes a sigpipe.
You can generate test programs which model it more exactly, but the point is that the program gets notification as soon as the pipe is closed; not having to wait until its next write.
Not a real solution to the problem - namely, detecting if the process down the pipe has terminated without writing to it - but here is a workaround, suggested in a comment by Daniel Farrell: (Define and) use a heartbeat signal that will get ignored downstream.
As this workaround is not transparent, it may not be possible if you don't control all processes involved.
Here's an example that uses the NUL byte as heartbeat signal for text based data:
my-cmd | head -1 | tr -d '\000' > file
my-cmd would send NUL bytes in times of inactivity to get a timely EPIPE / SIGPIPE.
Note the use of tr to strip off the heartbeats again once it has served its purpose - otherwise they would end up in file.

How to find median value in 2d array for each column with CUDA? [duplicate]

I found the method 'vectorized/batch sort' and 'nested sort' on below link. How to use Thrust to sort the rows of a matrix?
When I tried this method for 500 row and 1000 elements, the result of them are
vectorized/batch sort : 66ms
nested sort : 3290ms
I am using 1080ti HOF model to do this operation but it takes too long compared to your case.
But in the below link, it could be less than 10ms and almost 100 microseconds.
(How to find median value in 2d array for each column with CUDA?)
Could you recommend how to optimize this method to reduce operation time?
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <thrust/generate.h>
#include <thrust/equal.h>
#include <thrust/sequence.h>
#include <thrust/for_each.h>
#include <iostream>
#include <stdlib.h>
#define NSORTS 500
#define DSIZE 1000
int my_mod_start = 0;
int my_mod() {
return (my_mod_start++) / DSIZE;
}
bool validate(thrust::device_vector<int> &d1, thrust::device_vector<int> &d2) {
return thrust::equal(d1.begin(), d1.end(), d2.begin());
}
struct sort_functor
{
thrust::device_ptr<int> data;
int dsize;
__host__ __device__
void operator()(int start_idx)
{
thrust::sort(thrust::device, data + (dsize*start_idx), data + (dsize*(start_idx + 1)));
}
};
#include <time.h>
#include <windows.h>
unsigned long long dtime_usec(LONG start) {
SYSTEMTIME timer2;
GetSystemTime(&timer2);
LONG end = (timer2.wSecond * 1000) + timer2.wMilliseconds;
return (end-start);
}
int main() {
for (int i = 0; i < 3; i++) {
SYSTEMTIME timer1;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, (16 * DSIZE*NSORTS));
thrust::host_vector<int> h_data(DSIZE*NSORTS);
thrust::generate(h_data.begin(), h_data.end(), rand);
thrust::device_vector<int> d_data = h_data;
// first time a loop
thrust::device_vector<int> d_result1 = d_data;
thrust::device_ptr<int> r1ptr = thrust::device_pointer_cast<int>(d_result1.data());
GetSystemTime(&timer1);
LONG time_ms1 = (timer1.wSecond * 1000) + timer1.wMilliseconds;
for (int i = 0; i < NSORTS; i++)
thrust::sort(r1ptr + (i*DSIZE), r1ptr + ((i + 1)*DSIZE));
cudaDeviceSynchronize();
time_ms1 = dtime_usec(time_ms1);
std::cout << "loop time: " << time_ms1 << "ms" << std::endl;
//vectorized sort
thrust::device_vector<int> d_result2 = d_data;
thrust::host_vector<int> h_segments(DSIZE*NSORTS);
thrust::generate(h_segments.begin(), h_segments.end(), my_mod);
thrust::device_vector<int> d_segments = h_segments;
GetSystemTime(&timer1);
time_ms1 = (timer1.wSecond * 1000) + timer1.wMilliseconds;
thrust::stable_sort_by_key(d_result2.begin(), d_result2.end(), d_segments.begin());
thrust::stable_sort_by_key(d_segments.begin(), d_segments.end(), d_result2.begin());
cudaDeviceSynchronize();
time_ms1 = dtime_usec(time_ms1);
std::cout << "loop time: " << time_ms1 << "ms" << std::endl;
if (!validate(d_result1, d_result2)) std::cout << "mismatch 1!" << std::endl;
//nested sort
thrust::device_vector<int> d_result3 = d_data;
sort_functor f = { d_result3.data(), DSIZE };
thrust::device_vector<int> idxs(NSORTS);
thrust::sequence(idxs.begin(), idxs.end());
GetSystemTime(&timer1);
time_ms1 = (timer1.wSecond * 1000) + timer1.wMilliseconds;
thrust::for_each(idxs.begin(), idxs.end(), f);
cudaDeviceSynchronize();
time_ms1 = dtime_usec(time_ms1);
std::cout << "loop time: " << time_ms1 << "ms" << std::endl;
if (!validate(d_result1, d_result3)) std::cout << "mismatch 2!" << std::endl;
}
return 0;
}
The main takeaway from your thrust experience is that you should never compile a debug project or with device debug switch (-G) when you are interested in performance. Compiling device debug code causes the compiler to omit many performance optimizations. The difference in your case was quite dramatic, about a 30x improvement going from debug to release code.
Here is a segmented cub sort, where we are launching 500 blocks and each block is handling a separate 1024 element array. The CUB code is lifted from here.
$ cat t1761.cu
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
#include <iostream>
const int ipt=8;
const int tpb=128;
__global__ void ExampleKernel(int *data)
{
// Specialize BlockRadixSort for a 1D block of 128 threads owning 8 integer items each
typedef cub::BlockRadixSort<int, tpb, ipt> BlockRadixSort;
// Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[ipt];
// just create some synthetic data in descending order 1023 1022 1021 1020 ...
for (int i = 0; i < ipt; i++) thread_keys[i] = (tpb-1-threadIdx.x)*ipt+i;
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
__syncthreads();
// write results to output array
for (int i = 0; i < ipt; i++) data[blockIdx.x*ipt*tpb + threadIdx.x*ipt+i] = thread_keys[i];
}
int main(){
const int blks = 500;
int *data;
cudaMalloc(&data, blks*ipt*tpb*sizeof(int));
ExampleKernel<<<blks,tpb>>>(data);
int *h_data = new int[blks*ipt*tpb];
cudaMemcpy(h_data, data, blks*ipt*tpb*sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < 10; i++) std::cout << h_data[i] << " ";
std::cout << std::endl;
}
$ nvcc -o t1761 t1761.cu -I/path/to/cub/cub-1.8.0
$ CUDA_VISIBLE_DEVICES="2" nvprof ./t1761
==13713== NVPROF is profiling process 13713, command: ./t1761
==13713== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
0 1 2 3 4 5 6 7 8 9
==13713== Profiling application: ./t1761
==13713== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 60.35% 308.66us 1 308.66us 308.66us 308.66us [CUDA memcpy DtoH]
39.65% 202.79us 1 202.79us 202.79us 202.79us ExampleKernel(int*)
API calls: 98.39% 210.79ms 1 210.79ms 210.79ms 210.79ms cudaMalloc
0.72% 1.5364ms 1 1.5364ms 1.5364ms 1.5364ms cudaMemcpy
0.32% 691.15us 1 691.15us 691.15us 691.15us cudaLaunchKernel
0.28% 603.26us 97 6.2190us 400ns 212.71us cuDeviceGetAttribute
0.24% 516.56us 1 516.56us 516.56us 516.56us cuDeviceTotalMem
0.04% 79.374us 1 79.374us 79.374us 79.374us cuDeviceGetName
0.01% 13.373us 1 13.373us 13.373us 13.373us cuDeviceGetPCIBusId
0.00% 5.0810us 3 1.6930us 729ns 2.9600us cuDeviceGetCount
0.00% 2.3120us 2 1.1560us 609ns 1.7030us cuDeviceGet
0.00% 748ns 1 748ns 748ns 748ns cuDeviceGetUuid
$
(CUDA 10.2.89, RHEL 7)
Above I am running on a Tesla K20x, which has performance that is "closer" to your 1080ti than a Tesla V100. We see that the kernel execution time is ~200us. If I run the exact same code on a Tesla V100, the kernel execution time drops to ~35us:
$ CUDA_VISIBLE_DEVICES="0" nvprof ./t1761
==13814== NVPROF is profiling process 13814, command: ./t1761
0 1 2 3 4 5 6 7 8 9
==13814== Profiling application: ./t1761
==13814== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 82.33% 163.43us 1 163.43us 163.43us 163.43us [CUDA memcpy DtoH]
17.67% 35.073us 1 35.073us 35.073us 35.073us ExampleKernel(int*)
API calls: 98.70% 316.92ms 1 316.92ms 316.92ms 316.92ms cudaMalloc
0.87% 2.7879ms 1 2.7879ms 2.7879ms 2.7879ms cuDeviceTotalMem
0.19% 613.75us 97 6.3270us 389ns 205.37us cuDeviceGetAttribute
0.19% 601.61us 1 601.61us 601.61us 601.61us cudaMemcpy
0.02% 72.718us 1 72.718us 72.718us 72.718us cudaLaunchKernel
0.02% 59.905us 1 59.905us 59.905us 59.905us cuDeviceGetName
0.01% 37.886us 1 37.886us 37.886us 37.886us cuDeviceGetPCIBusId
0.00% 4.6830us 3 1.5610us 546ns 2.7850us cuDeviceGetCount
0.00% 1.9900us 2 995ns 587ns 1.4030us cuDeviceGet
0.00% 677ns 1 677ns 677ns 677ns cuDeviceGetUuid
$
You'll note there is no "input" array, I'm just synthesizing data in the kernel, since we are interested in performance, primarily. If you need to handle an array size like 1000, you should probably just pad each array to 1024 (e.g. pad with a very large number, then ignore the last numbers in the sorted result.)
This code is largely lifted from external documentation. It is offered for instructional purposes. I'm not suggesting it is defect-free or suitable for any particular purpose. Use it at your own risk.

How to implement this script about concurrency in c++ 11

I implement an concurrency script of multi thread in c++ 11 . But i stuck .
int product_val = 0;
- thread 1 : increase product_val to vector in thread 2 , notify thread 2 and waiting for thread 2 print product_val;
- thread 2 : wait and decrease product_val , print product_val
1 #include <iostream>
2 #include <thread>
3 #include <condition_variable>
4 #include <mutex>
5 #include <chrono>
6 #include <queue>
7 using namespace std;
8 int product_val = 0;
9 std::condition_variable cond;
10 std::mutex sync;
11 int main() {
12 //thread 2
13 std::thread con = std::thread([&](){
14 while (1)
15 {
16 std::unique_lock<std::mutex> l(sync);
17 cond.wait(l);
18 product_val--;
19 printf("Consumer product_val = %d \n", product_val);
20 l.unlock();
21 }
22 });
23 //thread 1 (main thread) process
24 for (int i = 0; i < 5; i++)
25 {
26 std::unique_lock<std::mutex> l(sync);
27 product_val++;
28 std::cout << "producer product val " << product_val;
29 cond.notify_one();
30 l.unlock();
31 l.lock();
32 while (product_val)
33 {
34
35 }
36 std::cout << "producer product val " << product_val;
37 l.unlock();
38 }
39 return 0;
40 }

C++11 atomic operations: concurrent atomic_fetch_sub_explicit operations

I have a question about the 'atomic_fetch_sub_explicit' operations in c++11. I have run the next code thousands of times, and only observed two possible outputs: "data: 0 1" or "data: 1 0". I want to know is it possible to generate output: "data: 1 1"?
#include <atomic>
#include <thread>
using namespace std;
std::atomic<int> x;
int data1, data2;
void a() {
data1 = atomic_fetch_sub_explicit(&x, 1, memory_order_relaxed);
}
void b() {
data2 = atomic_fetch_sub_explicit(&x, 1, memory_order_relaxed);
}
int main() {
x = 1;
std::thread t1(a);
std::thread t2(b);
t1.join(), t2.join();
printf("data: %d %d\n", data1, data2);
}

Resources