shmget with IPC_EXCL - shared-memory

I'm having issues using shmget() to manage memory segments. According to the man page, if both IPC_CREAT and IPC_EXCL flags are set, shmget should fail if it is asked to create a segment for a key that already exists.
What I'm seeing is that shmget creates a new segment for the same key (with a new shmid) regardless. The code below illustrates this issue. I'm running two instances of it, one in 'creator' mode, one in 'client' mode.
./test 10 0
./test 10 1
The creator allocates a memory segment for key=10, then attaches to it. The client also attaches to the segment. Running ipcs -m I can see that the segment exists and that two processes are attached to it.
Then I make the creator destroy the segment, and as expected ipcs shows that it is marked for destruction, with 1 process still attached. What's strange, is that if I start the creator again, with the same key, it creates a new segment instead of failing since a segment already exists?
Thanks for your help!
#include <sys/shm.h>
#include <sys/stat.h>
#include <errno.h>
#include <stdlib.h>
#include <vector>
#include <iostream>
#include <stdexcept>
using namespace std;
int main( int argc, char** argv )
cout << "usage: " << argv[0] << " <key> <mode (0=creator 1=client)>" << endl;
if ( argc < 3 ) return 0;
int key = atoi( argv[1] );
int mode = atoi( argv[2] );
cout << "key=" << key << endl;
cout << "mode=" << mode << endl;
char c;
int shmid=-1;
int size = 100; // bytes
if ( mode == 0 ) // creator
cout << "creating segment" << endl;
int flags = ( IPC_CREAT | IPC_EXCL | 0666 );
shmid = shmget( key, size, flags );
if ( shmid== -1 )
throw runtime_error("failed to create segment");
cout << "created: shmid=" << shmid << endl;
else if ( mode == 1 )
shmid = shmget( key, 0, 0 );
if ( shmid== -1 )
throw runtime_error("failed to load");
cout << "loaded: shmid=" << shmid << endl;
cout << "attach? (press key to continue)" << endl;
cin >> c;
void* data = shmat( shmid, NULL, 0 );
if ( data == (void *) -1 )
throw runtime_error("failed to attach");
cout << "attached to id=" << shmid << endl;
cout << "destroy? (press key to continue)" << endl;
cin >> c;
if ( shmctl( shmid, IPC_RMID, NULL ) == -1 )
throw runtime_error("failed to destroy");
cout << "destroyed" << endl;
catch( const exception& e )
cout << e.what() << " errno=" << errno << endl;

You should pay closer attention to the output of ipcs. Using your example code with a key of 10.
The server has created the segment:
$ ipcs -m
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x0000000a 1470791680 hristo 666 100 1
The client is attached, the server has marked the segment for destruction:
$ ipcs -m
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 1470791680 hristo 666 100 1 dest
As no new processes are expected to be able to find and attach such segments by their key, the key is being zeroed out. That's why you can create a new one with the same key.


How to use OpenMP to deal with two for loops with

I am new to OpenMP... Please help me with this dumb question. Thank you :)
Basically, I want to use OpenMP to speed up two for loops. But I do not know why it keeps saying: invalid controlling predicate for the for loop.
By the way, my GCC version is gcc (Ubuntu 6.2.0-5ubuntu12) 6.2.0 20161005, and OS I am using is Ubuntu 16.10.
Basically, I generate a toy data that has a typical Key-Value style, like this:
Data = {
"0": ["100","99","98","97",..."1"];
"1": ["100","99","98","97",..."1"];
Then, for each key, I want to compare its value with the rest of the keys. Here, I sum them up through "user1_list.size()+user2_list.size();". As for each key, the sum-up process is totally independent of other keys, which means this works for parallelism.
Here is my toy example code.
#include <map>
#include <vector>
#include <string>
#include <iostream>
#include "omp.h"
using namespace std;
int main(){
// Create Data
map<string, vector<string>> data;
for(int i=0; i != 1000; i++){
vector<string> list;
for (int j=100; j!=0; j--){
cout << "Data Total size: " << data.size() << endl;
int count = 1;
#pragma omp parallel for private(count)
for (auto it=data.begin(); it!=data.end(); it++){
//cout << "Evoke Thread: " << omp_get_thread_num();
cout << " count: " << count << " / " << data.size() << endl;
count ++;
string user1 = it->first;
vector<string> user1_list = it->second;
for (auto it2=data.begin(); it2!=data.end(); it2++){
string user2 = it2->first;
vector<string> user2_list = it2->second;
cout << "u1:" << user1 << " u2:" << user2;
int total_size = user1_list.size()+user2_list.size();
cout << " total size: " << total_size << endl;
return 0;

C ++ , My for loop doesn't work when I run it on the terminal. Any ideas?

When I run it on the terminal it works fine but the loop. The for loop just doesn't do anything at all. I'm learning C++, so I don't know much.
#include <iostream>
#include <cstring>
using namespace std;
int main( int argc, char *argv[] ) {
if (argc == 2) {
cout << "The first argument is " << argv[0] << endl;
cout << "The second argument is " << argv[1] << endl;
} else if (argc > 2) {
cout << "Too many arguments" << endl;
} else {
cout << "Only one argument" << endl;
cout << "The argument is " << argv[0] << endl;
if (atoi(argv[1]) < 0) {
cout << "Error negative number" << endl;
// this loop does not work, everything else does.
for (int i = 1; i >= atoi(argv[1]); i++){
int count = atoi(argv[1]--);
cout << count << endl;
int sum = sum + i;
cout << "The sum is: " << endl;
I think that could be the if statements what are messing around with the loop.
I think you made mistake in the for loop.
You show use "<=" instead of ">=" in the for loop.
Hope this might helps you.
I guess your code is not reaching the for loop as you have exit() conditions on each and every condition of if. Your code only reaches the loop if you are passing 2 arguments in the terminal while you are running your code

Calculating GPU's maximum flops using OpenCL

I am writing a simple OpenCL application, which is going to calculate the maximum experiment FLOPS of a target GPU device. I have decided to keep my cl kernel as simple as possible. Here are my OpenCL kernel and my host code. Kernel code is:
__kernel void flops(__global float *data) {
int gid = get_global_id(0);
double s = data[gid];
data[gid] = s * 0.35;
And the host code is:
#include <iostream>
#include <sstream>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include "support.h"
#include "Event.h"
#include "ResultDatabase.h"
#include "OptionParser.h"
#include "ProgressBar.h"
using namespace std;
std::string kernels_folder = "/home/users/saman/shoc/src/opencl/level3/FlopsFolder/";
std::string kernel_file = "";
static const char *opts = "-cl-mad-enable -cl-no-signed-zeros "
"-cl-unsafe-math-optimizations -cl-finite-math-only";
cl_program createProgram (cl_context context,
cl_device_id device,
const char* fileName) {
cl_int errNum;
cl_program program;
std::ifstream kernelFile (fileName, std::ios::in);
if (!kernelFile.is_open()) {
std::cerr << "Failed to open file for reading: " << fileName << std::endl;
std::ostringstream oss;
oss << kernelFile.rdbuf();
std::string srcStdStr = oss.str();
const char *srcStr = srcStdStr.c_str();
program = clCreateProgramWithSource (context, 1, (const char **)&srcStr,
NULL, &errNum);
errNum = clBuildProgram (program, 0, NULL, NULL, NULL, NULL);
return program;
bool createMemObjects (cl_context context, cl_command_queue queue,
cl_mem* memObject,
const int memFloatsSize, float *a) {
cl_int err;
*memObject = clCreateBuffer (context, CL_MEM_READ_WRITE,
memFloatsSize * sizeof(float), NULL, &err);
if (*memObject == NULL) {
std::cerr << "Error creating memory objects. " << std::endl;
return false;
Event evWrite("write");
err = clEnqueueWriteBuffer (queue, *memObject, CL_FALSE, 0, memFloatsSize * sizeof(float),
a, 0, NULL, &evWrite.CLEvent());
err = clWaitForEvents (1, &evWrite.CLEvent());
return true;
void cleanup (cl_context context, cl_command_queue commandQueue,
cl_program program, cl_kernel kernel, cl_mem memObject) {
if (memObject != NULL)
clReleaseMemObject (memObject);
if (kernel != NULL)
clReleaseKernel (kernel);
if (program != NULL)
clReleaseProgram (program);
void addBenchmarkSpecOptions(OptionParser &op) {
void RunBenchmark(cl_device_id id,
cl_context ctx,
cl_command_queue queue,
ResultDatabase &resultDB,
OptionParser &op)
for (float i = 0.1; i <= 0.2; i+=0.1 ) {
std::cout << "Deploying " << 100*i << "%" << std::endl;
bool verbose = false;
cl_int errNum;
cl_program program = 0;
cl_kernel kernel;
cl_mem memObject = 0;
char maxFloatsStr[128];
char testStr[128];
program = createProgram (ctx, id, (kernels_folder + kernel_file).c_str());
if (program == NULL) {
exit (0);
if (verbose) std::cout << "Program created successfully!" << std::endl;
kernel = clCreateKernel (program, "flops", &errNum);
if (verbose) std::cout << "Kernel created successfully!" << std::endl;
// Identify maximum size of the global memory on the device side
cl_long maxAllocSizeBytes = 0;
cl_long maxComputeUnits = 0;
cl_long maxWorkGroupSize = 0;
sizeof(cl_long), &maxAllocSizeBytes, NULL);
sizeof(cl_long), &maxComputeUnits, NULL);
sizeof(cl_long), &maxWorkGroupSize, NULL);
// Let's use 80% of this memory for transferring data
cl_long maxFloatsUsageSize = ((maxAllocSizeBytes / 4) * 0.8);
if (verbose) std::cout << "Max floats usage size is " << maxFloatsUsageSize << std::endl;
if (verbose) std::cout << "Max compute unit is " << maxComputeUnits << std::endl;
if (verbose) std::cout << "Max Work Group size is " << maxWorkGroupSize << std::endl;
// Prepare buffer on the host side
float *a = new float[maxFloatsUsageSize];
for (int j = 0; j < maxFloatsUsageSize; j++) {
a[j] = (float) (j % 77);
if (verbose) std::cout << "Host buffer been prepared!" << std::endl;
// Creating buffer on the device side
if (!createMemObjects(ctx, queue, &memObject, maxFloatsUsageSize, a)) {
exit (0);
errNum = clSetKernelArg (kernel, 0, sizeof(cl_mem), &memObject);
size_t wg_size, wg_multiple;
cl_ulong local_mem, private_usage, local_usage;
errNum = clGetKernelWorkGroupInfo (kernel, id,
sizeof (wg_size), &wg_size, NULL);
errNum = clGetKernelWorkGroupInfo (kernel, id,
sizeof (wg_multiple), &wg_multiple, NULL);
errNum = clGetKernelWorkGroupInfo (kernel, id,
sizeof (local_usage), &local_usage, NULL);
errNum = clGetKernelWorkGroupInfo (kernel, id,
sizeof (private_usage), &private_usage, NULL);
if (verbose) std::cout << "Work Group size is " << wg_size << std::endl;
if (verbose) std::cout << "Preferred Work Group size is " << wg_multiple << std::endl;
if (verbose) std::cout << "Local memory size is " << local_usage << std::endl;
if (verbose) std::cout << "Private memory size is " << private_usage << std::endl;
size_t globalWorkSize[1] = {maxFloatsUsageSize};
size_t localWorkSize[1] = {1};
Event evKernel("flops");
errNum = clEnqueueNDRangeKernel (queue, kernel, 1, NULL,
globalWorkSize, localWorkSize,
0, NULL, &evKernel.CLEvent());
if (verbose) cout << "Waiting for execution to finish ";
errNum = clWaitForEvents(1, &evKernel.CLEvent());
if (verbose) cout << "Kernel execution terminated successfully!" << std::endl;
delete[] a;
sprintf (maxFloatsStr, "Size: %d", maxFloatsUsageSize);
sprintf (testStr, "Flops: %f\% Memory", 100*i);
double flopCount = maxFloatsUsageSize * 16000;
double gflop = flopCount / (double)(evKernel.SubmitEndRuntime());
resultDB.AddResult (testStr, maxFloatsStr, "GFLOPS", gflop);
// Now it's time to read back the data
a = new float[maxFloatsUsageSize];
errNum = clEnqueueReadBuffer(queue, memObject, CL_TRUE, 0, maxFloatsUsageSize*sizeof(float), a, 0, NULL, NULL);
if (verbose) {
for (int j = 0; j < 10; j++) {
std::cout << a[j] << " ";
delete[] a;
if (memObject != NULL)
clReleaseMemObject (memObject);
if (program != NULL)
clReleaseProgram (program);
if (kernel != NULL)
clReleaseKernel (kernel);
std::cout << "Program executed successfully!" << std::endl;
Explaining the code, in the kernel code I actually do a single floating point operation, which means every single task will do on FOPS. In the host code, I first retrieve the maximum global memory size of the GPU, allocate portion of it (for loop define how much of it), then push the data and kernel execution into it. I will measure the execution time of clEnqueueNDRangeKernel and then calculate the GFLOPS of application. In my current implementation, no matter what is the size of cl_mem, I get around 0.28 GFLOPS of performance, which is much less than the advertised power. I assume I do specific things inefficiently here. Or in general my method for calculating the GPU performance is not right. Does anyone can tell my what kind of changes should I make into the code?
With local group size of 1, you are wasting 31/32 of the resources (thus you can have 1/32 of the peak performance at most). You need local group size of at least 32 (and is multiple of 32) to fully utilize computation resources and 64 to achieve 100% occupancy (100% occupancy is not necessary though).
Memory access has high latency and low bandwidth. Your kernel will always be waiting for memory controllers if other things are right. You need do more arithmetic operations to make the ALU's busy.
You need read the document first and make use of the Visual Profiler. In the previous two parts I just want to tell that things are stranger than you thought. But more strange things are waiting.
You can achieve peak performance eaily on CPU with assembly language (By doing only independent arithmetic operations. If you write such code in C it will simply be dropped by the compiler). NVidia only provides us an IL interface called PTX, and I'm not sure if compiler will optimize it. And you can only use PTX in CUDA I think.
edit: It seems that compiler will optimize unused PTX code away, at least in inline assembers.

Strange behaviour of for_each and push_back()

I was doing some testing with for_each and the use of lambda functions and I'm stuck on this (compiled with g++ -std=c++11, gcc version 5.3.1)
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
int main() {
vector<int> vi = {1,1,1,1};
int end =0;
cout << "vi contains: ";
for_each(vi.begin(), vi.end(),[](int i){
cout << i << " ";
cout << endl;
for_each(vi.begin(),vi.end(),[&](int i){
cout << "i="<<i<<" ";
cout << endl;
cout << "end=" << end << endl;
cout << "now vi contains: ";
for_each(vi.begin(), vi.end(),[](int i){
cout << i << " ";
cout << endl;
return 0;
and this is the output of this code
vi contains: 1 1 1 1
i=1 **i=0** i=1 i=1
now vi contains: 1 1 1 1 1 1 1
why is, at the first iteration of the loop, i equal to 0?

How to use SetConsoleTextAttribute C++

I have searched countless forums and websites but I can't seem to find the answer. I'm trying to use SetConsoleTextAttribute but it only affects the text. How can I affect the whole screen like the command color 1f would? My code is:
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
#include <wincon.h>
using namespace std;
int main()
SetConsoleTitle("C++ CALCULATOR"); // Title of window
int x; // Decision
int a; // First Number
int b; // Second Number
int c; // Answer
Con = GetStdHandle(STD_OUTPUT_HANDLE);
cout << "CALCULATOR" << endl << endl;
cout << "1:ADDITION" << endl << "2:SUBTRACTION" << endl << "3:MULTIPLICATION";
cout << endl << "4:DIVISION" << endl << "5:EXIT" << endl;
cin >> x;
switch (x)
case 1: // Addition code
cout << endl << "ADDITION" << endl << "FIRST NUMBER:";
cin >> a;
cout << endl << "SECOND NUMBER:";
cin >> b;
c = a + b;
cout << endl << "ANSWER:" << c;
case 2: // Subtraction code
cout << endl << "SUBTRACTION" << endl << "FIRST NUMBER:";
cin >> a;
cout << endl << "SECOND NUMBER:";
cin >> b;
c = a - b;
cout << endl << "ANSWER:" << c;
case 3: // Multiplication code
cout << endl << "MULTIPLICATION" << endl << "FIRST NUMBER:";
cin >> a;
cout << endl << "SECOND NUMBER:";
cin >> b;
c = a * b;
cout << endl << "ANSWER:" << c;
case 4: // Division code
cout << endl << "DIVISION" << endl << "FIRST NUMBER:";
cin >> a;
cout << endl << "SECOND NUMBER:";
cin >> b;
c = a / b;
cout << endl << "ANSWER:" << c;
case 5: // Exit code
return 0;
This solution relies on these WinAPI functions and structures:
GetConsoleScreenBufferInfo to get screen dimensions
FillConsoleOutputAttribute to fill screen with an attribute
CONSOLE_SCREEN_BUFFER_INFO structure to store screen information
The code is as follows:
COORD coordStart = { 0, 0 }; // Screen coordinate for upper left
DWORD dwNumWritten = 0; // Holds # of cells written to
// by FillConsoleOutputAttribute
DWORD dwScrSize;
hCon = GetStdHandle(STD_OUTPUT_HANDLE);
// Get the screen buffer information including size and position of window
if (!GetConsoleScreenBufferInfo(hCon, &csbiScreenInfo))
// Put error handling here
return 1;
// Calculate number of cells on screen from screen size
dwScrSize = csbiScreenInfo.dwMaximumWindowSize.X * csbiScreenInfo.dwMaximumWindowSize.Y;
// Fill the screen with the specified attribute
FillConsoleOutputAttribute(hCon, wAttributes, dwScrSize, coordStart, &dwNumWritten);
// Set attribute for newly written text
SetConsoleTextAttribute(hCon, wAttributes);
The inline comments should be enough to understand the basics of what is going with the supplied documentation links. We get the screen size with GetConsoleScreenBufferInfo and use that to determine the number of cells on the screen to update with a new attribute using FillConsoleOutputAttribute . We then use SetConsoleTextAttribute to ensure that all new text that gets printed matches the attribute we used to color the entire console screen.
For brevity I have left off the error check for the calls to FillConsoleOutputAttribute and SetConsoleTextAttribute. I put a stub for the error handling for GetConsoleScreenBufferInfo . I leave it as an exercise for the original poster to add appropriate error handling if they so choose.
SetConsoleTextAttribute changes the attribute for new characters that you write to the console, but doesn't affect existing contents of the console.
If you want to change the attributes for existing characters already being displayed on the console, use WriteConsoleOutputAttribute instead.
