Windows' AllocationPreference - like setting in MacOS to test 32->64bit code? - macos

As has been mentioned in previous answers here, Windows OS has a Registry key to force memory allocations from Top-Down to help catch pointer truncation issues when moving from 32bit code -> 64bit.
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\Memory Management\AllocationPreference
However I don't see anything similar in MacOS. I wrote a quick program to check, and it seems that in 64bit, all memory allocated need at least 33bits to represent their address.
I'm wondering if a 64bit MacOS X program will ever allocate memory in address space that can be represented in 32bits or less? If I compile & run the same program in 32bit it obviously uses addresses in the lower 32bit space.
int main(int argc, const char * argv[])
{
void * ptr = NULL;
for (int i = 0; i < 20; i++)
{
ptr = malloc(10);
if ((UInt64)ptr > INT32_MAX)
printf("Pointer = %p > 32 bits required!\n", ptr);
else
printf("Pointer = %p <= 32 bits ok\n", ptr);
}
return 0;
}
Thanks for any help!
Matt

Related

Same binary but different assembly instructions at runtime - Windows 7

For a bit of background, I was playing around with anti-debug techniques. To prevent software breakpoints, one can search at runtime for 0xCC inside a memory segment. Code example here -> https://github.com/LordNoteworthy/al-khaser/blob/master/al-khaser/AntiDebug/SoftwareBreakpoints.cpp
Instead of checking for only one function, I wanted to test the whole .text section at runtime and compute the hash of the section. After some research I ended up with something like that.
int main(int argc, TCHAR* argv[])
{
HMODULE imageBase = GetModuleHandle(NULL);
PIMAGE_DOS_HEADER imageDosHeader = (PIMAGE_DOS_HEADER)imageBase;
PIMAGE_NT_HEADERS imageHeader = (PIMAGE_NT_HEADERS)((SIZE_T)imageBase + imageDosHeader->e_lfanew);
SIZE_T base = (SIZE_T)imageBase + (SIZE_T)imageHeader->OptionalHeader.BaseOfCode;
SIZE_T end = (SIZE_T)base (SIZE_T)imageHeader->OptionalHeader.SizeOfCode;
int i = 0;
int total = 0;
while (base < end)
{
if (*((unsigned char*)base) != 0)
{
printf("ASM HEX: 0x%x \n", *((unsigned char*)base));
total += *((unsigned char*)base);
}
base++;
i++;
}
printf("i=%d\nTotal: %d\n", i, total);
return 0;
}
The code is basically adding every instruction and prints the total (also prints all the instructions but I removed it for clarity). There is only one .text/code section.
PS C:\Users\buildman\Source\Repos\Test\Debug> For ($i=1; $i -lt 10; $i++) {.\Test.exe | findstr "Total"}
Total: 124027
Total: 141202
Total: 123952
Total: 141502
Total: 125677
Total: 125677
Total: 122602
Total: 140302
Total: 140302
The question is Why is the total different each time?
From one compilation to another I understand, but I ran the same code 10 times in a row and some instructions are different... What are those added instructions, and why?
Running it in Windows 7 but compiled for v110_XP. Inside a VM.
Thank you.
Edit 1: Maybe it's because of ASLR, but isn't it supposed to be random? If the addresses were random, the sum will always be different.
#PeterCordes is right (look in the comments). It's because of ASLR, I just tested the code with ASLR Off and the sum is always the same.

Slow parallel memory access in OpenCL / nVidia, what did I miss?

I'm playing around with OpenCL, Geforce GTX550 and driver version 331.38 from Ubuntu 14.04. What stumps me is the speed of copying from global to local memory. As far as I know, the following code should do coalesced access to global memory:
void toLocal(__local float* target, const __global float* source, int count) {
const int iterations = (count + get_local_size(0) - 1) / get_local_size(0);
for (int i = 0; i < iterations; i++) {
int idx = i * get_local_size(0) + get_local_id(0);
if (idx < count)
target[idx] = source[idx];
}
}
In practice, the following code (which should use all threads to copy the same float over and over again) is measurably faster:
void toLocal(__local float* target, const __global float* source, int count) {
for (int i = 0; i < count; i++)
target[i] = source[i];
}
Both source and target point directly at the beginning of a buffer, so I would guess they are correctly aligned. Group size is 16 by 16, trying to use all threads makes the code more complex but doesn't affect speed. The optimal coalescing group size would be 128 bytes or 32 floats, but as far as I know, on compute model 2 cards (which GTX550 is) the penalty of using only a part or even permuting the elements shouldn't be that bad. Adding local memory fence to the first version makes it only slower. Is there anything else I missed?
EDIT: Changing group size to 32 by 32 made the parallel version roughly as fast as sequential 16 by 16 and made the sequential version slightly slower. Still not the speed improvement I was expecting.

Speed of memcpy() greatly influenced by different ways of malloc()

I wrote a program to test the speed of memcpy(). However, how memory are allocated greatly influences the speed.
CODE
#include<stdlib.h>
#include<stdio.h>
#include<sys/time.h>
void main(int argc, char *argv[]){
unsigned char * pbuff_1;
unsigned char * pbuff_2;
unsigned long iters = 1000*1000;
int type = atoi(argv[1]);
int buff_size = atoi(argv[2])*1024;
if(type == 1){
pbuff_1 = (void *)malloc(2*buff_size);
pbuff_2 = pbuff_1+buff_size;
}else{
pbuff_1 = (void *)malloc(buff_size);
pbuff_2 = (void *)malloc(buff_size);
}
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buff_size);
}
if(type == 1){
free(pbuff_1);
}else{
free(pbuff_1);
free(pbuff_2);
}
}
The OS is linux-2.6.35 and the compiler is GCC-4.4.5 with options "-std=c99 -O3".
Results on my computer(memcpy 4KB, iterate 1 million times):
time ./test.test 1 4
real 0m0.128s
user 0m0.120s
sys 0m0.000s
time ./test.test 0 4
real 0m0.422s
user 0m0.420s
sys 0m0.000s
This question is related with a previous question:
Why does the speed of memcpy() drop dramatically every 4KB?
UPDATE
The reason is related with GCC compiler, and I compiled and run this program with different versions of GCC:
GCC version--------4.1.3--------4.4.5--------4.6.3
Time Used(1)-----0m0.183s----0m0.128s----0m0.110s
Time Used(0)-----0m1.788s----0m0.422s----0m0.108s
It seems GCC is getting smarter.
The specific addresses returned by malloc are selected by the implementation and not always optimal for the using code. You already know that the speed of moving memory around depends greatly on cache and page effects.
Here, the specific pointers malloced are not known. You could print them out using printf("%p", ptr). What is known however, is that using just one malloc for two blocks surely avoids page and cache waste between the two blocks. That may already be the reason for the speed difference.

OpenCL in Xcode/OSX - Can't assign zero in kernel loop

I'm developing an accelerated component in OpenCL, using Xcode 4.5.1 and Grand Central Dispatch, guided by this tutorial.
The full kernel kept failing on the GPU, giving signal SIGABRT. I couldn't make much progress interpreting the error beyond that.
But I broke out aspects of the kernel to test, and I found something very peculiar involving assigning certain values to positions in an array within a loop.
Test scenario: give each thread a fixed range of array indices to initialize.
kernel void zero(size_t num_buckets, size_t positions_per_bucket, global int* array) {
size_t bucket_index = get_global_id(0);
if (bucket_index >= num_buckets) return;
for (size_t i = 0; i < positions_per_bucket; i++)
array[bucket_index * positions_per_bucket + i] = 0;
}
The above kernel fails. However, when I assign 1 instead of 0, the kernel succeeds (and my host code prints out the array of 1's). Based on a handful of tests on various integer values, I've only had problems with 0 and -1.
I've tried to outsmart the compiler with 1-1, (int) 0, etc, with no success. Passing zero in as a kernel argument worked though.
The assignment to zero does work outside of the context of a for loop:
array[bucket_index * positions_per_bucket] = 0;
The findings above were confirmed on two machines with different configurations. (OSX 10.7 + GeForce, OSX 10.8 + Radeon.) Furthermore, the kernel had no trouble when running on CL_DEVICE_TYPE_CPU -- it's just on the GPU.
Clearly, something ridiculous is happening, and it must be on my end, because "zero" can't be broken. Hopefully it's something simple. Thank you for your help.
Host code:
#include <stdio.h>
#include <OpenCL/OpenCL.h>
#include "zero.cl.h"
int main(int argc, const char* argv[]) {
dispatch_queue_t queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);
size_t num_buckets = 64;
size_t positions_per_bucket = 4;
cl_int* h_array = malloc(sizeof(cl_int) * num_buckets * positions_per_bucket);
cl_int* d_array = gcl_malloc(sizeof(cl_int) * num_buckets * positions_per_bucket, NULL, CL_MEM_WRITE_ONLY);
dispatch_sync(queue, ^{
cl_ndrange range = { 1, { 0 }, { num_buckets }, { 0 } };
zero_kernel(&range, num_buckets, positions_per_bucket, d_array);
gcl_memcpy(h_array, d_array, sizeof(cl_int) * num_buckets * positions_per_bucket);
});
for (size_t i = 0; i < num_buckets * positions_per_bucket; i++)
printf("%d ", h_array[i]);
printf("\n");
}
Refer to the OpenCL standard, section 6, paragraph 8 "Restrictions", bullet point k (emphasis mine):
6.8 k. Arguments to kernel functions in a program cannot be declared with the built-in scalar types bool, half, size_t, ptrdiff_t, intptr_t, and uintptr_t. [...]
The fact that your compiler even let you build the kernel at all indicates it is somewhat broken.
So you might want to fix that... but if that doesn't fix it, then it looks like a compiler bug, plain and simple (of CLC, that is, the OpenCL compiler, not your host code). There is no reason this kernel should work with any constant other than 0, -1. Did you try updating your OpenCL driver, what about trying on a different operating system (though I suppose this code is OS X only)?

Memory problems with a multi-threaded Win32 service that uses STL on VS2010

I have a multi-threaded Win32 service written in C++ (VS2010) that makes extensive use of the standard template library. The business logic of the program operates properly, but when looking at the task manager (or resource manager) the program leaks memory like a sieve.
I have a test set that averages about 16 simultaneous requests/second. When the program is first started up it consumes somewhere in the neighborhood of 1.5Mb of ram. After a full test run (which take 12-15 minutes) the memory consumption ends up somewhere near 12Mb. Normally, this would not be a problem for a program that runs once and then terminates, but this program is intended to run continuously. Very bad, indeed.
To try and narrow down the problem, I created a very small test application that spins off worker threads at a rate of once every 250ms. The worker thread creates a map and populates it with pseudo-random data, empties the map, and then exits. This program, too, leaks memory in like fashion, so I'm thinking that the problem is with the STL not releasing the memory as expected.
I have tried VLD to search for leaks and it has found a couple which I have remedied, but still the problem remains. I have tried integrating Hoard, but that has actually made the problem worse (i'm probably not integrating it properly, but i can't see how).
So I would like to pose the following question: is it possible to create a program that uses the STL in a multi-threaded environment that will not leak memory? Over the course of the last week I have made no less than 200 changes to this program. I have plotted the results of the changes and they all have the same basic profile. I don't want to have to remove all of the STL goodness that has made developing this application so much easier. I would earnestly appreciate any suggestions on how I can get this app working without leaking memory like it's going out of style.
Thanks again for any help!
P.S. I'm posting a copy of the memory test for inspection/personal edification.
#include <string>
#include <iostream>
#include <Windows.h>
#include <map>
using namespace std;
#define MAX_THD_COUNT 1000
DWORD WINAPI ClientThread(LPVOID param)
{
unsigned int thdCount = (unsigned int)param;
map<int, string> m;
for (unsigned int x = 0; x < 1000; ++x)
{
string s;
for (unsigned int y = 0; y < (x % (thdCount + 1)); ++y)
{
string z = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
unsigned int zs = z.size();
s += z[(y % zs)];
}
m[x] = s;
}
m.erase(m.begin(), m.end());
ExitThread(0);
return 0;
}
int main(int argc, char ** argv)
{
// wait for start
string inputWait;
cout << "type g and press enter to go: ";
cin >> inputWait;
// spawn many memory-consuming threads
for (unsigned int thdCount = 0; thdCount < MAX_THD_COUNT; ++thdCount)
{
CreateThread(NULL, 0, ClientThread, (LPVOID)thdCount, NULL, NULL);
cout
<< (int)(MAX_THD_COUNT - thdCount)
<< endl;
Sleep(250);
}
// wait for end
cout << "type e and press enter to end: ";
cin >> inputWait;
return 0;
}
Use _beginthreadex() when using the std library (includes the C runtime as far as MS is concerned). Also, you're going to experience a certain amount of fragmentation in the std runtime sub-allocator, especially in code designed to continually favor larger and larger requests like this.
The MS runtime library has some functions that allow you to debug memory requests and determine if there is a solid leak once you have a sound algorithm and are confident you don't see anything glaringly obvious. See the debug routines for more information.
Finally, I made the following modifications to the test jig you wrote:
Setup the proper _Crt report mode for spamming the debug window with any memory leaks after shutdown.
Modified the thread-startup loop to keep the maximum number of threads running constantly at MAXIMUM_WAIT_OBJECTS (WIN32-defined currently as 64 handles)
Threw in a purposeful leaked char array allocation to show the CRT will, in fact, catch it when dumping at program termination.
Eliminated console keyboard interaction. Just run it.
Hopefully this will make sense when you see the output log. Note: you must compile in Debug mode for this to make any proper dump for you.
#include <windows.h>
#include <dbghelp.h>
#include <process.h>
#include <string>
#include <iostream>
#include <map>
#include <vector>
using namespace std;
#define MAX_THD_COUNT 250
#define MAX_THD_LOOPS 250
unsigned int _stdcall ClientThread(void *param)
{
unsigned int thdCount = (unsigned int)param;
map<int, string> m;
for (unsigned int x = 0; x < MAX_THD_LOOPS; ++x)
{
string s;
for (unsigned int y = 0; y < (x % (thdCount + 1)); ++y)
{
string z = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
size_t zs = z.size();
s += z[(y % zs)];
}
m[x].assign(s);
}
return 0;
}
int main(int argc, char ** argv)
{
// setup reporting mode for the debug heap. when the program
// finishes watch the debug output window for any potential
// leaked objects. We're leaking one on purpose to show this
// will catch the leaks.
int flg = _CrtSetDbgFlag(_CRTDBG_REPORT_FLAG);
flg |= _CRTDBG_LEAK_CHECK_DF;
_CrtSetDbgFlag(flg);
static char msg[] = "Leaked memory.";
new std::string(msg);
// will hold our vector of thread handles. we keep this fully populated
// with running threads until we finish the startup list, then wait for
// the last set of threads to expire.
std::vector<HANDLE> thrds;
for (unsigned int thdCount = 0; thdCount < MAX_THD_COUNT; ++thdCount)
{
cout << (int)(MAX_THD_COUNT - thdCount) << endl;
thrds.push_back((HANDLE)_beginthreadex(NULL, 0, ClientThread, (void*)thdCount, 0, NULL));
if (thrds.size() == MAXIMUM_WAIT_OBJECTS)
{
// wait for any single thread to terminate. we'll start another one after,
// cleaning up as we detected terminated threads
DWORD dwRes = WaitForMultipleObjects(thrds.size(), &thrds[0], FALSE, INFINITE);
if (dwRes >= WAIT_OBJECT_0 && dwRes < (WAIT_OBJECT_0 + thrds.size()))
{
DWORD idx = (dwRes - WAIT_OBJECT_0);
CloseHandle(thrds[idx]);
thrds.erase(thrds.begin()+idx, thrds.begin()+idx+1);
}
}
}
// there will be threads left over. need to wait on those too.
if (thrds.size() > 0)
{
WaitForMultipleObjects(thrds.size(), &thrds[0], TRUE, INFINITE);
for (std::vector<HANDLE>::iterator it=thrds.begin(); it != thrds.end(); ++it)
CloseHandle(*it);
}
return 0;
}
Output Debug Window
Note: there are two leaks reported. One is the std::string allocation, the other is the buffer within the std::string that held our message copy.
Detected memory leaks!
Dumping objects ->
{80} normal block at 0x008B1CE8, 8 bytes long.
Data: <09 > 30 39 8B 00 00 00 00 00
{79} normal block at 0x008B3930, 32 bytes long.
Data: < Leaked memor> E8 1C 8B 00 4C 65 61 6B 65 64 20 6D 65 6D 6F 72
Object dump complete.
It is not an easy task debug large applications.
Your sample is not the best choice to show what is happening.
One fragment of your real code guess better.
Of course it is not possible, so my suggestion is: use the maximum possible log, including insertion and deletion controls in all structures. Use counters to this information.
When they suspect something make a dump of all data to understand what is happening.
Try to work asynchronously to save the information so there is less impact on your application. This is not an easy task, but for anyone who enjoys a challenge and loves even more to program in C/C++ will be a ride.
Persistence and simplicity should be the goal.
Good luck

Resources