Wrong result from GlobalMemoryStatusEx() on Windows

Wrong result from GlobalMemoryStatusEx() on Windows - winapi

I'm trying to get total system memory using GlobalMemoryStatusEx():
MEMORYSTATUSEX memory;
GlobalMemoryStatusEx(&memory);
#define PRINT(v) {printf("%s ~%.3fGB\n", (#v), ((double)v)/(1024.*1024.*1024.));}
PRINT(memory.ullAvailPhys);
PRINT(memory.ullTotalPhys);
PRINT(memory.ullTotalVirtual);
PRINT(memory.ullAvailPageFile);
PRINT(memory.ullTotalPageFile);
#undef PRINT
fflush(stdout);
But the result is very weired and not understandable.
memory.ullAvailPhys ~1.002GB
memory.ullTotalPhys ~1.002GB
memory.ullTotalVirtual ~0.154GB
memory.ullAvailPageFile ~0.002GB
memory.ullTotalPageFile ~1.002GB
My total physical memory is 8GB but non of result is close it. All values are much smaller.
Also, the 'total' values keep changing whenever I execute. For instance, another result is here:
memory.ullAvailPhys ~0.979GB
memory.ullTotalPhys ~0.979GB
memory.ullTotalVirtual ~0.154GB
memory.ullAvailPageFile ~0.002GB
memory.ullTotalPageFile ~0.979GB
What am I doing wrong?

This is the part you are missing:
MEMORYSTATUSEX memory = { sizeof memory };
MSDN:
dwLength
The size of the structure, in bytes. You must set this member before calling GlobalMemoryStatusEx.
If you checked value returned by GlobalMemoryStatusEx, you could see the problem by getting error indication there.

Related

Memory loads experience different latency on the same core

I am trying to implement a cache based covert channel in C but noticed something weird. The physical address between the sender and the receiver is shared by using the mmap() call that maps to the same file with the MAP_SHARED option. Below is the code for the sender process which flushes an address from the cache to transmit a 1 and loads an address into the cache to transmit a 0. It also measures the latency of a load in both cases:
// computes latency of a load operation
static inline CYCLES load_latency(volatile void* p) {
CYCLES t1 = rdtscp();
load = *((int *)p);
CYCLES t2 = rdtscp();
return (t2-t1);
}
void send_bit(int one, void *addr) {
if(one) {
clflush((void *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
clflush((void *)addr);
}
else {
x = *((int *)addr);
load__latency = load_latency((void *)addr);
printf("load latency = %d.\n", load__latency);
}
}
int main(int argc, char **argv) {
if(argc == 2)
{
bit = atoi(argv[1]);
}
// transmit bit
init_address(DEFAULT_FILE_NAME);
send_bit(bit, address);
return 0;
}
The load operation takes around 0 - 1000 cycles (during a cache-hit and a cache-miss) when issued by the same process.
The receiver program loads the same shared physical address and measures the latency during a cache-hit or a cache-miss, the code for which has been shown below:
int main(int argc, char **argv) {
init_address(DEFAULT_FILE_NAME);
rdtscp();
load__latency = load_latency((void *)address);
printf("load latency = %d\n", load__latency);
return 0;
}
(I ran the receiver manually after the sender process terminated)
However, the latency observed in this scenario is very much different as compared to the first case. The load operation takes around 5000-1000 cycles.
Both the processes have been pinned to the same core-id by using the taskset command. So if I'm not wrong, during a cache-hit, both processes will experience a load latency of the L1-cache on a cache-hit and DRAM on a cache-miss. Yet, these two processes experience a very different latency. What could be the reason for this observation, and how can I have both the processes experience the same amount of latency?

The initial access to an mmaped region will page-fault (lazy mapping/allocation by the kernel), unless you use mmap(MAP_POPULATE), or mlock, or touch some other cache line of the page first.
You're probably timing a page fault if you only do one time measurement per mmap, or per run of a whole program.
(Also you don't seem to be doing anything to warm up the CPU frequency, so once core cycle could be many reference cycles. Some of the time for an L3 miss is fixed in terms of memory clock cycles, but another part of it scales with core/uncore clock.)
Also note that unless you run the 2nd process immediately (e.g. from the same shell command), the OS will get a chance to put that core into a deep sleep. On Intel CPUs at least, that empties L1d and L2 so it can power them down in the deeper C states. Probably also the TLBs.
It's also strange that you you cast away volatile in load = *((int *)p);
Assigning the load result to a global(?) variable inside the timed region is also pretty questionable; that could also soft page fault. If so, RDTSCP will have to wait for it, because the store can't retire.
(But on TLB hit for the store, it doesn't have to wait for it commit to cache, since there's no mfence before rdtscp. A store can retire while the store is still in the store buffer. In fact it must retire before the store buffer entry is known to be non-speculative so it can commit.)

How to release memory, allocated by RtlQueryRegistryValues in Windows kernel driver

Function RtlQueryRegistryValues can allocate memory for returning REG_SZ result.
Everything works nicely, but I cannot guess how to release that memory.
Code fragment:
UNICODE_STRING result;
RtlZeroMemory(&result, sizeof(UNICODE_STRING));
RTL_QUERY_REGISTRY_TABLE queryTable[2];
queryTable[0].Flags = RTL_QUERY_REGISTRY_DIRECT | . . .;
queryTable[0].EntryContent = &result;
. . .
RtlQueryRegistryValues(..., queryTable, NULL, NULL);
All fields of result are populated correctly. I am trying to find out how to deallocate results.Buffer.
As a workaround, I can allocate sufficient buffer myself before calling the function, but I am looking for clean solution.
Thanks==

You can ExFreePool() to free memory. but since it is a Unicode string you should try also RtlFreeUnicodeString()

How can I measure how long this Linux interrupt handler takes to run?

I am trying to debug a custom Linux serial driver that is having some issues missing some receive data. It has one interrupt for 4 serial ports, and baud rate is 115200. Firstly I would like to see how to measure how long the interrupt handler takes. I have used perf, but things are just in percent and not seconds. Secondly does anyone see any issues with the below code that can be improved to speed things up?
void serial_interrupt(int irq, void *dev_id)
{
...
// Need to loop through each port to see which port caused the interrupt.
list_for_each(lpNode, &serial_ports)
{
struct serial_port_module *ser_dev = list_entry(lpNode, struct serial_port_module, port_list);
lnIsr = ioread8(ser_dev->membase + ser_dev->chan_num * PORT_OFFSET + SERIAL_ISR);
if (lnIsr & IPM512_RX_INT)
{
while (serialdata_is_data_available(ser_dev)) // equals a ioread8()
{
lcIn = ioread8(ser_dev->membase + ser_dev->chan_num * PORT_OFFSET + SERIAL_RBR);
kfifo_in(&ser_dev->rx_fifo, &lcIn, sizeof(lcIn));
// Notify if anyone is doing a blocking read.
wake_up_interruptible(&ser_dev->read_queue);
}
}
}
}

Use the ftrace API to try to track down your latency issues. It's woth the time to get to know: https://www.kernel.org/doc/Documentation/trace/ftrace.txt
If this is too heavy-weight, what about adding some simple instrumentation yourself? getnstimeofday(struct timespec *ts) is relatively lightweight... with a little code you could output in a sysfs debug file the worst case execution times, some stats on latency of call to this function, worst-case number of bytes available per interrupt... if this number gets near your hardware FIFO size, you're in trouble.
One optimization would be to read the data in batches into a buffer, as long as data is available, then input the entire buffer, then wake up any readers.
while(data_available(dev))
{
buf[cnt++] = ioread8();
}
kfifo_in(fifo, buf, cnt);
wake_up_interruptible();
But execution time of code this simple is not likely to be an issue. You're probably suffering from missed interrupts or unexpected latency of the interrupt handling.

Running a kernel with CUDAfy .NET using OpenCL asynchronously

I am trying to make asynchronous kernel calls to my GPGPU using CUDAfy .NET.
When I pass values to the kernel and copy them back to the host, I do not always get the value I expect.
I have a structure Foo with a byte Bar:
[Cudafy]
public struct Foo {
public byte Bar;
}
And I have a kernel I want to call:
[Cudafy]
public static void simulation(GThread thread, Foo[] f)
{
f[0].Bar = 3;
thread.SyncThreads();
}
I have a single thread with streamID = 1 (I tried using multiple threads, and noticed the issue. Reducing to a single thread didn't seem to fix the issue though).
//allocate
streamID = 1;
count = 1;
gpu.CreateStream(streamID);
Foo[] sF = new Foo[count];
IntPtr hF = gpu.HostAllocate<Foo>(count);
Foo[] dF = gpu.Allocate<Foo>(sF);
while (true)
{
//set value
sF[0].Bar = 1;
byte begin = sF[0].Bar;
//host -> pinned
GPGPU.CopyOnHost<Foo>(sF, 0, hF, 0, count);
sF[0].Bar = 2;
lock (gpu)
{
//pinned -> device
gpu.CopyToDeviceAsync<Foo>(hF, 0, dF, 0, count, streamID);
//run
gpu.Launch().simulation(dF);
//device -> pinned
gpu.CopyFromDeviceAsync<Foo>(dF, 0, hF, 0, count, streamID);
}
//WAIT
gpu.SynchronizeStream(streamID);
//pinned -> host
GPGPU.CopyOnHost<Foo>(hF, 0, sF, 0, count);
byte end = sF[0].Bar;
}
//de-allocate
gpu.Free(dF);
gpu.HostFree(hF);
gpu.DestroyStream(streamID);
First I create a stream on the GPU.
I am creating a regular structure Foo array of size 1 (sF) and setting it's Bar value to 1. Then I create pinned memory on the host (hF) for Foo as well. I also create memory on the device for Foo (dF).
I initialize the structure's Bar value to 1 then I copy it to the pinned memory (As a check, I set the value to 2 for the structure after copying to pinned, you'll see why later). Then I use a lock to ensure I have full access to the GPU and I queue a copy to dF, a run for the kernel, and a copy from dF. At this point I don't know when this will all actually run on the GPU... so I can call SynchronizeStream to wait on the host until the device is done.
When it's done, I can copy the pinned memory (hF) to the shared memory (sF). When I get the value, it's usually a 3 (which was set on the device) or a 1 (which means either the value wasn't set in the kernel, or the new value wasn't copied to the pinned memory). I do know that the pinned memory is copied to the structure because the structure never has the value of 2.
Over many runs, a small percentage is runs results in something other than begin=1 and end=3. It would always be begin=1, end=1 and it happens about 5-10% of the time.
I have no idea why this happens. I know it generally highlights a race condition, but by calling the sync calls, I would expect the async calls to work in a predictable fashion.
Why would I be encountering this kind of issue with this code?
Thank you so much!
-Phil

I just figured out the issue that was occurring. While the launch was being done asynchronously... I didn't include the stream for the launch.
Changing my launch to be:
gpu.Launch(gridsize,blocksize,streamID).simulation(dF);
resolved the problem. It seems that the launches were occurring on stream 0 and the stream 1 and 2 were being synced. So sometimes the data gets set, sometimes it doesn't. A race condition.

atmega8 UART- doesn't show character in realterm

Hi i'm new to this and i need help. It's suppose to just show the 'S' in the realterm instead it gives 'null'. What would be the problem? could it be the register? or the code itself?
#include <avr/io.h>
#include <util/delay.h>
void UART_Init(unsigned int ubrr)
{
UBRRH=(unsigned int)(ubrr>>8);
UBRRL=(unsigned int)ubrr;
UCSRA=0x00;
UCSRB=(1<<TXEN)|(1<<RXEN);
UCSRC=(0<<USBS)|(1<<UCSZ0)|(1<<UCSZ1);
}
void UART_Tx(unsigned char chr)
{
while (bit_is_clear(UCSRA,UDRE)){}
UDR=chr;
}
int main(void)
{
UART_Init(95);
DDRD|=0B11111111;
PORTD|=0B11111111;
while(1){
_delay_ms(10);
UART_Tx('S');
}
}
System is running on xtal with 14745600 Hz. Speed on host is 9600 baud. all settings should be 8N1.

You need to set the URSEL when writing to the UCSRC register.
Change
UCSRC=(0<<USBS)|(1<<UCSZ0)|(1<<UCSZ1);
to
UCSRC=(1<<URSEL)|(0<<USBS)|(1<<UCSZ0)|(1<<UCSZ1);
From the data sheet:
The UBRRH Register shares the same I/O location as the UCSRC Register. Therefore some
special consideration must be taken when accessing this I/O location. When doing a write access of this I/O location, the high bit of the value written, the USART Register Select (URSEL) bit, controlswhich one of the two registers that will be written. If URSEL is
zero during a write operation, the UBRRH value will be updated. If URSEL is one, the UCSRC
setting will be updated.
The rest of the code looks fine to me.

change UART_Tx('S'); using UART_Tx("S");

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Wrong result from GlobalMemoryStatusEx() on Windows - winapi

Related

Memory loads experience different latency on the same core

How to release memory, allocated by RtlQueryRegistryValues in Windows kernel driver

How can I measure how long this Linux interrupt handler takes to run?

Running a kernel with CUDAfy .NET using OpenCL asynchronously

atmega8 UART- doesn't show character in realterm

Categories

Resources