powershell RAM high water mark - windows

Is there a way to get the high water mark of RAM usage in powershell?
Running on a 2008 R2 System.
I was thinking I would have to write a script to turn on RAM counters, then use powershell to query? Make sense? Any examples?
-Ken

There is no Peak Memory Used counter for the entire box, but you can query the /Paging File/% Usage Peak. You may find this gives an indication of whatever event you're trying to monitor.
$categoryName = 'Paging File'
$counterName = '% Usage Peak'
$instanceName = '_Total'
$counter = "\$categoryName($instanceName)\$counterName"
Get-Counter $counter
There is also a set of Peaks for each running process. If there's a particular service you're interested in (like inetinfo), then query for /Process(inetinfo)/Virtual Bytes Peak, Working Bytes Peak, Page File Bytes Peak. The odds are this will get you closer to your real goal, anyway, of finding out where your memory's being consumed.
If neither of those does what you want, then you'll need to set up an actual counter log. This can be done programmatically via PowerShell (if you're trying to do this in bulk against a number of machines), but it's usually easier just to use the PerfMon Control Panel Applet. If you save that data to a .csv, you can then use PowerShell to analyze for maximum values of significant columns.

Related

Performance Counters and IMC Counter Not Matching

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.any_response and mem_load_uops_retired.l3_miss:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.any_response,mem_load_uops_retired.l3_miss sleep 10
Performance counter stats for 'system wide':
3,713,037 offcore_response.all_data_rd.l3_miss.any_response
2,909,573 mem_load_uops_retired.l3_miss
10.016644133 seconds time elapsed
These two values seem consistent, as the latter excludes prefetch requests and those not targeted at DRAM. But they do not match the read counter in the IMC. This counter is called UNC_IMC_DRAM_DATA_READS and documented here. I read the counter reread it 1 second later. The difference was around 30,000,000 (EDITED). If multiplied by 10 (to estimate for 10 seconds) the resulting value will be around 300 million (EDITED), which is 100 times the value of the above-mentioned performance counters (EDITED). It is nowhere near 3 million! What am I missing?
P.S.: The difference is much smaller (but still large), when the system has more load.
The question is also asked, here:
https://community.intel.com/t5/Software-Tuning-Performance/Performance-Counters-and-IMC-Counter-Not-Matching/m-p/1288832
UPDATE:
Please note that PCM output matches my IMC counter reads.
This is the relevant PCM output:
The values for columns READ, WRITE and IO are calculated based on UNC_IMC_DRAM_DATA_READS, UNC_IMC_DRAM_DATA_WRITES and UNC_IMC_DRAM_IO_REQUESTS, respectively. It seems that requests classified as IO will be either READ or WRITE. In other words, during the depicted one second interval, almost (because of the inaccuracy reported in the above-mentioned doc) 2.01GB of the 2.42GB READ and WRITE requests belong to IO. Based on this explanation, the above three columns seem consistent with each other.
The problem is that there still exists a LARGE gap between the IMC and PMC values!
The situation is the same when I boot in runlevel 1. The processes on the scheduler are one of swapper, kworker and migration. Disk IO is almost 85KB/s. I'm wondering what leads to such a (relatively) huge amount of IO. Is it possible to detect that (e.g., using a counter or a tool)?
UPDATE 2:
I think that there is something wrong with the IO column. It is always something in the range [1.99,2.01], regardless of the amount of load in the system!
UPDATE 3:
In runlevel 1, the average number of occurrences of the uops_retired.all event in a 1-second interval is 15,000,000. During the same period, the number of read requests recorded by the associated IMC counter is around 30,000,000. In other words, assuming that all memory accesses are directly caused by cpu instructions, for each retired micro-operation, there exists two memory accesses. This seems impossible specially concerning the fact that there exist multiple levels of caches. Therefore, in the idle scenario, perhaps, the read accesses are caused by IO.
Actually, it was mostly caused by the GPU device. This was the reason for exclusion from performance counters. Here is the relevant output for a sample execution of PCM on a relatively idle system with resolution 3840x2160 and refresh rate 60 using xrandr:
And this is for the situation with resolution 800x600 and the same refresh rate (i.e., 60):
As can be seen, changing screen resolution reduced read and IO traffic considerably (more than 100x!).

HDD access + search time calculation algorithm based on read/write speed and HDD buffer size

I want to write application that will calculate average (access + search) HDD time by performing a benchmark. I know how to do test on file read/wite speed from HDD to memory and I can check on manufacturer page what is HDD's internal buffer size. Test would be performed on defragmentet partition so I think the result isn't bad approximation of real values. If read speed were equal to write speed then I could do
average_value = (copy_time - (file_size / read_speed * 2)) / (file_size / HDD_buffer_size * 2)
but read speed usually differs from write speed so this folrmula doesn't apply.
Please someone answer what the formula should be for that.
First I would use sector access instead of file to avoid FAT and fragmentation influencing benchmark. Sector access is platform depended.
But any file access routine you got at your disposal should work just change the filename to "\\\\.\\A:" or \\\\.\\PhysicalDrive0. To access removal media like floppies and USB keys use "\\\\.\\A:" but for hard drives use \\\\.\\PhysicalDrive0 as the first method will not work for those. Also change the drive letter A or HDD number 0 to whatever you need...
In Window VCL C++ I do something like this:
int hnd;
BYTE dat[512];
hnd=FileOpen("\\\\.\\PhysicalDrive0",fmOpenRead);
if (hnd>=0)
{
FileSeek(hnd,0*512,0); // position to sector you want ...
FileRead(hnd,dat,512); // read it
FileClose(hnd);
hnd=FileCreate("bootsector.dat");
if (hnd>=0)
{
FileWrite(hnd,dat,512);
FileClose(hnd);
}
}
It will open first HDD drive as device (that is why the file name is so weird). From programming side you can assume it is file containing all the sectors of your HDD.
Beware you should not overwrite file system !!! So if you are writing do not forget to restore original sectors. Also safer is to avoid write access for first sectors (where Boot and FAT usually is stored) so in case of a bug or shutdown you do not lose FS.
The code above reads boot sector of HDD0 (if accessible) and saves it to file.
In case you do not have VCL use winapi C++ sector access example or OS API you have at your disposal.
For the HDD buffers estimation/measurement I would use this:
Cache size estimation on your system?
As it is the same thing just instead of memory transfer do disc access.
I would use QueryPerformanceCounter from winapi for the time measurement (should be more than enough).
You are going a bit backwards with the equations I think (but might be wrong). I would:
measure read and write speeds (separately)
transfer_read_rate = transfer_read_size / transfer_read_time
transfer_write_rate = transfer_write_size / transfer_write_time
to mix booth rates (the average)
transfer_avg_rate = (transfer_read_size + transfer_write_size) / (transfer_read_time + transfer_write_time)
where (transfer_read_time + transfer_write_time) can be directly measured.
When I change my memory benchmark to HDD (just by replacing the STOSD transfer by continuous sector read The result looks like this (for mine setup):
You can add seek time measurement by simply measure and average seek time to random positions ... And if you also add HDD geometry (sectors per track) then you can more precisely measure the rates as you can add seek times between tracks to the equations ...
Try this:
double read_time = file_size / read_speed ;
double write_time = file_size / write_speed ;
double total_time_if_no_delays = read_time + write_time ;
double avg_value = (copy_time - total_time_if_no_delays) / (file_size / HDD_buffer_size * 2) ;

TensorFlow object detection limiting memory and cpu usage

I manage to run tensorflow pet example from the tutorial. I decided to use the slowest model (because I want to use for my own data). However, when I start the training it gets killed after running a bit. It used all my cpus (4) and all my memory 8GB. Do you know anyway I can limit the number of CPUs to 2 and limit the amount of memory used ? If I reduce the batch size ? My batch size is already 1.
I managed to run by reducing the resize:
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 300
max_dimension: 612
}
Thanks in advance.
Another idea to reduce memory usage is to reduce the queue sizes for input data. Specifically, in the object_detection/protos/train.proto file, you will see entries for batch_queue_capacity and prefetch_queue_capacity --- consider setting these fields explicitly in your config file to smaller numbers.

Collecting Windows CPU Utilization from WMI Raw Counters

I want to send a counter (incrementing number) of cpu utilization to a monitoring system. The monitoring system handles deltas for me, so in order to avoid gaps between the observation I want to preserve the counter and not send the delta value itself. I am currently doing the following which generally works, but there are occasional random spikes of CPU which don't make sense:
In a loop over each core:
used += v.Timestamp_Sys100NS - v.PercentIdleTime
num++ //To count the cores
And then:
cpu := used / 1e5 / num
As I said the above formula seems to be accurate from the monitoring systems derived deltas, except for the crazy spikes:
Derived:
Raw Counter:
Can anyone explain these spikes and/or suggest a way to avoid them?

Parallel text processing in julia

I'm trying to write a simple function that reads a series of files and performs some regex search (or just a word count) on them and then return the number of matches, and I'm trying to make this run in parallel to speed it up, but so far I have been unable to achieve this.
If I do a simple loop with a math operation I do get significant performance increases. However, a similar idea for the grep function doesn't provide speed increases:
function open_count(file)
fh = open(file)
text = readall(fh)
length(split(text))
end
tic()
total = 0
for name in files
total += open_count(string(dir,"/",name))
total
end
toc()
elapsed time: 29.474181026 seconds
tic()
total = 0
total = #parallel (+) for name in files
open_count(string(dir,"/",name))
end
toc()
elapsed time: 29.086511895 seconds
I tried different versions but also got no significant speed increases. Am I doing something wrong?
I've had similar problems with R and Python. As others pointed out in the comment, you should start with the profiler.
If the read is taking up the majority of time then there's not much you can do. You can try moving the files to different hard drives and read them in from there.
You can also try a RAMDisk kind of solution, which basically makes your RAM look like permanent storage (reducing available ram) but then you can get very fast read and writes.
However, if the time is used to do the regex, than consider the following:
Create a function that reads in one file as whole and splits out separate lines. That should be a continuous read hence as fast as possible. Then create a parallel version of your regex which processes each line in parallel. This way the whole file is in memory and your computing cores can munge the data a faster rate. That way you might see some increase in performance.
This is a technique I used when trying to process large text files.

Resources