I am trying to obtain the CPU utilization of each of the (up to 200) threads in my (Delphi XE) application. To prepare for this I pass to PdhExpandWildCardPath a string '\Thread(myappname/*)\% Processor Time'. However (on Win7/64) the buffer returned from this function returns a string for every thread running in the system, in other words it seems to have treated the input as if it were '\Thread(*/*)\% Processor Time'. This was unexpected. The same happens when I subsequently expand a string to get 'ID Thread'.
Obviously I can filter the resulting strings on the application name and only add the counters I need, but this requires many hundreds of substring scans. Have I misinterpreted how the wildcards work?
Late, but I've hit the same wall, maybe someone else needs it:
Here it is: '\Thread(myappname*)\% Processor Time'
Especially useful with ProcessNameFormat set to 2 and ThreadNameFormat set to 2 in 'HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\PerfProc\Performance'
For ProcessNameFormat = 2 -> link, same applies for ThreadNameFormat, although I couldn't find any kind of documentation.
Related
I'm trying to find a more efficient and speedier way (if possible) to pull subsets of observations that meet certain criteria from multiple hospital claims datasets in SAS. A simplified but common type of data pull would look like this:
data out.qualifying_patients;
set in.state1_2017
in.state1_2018
in.state1_2019
in.state1_2020
in.state2_2017
in.state2_2018
in.state2_2019
in.state2_2020;
array prcode{*} I10_PR1-I10_PR25;
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
if cohort=1 then output;
run;
Now imagine that instead of 2 states and 4 years we have 18 states and 9 years -- each about 1GB in size. The code above works fine but it takes FOREVER to run on our non-optimized server setup. So I'm looking for alternate methods to perform the same task but hopefully at a faster clip.
I've tried including (KEEP=) or (DROP=) statements for each dataset included the SET statement to limit the variables being scanned, but this really didn't have much of an impact on speed -- and, for non-coding-related reasons, we pretty much need to pull all the variables.
I've also experimented a bit with hash tables but it's too much to store in memory so that didn't seem to solve the issue. This also isn't a MERGE issue which seems to be what hash tables excel at.
Any thoughts on other approaches that might help? Every data pull we do contains customized criteria for a given project, but we do these pulls a lot and it seems really inefficient to constantly be processing thru the same datasets over and over but not benefitting from that. Thanks for any help!
I happend to have a 1GB dataset on my compute, I tried several times, it takes SAS no more than 25 seconds to set the dataset 8 times. I think the set statement is too simple and basic to improve its efficient.
I think the issue may located at the do loop. Your program runs do loop 25 times for each record, may assigns to cohort more than once, which is not necessary. You can change it like:
do i=1 to 25 until(cohort=1);
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
This can save a lot of do loops.
First, parallelization will help immensely here. Instead of running 1 job, 1 dataset after the next; run one job per state, or one job per year, or whatever makes sense for your dataset size and CPU count. (You don't want more than 1 job per CPU.). If your server has 32 cores, then you can easily run all the jobs you need here - 1 per state, say - and then after that's done, combine the results together.
Look up SAS MP Connect for one way to do multiprocessing, which basically uses rsubmits to submit code to your own machine. You can also do this by using xcmd to literally launch SAS sessions - add a parameter to the SAS program of state, then run 18 of them, have them output their results to a known location with state name or number, and then have your program collect them.
Second, you can optimize the DO loop more - in addition to the suggestions above, you may be able to optimize using pointers. SAS stores character array variables in memory in adjacent spots (assuming they all come from the same place) - see From Obscurity to Utility:
ADDR, PEEK, POKE as DATA Step Programming Tools from Paul Dorfman for more details here. On page 10, he shows the method I describe here; you PEEKC to get the concatenated values and then use INDEXW to find the thing you want.
data want;
set have;
array prcode{*} $8 I10_PR1-I10_PR25;
found = (^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ0ZZ')) or
(^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ4ZZ'))
;
run;
Something like that should work. It avoids the loop.
You also could, if you want to keep the loop, exit the loop once you run into an empty procedure code. Usually these things don't go all 25, at least in my experience - they're left-filled, so I10_PR1 is always filled, and then some of them - say, 5 or 10 of them - are filled, then I10_PR11 and on are empty; and if you hit an empty one, you're all done for that round. So not just leaving when you hit what you are looking for, but also leaving when you hit an empty, saves you a lot of processing time.
You probably should consider a hardware upgrade or find someone who can tune your server. This paper suggests tips to improve the processing of large datasets.
Your code is pretty straightforward. The only suggestion is to kill the loop as soon as the criteria is met to avoid wasting unnecessary resources.
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then do;
output; * cohort criteria met so output the row;
leave; * exit the loop immediately;
end;
end;
For a university project I am using bifacial_radiance v0.4.0 to run simulations of approx. 270 000 rows of data in an EWP file.
I have set up a scene with some panels in a module following a tutorial on the bifacial_radiance GitHub page.
I am running the python script for this on a high power computer with 64 cores. Since python natively only uses 1 processor I want to use multiprocessing, which is currently working. However it does not seem very fast, even when starting 64 processes it uses roughly 10 % of the CPU's capacity (according to the task manager).
The script will first create the scene with panels.
Then it will look at a result file (where I store results as csv), and compare it to the contents of the radObj.metdata object. Both metdata and my result file use dates, so all dates which exist in the metdata file but not in the result file are stored in a queue object from the multiprocessing package. I also initialize a result queue.
I want to send a lot of the work to other processors.
To do this I have written two function:
A file writer function which every 10 seconds gets all items from the result queue and writes them to the result file. This function is running in a single multiprocessing.Process process like so:
fileWriteProcess = Process(target=fileWriter,args=(resultQueue,resultFileName)).start()
A ray trace function with a unique ID which does the following:
Get an index ìdx from the index queue (described above)
Use this index in radObj.gendaylit(idx)
Create the octfile. For this I have modified the name which the octfile is saved with to use a prefix which is the name of the process. This is to avoid all the processes using the same octfile on the SSD. octfile = radObj.makeOct(prefix=name)
Run an analysis analysis = bifacial_radiance.AnalysisObj(octfile,radObj.basename)
frontscan, backscan = analysis.moduleAnalysis(scene)
frontDict, backDict = analysis.analysis(octfile, radObj.basename, frontscan, backscan)
Read the desired results from resultDict and put them in the resultQueue as a single line of comma-separated values.
This all works. The processes are running after being created in a for loop.
This speeds up the whole simulation process quite a bit (10 days down to 1½ day), but as said earlier the CPU is running at around 10 % capacity and the GPU is running around 25 % capacity. The computer has 512 GB ram which is not an issue. The only communication with the processes is through the resultQueue and indexQueue, which should not bottleneck the program. I can see that it is not synchronizing as the results are written slightly unsorted while the input EPW file is sorted.
My question is if there is a better way to do this, which might make it run faster? I can see in the source code that a boolean "hpc" is used to initiate some of the classes, and a comment in the code mentions that it is for multiprocessing, but I can't find any information about it elsewhere.
TL;DR: I need something way faster than FSO.write OR another way to share a variable in memory between different script instances.
Hello, I am running CCPulse (on Windows 7), which is a Call Center monitoring tool. Agents are represented as "Objects" and can have various statistics (like calls taken, total talk duration etc). CCPulse allows to apply thresholds and actions to any statistic. These are basically vbscripts and as far as I can tell, there are no restrictions.
This allows me to take the "Threshold StatValue" and do things with it, ie writing it to a file. The issue is that if I apply a threshold to a statistic for all agents, the script executes for each agent object seperately (in sequence, not parallel). However, I want to export all the agent stats to a single csv file.
I already got it working, by creating a file if it doesn't exist, then open/ReadAll into a string. If an agent has not been written to the file yet his stat values get appended as a newline in the string, if he already exists in this file I search and replace his line using a regex pattern. I then write the entire multiline string back to the file:
Set objFile = objFSO.OpenTextFile(inFile,2)
objFile.Write strMemoryBuffer
objFile.Close
set objFile = nothing
strMemoryBuffer contains the files original content, with either a new line or a modified line. This string (and subsequently the export file) is around 30kb in size after all agents have been exported. It looks like this (simplified):
LoginID;Calls;TotalTalkTime
2243;08;9403
2132;12;8439
As I said, since the script runs seperately for each agent, only one line is ever added/modified per pass (CCpulse will execute the script one object at a time, until all are finished).
The write process is very slow however, using Timer() it says it needs between 0.10 and 0.15 seconds! That is way too slow, as I need to run the script on almost 500 agents (ideally in no more than 30 second intervals), but all the writing would take over a minute (CCPulse would create a backlog of threshold operations which could never be finished. I can decrease the recalculation frequency, but that is detrimental in other ways).
If I comment out only the above block, execution time dramatically decreases to ~0.02 seconds. So reading the file and manipulating the string takes almost no time at all, just the write process is slow.
I am writing the file locally to a hard drive (no SSD though). I cannot use a RAM Disk.
I also already tried writing to the volatile environment, but somehow, this is even slower (it does work, but for some reason the explorer process goes crazy with up to 50% cpu usage and ccpulse locks up, allthough the export file is still being updated).
The ideal solution would to have the string being repeadetly manipulated only in memory, and then written to file like only once every 30 seconds or something like that, but I don't know how I can make the strMemoryBuffer variable available to the "next" agent. Any ideas?
I have searched through every documentation and still didn't find why there is a prefix and what is c000 in the below file naming convention:
file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-
c000.snappy.parquet
You should use "Talk is cheap, show me the code." methodology. Everything is not documented and one way to go is just the code.
Consider part-1-2_3-4.parquet :
Split/Partition number.
Random UUID to prevent collision between different (appending) write jobs.
Unique Job/Task ID (sometimes it will not be included).
The "c" stands for count. This is file counter which means the number of files that have been written in the past for this specific partition. This is used to limit the max number of records written for a single file. The value should start from 0.
I found it based on this code and this code.
Does anyone know an elegant way to determine “System Load” preferably using Windows performance counters? In this case I mean “System Load” in the classical (UNIX) sense of the term and not in the commonly confused “CPU Utilization” percentage.
Based on my reading … “System Load” is typically represented as a float, defining the number of processes in a runnable state (i.e. not including the number of processes that are currently blocked for one reason or another) that could be run at a given time. Wikipedia gives a good explanation here - http://en.wikipedia.org/wiki/Load_(computing).
By-the-way I’m working in C# so any examples in that language would be greatly appreciated.
System load, in the UNIX sense (and if I recall correctly), is the number of processes which are able to be run that aren't actually running on a CPU (averaged over a time period). Utilities like top show this load over, for example, the last 1, 5 and 15 minutes.
I don't believe this is possible with the standard Win32 WMI process classes. The process state field (ExecutionState) in the WMI Win32_Process objects are documented as not being used.
However, the thread class does provide that information (and it's probably a better indicator as modern operating systems tend to schedule threads rather than processes). The Win32_Thread class has an ExecutionState field which is set to one of:
0 Unknown
1 Other
2 Ready
3 Running
4 Blocked
5 Suspended Blocked
6 Suspended Ready
If you were to do a query of that class and count up the number of type 2 (and possibly type 6; I think suspended means swapped out in this context), that should give you your load snapshot. You would then have to average them yourself if you wanted averages.
Alternatively, there's a ThreadState in that class as well:
0 Initialized (recognized by the microkernel).
1 Ready (prepared to run on the next available processor).
2 Running (executing).
3 Standby (about to run, only one thread may be in this state at a time).
4 Terminated (finished executing).
5 Waiting (not ready for the processor, when ready, it will be rescheduled).
6 Transition (waiting for resources other than the processor).
7 Unknown (state is unknown).
so you could look into counting those in state 1 or 3.
Don't ask me why there's two fields showing similar information or what the difference is. I've long since stopped second-guessing Microsoft with their WMI info, I just have to convince the powers that be that my choice is a viable one :-)
Having just finished developing a Windows client for our own monitoring application, I'd just suggest going for 1-second snapshots and averaging these over whatever time frame you need to report on. VBScript and WMI seem remarkably resilient even at one query per second - it doesn't seem to suck up too much CPU and, as long as you free everything you use, you can run for extended periods of time.
So, every second, you'd do something like (in VBScript, and from memory since I don't have ready access to the code from here):
set objWmi = GetObject("winmgmts:\\.\root\cimv2")
set threadList = objWmi.ExecQuery("select * from Win32_Thread",,48)
sysLoad = 0
for each objThread in threadList
if objThread.ThreadState = 1 or objThread.ThreadState = 3 then
sysLoad = sysLoad + 1
end if
next
' sysLoad now contains number of threads waiting for a CPU. '