How to measure execution time of Vulkan pipeline - time

Summary
I wish to be able to measure time elapsed in milliseconds, on the GPU, of running the entire graphics pipeline. The goal: To be able to save benchmarks before/after optimizing the code (next step would be mipmapping textures) to see improvements. This was really simple in OpenGL, but I'm new to Vulkan, and could use some help.
I have browsed related existing answers (here and here), but they aren't really of much help. And I cannot find code samples anywhere, so I dare ask here.
Through documentation pages I have found a couple of functions that I think I should be using, so I have in place something like this:
1: Creating query pool
void CreateQueryPool()
{
VkQueryPoolCreateInfo createInfo{};
createInfo.sType = VK_STRUCTURE_TYPE_QUERY_POOL_CREATE_INFO;
createInfo.pNext = nullptr; // Optional
createInfo.flags = 0; // Reserved for future use, must be 0!
createInfo.queryType = VK_QUERY_TYPE_TIMESTAMP;
createInfo.queryCount = mCommandBuffers.size() * 2; // REVIEW
VkResult result = vkCreateQueryPool(mDevice, &createInfo, nullptr, &mTimeQueryPool);
if (result != VK_SUCCESS)
{
throw std::runtime_error("Failed to create time query pool!");
}
}
I had the idea of queryCount = mCommandBuffers.size() * 2 to have space for a separate query timestamp before and after rendering, but I have no clue whether this assumption is correct or not.
2: Recording command buffers
// recording command buffer i:
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, mTimeQueryPool, i);
// render pass ...
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, mTimeQueryPool, i);
vkCmdCopyQueryPoolResults(/* many parameters here */);
I'm looking for a couple of clarifications:
What is the concequence of writing to the same query index? Do I need two separate query pools - one for before render time and one for after render time?
How should I handle synchronization? I assume having a separate query for each command buffer.
For the destination buffer containing the query result, is it good enough to store somewhere with "host visible bit", or do I need staging memory for "device visible only"? I'm a bit lost on this one as well.
I have not been able to find any online examples of how to measure render time, but I just assume it's such a common task that surely there must be an example out there somewhere.

So, thanks to #karlschultz, I managed to get something working. So in case other people will be looking for the same answer, I decided to post my findings here. For the Vulkan experts out there: Please let me know if I make obvious mistakes, and I will correct them here!
Query Pool Creation
I fill out a VkQueryPoolCreateInfo struct as described in my question, and let its queryCount field equal twice the number of command buffers, to store space for a query before and after rendering.
Important here is to reset all entries in the query pool before using the queries, and to reset a query after writing to it. This necessitates a few changes:
1) Asking graphics queue if timestamps are supported
When picking the graphics queue family, the struct VkQueueFamilyProperties has a field timestampValidBits which must be greater than 0, otherwise the queue family cannot be used for timestamp queries!
2) Determining the timestamp period
The physical device contains a special value which indicates the number of nanoseconds it takes for a timestamp query to be incremented by 1. This is necessary to interpret the query result as e.g. nanoseconds or milliseconds. That value is a float, and can be retrieved by calling vkGetPhysicalDeviceProperties and looking at the field VkPhysicalDeviceProperties.limits.timestampPeriod.
3) Asking for query reset support
During logical device creation, one must fill out a struct and add it to the pNext chain to enable the host query reset feature:
VkDeviceCreateInfo createInfo{};
VkPhysicalDeviceHostQueryResetFeatures resetFeatures;
resetFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_HOST_QUERY_RESET_FEATURES;
resetFeatures.pNext = nullptr;
resetFeatures.hostQueryReset = VK_TRUE;
createInfo.pNext = &resetFeatures;
4) Recording command buffers
Timestamp queries should be outside the scope of the render pass, as seen below. It is not possible to measure running time of a single shader (e.g. fragment shader), only the entire pipeline or whatever is outside the scope of the render pass, due to (potential) temporal overlap of pipeline stages.
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, mTimeQueryPool, i * 2);
vkCmdBeginRenderPass(/* ... */);
// render here...
vkCmdEndRenderPass(mCommandBuffers[i]);
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, mTimeQueryPool, i * 2 + 1);
5) Retrieving query result
We have two methods for this: vkCmdCopyQueryPoolResults and vkGetQueryPoolResults. I chose to go with the latter since is greatly simplifies the setup and does not require synchronization with GPU buffers.
Given that I have a swapchain index (in my scenario same is command buffer index!), I have a setup like this:
void FetchRenderTimeResults(uint32_t swapchainIndex)
{
uint64_t buffer[2];
VkResult result = vkGetQueryPoolResults(mDevice, mTimeQueryPool, swapchainIndex * 2, 2, sizeof(uint64_t) * 2, buffer, sizeof(uint64_t),
VK_QUERY_RESULT_64_BIT);
if (result == VK_NOT_READY)
{
return;
}
else if (result == VK_SUCCESS)
{
mTimeQueryResults[swapchainIndex] = buffer[1] - buffer[0];
}
else
{
throw std::runtime_error("Failed to receive query results!");
}
// Queries must be reset after each individual use.
vkResetQueryPool(mDevice, mTimeQueryPool, swapchainIndex * 2, 2);
}
The variable mTimeQueryResults refers to an std::vector<uint64_t> which contains a result for each swapchain. I use it to calculate an average rendering time each second by using the timestamp period determined in step 2).
And one must not forget to cleanup to query pool by calling vkDestroyQueryPool.
There are a lot of details omitted, and for a total Vulkan noob like me this setup was frightening and took several days to figure out. Hopefully this will spare someone else the headache.
More info in documentation.

Writing to the same query index is bad because you are overwriting your "before" timestamp with the "after" timestamp at the same query index. You might want to change the last parameter in your write timestamp calls to i * 2 for the "before" call and to i * 2 + 1 for the "after". You are already allocating 2 timestamps for each command buffer, but only using half of them. This scheme ends up producing a pair of before/after timestamps for each command buffer i.
I don't have any experience using vkCmdCopyQueryPoolResults(). If you can idle your queue, then after idle, call vkGetQueryPoolResults() which will probably be much easier for what you are doing here. It copies the query results back into host memory and you don't have to mess with synchronizing writes to another buffer and then mapping/reading it back.

Related

How do we improve a MongoDB MapReduce function that takes too long to retrieve data and gives out of memory errors?

Retrieving data from mongo takes too long, even for small datasets. For bigger datasets we get out of memory errors of the javascript engine. We've tried several schema designs and several ways to retrieve data. How do we optimize mongoDB/mapReduce function/MongoWire to retrieve more data quicker?
We're not very experienced with MongoDB yet and are therefore not sure whether we're missing optimization steps or if we're just using the wrong tools.
1. Background
For graphing and playback purposes we want to store changes for several objects over time. Currently we have tens of objects per project, but expectations are we need to store thousands of objects. The objects may change every second or not change for long periods of time. A Delphi backend writes to and reads from MongoDB through MongoWire and SuperObjects, the data is displayed in a web frontend.
2. Schema design
We're storing the object changes in minute-second-millisecond objects in a record per hour. The schema design is like described here. Sample:
o: object1,
dt: $date,
v: {0: {0:{0: {speed: 8, rate: 0.8}}}, 1: {0:{0: {speed: 9}}}, …}
We've put indexes on {dt: -1, o: 1} and {o:1}.
3. Retrieving data
We use a mapReduce to construct a new date based on the minute-second-millisecond objects and to put the object back in v:
o: object1,
dt: $date,
v: {speed: 8, rate:0.8}
An average document is about 525 kB before the mapReduce function and has had ~29000 updates. After mapReduce of such a document, the result is about 746 kB.
3.1 Retrieving data from through mongo shell with mapReduce
We're using the following map function:
function mapF(){
for (var i = 0; i < 3600; i++){
var imin = Math.floor(i / 60);
var isec = (i % 60);
var min = ''+imin;
var sec = ''+isec;
if (this.v.hasOwnProperty(min) && this.v[min].hasOwnProperty(sec)) {
for (var ms in this.v[min][sec]) {
if (imin !== 0 && isec !== 0 && ms !== '0' && this.v[min][sec].hasOwnProperty(ms)) {// is our keyframe
var currentV = this.v[min][sec][ms];
//newT is new date computed by the min, sec, ms above
if (toDate > newT && newT > fromDate) {
if (fields && fields.length > 0) {
for (var p = 0, length = fields.length; p < length; p++){
//check if field is present and put it in newV
}
if (newV) {
emit(this.o, {vs: [{o: this.o, dt: newT, v: newV}]});
}
} else {
emit(this.o, {vs: [{o: this.o, dt: newT, v: currentV}]});
}
}
}
}
}
}
};
The reduce function basically just passes the data on. The call to mapReduce:
db.collection.mapReduce( mapF,reduceF,
{out: {inline: 1},
query: {o: {$in: objectNames]}, dt: {$gte: keyframeFromDate, $lt: keyframeToDate}},
sort: {dt: 1},
scope: {toDate: toDateWithinKeyframe, fromDate: fromDateWithinKeyframe, fields: []},
jsMode: true});
Retrieving 2 objects over 1 hour: 2,4 seconds.
Retrieving 2 objects over 5 hour: 8,3 seconds.
For this method we would have to write js and bat files runtime and read the json data back in. We have not measured times fort his yet, because frankly, we don’t like the idea very much.
Another problem with this method is that we get out of memory errors of the v8 javascript engine when we try to retrieve data for longer periods and/or more objects. Using a pc with more RAM works to some extend in preventing out of memory, but it doesn't make retrieving data faster.
This article mentions splitVector, which we might use to devide the workload. But we're not sure on how to use the keyPattern and maxChunkSizeBytes options. Can we use a keyPattern for both o and dt?
We might use multiple collections, but our dataset isn’t that big to start with at the moment, so we’re worried about how much collections we’d need.
3.2 Retrieving data through mongoWire with mapReduce
For retrieving data through mongoWire with mapReduce, we use the same mapReduce functions as above. We use the following Delphi code to start te query:
FMongoWire.Get('$cmd',BSON([
'mapreduce', ‘collection’,
'map', bsonJavaScriptCodePrefix + FMapVCRFunction.Text,
'reduce', bsonJavaScriptCodePrefix + FReduceVCRFunction.Text,
'out', BSON(['inline', 1]),
'query', mapquery,
'sort', BSON(['dt', -1]),
'scope', scope
]));
Retrieving data with this method is about 3-4 times (!) slower. And then the data has to be translated from BSON (IBSONDocument to JSON (SuperObject), which is a major time consuming part in this method. For retrieving raw data we use TMongoWireQuery which translates the BSONdocument in parts, while this mapReduce function uses TMongoWire directly and tries to translate the complete result. This might explain why this takes so long, while normally it's quite fast. If we can reduce the time it takes for the mapReduce to return results, this might be a next step for us to focus on.
3.3 Retrieving raw data and parsing in Delphi
Retrieving raw data to Delphi takes a bit longer then the previous method, but probably because of the use of TMongoWireQuery, the translation from BSON to JSON is much quicker.
4. Questions
Can we do further optimizations on our schema design?
How can we make the mapReduce function faster?
How can we prevent the out of
memory errors of the v8 engine? Can someone give more information on
the splitVector function?
How can we best use of mapReduce from Delphi? Can we use
MongoWireQuery in stead of MongoWire?
5. Specs
MongoDB 3.0.3
MongoWire from 2015 (recently updated)
Delphi 2010 (got XE5 as well)
4GB RAM (tried on 8GB RAM as well, less out of memory, but reading times are about the same)
Phew what a question! First up: I'm not an expert at MongoDB. I wrote TMongoWire as a way to get to know MongoDB a little. Also I really (really) dislike when wrappers have a plethora of overloads to do the same thing but for all kinds of specific types. A long time ago programmers didn't have generics, but we did have Variant. So I built a MongoDB wrapper (and IBSONDocument) based around variants. That said, I apparently made something people like to use, and by keeping it simple performs quite well. (I haven't been putting much time in it lately, but on the top of the list is catering for the new authentication schemes since version 3.)
Now, about your specific setup. You say you use mapreduce to get from 500KB to 700KB? I think there's a hint there you're using the wrong tool for the job. I'm not sure what the default mongo shell does differently than when you do the same over TMongoWire.Get, but if I assume mapReduce assembles the response first before sending it over the wire, that's where the performance gets lost.
So here's my advice: you're right with thinking about using TMongoWireQuery. It offers a way to process data faster as the server will be streaming it in, but there's more.
I strongly suggest to use an array to store the list of seconds. Even if not all seconds have data, store null on the seconds without data so each minute array has 60 items. This is why:
One nicety that turned up in designing TMongoWireQuery, is the assumption you'll be processing a single (BSON) document at a time, and that the contents of the documents will be roughly similar, at least in the value names. So by using the same IBSONDocument instance when enumerating the response, you actually save a lot of time by not having to de-allocate and re-allocate all those variants.
That goes for simple documents, but would actually be nice to have on arrays as well. That's why I created IBSONDocumentEnumerator. You need to pre-load an IBSONDocument instance with an IBSONDocumentEnumerator in the place where you're expecting the array of documents, and you need to process the array in roughly the same way as with TMongoWireQuery: enumerate it using the same IBSONDocument instance, so when subsequent documents have the same keys, time is saved not having to re-allocate them.
In your case though, you would still need to pull the data of an entire hour through the wire just to select the seconds you need. As I said before, I'm not a MongoDB expert, but I suspect there could be a better way to store data like this. Either with a separate document per second (I guess this would let the indexes do more of the work, and MongoDB can take that insert-rate), or with a specific query construction so that MongoDB knows to shorten the seconds array into just that data you're requesting (is that what $splice does?)
Here's an example of how to use IBSONDocumentEnumerator on documents like {name:"fruit",items:[{name:"apple"},{name:"pear"}]}
q:=TMongoWireQuery.Create(db);
try
q.Query('test',BSON([]));
e:=BSONEnum;
d:=BSON(['items',e]);
d1:=BSON;
while q.Next(d) do
begin
i:=0;
while e.Next(d1) do
begin
Memo1.Lines.Add(d['name']+'#'+IntToStr(i)+d1['name']);
inc(i);
end;
end;
finally
q.Free;
end;

Hard Disk scheduling simulator algorithm (track to track timing) Perl

I am trying to get to grips with perl. I am trying to write a few scripts as a scheduling simulator. FCFS, SSTF and Scan and Look
I have one array with a list of block requests and another to act as the buffer. First I will copy over the first request, then I need to work out the time it takes to get from the first to the second block.
the buffer reads in blocks at 1 per ms, seek, search and access time are all 1ms to make the calculations a bit easier, the simulator always starts on block 1 track 1.
http://postimg.org/image/d9osb8tkj/
so if the first block is 5, the search time will be 3ms to traverse to the start of the 5th block, the seek time will be zero as its on the same track and the access time to read the block will always be 1ms. This means that the time for this request will be 4ms so the simulator will read in the next 4 requests into the buffer. In first come first served this will just be the order that the requests are served.
So if the next request to serve is 12 the arm is on the end of the 5th block so will take 2ms to get to the right track then 1ms to get to the start of the 12th block and another 1ms to access it.
I was just wondering if anyone could give me some idea how I could express this as an algorithm. Just some pointers would be much appreciated.
write a class HardDiskSim::Abstract, 3 subs seek_time(), spin_time(), and read_time()
Write a subclass of AbstractDisk for each different set of values/logic for the three methods.
Fir example:
package HardDiskSim::Simple;
use base qw(HardDiskSim::Abstract);
our $SECTORS_PER_TRACK = 5;
our $SEEK_TTIM_PER_TRACK = 1;
sub read_time { return 1 }
sub seek_time {
my $block = #_;
my $tracks_to_seek = int($block / $SECTORS_PER_TRACK);
return $tracks_to_seek * $SEEK_TTIM_PER_TRACK;
}
sub spin_time {
# compute head position at end of seek using seek time and RPM of disk
# compute number of sectors to spin past using computed head position
# return number_of_sectors_to_spin_past * time_per_sector
}
I had the fun of writing this kind of code in Fortran, for a class, back in 1985.

How to Spam Filter Gmail Messages by Recipient Address?

I use the dot feature (m.yemail#gmail.com instead of myemail#gmail.com) to give emails for questionable sites so that I can easily spot spam from my address being sold.
I made this function and set it to trigger every 30 minutes to automatically filter these.
function moveSpamByAddress(){
var addresses = ["m.yemail#gmail.com"]
var threads = GmailApp.getInboxThreads();
for (var i = 0; i < threads.length; i++){
var messages = threads[i].getMessages();
for (var ii = 0; ii<messages.length; ii++){
for (var iii = 0; iii<addresses.length; iii++){
if (messages[ii].getTo().indexOf(addresses[iii]) > -1){
threads[i].moveToSpam()
}
}
}
}
}
This works, but I noticed that this runs slower than I would expect it to (but my expectation may be unreasonable) given that my inbox only contains 50 messages and I am only currently filtering one address. Is there a way to increase execution speed?
Also are there any penalties for running scripts too often? I see that I have the option to trigger a script every minute, and that would increase the likelihood of filtering a message before I see it, but it would also run the scripts uselessly significantly more times.
You can do this using native gmail filters plus apps script.
Script time quotas varies from 1 to 6 hours depending on account type.
To improve performance, first check getInboxUnreadCount and return inmediately if zero.
If you use a 1minute trigger, make sure to use a lock to avoid one timer starting while the other runs. If the lock is in use simply return.
First, make a gmail filter so when "to" matches your special address, apply a special label like "mySpam"
Second, make an apps script with my suggestions above, plus your code no longer needs to search so much, now you just need to find emails with that label (a single api call) and .moveToSpam
There shouldnt be that many at any time in the label if the script runs often.

TimeLimitingCollector Lucene, how to use effectively?

I have a Lucene (4.1) index of about 500M documents. I try to build a search interface on it, but I run into some performance issues.
Initially, I show all the hits (paginated) by using a MatchAllDocumentsQuery. This search takes long (about 10 seconds). I think this is because of the collector I use, it is one that tries to find the total number of hits TotalHitCountCollector.
I would like to be able to time-limit the query, so I found the TimeLimitingCollector. Unfortunatly the API docs are a bit shady. It uses a Counter that is not much documented.
Does anyone have experience using the TimeLimitingCollector in Lucene 4.x? And if so, are there approaches to get a guesstimate on the total number of hits?
I read the: https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/core/org/apache/lucene/search/TimeLimitingCollector.html and the example, but it is not clear on setting the Counter and how to use that in combination with the numTicks
Counter can either be thread safe or not - just use the static Counter.newCounter(boolean threadSafe) method to instantiate one that fits you.
Then, let's say we allow 10 ticks and we update ticks in a separate thread. Code should look like this:
Counter clock = Counter.newCounter(true);
TimeLimitingCollector collector = new TimeLimitingCollector(c, clock, 10);
collector.setBaseline(0);
new Thread() {
public void run() {
clock.addAndGet(1); // will kill the indexSearcher.search(...) after 10 ticks (10 seconds)
Thread.sleep(1000); // try-catch is necessary here, yes
}
}.start();
indexSearcher.search(query, collector);
I, however, find the above a bit cumbersome. Guava's TimeLimiter.callWithTimeout(...) looks much cleaner even though not native to Lucene.

Slow Update/insert into SQL Server CE using LinqToDatasets

I have a mobile app that is using LinqToDatasets to update/insert into a SQL Server CE 3.5 File.
My Code looks like this:
// All the MyClass Updates
MyTableAdapter myTableAdapter = new MyTableAdapter();
foreach (MyClassToInsert myClass in updates.MyClassChanges)
{
// Update the row if it is already there
int result = myTableAdapter.Update(myClass.FirstColumn,
myClass.SecondColumn,
myClass.FirstColumn);
// If the row was not there then insert it.
if (result == 0)
{
myTableAdapter.Insert(myClass.FirstColumn, myClass.SecondColumn);
}
}
This code is used to keep the hand held database in sync with the server database. Problem is if it is a full update (first time for example) there are a lot of updates (about 125). That makes this code (and more loops like it take a very long time (I have three such loops that take over 30 seconds each).
Is there a faster or better way to do updates/inserts like this?
(I did see this Codeplex Project, but I could not see how to make it work with both updates and inserts.)
You should always use SqlCeResultSet for data access on mobile devices for maximum performance and memory usage. You must identify the data to be inserted and then use code like the SqlCeBulkCopy sample, and use similar code by using the Seek and Update methods of the SqlCeResultSet.

Resources