I am building an application where i want to replay the movement of several users (up to 20).
Each user has a list of X,Y positions (ranging from 20 to 400 positions). A replay ranges from 1-10 minutes.
The replay is drawn at 8 FPS, which is all i require. At each frame i delete the layer showing the user's movement, and redraw everything up to the next point in time.
This application uses a lot of memory, and if i re-run a replay, the memory consumption keeps increasing (up to 8 gb). I have tried using the Profiler in Google Chrome (version 27), and it seems there is a build-up of layers in memory, even though i constantly remove() the old layers.
The following code shows a quick mockup of what the application does.
function draw()
{
stage.removeChildren();
var userLayer = new Kinetic.Layer();
/*
iterate all data and create lines to signify the movement of a user,
and add it to userLayer
*/
stage.add(userLayer);
}
setInterval(draw, 125); // 8 FPS
My question is, stage.removeChildren() and also Kinetic.Layer().remove(), do they not remove the layer from memory? Or do i need to handle this in an entirely different manner?
Yes. "Remove" is removing from parent container. But object still exists. You have to use "destroy" instead.
Related
I am working with an Ettus N310 that is being controlled by some 3rd party software. I don't have much of an insight of how they set up and control the device, just tell it what center frequency to tune to and when to grab IQ. If I receive a signal, let's say a tone, at or very near the center frequency, I end up with a large DC offset that jumps around every few 100 usec. If I offset the signal well away from the center frequency, the DC offset is negligible. From what I see in Ettus' documentation, DC offset compensation is something that's set once when the device starts receiving but it looks to me like here it is being done periodically while the USRP is acquiring data. If I receive a signal near center frequency, the DC offset compensator gets messed up and creates a worse bias. Is this a feature on the N310 that I am not aware of or is this probably something that the 3rd party controller is doing?
Yes, there's a DC offset compensation in the N310. The N310 uses an Analog Devices RFIC (the AD9371), which has these calibrations built-in. Both the AD9371 and the AD9361 (used in the USRP E3xx and B2xx series) don't like narrow-band signals close to DC due to their calibration algorithms (those chips are optimized for telecoms signals).
Like you said, the RX DC offset compensation is happening at initialization. At runtime, the quadrature error correction kicks in. The manual holds a table of those: https://uhd.readthedocs.io/en/latest/page_usrp_n3xx.html#n3xx_mg_calibrations). You can try turning off the QEC tracking and see if it improves your system's performance.
I'm experimenting with cache blocking. To do that, I implemented 2 convolution based smoothing algorithms. The gaussian kernel I'm using looks like this:
The first algorithm is just the simple double for loop, looping from left to right, top to bottom as shown below.
Image source: (https://people.engr.ncsu.edu/efg/521/f02/common/lectures/notes/lec9.html)
In the second algorithm I tried to play with cache blocking by spliting the loops into chunks, which became something like the following. I used a BLOCK size of 512x512.
Image source: (https://people.engr.ncsu.edu/efg/521/f02/common/lectures/notes/lec9.html)
I'm running the code on a raspberry pi 3B+, which has a Cortex-A53 with 32KB of L1 and 256KB of L2, I believe. I ran the two algorithms with different image sizes (2048x1536, 6000x4000, 12000x8000, 16000x12000. 8bit gray scale images). But across different image sizes, I saw the run time being very similar.
The question is shouldn't the first algorithm experience access latency which the second should not, especially when using large size image (like 12000x8000). Base on the description of cache blocking in this link, when processing data at the end of image rows using the 1st algorithm, the data at the beginning of the rows should have been evicted from the L1 cache. Using 12000x8000 size image as an example, since we are using 5x5 kernel, 5 rows of data is need, which is 12000x5=60KB, already larger than the 32KB L1 size. When we start processing data for a new row, 4 rows of previous data are still needed but they are likely gone in L1 so needs to be re-fetched. But for the second algorithm it shouldn't have this problem because the block size is small. Can anyone please tell me what am I missing?
I also profiled the algorithm using oprofile with the following data:
Algorithm 1
event
count
L1D_CACHE_REFILL
13,933,254
PREFETCH_LINEFILL
13,281,559
Algorithm 2
event
count
L1D_CACHE_REFILL
9,456,369
PREFETCH_LINEFILL
8,725,250
So it looks like the 1st algorithm does have more cache miss compared to the second, reflecting by the L1D_CACHE_REFILL counts. But it also has higher data prefetching rate, which maybe due to the simple behavior of the loop. So is the whole story of cache blocking not taking into account data prefetching?
Conceptually, you're right blocking will reduce cache misses by keeping the input window in cache.
I suspect the main reason you're not seeing a speedup is because the cache is prefetching from all 5 input rows. Your performance counters show more prefetch loads in the unblocked implementation. I suspect many textbook examples are out of date since cache prefetching has kept getting better. Intel's L2 cache can detect and prefetch from up to 16 linear streams about 10 years ago, I think.
Assume the filter takes 5 * 5 cycles. So that would be 20.8 ns = 25 / 1.2GHz on RPI3. The IO cost will be reading a 5 high column of new input pixels. The amortized IO cost will be 5 bytes / 20.8ns = 229 MiB/s, which is much less than the ~2 GiB/s DRAM bandwidth. So in theory, the relatively slow computation combined with prefetching (I'm not certain how effective) means that memory access isn't a bottleneck.
Try increasing the filter height. The cache can only detect and prefetch from a certain # streams. Or try vectorizing the computation so that memory access becomes the bottleneck.
I’m seeing some strange results when trying to time some WebGL using the disjoint timer query extension.
I have written some simple WebGL code to reproduce the issue we are seeing here : https://jsfiddle.net/d79q3mag/2/.
const texImage2DQuery = gl.createQuery();
gl.beginQuery(ext.TIME_ELAPSED_EXT, texImage2DQuery);
gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA32F, 512, 512, 0, gl.RGBA, gl.FLOAT, buffer);
gl.endQuery(ext.TIME_ELAPSED_EXT);
tex2dQuerys.push(texImage2DQuery);
const drawQuery = gl.createQuery();
gl.beginQuery(ext.TIME_ELAPSED_EXT, drawQuery);
gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);
gl.endQuery(ext.TIME_ELAPSED_EXT);
drawQuerys.push(drawQuery);
tryGetNextQueryResult(tex2dQuerys, tex2dQueryResults);
tryGetNextQueryResult(drawQuerys, drawQueryResults);
The render function uses the timer extension to individually time the texImage2D call and the drawArrays call. If I graph the results of the draw call I see some pretty large spikes (Some as high as 12ms! but the majority of spikes in the 2ms to 4ms range) :
However, if I increase the framerate from 30fps to 60fps the results improve (largest spike 1.8ms most spikes between 1ms and 0.4ms) :
I have also noticed that if I don’t time the texImage2D function (https://jsfiddle.net/q01aejnv/2/) then the spikes in the times for the drawArrays call also disappear at 30FPS (spikes between 1.6ms and 0.2 ms).
I’m using a Quadro P4000 with chrome 81.
In the Nvidia control panel Low latency mode is set to Ultra and the power management mode is set to Prefer maximum performance.
These results were gathered using D3D11 as the ANGLE graphics backend in chrome.
There seems to be 2 confusing things here. First is that a higher framerate seems to improve the draw times. Second is that timing the texImage2D seems to be affecting the draw times.
When doing timing things you probably need to gl.flush after gl.endQuery
Explanation: WebGL functions add commands to a command buffer. Some other process reads that buffer but there is overhead in telling the other process "hey! I added some commands for you!". So, in general, WebGL commands don't always do they "Hey!" part they just insert the commands. WebGL will do an auto flush (a "Hey! Execute this!) at various points automatically but when doing timing you may need to add a gl.flush.
Note: Like I said above there is overhead to flushing. In a normal program it's rarely important to do a flush.
We have been porting some of our CPU pipeline to Metal to speed up some of the slowest parts with success. However since it is only parts of it we are transferring data back and forth to the GPU and I want to know how much time this actually takes.
Using the frame capture in XCode it informs me that the kernels take around 5-20 ms each, for a total of 149.5 ms (all encoded in the same Command Buffer).
Using Instruments I see some quite different numbers:
The entire thing operations takes 1.62 seconds (Points - Code 1).
MTLTexture replaceRegion takes up the first 180 ms, followed with the CPU being stalled the next 660 ms at MTLCommandBuffer waitUntilCompleted (highlighted area), and then the last 800 ms gets used up in MTLTexture getBytes which maxes out that CPU thread.
Using the Metal instruments I'm getting a few more measurements, 46ms for "Compute Command 0", 460 ms for "Command Buffer 0", and 210 ms for "Page Off". But I'm not seeing how any of this relates to the workload.
The closest thing to an explanation of "Page off" I could find is this:
Texture Page Off Data (Non-AGP)
The number of bytes transferred for texture page-off operations. Under most conditions, textures are not paged off but are simply thrown away since a backup exists in system memory. Texture page-off traffic usually happens when VRAM pressure forces a page-off of a texture that only has valid data in VRAM, such as a texture created using the function glCopyTexImage, or modified using the functiona glCopyTexSubImage or glTexSubImage.
Source: XCode 6 - OpenGL Driver Monitor Parameters
This makes me think that it could be the part that copies the memory off the GPU, but then there wouldn't be a reason getBytes takes that long. And I can't see where the 149.5 ms from XCode should fit into the data from Instruments.
Questions
When exactly does it transfer the data? If this cannot be inferred from the measurements I did, how do I acquire those?
Does the GPU code actually only take 149.5 ms to execute, or is XCode lying to me? If not, then where is the remaining 660-149.5 ms being used?
I have known the ideas of block and grid in cuda, and I'm wondering if there is any helper function well written that can help me determine the best block and grid size for any given 2D image.
For example, for a 512x512 image mentioned in this thread. Grid is 64x64 and block is 8x8.
However sometimes my input image may not be power of 2, it may be 317x217 or something like that.In this case, maybe grid should be 317x1 and block should be 1x217.
So if I have an application that accepts an image from user, and use cuda to process it, how can it automatically determine the size and dimension of block and grid, where user can input any size of image.
Is there any existed helper function or class that handles this problem?
Usually you want to choose the size of your blocks based on your GPU architecture, with the goal of maintaining 100% occupancy on the Streaming Multiprocessor (SM). For example, the GPUs at my school can run 1536 threads per SM, and up to 8 blocks per SM, but each block can only have up to 1024 threads in each dimension. So if I were to launch a 1d kernel on the GPU, I could max out a block with 1024 threads, but then only 1 block would be on the SM (66% occupancy). If I instead chose a smaller number, like 192 threads or 256 threads per block, then I could have 100% occupancy with 6 and 8 blocks respectively on the SM.
Another thing to consider is the amount of memory that must be accessed vs the amount of computation to be done. In many imaging applications, you don't just need the value at a single pixel, rather you need the surrounding pixels as well. Cuda groups its threads into warps, which step through every instruction simultaneously (currently, there are 32 threads to a warp, though that may change). Making your blocks square generally minimizes the amount of memory that needs to be loaded vs the amount of computation that can be done, making the GPU more efficient. Likewise, blocks that are a power of 2 load memory more efficiently (if properly aligned with memory addresses) since Cuda loads memory lines at a time instead of by single values.
So for your example, even though it might seem more effective to have a grid that is 317x1 and blocks that are 1x217, your code will likely be more efficient if you launch blocks that are 16x16 on a grid that is 20x14 as it will lead to better computation/memory ratio and SM occupancy. This does mean, though, that you will have to check within the kernel to make sure the thread is not out of the picture before trying to access memory, something like
const int thread_id_x = blockIdx.x*blockDim.x+threadIdx.x;
const int thread_id_y = blockIdx.y*blockDim.y+threadIdx.y;
if(thread_id_x < pic_width && thread_id_y < pic_height)
{
//Do stuff
}
Lastly, you can determine the lowest number of blocks you need in each grid dimension that completely covers your image with (N+M-1)/M where N is the number of total threads in that dimension and you have M threads per block in that dimension.
It depends on how you deal with the image. If your thread only process each pixel separately, for example, adding 3 to each pixel value, you can just assign one dimension to your block size and the other to your grid size (just do not out of range). But if you want to do something like filter or erode, this kind of operation often need to access the pixels near the center pixel like 3*3 of 9*9. Then the block should be 8*8 as you mentioned, or some other value. And you'd better to use the texture memory. Because when the thread access to the global memory, there always be 32 thread to be a wrap in a block one time.
So there isn't function as you described. The number of threads and blocks depends on how you process the data, it is not universal.