Performance: Offline surface for hit testing vs Triangle Intersection

Performance: Offline surface for hit testing vs Triangle Intersection - performance

First, a disclaimer. I'm well aware of the std answer for X vs Y - "it depends". However, I'm working on a very general purpose product, and I'm trying to figure out "it depends on what". I'm also not really able to test the wide variety of hardware, so try-and-see is an imperfect measure at best.
I've been doing some googling, and I've found very little reference to using an offline render target/surface for hittesting. I'm not sure of the nomenclature, what I'm talking about though is using very simple shaders to render a geometry ID (for example) to a buffer, then reading the pixel value under the mouse to see what geometry is directly under the mouse pointer.
I have however, found 101 different tutorials on doing triangle intersection, a la D3DXIntersect & DirectX sample "Pick".
I'm a little curious on this - I would have thought using HW was the standard method. By all rights, it should be many orders of magnitude faster, and should scale far better.
I'm relatively new to graphics programming, so here are my assumptions, for you to disabuse.
1) A simple shader that does geometry transform & writes a Node + UV value should be nearly free.
2) The main cost in the HW pick method would be the buffer fetch, when getting the rendered surface back off the GPU for the CPU to read over. I have no idea how costly this is. us? ms? seconds? minutes?
3) This may be obvious, but I am assuming that Triangle Intersection (D3DXIntersect) is only possible on the CPU.
4) A possible cost people want to avoid is the cost of the extra render target(s) (zbuffer+surface). I'm a'guessing about 10 megs for 1024x1280 (std screen size?). This is acceptable to me, although if I could render a smaller surface (trade accuracy for memory) I would do so (is that possible?).
This all leads to a few thoughts.
1) For very simple scenes, triangle intersection may be faster. Quite what is simple/complex is hard to guess at this point. I'm looking at possible 100s of tris to 10000s. Probably not much more than that.
2) The HW buffer needs to be rendered regardless of whether or not its used (in my case). However, it can be reused without cost (ie, click-drag, where mouse tracks across a static scene)
2a) Possibly, triangle intersection may be preferable if my scene updates every frame, or if I have limited mouse interaction.
Now I've finished writing, I see a similar question has been asked: (3D Graphics Picking - What is the best approach for this scenario). My problem with this is (a) why would you need to re-render your picking surface for click-drag as your scene hasn't actually changed, and (b) wouldn't it still be faster than triangle intersection?
I welcome thoughts, criticism, and any manner of side-tracking :-)

Related

Three.js: What's the upper limit for holding 60 FPS on an average desktop?

I'm currently working on a game using Three.js. I've been studying software engineering for four years and have been working professionally on backends for two, but I've barely touched on graphics aside from some simple Unity experimenting.
I currently have ~22,000 vertices and ~8,000 faces according to renderstats.js, and my desktop (above average) can't run it above 20 FPS. I'm using Lambert material as well as a single ambient light, so I feel like this isn't too much to ask.
With these figures in mind, is this the expected behavior for three.js rendering?

I would be pretty sure that is not end of the line and you are probably missing some possibilities for massive performance-improvements.
But just to give you some numbers first,
if you leave everything fancy away (including three.js) and just render an ultra-simple point-cloud with one fragment rendered per point, you can easily get to rendering 10-20 million (yes, million) points/vertices on an average GPU.
just with simple shapes and material, I already got three.js to render something in the range of 500k triangles (at 1080p-resolution) at 60FPS without problem. You can probably take those numbers times 10 for latest high-end GPUs.
However, these kinds of numbers are not really helpful.
Some hints:
if you want to debug your rendering-performance, you should first add some metrics. Renderstats is good, but I'd recommend integrating http://spite.github.io/rstats/ for this (see the example).
generally the choice of material shouldn't matter too much, the GPU is way more capable than most people think. It's more likely a problem somewhere else in the pipeline. EDIT from comment: In some cases, like hi-resolution displays with slow GPUs (think mobile-devices) this might be less true and complicated shader-code can slow down your site, but it might worth be looking at the other points first. As the rendering itself happens off-thread (so you can't measure it's duration using regular tools like the devtools-profiler), you can use the EXT_disjoint_timer_query-extension to get some information about what is going on on the GPU.
the number of drawcalls shouldn't be too high: three.js needs to do a single drawcall for every Mesh and Points-object rendered in the scene and too many objects are generally a far bigger problem than objects with lots of vertices. Reducing the number of drawcalls can be done by merging multiple geometries into one and making use of multi-materials, vertex-colors and things like that.
if you are doing postprocessing, the GPU needs to render every pixel on screen several times. This might as well massively limit your performance. This can be optimized by merging multiple postprocessing-passes into one (I admit, that'd be a lot of hard work..)
another problem could be on the JS side: you should use the profiler or timeline-view from the chrome devtools to see if maybe it's the javascript that is taking too much time per frame (shouldn't be more than 8-12ms per frame). I've been told there are ways to optimize the javascript-performance as well :)

OpenGL ES, Z-Buffer, 2D sprites, discard, performance

I have a retro-looking 2D game with a lot of sprites (reminiscent of Sega's Super Scaler arcades) which do not use semi-transparency. I have thought about using the Z-Buffer over sorting to simplify things. Ok, but by default writes are done to the Z-buffer even though alpha is zero, giving the effect illustrated here:
http://i.stack.imgur.com/ubLlp.png
Now, since I'm in OpenGL ES 2, I don't have alpha testing, so from what I understand my only possibility is to discard the pixel from the fragment shader if alpha is 0 so that it doesn't get written to the Z-Buffer. But in terms of performance this is SO wrong: not only the if is slow, but the discard basically kills the purpose since it disables early depth testing and the result is way worse than doing it in software.
if (val.a < 0.5) {
discard;
}
Is there any other solution I could use which would not kill the performance? Do all 2D games sort sprites themselves and not use depth buffer?

It's a tradeoff really. If you let the z-buffer do the sorting and use discard in your shaders then it's more expensive on the GPU because of branching and late depth testing as you say.
If you do the depth sorting yourself, then you'll find it's harder to issue your draw calls in an optimal order (e.g. you'll keep having to change texture). Draw calls on GLES2 have a very significant CPU hit on lower end devices and the count will probably go up.
If performance is a big concern, then probably the second option is better if you do it in conjunction with a big effort on the texture atlasing front to minimize your draw call count, this might be particularly effective if your sprites are low resolution retro sprites because you'll be able to get a lot of sprites per texture atlas. It isn't a clear winner by any stretch and I can imagine that different games take different approaches.
Also, you should take into account that the vast majority of target hardware is going to perform just fine whichever path you choose, and maybe you should just choose the one that is faster to implement and makes your code simpler (which is probably letting the z-buffer do the sorting).
If you fancy a technical challenge, I've often thought the best approach might be divide up your sprites into fully opaque sections and sections with transparency and render the two parts as separate meshes (they won't be quads any more). You'd have to do a lot of preprocessing and draw a lot more triangles, but by being able to do some rendering with fully-opaque parts then you can take advantage of the hidden-surface-removal tech in all iOS devices and lots of Android devices. Certainly by doing this you should be able to reduce your fill rate burden, but at a cost of increased draw calls, and there might be an unnecessarily high amount of added complexity to your code and your tools.

j2me: what is effect of 'g.clipRect()' on speed?

I'm developing in j2me and using canvas to drawing some images.
Now, my question is : what is difference of below sample codes in speed of drawing?
drawing after clipping area rectangle:
g.clipRect(x, y, myImage.getWidth(), myImage.getHeight());
g.drawImage(myImage, x , y, Graphics.TOP | Graphics.LEFT);
g.setClip(0, 0, screenWidth, screenHeight);
drawing without clip:
g.drawImage(myImage, x, y, Graphics.TOP | Graphics.LEFT);
is the first one is faster? I'm drawing on screen a lot.

Well the direct answer to your question would be Mu I'm afraid - because you appear to approach the issue from the wrong direction.
Thing is, clipping API is not intended for performance considerations / optimizations. You can find full coverage of its purpose in API documentation (available online), it does not state anything related to performance impact:
Clipping
The clip is the set of pixels in the destination of the Graphics object that may be modified by graphics rendering operations.
There is a single clip per Graphics object. The only pixels modified by graphics operations are those that lie within the clip. Pixels outside the clip are not modified by any graphics operations.
Operations are provided for intersecting the current clip with a given rectangle and for setting the current clip outright...
Attempting to use clipping API for imaginary performance considerations will make your code a nightmare to understand for future maintainers. Note this future maintainer may be you yourself, just few weeks / months / years later - I for one had my nose broken on my own code written some time ago without clearly understandable intent - trust me, it hurts the same as messing with poor code written by anyone else.
Don't get me wrong - there is a chance that clipping may have substantial performance impact in some particular case on specific device - why not, everything is possible given the variety of MIDP implementations. Know what? there is even a chance of it having an opposite impact on some other device, why not.
If (if) that happens, if (if) you'll somehow get a clear, solid, tested and proven justification of specific performance impact - then (then), go ahead, implement whatever tricks necessary to reach required performance, no matter how perverse they may be (BTDTGTTS). Until then, though, drop any baseless assumptions that just may come to your mind.
Until then... Just. Drop. It.
Developers love to optimize code and with good reason. It is so satisfying and fun. But knowing when to optimize is far more important. Unfortunately, developers generally have horrible intuition about where the performance problems in an application will actually be... Most performance tuning reminds me of the old joke about the guy who's looking for his keys in the kitchen even though he lost them in the street, because the light's better in the kitchen... (Brian Goetz)

This will almost certainly vary between platforms, and will depend on how much you're actually drawing.
I suggest you measure performance yourself by logging the number of paints per second, or the average duration of a paint method, and painting this on screen.

Drawing without clip should be faster on any platform for the simple reason that you are not calling two clip methods. But I might ask, why are you using clip to begin with?
You usually use clipping when you have an animation sprite or an icon variation in the same file. In this case you can create a file for each frame/icon. It will increase your jar file size and will use more heap space to hold these images on memory, but will be drawn faster.

OpenGL - Will using multiple VBO's slow down rendering?

I am rendering some meshes (sometimes upwards of 500) and I wanted to know the best way to approach this. Would it be pointless to create 500 VBOs and then if they pass the frustum and visibility tests, render them. Is there a more efficient way to do this? I am looking to maximize performance.

To answer your question, yes, many VBOs will slow things down. More polys will usually slow down the render, but more draw calls has a much greater hit. You want to minimize state changes and draws, as well as the number of buffers you have (and memory use).
I would suggest first looking at the buffers and figuring out how many you need. If you can batch/instance geometry, merge static geometry into a single buffer, reuse buffer more efficiently, etc.
Once you've cut the buffers down to the minimum possible, you'll want to use culling of multiple sorts. Visibility, both by frustrum (perhaps in an octree) and occlusion, can provide a significant performance boost. The main idea is to disqualify the geometry as fast and simply as possible, so you start with rough tests (octree), then somewhat more detailed (perhaps an AABB and/or simplified hull), then occlusion, then actually draw.
Here's a good article on frustrum culling, which touches a bit on quadtrees (and by extension, octrees). Diagrams, explanations and some sample code.
OpenGL occlusion culling articles seem a bit less common, although this one from GPU Gems might be a good starting place.

What is the fastest way of edge detection?

I am thinking of implement a image processing based solution for industrial problem.
The image is consists of a Red rectangle. Inside that I will see a matrix of circles. The requirement is to count the number of circles under following constraints. (Real application : Count the number of bottles in a bottle casing. Any missing bottles???)
The time taken for the operation should be very low.
I need to detect the red rectangle as well. My objective is to count the
items in package and there are no
mechanism (sensors) to trigger the
camera. So camera will need to capture
the photos continuously but the
program should have a way to discard
the unnecessary images.
Processing should be realtime.
There may be a "noise" in image capturing. You may see ovals instead of circles.
My questions are as follows,
What is the best edge detection algorithm that matches with the given
scenario?
Are there any other mechanisms that I can use other than the edge
detection?
Is there a big impact between the language I use and the performance of
the system?

AHH - YOU HAVE NOW TOLD US THE BOTTLES ARE IN FIXED LOCATIONS!
IT IS AN INCREDIBLY EASIER PROBLEM.
All you have to do is look at each of the 12 spots and see if there is a black area there or not. Nothing could be easier.
You do not have to do any edge or shape detection AT ALL.
It's that easy.
You then pointed out that the box might be rotatated, things could be jiggled. That the box might be rotated a little (or even a lot, 0 to 360 each time) is very easily dealt with. The fact that the bottles are in "slots" (even if jiggled) massively changes the nature of the problem. You're main problem (which is easy) is waiting until each new red square (crate) is centered under the camera. I just realised you meant "matrix" literally and specifically in the sentence in your original questions. That changes everything totally, compared to finding a disordered jumble of circles. Finding whether or not a blob is "on" at one of 12 points, is a wildly different problem to "identifying circles in an image". Perhaps you could post an image to wrap up the question.
Finally I believe Kenny below has identified the best solution: blob analysis.
"Count the number of bottles in a bottle casing"...
Do the individual bottles sit in "slots"? ie, there are 4x3 = 12 holes, one for each bottle.
In other words, you "only" have to determine if there is, or is not, a bottle in each of the 12 holes.
Is that correct?
If so, your problem is incredibly easier than the more general problem of a pile of bottles "anywhere".
Quite simply, where do we see the bottles from? The top, sides, bottom, or? Do we always see the tops/bottoms, or are they mixed (ie, packed top-to-tail). These issues make huge, huge differences.

Surf/Sift = overkill in this case you certainly don't need it.
If you want real time speed (about 20fps+ on a 800x600 image) I recommend using Cuda to implement edge detection using a standard filter scheme like sobel, then implement binarization + image closure to make sure the edges of circles are not segmented apart.
The hardest part will be fitting circles. This is assuming you already got to the step where you have taken edges and made sure they are connected using image closure (morphology.) At this point I would proceed as follows:
run blob analysis/connected components to segment out circles that do not touch. If circles can touch the next step will be trickier
for each connected componet/blob fit a circle or rectangle using RANSAC which can run in realtime (as opposed to Hough Transform which I believe is very hard to run in real time.)
Step 2 will be much harder if you can not segment the connected components that form circles seperately, so some additional thought should be invested on how to guarantee that condition.
Good luck.
Edit
Having thought about it some more, I feel like RANSAC is ideal for the case where the circle connected components do touch. RANSAC should hypothetically fit the circle to only a part of the connected component (due to its ability to perform well in the case of mostly outlier points.) This means that you could add an extra check to see if the fitted circle encompasses the entire connected component and if it does not then rerun RANSAC on the portion of the connected component that was left out. Rinse and repeat as many times as necessary.
Also I realize that I say circle but you could just as easily fit an ellipse instead of circles using RANSAC.
Also, I'd like to comment that when I say CUDA is a good choice I mean CUDA is a good choice to implement the sobel filter + binirization + image closing on. Connected components and RANSAC are probably best left to the CPU, but you can try pushing them onto CUDA though I don't know how much of an advantage a GPU will give you for those 2 over a CPU.

For the circles, try the Hough transform.
other mechanisms: dunno
Compiled languages will possibly be faster.

SIFT should have a very good response to circular objects - it is patented, though. GLOHis a similar algorithm, but I do not know if there are any implementations readily available.
Actually, doing some more research, SURF is an improved version of SIFT with quite a few implementations available, check out the links on the wikipedia page.

Sum of colors + convex hull to detect boundary. You need, mostly, 4 corners of a rectangle, and not it's sides?
No motion, no second camera, a little choice - lot of math methods against a little input (color histograms, color distribution matrix). Dunno.
Java == high memory consumption, Lisp == high brain consumption, C++ == memory/cpu/speed/brain use optimum.

If the contrast is good, blob analysis is the algorithm for the job.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio