How can I debug a GLSL shader? - debugging

How can I debug an OpenGL shader? For example.:
void main(void)
{
vec2 uv = gl_FragCoord.xy;
gl_FragColor = vec4(uv,222,1);
}
Is there a way that I can find out what the uv value is?

Debugging a GLSL Shader
There is many way to debug a shader, most way are visual rather than outputting the full pixel data of the whole image.
Since shader run in a highly parallel way you can output a lot of visual data at once.
There is also some external application that can help you a lot in debugging glsl shader and also the whole rendering pipeline.
visual debugging
This is the simplest form of debugging while the hardest to understand the result. You just have to output the data you want to see to the screen, for example to know what are the value of the uv like you wanted.
You could do it like this :
void mainImage( out vec4 fragColor, in vec2 fragCoord )
{
// Normalized pixel coordinates (from 0 to 1)
vec2 uv = fragCoord/iResolution.xy;
// Output to screen
fragColor = vec4(uv, 0.0 ,1.0);
}
In this picture you can see that you have a range from 0 to 1 for the normalized uv. the yellow color mean uv are vec2(1.0,1.0) and black mean vec2(0.0,0.0).
another exampler of visual debugging :
(source : https://www.shadertoy.com/view/ts2yzK)
This one is from a raymarching project. there is two thing going one in this image, i debug the depth of the ray and also debug if there is a hit or not.
This image go to see what you have hit or not since white mean hit and black mean miss.
One the other hand this image is bad to display the depth of image. You can have hard time debugging images likes this because of two things :
negative value who are render black
value superior to 1.0 who are render white
Value like that lead us to the other type of debugging
external application
https://renderdoc.org/
Renderdoc is one of the best debugger out there. Yes there is many other but This one is free to use and has a very wide range of usage.
You can debug vertex glsl shader by seeing the before and after the shader.
You can see every pixel of a fragment shader and there value.
You can also see how you data a stored on the GPU and many more things.
note for the end
While debugging, visual or not you have to know what you are looking for, you at least have a hit. Visual debugging is one of the hardest thing to debug since usually there is no way to perform you code step by step because of the high parallelism of shaders.
Many bugs are also because we forgot small things, like normalising a value, forgetting a minus sign and more small things.
To spot error and properly debug you have to be thorough and be patient.

Related

Summed area table in GLSL and GPU fragment shader execution

I am trying to compute the integral image (aka summed area table) of a texture I have in the GPU memory (a camera capture), the goal being to compute the adaptive threshold of said image. I'm using OpenGL ES 2.0, and still learning :).
I did a test with a simple gaussian blur shader (vertical/horizontal pass), which is working fine, but I need a way bigger variable average area for it to give satisfactory results.
I did implement a version of that algorithm on CPU before, but I'm a bit confused on how to implement that on a GPU.
I tried to do a (completely incorrect) test with just something like this for every fragment :
#version 100
#extension GL_OES_EGL_image_external : require
precision highp float;
uniform sampler2D u_Texture; // The input texture.
varying lowp vec2 v_TexCoordinate; // Interpolated texture coordinate per fragment.
uniform vec2 u_PixelDelta; // Pixel delta
void main()
{
// get neighboring pixels values
float center = texture2D(u_Texture, v_TexCoordinate).r;
float a = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, 0.0)).r;
float b = texture2D(u_Texture, v_TexCoordinate + vec2(0.0, u_PixelDelta.y * 1.0)).r;
float c = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, u_PixelDelta.y * 1.0)).r;
// compute value
float pixValue = center + a + b - c;
// Result stores value (R) and original gray value (G)
gl_FragColor = vec4(pixValue, center, center, 1.0);
}
And then another shader to get the area that I want and then get the average. This is obviously wrong as there's multiple execution units operating at the same time.
I know that the common way of computing a prefix sum on a GPU is to do it in two pass (vertical/horizontal, as discussed here on this thread or or here), but isn't there a problem here as there is a data dependency on each cell from the previous (top or left) one ?
I can't seem to understand the order in which the multiple execution units on a GPU will process the different fragments, and how a two-pass filter can solve that issue. As an example, if I have some values like this :
2 1 5
0 3 2
4 4 7
The two pass should give (first columns then rows):
2 1 5 2 3 8
2 4 7 -> 2 6 13
6 8 14 6 14 28
How can I be sure that, as an example, the value [0;2] will be computed as 6 (2 + 4) and not 4 (0 + 4, if the 0 hasn't been computed yet) ?
Also, as I understand that fragments are not pixels (If I'm not mistaken), would the values I store back in one of my texture in the first pass be the same in another pass if I use the exact same coordinates passed from the vertex shader, or will they be interpolated in some way ?
Tommy and Bartvbl address your questions about a summed-area table, but your core problem of an adaptive threshold may not need that.
As part of my open source GPUImage framework, I've done some experimentation with optimizing blurs over large radii using OpenGL ES. Generally, increasing blur radii leads to a significant increase in texture sampling and calculations per pixel, with an accompanying slowdown.
However, I found that for most blur operations you can apply a surprisingly effective optimization to cap the number of blur samples. If you downsample the image before blurring, blur at a smaller pixel radius (radius / downsampling factor), and then linearly upsample, you can arrive at a blurred image that is the equivalent of one blurred at a much larger pixel radius. In my tests, these downsampled, blurred, and then upsampled images look almost identical to the ones blurred based on the original image resolution. In fact, precision limits can lead to larger-radii blurs done at a native resolution breaking down in image quality past a certain size, where the downsampled ones maintain the proper image quality.
By adjusting the downsampling factor to keep the downsampled blur radius constant, you can achieve near constant-time blurring speeds in the face of increasing blur radii. For a adaptive threshold, the image quality should be good enough to use for your comparisons.
I use this approach in the Gaussian and box blurs within the latest version of the above-linked framework, so if you're running on Mac, iOS, or Linux, you can evaluate the results by trying out one of the sample applications. I have an adaptive threshold operation based on a box blur that uses this optimization, so you can see if the results there are what you want.
AS per the above, it's not going to be fantastic on a GPU. But assuming the cost of shunting data between the GPU and CPU is more troubling it may still be worth persevering.
The most obvious prima facie solution is to split horizontal/vertical as discussed. Use an additive blending mode, create a quad that draws the whole source image then e.g. for the horizontal step on a bitmap of width n issue a call that requests the quad be drawn n times, the 0th time at x = 0, the mth time at x = m. Then ping pong via an FBO, switching the target of buffer of the horizontal draw into the source texture for the vertical.
Memory accesses are probably O(n^2) (i.e. you'll probably cache quite well, but that's hardly a complete relief) so it's a fairly poor solution. You could improve it by divide and conquer by doing the same thing in bands — e.g. for the vertical step, independently sum individual rows of 8, after which the error in every row below the final is the failure to include whatever the sums are on that row. So perform a second pass to propagate those.
However an issue with accumulating in the frame buffer is clamping to avoid overflow — if you're expecting a value greater than 255 anywhere in the integral image then you're out of luck because the additive blending will clamp and GL_RG32I et al don't reach ES prior to 3.0.
The best solution I can think of to that, without using any vendor-specific extensions, is to split up the bits of your source image and combine channels after the fact. Supposing your source image were 4 bit and your image less than 256 pixels in both directions, you'd put one bit each in the R, G, B and A channels, perform the normal additive step, then run a quick recombine shader as value = A + (B*2) + (G*4) + (R*8). If your texture is larger or smaller in size or bit depth then scale up or down accordingly.
(platform specific observation: if you're on iOS then you've hopefully already got a CVOpenGLESTextureCache in the loop, which means you have CPU and GPU access to the same texture store, so you might well prefer to kick this step off to GCD. iOS is amongst the platforms supporting EXT_shader_framebuffer_fetch; if you have access to that then you can write any old blend function you like and at least ditch the combination step. Also you're guaranteed that preceding geometry has completed before you draw so if each strip writes its totals where it should and also to the line below then you can perform the ideal two-pixel-strips solution with no intermediate buffers or state changes)
What you attempt to do cannot be done in a fragment shader. GPU's are by nature very different to CPU's by executing their instructions in parallel, in massive numbers at the same time. Because of this, OpenGL does not make any guarantees about execution order, because the hardware physically doesn't allow it to.
So there is not really any defined order other than "whatever the GPU thread block scheduler decides".
Fragments are pixels, sorta-kinda. They are pixels that potentially end up on screen. If another triangle ends up in front of another, the previous calculated colour value is discarded. This happens regardless of whatever colour was stored at that pixel in the colour buffer previously.
As for creating the summed area table on the GPU, I think you may first want to look at GLSL "Compute Shaders", which are specifically made for this sort of thing.
I think you may be able to get this to work by creating a single thread for each row of pixels in the table, then have every thread "lag behind" by 1 pixel compared to the previous row.
In pseudocode:
int row_id = thread_id()
for column_index in (image.cols + image.rows):
int my_current_column_id = column_index - row_id
if my_current_column_id >= 0 and my_current_column_id < image.width:
// calculate sums
The catch of this method is that all threads should be guaranteed to execute their instructions simultaneously without getting ahead of one another. This is guaranteed in CUDA, but I'm not sure whether it is in OpenGL compute shaders. It may be a starting point for you, though.
It may look surprising for the beginner but the prefix sum or SAT calculation is suitable for parallelization. As the Hensley algorithm is the most intuitive to understand (also implemented in OpenGL), more work-efficient parallel methods are available, see CUDA scan. The paper from Sengupta discuss parallel method which seems state-of-the-art efficient method with reduce and down swap phases. These are valuable materials but they do not enter OpenGL shader implementations in detail. The closest document is the presentation you have found (it refers to Hensley publication), since it has some shader snippets. This is the job which is doable entirely in fragment shader with FBO Ping-Pong. Note that the FBO and its texture need to have internal format set to high precision - GL_RGB32F would be best but I am not sure if it is supported in OpenGL ES 2.0.

cg: Vertex output struct corrupted by different member order? Profile violation or cg bug?

I've been tinkering with cg shaders for Retroarch, and I've encountered what appears to be a strange bug in the Cg Toolkit's compiler or code generator...or something. Consider the three-pass shader found here which simulates a CRT TV: https://github.com/libretro/common-shaders/tree/master/crt/crt-interlaced-halation
In particular, consider the final pass:
https://github.com/libretro/common-shaders/blob/master/crt/crt-interlaced-halation/crt-interlaced-halation-pass2.cg
As it stands, the shader output works as expected. If you comment out the "#define CURVATURE" at the top of this file (which simulates the curvature of a CRT TV), the shader output also works as expected. However, it's very particular to the member order of the vertex shader output struct here:
struct out_vertex {
float4 position : POSITION;
float4 color : COLOR;
float2 texCoord : TEXCOORD0;
float2 one;
float mod_factor;
float2 ilfac;
float3 stretch;
float2 sinangle;
float2 cosangle;
};
If you rearrange the order to the following, you will get corrupted output:
struct out_vertex {
float4 position : POSITION;
float4 color : COLOR;
float2 texCoord : TEXCOORD0;
float2 cosangle;
float2 one;
float mod_factor;
float2 ilfac;
float3 stretch;
float2 sinangle;
};
My desktop's nvidia card gives me a black screen with that order, and my laptop's ATI card gives me bizarre artifacts where the texture coordinates seem to be broken (perhaps). The exact nature of the error therefore depends on the GPU or drivers, but the presence of the error is vendor/driver-agnostic...so it appears to be a bug in the cg compiler that causes the varying attributes to become corrupt. There's pretty much no end to the kinds of corruption you can get. For instance, other member rearrangements screw up things like the "mod_factor" variable (storing the x pixel coordinate of the output), which causes the alternate magenta/green pixel tints to get stuck on one or the other, blanketing the entire image with the same tint. Still others cause a black screen except for halation/bloom contribution, etc.
The issue does not occur in this particular shader if you reenable "#define CURVATURE", but it doesn't have anything to do with errors in the "flat" codepath: In fact, in the part of the fragment shader within the "#ifdef CURVATURE" block, you can actually replace the final value with "xy = VAR.texCoord;" (the same value used by the uncurved version), and you'll get flat output without any errors. (EDIT: Oops, this isn't actually true with this particular shader, but it was in my own version. I should have checked that first before making the same assessment about this "simplified" example.) In reality, the fact that the flat codepath triggers the corruption but the curved codepath doesn't seems to indicate it has something to do with the curved codepath reading more of the varying attributes in the fragment shader (and maybe the read order or usage matters too...?), but I have not yet found a rhyme or reason to it. I have my own drastically different forked WIP where the same bizarre issues affect a curved codepath as well, but I'd rather keep it to myself until it's ready anyway.
So, I guess I have a few questions:
Has anyone else seen anything like this?
Is this nondeterminism simply expected with output struct members that aren't explicitly associated with any semantics?
Could this corruption be coming from cg shader profile limits I'm unaware of? I have no idea what shader profile Retroarch compiles for, but I can see this kind of corruption occurring if the size of the vertex output struct exceeds some maximum allowed size.
Are there any other possibilities I might be overlooking? I considered driver errors, but that went out the window once I realized it affected both nvidia and ATI hardware. Still, I want to do my homework before informing nvidia the Cg Toolkit seems to have a bug...
Thanks for any insights! :)
It turns out the problem has everything to do with relying on cg's auto-assigned semantics. I'll copy/paste my comment from above:
I'm starting to think the problem might have something to do with
relying on cg to auto-assign semantics: If for instance cg associates
a value with a full float range to a semantic that clamps to [0.0,
1.0], that would obviously cause issues. mod_factor, ilfac, and stretch would all fall into that category, and sinangle and cosangle
could be in [-1, 1], so the same probably applies to them. The
assignment of semantics is likely to be affected by dead code
elimination, which would explain the differences with and without
"#define CURVATURE." I'll have to test this hypothesis though...
There are only a limited number of semantics available depending on the profile (see this specification), and (I may be mistaken) Retroarch appears to use a lower profile, where only the following are available:
POSITION: must be set to the clipspace vertex position, not only because it informs the rasterizer, but also because it apparently can't even be read from the fragment shader.
COLOR0 and COLOR1: values are clamped to the [0, 1] range.
TEXCOORD0-7: safe for any scalar or vector float value
FOG: safe for any scalar float value
The BCOL0/BCOL1 semantics probably clamp too in profiles that support them, and PSIZE and CLP0-5 probably don't. The overall lesson seems to be that letting the cg compiler auto-assign semantics for values outside of the [0, 1] range is like playing Russian roulette, because you never know if they'll end up being associated with the clamped semantics or not, and the auto-assignment will change depending on the specifics of the shader code. For that reason, you need to carefully manage semantics so values potentially outside of [0, 1] get paired up with something like TEXCOORD0-7 or FOG (for a scalar float).

Debugging DirectX on VS2013 - Drawing Call not appearing, But Vertex Shader is executed

I am trying to start working on DirectX 11, but I have so far been unable to draw a single polygon.
I started with sample codes from Beginning DirectX 11 Game Development (Chapter 2, simple, untextured triangle) and this sample from Microsoft's Dev Center.
So far I have created my Device, Context, Swap Chain, Viewport, Vertex and Pixel Shaders, Input Layout and Vertex Buffer for this effect. However, nothing is displayed on screen.
Using VS2013's Graphics Debugging tools, I managed to find that my geometry is being sent to the Vertex Shader, apparently in the correct position (Single Triangle in the middle of the screen). However, when I switch to the Graphics Pixel History tool, it appears my drawing call is never executed; Yet, the Graphics Event list says it was.
My vertices are in CW order, no blending/alpha has been enabled, the same debugging tools let me know what all the objects listed above were properly created. Yet nothing appears. Does anybody have a pointer in the right direction?
Without shader code and vertex data we can only guess.
And my guess (which is a frequent mistake) is that fragments not passing depth tests due to incorrect Z or W values output in VS (due to incorrect vertex data, incorrect matrices in constant buffers, incorrect calculations, etc.) or incorrect depth buffer settings.
To debug it try this algo:
Run graphics debugger and capture frame (Print Screen)
In "Graphics Event List" choose draw function call (with brush near it)
Click on a pixel in a captured frame (preferably one that belongs to your geometry)
Look at "Graphics Pipeline Stages" to see if IA, VS, PS and OM stages functional for that pixel
Check "Pixel history" (unfold all lines). Does anything happens after cleaning buffer? Does your fragments pass depth test? Does you geometry has wrong color?
On "Graphics Pipeline Stages" or "Pixel history" click green arrow near shader stages to debug their HLSL code. Go line-by-line, watch variables. Check that shaders output valid data to next stages.
Repeat same for some other pixels of your frame if needed.
Target is to find on which stage of pipeline things go wrong.
If you cannot find some windows, check menu "Debug" - "Graphics".
Hope it helps.
Happy debugging!

Optimizing vertices for skeletal animation in OpenGL ES

So I'm working with a 2D skeletal animation system.
There are X number of bones, each bone has at least 1 part (a quad, two triangles). On average, I have maybe 20 bones, and 30 parts. Most bones depend on a parent, the bones will move every frame. There are up to 1000 frames in total per animation, and I'm using about 50 animations. A total of around 50,000 frames loaded in memory at any one time. The parts differ between instances of the skeleton.
The first approach I took was to calculate the position/rotation of each bone, and build up a vertex array, which consisted of this, for each part:
[x1,y1,u1,v1],[x2,y2,u2,v2],[x3,y3,u3,v3],[x4,y4,u4,v4]
And pass this through to glDrawElements each frame.
Which looks fine, covers all scenarios that I need, doesn't use much memory, but performs like a dog. On an iPod 4, could get maybe 15fps with 10 of these skeletons being rendered.
I worked out that most of the performance was being eaten up by copying so much vertex data each frame. I decided to go to another extreme, and "pre-calculated" the animations, building up a vertex buffer at the start for each character, that contained the xyuv coordinates for every frame, for every part, in a single character. Then, I calculate the index of the frame that should be used for a particular time, and calculate a delta value, which is passed through to the shader used to interpolate between the current and the next frames XY positions.
The vertices looked like this, per frame
[--------------------- Frame 1 ---------------------],[------- Frame 2 ------]
[x1,y1,u1,v1,boneIndex],[x2, ...],[x3, ...],[x4, ...],[x1, ...][x2, ...][....]
The vertex shader looks like this:
attribute vec4 a_position;
attribute vec4 a_nextPosition;
attribute vec2 a_texCoords;
attribute float a_boneIndex;
uniform mat4 u_projectionViewMatrix;
uniform float u_boneAlpha[255];
varying vec2 v_texCoords;
void main() {
float alpha = u_boneAlpha[int(a_boneIndex)];
vec4 position = mix(a_position, a_nextPosition, alpha);
gl_Position = u_projectionViewMatrix * position;
v_texCoords = a_texCoords;
}
Now, performance is great, with 10 of these on screen, it sits comfortably at 50fps. But now, it uses a metric ton of memory. I've optimized that by losing some precision on xyuv, which are now ushorts.
There's also the problem that the bone-dependencies are lost. If there are two bones, a parent and child, and the child has a keyframe at 0s and 2s, the parent has a keyframe at 0s, 0.5s, 1.5s, 2s, then the child won't be changed between 0.5s and 1.5s as it should.
I came up with a solution to fix this bone problem -- by forcing the child to have keyframes at the same points as the parents. But this uses even more memory, and basically kills the point of the bone hierarchy.
This is where I'm at now. I'm trying to find a balance between performance and memory usage. I know there is a lot of redundant information here (UV coordinates are identical for all the frames of a particular part, so repeated ~30 times). And a new buffer has to be created for every set of parts (which have unique XYUV coordinates -- positions change because different parts are different sizes)
Right now I'm going to try setting up one vertex array per character, which has the xyuv for all parts, and calculating the matrices for each parts, and repositioning them in the shader. I know this will work, but I'm worried that the performance won't be any better than just uploading the XYUV's for each frame that I was doing at the start.
Is there a better way to do this without losing the performance I've gained?
Are there any wild ideas I could try?
The better way to do this is to transform your 30 parts on the fly, not make thousands of copies of your parts in different positions. Your vertex buffer will contain one copy of your vertex data, saving tons of memory. Then each frame can be represented by a set of transformations passed as a uniform to your vertex shader for each bone you draw with a call to glDrawElements(). Each dependent bone's transformation is built relative to the parent bone. Then, depending on where on the continuum between hand crafted and procedurally generated you want your animations, your sets of transforms can take more or less space and CPU computing time.
Jason L. McKesson's free book, Learning Modern 3D Graphics Programming, gives a good explanation on how to accomplish this in chapter 6. The example program at the end of this chapter shows how to use a matrix stack to implement a hierarchical model. I have an OpenGL ES 2.0 on iOS port of this program available.

How WebGL works?

I'm looking for deep understanding of how WebGL works. I'm wanting to gain knowledge at a level that most people care less about, because the knowledge isn't necessary useful to the average WebGL programmer. For instance, what role does each part(browser, graphics driver, etc..) of the total rendering system play in getting an image on the screen?
Does each browser have to create a javascript/html engine/environment in order to run WebGL in browser? Why is chrome a head of everyone else in terms of being WebGL compatible?
So, what's some good resources to get started? The kronos specification is kind of lacking( from what I saw browsing it for a few minutes ) for what I'm wanting. I'm wanting mostly how is this accomplished/implemented in browsers and what else needs to change on your system to make it possible.
Hopefully this little write-up is helpful to you. It overviews a big chunk of what I've learned about WebGL and 3D in general. BTW, if I've gotten anything wrong, somebody please correct me -- because I'm still learning, too!
Architecture
The browser is just that, a Web browser. All it does is expose the WebGL API (via JavaScript), which the programmer does everything else with.
As near as I can tell, the WebGL API is essentially just a set of (browser-supplied) JavaScript functions which wrap around the OpenGL ES specification. So if you know OpenGL ES, you can adopt WebGL pretty quickly. Don't confuse this with pure OpenGL, though. The "ES" is important.
The WebGL spec was intentionally left very low-level, leaving a lot to
be re-implemented from one application to the next. It is up to the
community to write frameworks for automation, and up to the developer
to choose which framework to use (if any). It's not entirely difficult
to roll your own, but it does mean a lot of overhead spent on
reinventing the wheel. (FWIW, I've been working on my own WebGL
framework called Jax for a while
now.)
The graphics driver supplies the implementation of OpenGL ES that actually runs your code. At this point, it's running on the machine hardware, below even the C code. While this is what makes WebGL possible in the first place, it's also a double edged sword because bugs in the OpenGL ES driver (which I've noted quite a number of already) will show up in your Web application, and you won't necessarily know it unless you can count on your user base to file coherent bug reports including OS, video hardware and driver versions. Here's what the debug process for such issues ends up looking like.
On Windows, there's an extra layer which exists between the WebGL API and the hardware: ANGLE, or "Almost Native Graphics Layer Engine". Because the OpenGL ES drivers on Windows generally suck, ANGLE receives those calls and translates them into DirectX 9 calls instead.
Drawing in 3D
Now that you know how the pieces come together, let's look at a lower level explanation of how everything comes together to produce a 3D image.
JavaScript
First, the JavaScript code gets a 3D context from an HTML5 canvas element. Then it registers a set of shaders, which are written in GLSL ([Open] GL Shading Language) and essentially resemble C code.
The rest of the process is very modular. You need to get vertex data and any other information you intend to use (such as vertex colors, texture coordinates, and so forth) down to the graphics pipeline using uniforms and attributes which are defined in the shader, but the exact layout and naming of this information is very much up to the developer.
JavaScript sets up the initial data structures and sends them to the WebGL API, which sends them to either ANGLE or OpenGL ES, which ultimately sends it off to the graphics hardware.
Vertex Shaders
Once the information is available to the shader, the shader must transform the information in 2 phases to produce 3D objects. The first phase is the vertex shader, which sets up the mesh coordinates. (This stage runs entirely on the video card, below all of the APIs discussed above.) Most usually, the process performed on the vertex shader looks something like this:
gl_Position = PROJECTION_MATRIX * VIEW_MATRIX * MODEL_MATRIX * VERTEX_POSITION
where VERTEX_POSITION is a 4D vector (x, y, z, and w which is usually set to 1); VIEW_MATRIX is a 4x4 matrix representing the camera's view into the world; MODEL_MATRIX is a 4x4 matrix which transforms object-space coordinates (that is, coords local to the object before rotation or translation have been applied) into world-space coordinates; and PROJECTION_MATRIX which represents the camera's lens.
Most often, the VIEW_MATRIX and MODEL_MATRIX are precomputed and
called MODELVIEW_MATRIX. Occasionally, all 3 are precomputed into
MODELVIEW_PROJECTION_MATRIX or just MVP. These are generally meant
as optimizations, though I'd like find time to do some benchmarks. It's
possible that precomputing is actually slower in JavaScript if it's
done every frame, because JavaScript itself isn't all that fast. In
this case, the hardware acceleration afforded by doing the math on the
GPU might well be faster than doing it on the CPU in JavaScript. We can
of course hope that future JS implementations will resolve this potential
gotcha by simply being faster.
Clip Coordinates
When all of these have been applied, the gl_Position variable will have a set of XYZ coordinates ranging within [-1, 1], and a W component. These are called clip coordinates.
It's worth noting that clip coordinates is the only thing the vertex shader really
needs to produce. You can completely skip the matrix transformations
performed above, as long as you produce a clip coordinate result. (I have even
experimented with swapping out matrices for quaternions; it worked
just fine but I scrapped the project because I didn't get the
performance improvements I'd hoped for.)
After you supply clip coordinates to gl_Position WebGL divides the result by gl_Position.w producing what's called normalized device coordinates.
From there, projecting a pixel onto the screen is a simple matter of multiplying by 1/2 the screen dimensions and then adding 1/2 the screen dimensions.[1] Here are some examples of clip coordinates translated into 2D coordinates on an 800x600 display:
clip = [0, 0]
x = (0 * 800/2) + 800/2 = 400
y = (0 * 600/2) + 600/2 = 300
clip = [0.5, 0.5]
x = (0.5 * 800/2) + 800/2 = 200 + 400 = 600
y = (0.5 * 600/2) + 600/2 = 150 + 300 = 450
clip = [-0.5, -0.25]
x = (-0.5 * 800/2) + 800/2 = -200 + 400 = 200
y = (-0.25 * 600/2) + 600/2 = -150 + 300 = 150
Pixel Shaders
Once it's been determined where a pixel should be drawn, the pixel is handed off to the pixel shader, which chooses the actual color the pixel will be. This can be done in a myriad of ways, ranging from simply hard-coding a specific color to texture lookups to more advanced normal and parallax mapping (which are essentially ways of "cheating" texture lookups to produce different effects).
Depth and the Depth Buffer
Now, so far we've ignored the Z component of the clip coordinates. Here's how that works out. When we multiplied by the projection matrix, the third clip component resulted in some number. If that number is greater than 1.0 or less than -1.0, then the number is beyond the view range of the projection matrix, corresponding to the matrix zFar and zNear values, respectively.
So if it's not in the range [-1, 1] then it's clipped entirely. If it is in that range, then the Z value is scaled to 0 to 1[2] and is compared to the depth buffer[3]. The depth buffer is equal to the screen dimensions, so that if a projection of 800x600 is used, the depth buffer is 800 pixels wide and 600 pixels high. We already have the pixel's X and Y coordinates, so they are plugged into the depth buffer to get the currently stored Z value. If the Z value is greater than the new Z value, then the new Z value is closer than whatever was previously drawn, and replaces it[4]. At this point it's safe to light up the pixel in question (or in the case of WebGL, draw the pixel to the canvas), and store the Z value as the new depth value.
If the Z value is greater than the stored depth value, then it is deemed to be "behind" whatever has already been drawn, and the pixel is discarded.
[1]The actual conversion uses the gl.viewport settings to convert from normalized device coordinates to pixels.
[2]It's actually scaled to the gl.depthRange settings. They default 0 to 1.
[3]Assuming you have a depth buffer and you've turned on depth testing with gl.enable(gl.DEPTH_TEST).
[4]You can set how Z values are compared with gl.depthFunc
I would read these articles
http://webglfundamentals.org/webgl/lessons/webgl-how-it-works.html
Assuming those articles are helpful, the rest of the picture is that WebGL runs in a browser. It renderers to a canvas tag. You can think of a canvas tag like an img tag except you use the WebGL API to generate an image instead of download one.
Like other HTML5 tags the canvas tag can be styled with CSS, be under or over other parts of the page. Is composited (blended) with other parts of the page. Be transformed, rotated, scaled by CSS along with other parts of the page. That's a big difference from OpenGL or OpenGL ES.

Resources