Hi! Wanted to give some progress on my efforts of ridding GL3 from deprecated functions. As many of you might already know, the GL3 version was still mostly FFP (fixed-function pipeline). This means four things:
1) It doesn't use the newest GL and hardware capabilities.
2) It makes using newest GL capabilities a pain (like writing shaders in the GL1.1 is a pain).
3) It's not actually GL3 - even though we call it that.
4) And probably most importantly - It makes ports harder. GLES 2.0 is a subset of GL3. This means that a fully working GL3 without deprecated functions allow A LOT easier porting to embedded devices like Android phones.
Note: In this topic I will say GL3 everywhere, when in fact it's GL3.1 full context or even GL3.2 core (but no GL3.2 functions are used).
So what I did was first write a matrix library to replace the GL matrix madness, as GL3 actually doesn't support GL matrices anymore. It all has to be done by the developer.
Performance
GL1.1
While I was at it, I noticed that these changes also help GL1.1, as there were many useless functions in _transform_ that only slowed it down, but were necessary for the old GL way of doing things. In GL1.1 - matrices are used only in transformation functions and so the speed differences are based on how many d3d_transform_ functions are used. For all comparison I used two 3D examples which used a lot of 3D functionality - Minecraft example and the Project Mario game. The FPS are average after convergence in the game (so in both examples I don't move at all - after the game starts I wait until they both hit a steady point, like Mario going to sleep in Project Mario).
Performance changes for GL1.1 were these:
So in Minecraft example which doesn't use that many transform functions, there was no difference. No gain or loss. But in Project Mario there was 90FPS increase which is about 9%. In games that use A LOT of transform functions improvements will be greater. So the worst case is no difference, best case is improvement. And as this doesn't create any compatibility issues, then I don't see why free performance gain should be discarded. Plus, this allows gaining even more performance when specific improvements to matrix code is done (like using MMX/SSE instructions on PC's).
GL3
For GL3 I will show also the progress by iterations (or GIT commits if you will) and tell what went wrong and what went right.
Here the OLD is the one currently in GIT MASTER. The V1 is the version where only matrices are changed. It uses basically the same code as GL1.1, but the only difference is that matrix multiplication and sending to GL (via glLoadMatrix) is done not during the _transform_ call, but during render call (but only if update was needed). This means that examples like minecraft got a 70FPS boost without problem, as it has relatively few render calls. Project Mario on the other hand has a lot more (my debugger even shows Mario tries to render vertex buffers with only 1 vertex, which Robert and I should investigate), so there the FPS was massively decreased.
In V2 I finally removed FPP and added shaders. This meant that FPS was mostly increased, but the method was still sub-optimal (I queried uniform and attribute locations every draw call which of course is stupid). To be honest, I cannot remember why Minecraft had 265FPS reduction in this one.
And V3 is the current one. Locations for shaders are loaded when linking the shader program, fixed problems with texture coordinates being passed to shaders, while there is in fact no textures and so on (like when drawing d3d_model_block with texture == -1). This shows an overall improvement and so I finally got it working at least as fast and even faster than the older implementation.
But this is not the end - It will still go both ways. The shaders I wrote are quite primitive and don't have things like lights or fog. This means that when those are implemented, then the speed might decrease. On the other hand there are more optimizations to be done, like maybe using vertex array objects (VAO).
Shaders
Prefixes
I chose to go with the method done in ThreeJS and make prefixes for both fragment and vertex shaders. This means that when a user in ENIGMA codes a shader, there will be a prefix appended to the top of his code. This prefix defines things like matrices (projectionMatrix, viewMatrix, modelViewProjectionMatrix etc. as well as gm_Matrices[] for compatibility with GM:S), default attributes (vertex positions, texture positions, vertex colors and so on) and uniforms (like whether texture is bound or to get bound color (works like draw_get_color() in shader)). This makes writing shaders a lot easier as most of the needed stuff is already there. Shader compilers remove unused uniforms and attributes, so if the user writes a shader that doesn't use these functions (like a user doesn't need vertex color for example), then the shader will optimize out the color attribute. That is why these prefixes are so good - they don't have any real penalty. In the end the prefix's could get quite large and we also need to check what GM:S appends as well to make it compatible, but I don't think they are going to make a difference performance wise. They are mostly #define's and things like that.
Default shader
The default shader (bound by default on startup or when calling glsl_program_reset()) will have to replicate all the FPP stuff ENIGMA and GM allowed to use. This include flat and smooth shading for lights (up to 8 lights if I am not mistaken), fog (colored and with different falloff functions) and a few other things. The best way (or the "proper" way) to do this is to make several shaders - one with lights enabled, one with disabled, one with texturing, one without and so on. The combinations are going to be large though and so we either have to make our own shader generator (quite often done actually) or just make an "ubershader". And "ubershader" is something that does all of these things and you control everything via uniforms. I personally think that we might as well do it this way because of three reason:
1) The shaders shouldn't be that complicated even after all these FPP things have been implemented, as FFP didn't support things like normal mapping or any other more advanced thing.
2) This will be a lot easier and shorter. Writing all of the possible shaders by hand will be madness, even when the rendering possibilities are limited, and I don't plan to write a shader generator.
3) Performance impact could be negligible, but I cannot be sure. The slowest things in shaders are branches - if/else statements. If the if statement depends on the fragment (like you do "if (v_TextureCoord.t) == 0)") then performance can be severely impacted (like I saw an example where a basic change like that took 60FPS down to 20FPS in an IOS device). This happens because then shader units don't work in parallel in this case. Nvidia, for example, uses something called Warps - they are basically batches that work on several pixels at once - like 32. If all of them have the same instructions (so branching doesn't change per pixel), then they work very fast. If even one of the pixels in the warp branches differently, then performance impact is significant. The reason I write it here is because "ubershader" will require branches, but the speed shouldn't be impacted because they will use uniform constants. Like if we don't use coloring, then all of the pixels in the drawing call will not use coloring. This is currently the basic default fragment shader:
in vec2 v_TextureCoord;
in vec4 v_Color;
out vec4 out_FragColor;
void main()
{
if (en_Texturing == true && en_Color == true){
out_FragColor = texture2D( TexSampler, v_TextureCoord.st ) * v_Color;
}else if (en_Color == true){
out_FragColor = v_Color;
}else if (en_Texturing == true){
out_FragColor = texture2D( TexSampler, v_TextureCoord.st );
}else{
out_FragColor = en_bound_color;
}
}
Here you can see that the branching is static. And on my 660Ti this branching doesn't decrease FPS at all (when I remove branching I get exactly the same FPS). I read a theory (without technical evidence though) that maybe shader compilers make several shaders - one for every static branch. This means if I pass en_Color == true and en_Texturing == false, then it won't even consider the first branch. It does seem that my GPU might do this, but I am not sure.
Coloring
There have been like 3 or 4 topics about this already in this forum and no real consensus whether we should blend with bound color or not. In the shader above you can see that the current method replicates GL1.1 implementation. So if texture is used and a per vertex color is used, then we blend the texture and the color. If only color is used (like when using draw_rectangle_color) then only vertex color is used. When drawing with texture, then only texture is used (so "draw_set_color(c_red); draw_model(model,texture);" will NOT draw a red tinted model). And when no per vertex color or texture is bound, then it uses the bound color. So "draw_set_color(c_red); draw_set_alpha(.5); d3d_draw_block(..., texture = -1, ..);" will draw a transparent red block.
Deprecated functions
All these changes we mostly done with one idea - to remove all deprecated functions. Right now these two examples run with without calling any deprecated functions per frame. There are still some deprecated GL functions here and there (like inside d3d_start() and d3d_end()), but I will remove them soon enough.
So, any suggestions and ideas? There wasn't much of a point to this topic, but I just wanted to share with things that might come in future. Merging my fork will probably be a pain, as I am like 50 commits behind Master, but conflicts should be minimal. This will not be merged until I replicate also lighting and do more tests. Then I would want others to try as well (other hardware and OS's). These changes will probably break GL3 to some people here (like Poly), but that is because their hardware just don't support GL3 in the first place. They currently can run it, because the implementation in Master is more like GL2.