ENIGMA Forums

Contributing to ENIGMA => Proposals => Topic started by: TheExDeus on July 30, 2013, 11:44:34 am

Title: GL3 changes from immediate to retained mode
Post by: TheExDeus on July 30, 2013, 11:44:34 am
Hi! I think I made a topic about this a long time ago, but lets do it again. We have GL3 for some time now, but most of the things are still rendered in immediate mode. So I propose changing all the drawing functions not to use it. The problem is with caching, if we even decide use it. Right now we render a sprite like so:
Code: [Select]
void draw_sprite(int spr, int subimg, gs_scalar x, gs_scalar y)
{
    get_spritev(spr2d,spr);
    const int usi = subimg >= 0 ? (subimg % spr2d->subcount) : int(((enigma::object_graphics*)enigma::instance_event_iterator->inst)->image_index) % spr2d->subcount;
    texture_use(GmTextures[spr2d->texturearray[usi]]->gltex);

    glPushAttrib(GL_CURRENT_BIT);
    glColor4f(1,1,1,1);

    const float tbx = spr2d->texbordxarray[usi], tby = spr2d->texbordyarray[usi],
                xvert1 = x-spr2d->xoffset, xvert2 = xvert1 + spr2d->width,
                yvert1 = y-spr2d->yoffset, yvert2 = yvert1 + spr2d->height;

    glBegin(GL_QUADS);
    glTexCoord2f(0,0);
    glVertex2f(xvert1,yvert1);
    glTexCoord2f(tbx,0);
    glVertex2f(xvert2,yvert1);
    glTexCoord2f(tbx,tby);
    glVertex2f(xvert2,yvert2);
    glTexCoord2f(0,tby);
    glVertex2f(xvert1,yvert2);
    glEnd();

glPopAttrib();
}
This means in immediate mode it sends vertices one by one and is bad and slow and deprecated. The change would be using VAO's or VBO's which are sadly for more static geometry. It requires rebuilding the buffer all the time before drawing. So if just do this:
Code: [Select]
void draw_sprite(int spr, int subimg, gs_scalar x, gs_scalar y)
{
    get_spritev(spr2d,spr);
    const int usi = subimg >= 0 ? (subimg % spr2d->subcount) : int(((enigma::object_graphics*)enigma::instance_event_iterator->inst)->image_index) % spr2d->subcount;
    texture_use(GmTextures[spr2d->texturearray[usi]]->gltex);

    const float tbx = spr2d->texbordxarray[usi], tby = spr2d->texbordyarray[usi],
        xvert1 = x-spr2d->xoffset, xvert2 = xvert1 + spr2d->width,
        yvert1 = y-spr2d->yoffset, yvert2 = yvert1 + spr2d->height;

    float data[][7] = {
       {  xvert1, yvert1, 0.0, 0.0, 1.0, 1.0, 1.0  },
       {  xvert2, yvert1, tbx, 0.0, 1.0, 1.0, 1.0  },
       {  xvert2, yvert2, tbx, tby, 1.0, 1.0, 1.0  },

       {  xvert2, yvert2, tbx, tby, 1.0, 1.0, 1.0  },
       {  xvert1, yvert2, 0.0, tby, 1.0, 1.0, 1.0  },
       {  xvert1, yvert1, 0.0, 0.0, 1.0, 1.0, 1.0  }
    };

    GLuint spriteVBO;
    glGenBuffers(1, &spriteVBO);
    glEnableClientState(GL_VERTEX_ARRAY);
    glEnableClientState(GL_TEXTURE_COORD_ARRAY);
    glEnableClientState(GL_COLOR_ARRAY);

    glBindBuffer(GL_ARRAY_BUFFER, spriteVBO);
    glBufferData(GL_ARRAY_BUFFER, sizeof(data), data, GL_DYNAMIC_DRAW);
    glVertexPointer( 2, GL_FLOAT, sizeof(float) * 7, NULL );
    glTexCoordPointer( 2, GL_FLOAT, sizeof(float) * 7, (void*)(sizeof(float) * 2) );
    glColorPointer( 3, GL_FLOAT, sizeof(float) * 7, (void*)(sizeof(float) * 4) );

    glDrawArrays( GL_TRIANGLES, 0, 6);

    glDisableClientState( GL_COLOR_ARRAY );
    glDisableClientState( GL_TEXTURE_COORD_ARRAY );
    glDisableClientState( GL_VERTEX_ARRAY );

    glDeleteBuffers(1, &spriteVBO);
}
Then we will be using VBO's, but because we rebuild both the VBO and the data, then it ends up A LOT slower. So does anyone have any ideas on how we could buffer this? Originally I thought if it could be possible to assign some ID for each draw call which could be used throughout frames? So it could be possible to draw the same sprite if no such things like sprite index or position has changed? Or that even if we could create 1 VBO per sprite and then just rebuild the data per render, then it would be a lot faster? But I just did some tests and even if I only call:
Code: [Select]
glBufferData(GL_ARRAY_BUFFER, sizeof(data), data, GL_DYNAMIC_DRAW);
glDrawArrays( GL_TRIANGLES, 0, 6);
Then it is still a lot slower then immediate mode. The problem I guess is that we need to batch sprites into one VBO. But it cannot be painlessly done with all the dynamic things we have. I tried a quick hack by using a global vbo and then populating it in draw_sprite() just before drawing and I got a 3 times boost over immediate mode (though I seemed to get capped at exactly 100-101FPS, which mean it maybe was more, but for some reason I got vsyn'ced). That means I drawn 25k objects together with their logic (simple "bounce against walls" logic) and I got 30FPS with immediate mode and 100FPS with global VBO. 100k objects were 9fps for immediate and stable 25FPS with VBO. But I tested without drawing and found that I was actually capped at 33FPS by my use of a vector (I pushed 6*7 values for each sprite and there were 100k of them). Dunno how to improve that much though. After reserving and using manual counter (so no need for clear, just overwrite) I got to 44FPS (which was 30FPS with drawing).

So basically what I propose is this:
1) Have 1 global VBO.
2) In all drawing functions we populate this FBO with x,y,tx,ty,r,g,b,a and do that until texture_use() fails (eg, when the currently used texture is not the same as the requested one) at which point we draw the VBO and clear it.
3) Bind the new texture and repeat.

Advantages:
1) This way we will batch as much as we can before drawing and yet have the possibility to use different drawing functions (even sprite and background) interchangeably.
2) When we add sprite packing (or more precisely texture packing), then we will have a massive speed boost without changing any drawing functions. This is because we push the texture coordinates and render only when texture changes. So less texture changes means more batching.
3) Tiles would automatically be batched (usually), because calling draw_background_ext_transformed like previously would automatically make them be added to the same VBO (if the same tilestrip is used which often is). Right now it seems some GLLists are made and populated, by I think that is slower (especially when many glBegin and glEnd functions are used per tile). Of course remaking the tile system for 1 VBO per layer could maybe be better and speed the whole thing up (but will take more memory).
4) Port to GLES (Android and such) would be a lot simpler, as it doesn't support immediate mode and requires the app to basically be GL3 (so no gl transformation functions either). So we must push towards that for easier maintenance and compatibility.

Disadvantages:
1) If a lot of texture switching happens (like having two objects with the same depth and be created interleaved with one another, so the draw event is called interleaved as well)) then there will be a performance impact. On a game with few hundred sprites it will probably not be seen, but with thousands of sprites the impact could be noticeable. The good thing is that things like depth changes would reduce the impact. As well as texture packing.

note: Functions like glEnableClientState and such are actually also deprecated. Now all of that has to happen on a vertex shader. I plan to test that too and maybe implement it that way. But this global VBO thing is a lot simpler and could potentially give a lot of speed.

So, any ideas?

edit: By replacing glBufferData with glBufferSubData I got to 36FPS with 100k objects, but this won't be possible in the implementation mentioned here (as the size will change all the time depending on how many sprites are drawn and how many texture swaps happen). But with a much smaller VBO the impact of that function will not be so great. It is even recommended to use several smaller VBO's than one big one anyway.
Title: Re: GL3 changes from immediate to retained mode
Post by: Josh @ Dreamland on July 31, 2013, 07:25:33 am
The point of the GL3 graphics system is to assume that the hardware supports VBOs and shaders, as opposed to call lists and matrices.

This has been planned for a loooong time, but has only recently begun being implemented.

Purging immediate mode from GL3 is certainly a goal. Texture batching is also an interest, if possible.
Title: Re: GL3 changes from immediate to retained mode
Post by: Goombert on July 31, 2013, 07:34:13 am
Yes, these were all the things I was originally planning to do with OpenGL 3. But Harri, for that I was thinking of a common interface for vertex formats of all the basic shapes like a plane and what not, and include them from a common header, thats what those GLshapes.cpp and GL3shapes.cpp files are about. I just didn't have the energy to do it. Also, that is why there's a shaders folder, we need to rewrote all the behavior expected into GPU programs as well. Especially for particle effects.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on July 31, 2013, 08:46:52 am
Quote
This has been planned for a loooong time, but has only recently begun being implemented.
Well I have been thinking about this for a long time as well. The problem is that it is not that straight forward. If you create your own GL project then it is easy to batch things as you do all the logic quite differently and you can decide what is going to be static and what dynamic. With GM way of doing things (and now ENIGMA of course) it is a lot harder, because you can do this:
Code: [Select]
draw_sprite(spr_0,0,x,y);
draw_background(back_0,10,10);
draw_line(10,10,250,300);
draw_sprite(spr_0,0,60,10);
Which cannot be straightforwardly batched. Even if background and sprite functions draw a static image (so the x and y are static), they are still considered dynamic and so must be either redrawn every time via immediate mode (like now) or batched to dynamic VBO (like it is proposed here). If we had sprite packing then at least this simple code wouldn't call a texture rebind (and in turn VBO regen), but it still wouldn't be as good as it could be. Like if we could have a way to figure out if the drawn images are static or not (like if none of the arguments are variables), then it could be possible to batch them together in a different VBO which would be reused and never regenerated. The problem though is that the static image can be inside an if(){} or that draw ordering would break. Some of the things could be fixed by some extreme analysis of the code at compile time, but I can't even imagine what the thinking that would require. I think JDI already returns many things about the functions, so it could be possible. And the draw_line() breaks the whole thing even further. I guess we can have a separate vector which whold drawing mode or even textures. Then just push everything into a single VBO and then bind different textures and draw only part of the VBO via the glDrawElements().

Quote
But Harri, for that I was thinking of a common interface for vertex formats of all the basic shapes like a plane and what not, and include them from a common header, thats what those GLshapes.cpp and GL3shapes.cpp files are about.
Can you explain in more detail? Did you mean that you though a common shape functions that return vertices or something? Like vert_plane(x,y,w,h,r) which would push to a common vertex array a rotated plane?
Title: Re: GL3 changes from immediate to retained mode
Post by: Goombert on July 31, 2013, 09:21:42 am
Harri, yup, exactly what I was thinking.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on July 31, 2013, 10:03:16 am
Then I don't get that at all. Taking into a account that some things needs to be rotated and some not, then I think it will be better if I just populate the thing inside drawing functions. Though I guess if it is done this way, then it will be easier to change formats later on. I will investigate.
Title: Re: GL3 changes from immediate to retained mode
Post by: Josh @ Dreamland on July 31, 2013, 10:08:22 am
That's why I added that "if possible" clause to batching. It's hard to do batching when people layer sprites at the current depth, and intermix texture calls with untextured calls.

Fortunately, we can do some hackery in the compiler to help with that. We have a few options, which I propose we support as options in full:

None of these options, alone, will solve the problem, but you can imagine that together these are extremely powerful options. Let me elaborate on points (2) and (3).

2. Parallel batch barriers
Say we have three events which are run during the game.
Code: (EDL) [Select]
draw_sprite(spr_wing_bottom, 0, x, y);
draw_sprite(spr_bird, 0, x, y);
draw_sprite(spr_wing_top, 0, x, y);
Code: (EDL) [Select]
draw_sprite(spr_fire, -1, x, y);
draw_sprite(spr_wing_bottom, 0, x, y);
draw_sprite(spr_firebird, 0, x, y);
draw_sprite(spr_wing_top, 0, x, y);
Code: (EDL) [Select]
draw_circle_color(x, y, 64, c_white, c_red, false);
draw_sprite(spr_wing_bottom, 0, x, y);
draw_rectangle_color(x-16, y-16, x+16, y+16, c1, c2, c3, c4);
draw_sprite(spr_wing_top, 0, x, y);

Our batch mechanism would work by keeping a list, in order, of each type of sprite, line, ...whatever needs drawn. At the beginning of each draw event would be batch_chunk_start(), at the end would be batch_chunk_end().

0. A list of batch jobs is created, and is initially empty.
1. The batch_chunk_start() method moves the head position to the beginning of the list.
2. Each time the user tries to draw something, the head advances until a batch job of that type is encountered.
3. If no batch job of that type is encountered, the head is moved back where it was, and a new batch job is inserted there.
4. The batch_chunk_end() method doesn't do anything except maybe a check in debug mode.

By the above process, the batch jobs generated for the above codes, in order, will be as follows (assuming the codes are first encountered in the order given above and then in any sequence for repetition):

The worst case for this batch algorithm is when every object draws everything uniquely or in reverse order of another object which already drew it. The issue is that in this system, everything must be batchable, or have a batch node. Every. Single. Draw. Function.

3. Profiling
To improve further on the above, code profiling can be done by creating texture pairs as described. With our batch class in place, the pairs generated will be (spr_fire, spr_wing_bottom), (spr_wing_bottom, spr_firebird), (spr_firebird, spr_bird), (spr_bird, spr_wing_top). A very complicated (relatively speaking—I mean in terms of runtime complexity rather than in difficulty) algorithm would then decide the best arrangement for these sprites. An obvious answer (aside from put them all on the same sheet) is to arrange them so that spr_fire and spr_wing_bottom are on one atlas, and spr_firebird, spr_bird, and spr_wing_top are on another. The point is to minimize the number of texture switches in a batched or unbatched environment; for more complicated games, where these transitions will not be made 1:1 by the batch tool, the profiling will come in handy to a much higher degree.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on July 31, 2013, 12:35:01 pm
I don't think taking away drawing order management other than depth is such a good thing. I always rely on the drawing order for hud drawing and now it would either require shit ton of objects or changing depth mid draw, which I don't think would work in your case (or it would just call the buffer to be drawn and reset?), like so:
Code: [Select]
depth = 0;
draw_sprite(spr_wing_bottom, 0, x, y);
draw_sprite(spr_bird, 0, x, y);
depth = 1;
draw_sprite(spr_wing_top, 0, x, y);
Also note how I changed the depth in reverse order.

I now implemented a system which could be something like the system in the final version. These are the results. In the best case (0 texture switching) and drawing 50k sprites I get this:
Code: [Select]
int i = 0;
var inst;
repeat (50000){
inst = instance_create(random(room_width),50+random(room_height-50),obj_0);
inst.spr = (i%1==0?spr_0:spr_1);
++i;
}
(http://img.inperil.net/VBO_best_case.png)
Without batching (like GL1) gives 18FPS, so it's a speed increase of 400%. Now the worst case:
Code: [Select]
int i = 0;
var inst;
repeat (50000){
inst = instance_create(random(room_width),50+random(room_height-50),obj_0);
inst.spr = (i%2==0?spr_0:spr_1);
++i;
}
(http://img.inperil.net/VBO_worst_case.png)
Without batching gives 14FPS (so a slight decrease because of texture switching), but the VBO is 230% slower here. In this case I have 50k sprites, but two different are draw (25k each) and as they are created intermittent, then they are rendered as such as well. This means texture switch happens for every draw_sprite and thus VBO flushing as well.

So some thoughts and questions:
1) Thus this seem acceptable to be committed? So for the worst case this could be a step back performance wise, but some points to consider:
    * Normally you don't render this many sprites like this. If you render thousands of sprites then they are for things like particles, which in this case would batch perfectly.
    * This runs fast enough for 500 and even 5000 (60 fps) worst case sprites, so it shouldn't impact any current game.
    * Worst case almost never happens (tm).
2) All sprite functions (like draw_sprite, draw_sprite_ext, _transformed etc.) are rendered together.
3) This clearly shows we need to use a texture atlas. Some thoughts:
    * Do we make it runtime or compile time? At runtime it would be better because we could pack also when using sprite_add() functions, but that would require a loading screen (as startup will get slower). We could also have a middle ground where all the compile time resources are packed at compile time (like fonts are now) and runtime sprite_add() packs at runtime. I would love some help implementing this.
    * Do we pack sprite and background resources together? As texture wise there is no difference, then I suggest we do.
    * How to select packing size? At runtime we could provide a function which allows the user to choose size (like 1024x1024, 4096x4096 etc.), but at compile time it might require either a macro or an option in ENIGMA settings.
    * If we do it compile time, then do with make it universal or tied to a graphics system? I think it should work if it is universal.
4) When this is drawn using shaders instead of glDrawElements(), then it could be faster.
Title: Re: GL3 changes from immediate to retained mode
Post by: Josh @ Dreamland on July 31, 2013, 12:39:54 pm
Quote
I don't think taking away drawing order management other than depth is such a good thing.
The proposal I gave above in (2) doesn't do that. It emulates it perfectly, with speedup in your "worst case" comparable to speedup in the best case. The only thing it takes away is instance ID behaving as a secondary depth. The order of those drawings is deterministic, but different, and should not differ in any meaningful way.

In my method, an object that draws spr1, then spr2, then spr3 will behave like this:

Code: (EDL) [Select]
with (obj_0)
  draw_sprite(spr_1);
with (obj_0)
  draw_sprite(spr_2);
with (obj_0)
  draw_sprite(spr_3);

Instead of like this:
Code: (EDL) [Select]
with (obj_0)
  draw_sprite(spr_1),
  draw_sprite(spr_2),
  draw_sprite(spr_3);
And dynamically changing to that behavior is basically trivial.


That said, go ahead and commit what you have for now, as improvement is improvement, and your solution is much less involved than mine.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on July 31, 2013, 02:43:11 pm
Quote
The proposal I gave above in (2) doesn't do that.
Well your example ordered 7 batches and if they are rendered in that order, then the output will differ. And if you draw a hud then that can be a difference between having text on the background or the background on the text (or even a player on the text).

Quote
In my method, an object that draws spr1, then spr2, then spr3 will behave like this:
I don't see how that would change much. The slowdown now happens only when switching textures. That means I can draw 20 objects with different depth and ID order and still get only 1 VBO if they all draw the same thing. If the thing differs, then you must switch texture and do the same thing again. So in that spr_1, spr_2, spr_3 example it would take the same amount of time whatever you do. Even if you use 1 VBO for each or 1 global one (of course it will work faster with 1 global one). By my testing it seems that you must render about 10 things in batch to have any speed gain over immediate mode. So what we really need is texture atlas. Any comments on that (the compile time vs runtime)?
Title: Re: GL3 changes from immediate to retained mode
Post by: Goombert on July 31, 2013, 05:41:18 pm
Code: [Select]
d3d_transform_set_identity();
d3d_transform_add_rotation_x(90);
draw_sprite(spr_wall, 0, 0);

:P

That is possible in Game Maker and using the DX batching class. DX can also outperform this with different textured sprites, I presume because of batching, and it must be mixing it with a shader. If we add all that to the OpenGL one Harri committed, we could make it a LOT faster than what he has right now.
Title: Re: GL3 changes from immediate to retained mode
Post by: Josh @ Dreamland on July 31, 2013, 06:15:19 pm
Quote
Well your example ordered 7 batches and if they are rendered in that order, then the output will differ. And if you draw a hud then that can be a difference between having text on the background or the background on the text (or even a player on the text).
The output will only differ if individual objects overlap. The correct draw order for everything drawn inside each object's draw event is preserved. So if you draw a square sprite, then a circle sprite over it, the corner of the square sprite for one object will not be able to overlap the circle sprite of another object, because all square sprites are drawn at the same time, and all circle sprites are drawn after.

This problem in fact manifests anywhere multiple sprite draws are used to create one sprite, such as for attaching a hat, or armor, or other equipables to a character sprite. In this case, other objects on that depth which are drawn in the same way would not overlap correctly. Consider two characters with these equipables standing such that they overlap each other. This would cause one character to appear to have all equipables drawn on him, while the body of the other character remains behind him. This is an unfortunate,  but rare, side effect of the conversion. Proper texture atlasing would fix that.

In your example above, Harri, the bombs would all be drawn under the nuclear signs. Assuming spr_0 is the bomb. The disaster case for my algorithm is doing all that drawing in a single loop instead of at one depth.

Quote
I don't see how that would change much. The slowdown now happens only when switching textures.
This is exactly what my method avoids, Harri. It does this by batching sprites of the same texture together under strict conditions. How are you not noticing a difference between those two codes? I think you missed the point of what I was saying. The bottom code demonstrates the original, un-batched behavior, when ENIGMA asks each object to perform its draw event. It requires 3n texture binds, where n is the number of instances. The top code shows how the batching algorithm refactors the code to look; it requires only three texture binds, regardless of how many instances there are. All sprites with spr_1's texture are drawn in batch. Then all sprites for spr_2, then all sprites for spr_3. The order is determined from the order they are drawn in the code, so it will look identical to the original except in cases of one-sprite overlap (described above).

As for texture aliasing, I cannot think of an efficient way to do this at run time. I think our best option is to allow the user to specify groups of sprites for texture atlases, then atlas those atlases together according to profiler data from a special compilation mode.

Again in your example, method (3) from my post would return the tuples (spr_0, spr_1) with 25,000 hits, and (spr_1, spr_0) with 24,999 hits. The result would be that the profiler would strongly recommend (to the IDE/Compiler) placing spr_0 and spr_1 on the same atlas. The user could also manually fix the glitch in appearance from my method (2) by atlasing them together manually in the interface.


@Robert
That's a legitimate concern. Unfortunately, the option is to either stash matrix data in the bash operation, or treat transform calls as another barrier, which can be devastating for the performance of that batch algorithm.


One extra consideration:
Perhaps it would be a good idea to allow placing sprites in multiple texture atlases, and making it simple to check if the current atlas contains a sprite. This would further improve batching.
Title: Re: GL3 changes from immediate to retained mode
Post by: Goombert on July 31, 2013, 08:31:27 pm
Well I would also like to mention that contrary to Josh's belief the D3D9 sprite batcher can also render tiled sprites by simply setting the source rectangle larger than the bounds and enabling texture repetition render state...

The only thing I don't know about is, whether I should force texture repetition on and leave it on in the sampler when these functions get called or implement a perplexed system for checking whether its enabled and then disable it again.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on August 01, 2013, 02:14:07 am
Quote
So if you draw a square sprite, then a circle sprite over it, the corner of the square sprite for one object will not be able to overlap the circle sprite of another object, because all square sprites are drawn at the same time, and all circle sprites are drawn after.
Ok, I get what you meant. I think this is a compile thing instead of runtime thing no? So if there is not much to change in the drawing code then I guess it could be provided as an option. But I really do want it as an option, because I like how very deterministic is the drawing now. Because if we do this change, then at one point some users will come asking about this overlap problem and the only thing we could offer them would be changing depth (which is ofter hard, as people don't do 0 for player, 10000 for background, -10000 for foreground etc., it is usually 0, 1, -1).

Quote
This is exactly what my method avoids, Harri. It does this by batching sprites of the same texture together under strict conditions. How are you not noticing a difference between those two codes? I think you missed the point of what I was saying. The bottom code demonstrates the original, un-batched behavior, when ENIGMA asks each object to perform its draw event. It requires 3n texture binds, where n is the number of instances. The top code shows how the batching algorithm refactors the code to look; it requires only three texture binds, regardless of how many instances there are. All sprites with spr_1's texture are drawn in batch. Then all sprites for spr_2, then all sprites for spr_3. The order is determined from the order they are drawn in the code, so it will look identical to the original except in cases of one-sprite overlap (described above).
Yeah, sorry, it was late and I understood that code only when I was lying in bed.

Quote
As for texture aliasing, I cannot think of an efficient way to do this at run time. I think our best option is to allow the user to specify groups of sprites for texture atlases, then atlas those atlases together according to profiler data from a special compilation mode.
I didn't mean on using some magical heuristic in real-time though. I just though that we pack all sprites (as much as possible) in an nxn texture at runtime (or when sprite_add() is called) without taking into account usage. Usually the texture size can be quite massive, some even suggest 16kx16k for a modern PC (which GL3 is meant for). And in that size we could pack sprites for most 2d games (that texture can fit 65k 64x64 sprites, or 16k 128x128 sprites.. I think you get the point). At run-time it would also be possible to pack into GL_MAX_TEXTURE_SIZE and so work no matter what. The larger the maximum texture size the better it would go. Giving users the ability to set this would also be good of course.

Quote
Perhaps it would be a good idea to allow placing sprites in multiple texture atlases, and making it simple to check if the current atlas contains a sprite. This would further improve batching.
Well we will have to do this anyway. If the person doesn't have enough VRAM (or we just choose a conservative size when packing at compile time), then we must use multiple atlases. And I was thinking not about a way to check if a sprite is in an atlas, but that sprite returns in which atlas it is in. So basically nothing in the drawing functions would really have to change (only a little bit of texture coords). The texture_use() would automatically work.

Quote
That is possible in Game Maker and using the DX batching class. DX can also outperform this with different textured sprites, I presume because of batching, and it must be mixing it with a shader. If we add all that to the OpenGL one Harri committed, we could make it a LOT faster than what he has right now.
And it also worked in immediate mode. In GL3 transformations themselves are a massive beast, as we must rewrite all those functions to use our own matrix math. The problem is that it would probably break batching, as I would need to call glDrawElements as many times as there are transformations. Only vertex shaders could help there.
Title: Re: GL3 changes from immediate to retained mode
Post by: Goombert on August 01, 2013, 02:21:56 am
Quote
Only vertex shaders could help there.
Exactly, it is OpenGL 3, the goal was to rewrite all of it to use shaders. In fact, just Google, I found all the basic immediate mode functions recreated into shaders yesterday somewhere, lost the link :/, but it was open source.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on August 01, 2013, 06:18:54 am
That still means that there would be a lot of batch creation/destruction going on. Though I thought how to batch primitive drawing functions (lines, curves etc.) and that would probably require another VBO. So the system you are proposing could be investigated (I am still not convinced it would give a massive speed boost and it would also foobar the rendering order), but for now I will look into packing textures at run-time, as well as maybe creating several VBO's where one would be for sprites/backgrounds and the other for lines and such. I will also test if putting everything in one VBO and then drawing subelements is faster than resetting the batch every time textures get changed.
Title: Re: GL3 changes from immediate to retained mode
Post by: Josh @ Dreamland on August 01, 2013, 06:51:20 am
Quote
I think this is a compile thing instead of runtime thing no?
No, this is exclusively a run-time thing. Instead of doing anything in the drawing functions themselves, the draw functions just look for an appropriate batch job to include their operations in. Sprite functions look for a triangle batch job with the same texture, then for a sub-job with color information if needed. Filled shape operations look for a batch job of the same color or of arbitrary color. Curve, line, and outline operations look for a line batch job of the same color/arbitrary color.

Quote
I didn't mean on using some magical heuristic in real-time though. I just though that we pack all sprites (as much as possible) in an nxn texture at runtime (or when sprite_add() is called) without taking into account usage. Usually the texture size can be quite massive, some even suggest 16kx16k for a modern PC (which GL3 is meant for). And in that size we could pack sprites for most 2d games (that texture can fit 65k 64x64 sprites, or 16k 128x128 sprites.. I think you get the point). At run-time it would also be possible to pack into GL_MAX_TEXTURE_SIZE and so work no matter what. The larger the maximum texture size the better it would go. Giving users the ability to set this would also be good of course.
Yes, this would be a great thing, except we need to keep input from the user in mind at all times. For larger games with shitloads of sprites, the user might want to swap them in and out of memory frequently enough to where reconstructing these huge atlases could be a problem. We need some useful heuristic; not a magical one, but one based on the user's intended use. I suggested earlier using the resource tree as a method to group sprites for mass load/unload. We might instead want to just consider a new resource type for general logical grouping. One of these grouping types could be for use in generating atlases. Another could be for use in mass load/unload. A third could just be for generic categorization, so it's easy to check if a resource is in a certain category. Other heuristic data comes from profile mode,which converts the ((spr_0, spr_1), 25000) and ((spr_1, spr_0), 24999) tuples into a more useful {50,000 => (spr_0, spr_1)} in a map or priority queue. So when generating these atlases on, say, embedded systems, where MAX_TEXTURE_SIZE is to the order of 32x32 (or, more practically, 512x512), precedence will be given to grouping spr_0 with spr_1, as otherwise, we will be saddled with 50,000 texture misses. If other pairs have higher weights, they will, of course, be given precedence; I use that 50,000 as a good example of an extreme case. I think typical values will be closer to the 5s and 10s range to 50s-100s, in a typical game.

Quote
Well we will have to do this anyway. If the person doesn't have enough VRAM (or we just choose a conservative size when packing at compile time), then we must use multiple atlases. And I was thinking not about a way to check if a sprite is in an atlas, but that sprite returns in which atlas it is in. So basically nothing in the drawing functions would really have to change (only a little bit of texture coords). The texture_use() would automatically work.
Not what I'm saying. I mean (spr_0, subimage 0) isn't in one atlas. It's in two or three, because otherwise, we'll have misses out the wazoo for this one sprite, because we don't have room for all the sprites in our profiler tuples.


Robert's code will not break batching, especially if we applied the matrices software-side for sprites (which is what we do now). This isn't necessary, though. The only way those matrix operations can hurt us is if they're being applied randomly in every single draw event. Otherwise, the batching mechanism can treat them as a separate barrier to move between. When a matrix has the same result as a previous matrix, it becomes the new head position moved to by batch_chunk_start(). That is, after a call to a matrix operation, the batch_chunk_start() method can only move the head back as far as the first matrix node matching the current matrix configuration.

Since it's a run-time mechanism, there is no guess and check here. It will simply work.

Quote
That still means that there would be a lot of batch creation/destruction going on. ... I am still not convinced it would give a massive speed boost and it would also foobar the rendering order
Think of it this way. Batch or no batch, this information needs to make it to the card. Moving vertices to the card is expensive, too; a batch operation allows us to amortize this cost over a much larger fraction of the drawing work. So as long as the savings from that is greater than the cost of maintaining the batch, we're fine.

Still, we should be sure it's relatively easy to enable and disable the batch algorithm. A simple way to do that while minimizing the number of data transfers is to just nullify the behavior of batch_chunk_start(). If it no longer moves the head backward, batching will only be done where possible without re-ordering anything. A new batch job will be created for every draw call, and the behavior will be identical to what we have now.

As for implementing texture atlases... You can do that at load time relatively easily. Use the existing rectangle packer used for font glyphs. But you might want to figure out what's causing the font off-by-one error in the engine, first, as it will probably affect you. You'll have to change the load API a bit. Right now, each sprite load is done through graphics_create_texture. You'll need to replace it with some kind of graphics_place_texture() function which returns not just the texture ID, but the top-left and bottom-right float coordinates, as well. You'll also need some kind of finalize method so you can convert from your rect packer + empty texture + pxdata state to just a texture in memory. Best I can say is, good luck.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on August 01, 2013, 08:55:37 am
Quote
(spr_0, subimage 0)
Ou, so you mean having the same sprite in several atlases, so if a spr_player is used a lot it would be in almost every texture page? Because now there could be a situation where in a game with many level styles every level would have it's own atlas (which is a good thing), but the spr_player is put in the first atlas (with all the level1 sprites) and so it would do a lot of texture switching just because of that. I also like the manual grouping of sprites in the atlas. Right now during compile do I get the sprite folder (like can I iterate trough folders)? Then I guess I could make the texture packing use sprite folders for grouping (but not be limited by it, there is no performance benefit I know from packing less sprites in a texture).

The problem with compile time packing though, is that I don't know how big the texture can be. At the start I will just try setting it to some constant variable (like 2048x2048 which is reasonable on even crap hardware, like Android phones). I would love to pack the in run-time just because of that reason. Maybe have like a 2048x2048 packing at compile time, but when running then it checks max size and if it's something like 4096x4096 then it just packs 4 of the already packed textures in there. That should be a lot faster as no packing algorithm would be used.
Title: Re: GL3 changes from immediate to retained mode
Post by: Josh @ Dreamland on August 01, 2013, 09:12:28 am
Precisely.

The packing isn't done compile time. Only the heuristic is computed. It's still up to the load functions to do the packing, they can just do so under advisement that there are 49,999 occasions wherein a transfer is needed between spr_0's texture and spr_1's. The atlas picker can do no planning, but if the packing algorithm is aware that not having spr_0 and spr_1 on the same sheet costs around 50K binds per step, it can use that info to the game's advantage.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on August 01, 2013, 10:29:19 am
That makes sense.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on August 02, 2013, 02:04:44 pm
So a funny thing happened. I understood that if I use a vector (or any data type basically) that at one point I will just go out of memory and segfault. I tried drawing 5000000 objects and it in indeed segfaulted. It went up to 1000000, then I thought I could fix that by having something like enigma::globalVBO_data.size()>enigma::globalVBO_data.max_size()-100 when binding the texture. The funny thing is that it didn't actually segfault because of sprites, but because of objects. I apparently cannot create more than 1mil. or it segfaults in some lua map. Normally you wouldn't create that many objects, but I still think we need at least graceful death and not just segfault. When I just used draw_sprite() then I could draw 5mil. until segfault. Of course you also cannot allocate enigma::globalVBO_data.max_size()-100 either, as it usually segfaults way before that (especially when have 32bit game, although it could take that into account).

Some thoughts: Is checking for a bad_alloc and then clearing the buffer before continuing (thus basically allowing to draw infinite amount of sprites) worth it? The problem is that it would require an if check. And if you do 5mil. if checks then it itself impacts performance. Or just don't do it and call anyone who wants to draw more than 5mil. sprites at once an idiot (because it can't be done in a reasonable framerate anyway, for me it's just 2-3 fps, and in GL1 it's about 3 sec per frame). Right now it would stop when changing textures and so I can draw more than 5mil. when doing that (drawing many different sprites). But when we have texture atlas then it would be more possible this could happen as a lot more would be batched.
Title: Re: GL3 changes from immediate to retained mode
Post by: Goombert on August 03, 2013, 12:57:38 am
Harri, another thing about this is that DX has FVF or Flexible Vertex Format.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on August 03, 2013, 06:22:47 am
Well DX is basically an engine, so it has a lot of high level functions. GL is very low-level is so it asks the programmer to make everything. But the principle is the same and so it also allows the same stuff. For example, we already use FVF in GL3 now, because we create the vector ourselves (I just chose x,y,tx,ty,r,g,b,a, it might as well would have been r,g,b,a,x,y,nx,ny,nz,customAttribute1,customAttribute2,tx,ty etc.) and then bind the array with sizes and offsets. When shaders will be used to render then it will be even more apparent.
Title: Re: GL3 changes from immediate to retained mode
Post by: Goombert on August 03, 2013, 07:39:41 am
Well see now we are going to have to implement those FVF functions from Studio and wrap the old ones like Josh said. Because, I was writing the models into DX and got them working, but I had to make each vertex take color and alpha and shit even when not passed. At least it would allow us to do multitexturing, and then we could deprecate those immediate mode style functions I added. While we're at it, we also need to deprecated quads as a primitive type for models.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on August 03, 2013, 11:40:49 am
Quote
but I had to make each vertex take color and alpha and shit even when not passed.
That is what this (http://enigma-dev.org/forums/index.php?topic=1375.0) topic was all about.

Quote
While we're at it, we also need to deprecated quads as a primitive type for models.
I don't really plan to work on model functions all that much, so if you are willing then you can do that. I don't use them. My goal is to make ENIGMA not use any deprecated functions (when ran trough gDEBugger) when drawing a simple 2D game. This would make GLES port a lot easier to make.
Title: Re: GL3 changes from immediate to retained mode
Post by: Goombert on August 03, 2013, 06:59:18 pm
Quote
That is what this topic was all about.
I know :(
Quote
I don't really plan to work on model functions all that much, so if you are willing then you can do that. I don't use them. My goal is to make ENIGMA not use any deprecated functions (when ran trough gDEBugger) when drawing a simple 2D game. This would make GLES port a lot easier to make.
Yes agreed, Direct3D9 I can't even find the constant for a triangle fan, it is really best to just triangle lists everything with multitexturing and stuff. When I get time I will sit down and do it.
Title: Re: GL3 changes from immediate to retained mode
Post by: Josh @ Dreamland on August 06, 2013, 11:31:52 am
Okay, what? A few million instances segfaults ENIGMA? Or did you actually mean objects? The difference is huge. IDs 0-1000000 are used for objects; IDs 1000001+ are used for instances. So if you have more than one million objects, GM and ENIGMA break.

This is one of Undermars' biggest oversights. I'm open to suggestions, but I promise they'll all involve the IDE.
Title: Re: GL3 changes from immediate to retained mode
Post by: polygone on August 06, 2013, 11:34:03 am
Who the fuck is going to use a million objects?
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on August 06, 2013, 11:48:02 am
I meant instances.
Title: Re: GL3 changes from immediate to retained mode
Post by: Josh @ Dreamland on August 06, 2013, 12:04:02 pm
Ah, all right. Anyway, I don't store objects in a Lua map. I'm not sure what happens when you have a million instances; I'll have to investigate later.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on August 06, 2013, 01:09:26 pm
Million works fine. 1.2mill and over broke on my machine.

Also, why is Rober's git commits showing in the future? It showed me it is 7 of august and all my newer commits are behind him in 6th. Is it because his clock is wrong or something? Because I have problems with 1 hour offsets in almost every program while system time is correct. Cannot figure that shit out.
Title: Re: GL3 changes from immediate to retained mode
Post by: Josh @ Dreamland on August 08, 2013, 05:41:31 pm
It may be, or it may be a timezone thing. I'm not certain. I know GitHub has had problems with time zones in the past, but I'm not sure as to why.

Anyway, 1.2 million instances seems like relatively few to be giving you such severe problems. How much memory is ENIGMA using before the collapse? If it's using like 2GB, then I have good news and bad news. The good news is, that's the maximum addressable space for a 32-bit program (which all ENIGMA games are on Windows). The bad news is that 1.2 million objects are using 2GB of RAM. What the fuck?

If it's not using 2GB of memory, then I have good news and bad news. The bad news is, something's terribly wrong with ENIGMA's instance system (which cheesewad has been hinting at in his other topic). The good news is, it's not using that much RAM.

Let me know which scenario we're looking at.
Title: Re: GL3 changes from immediate to retained mode
Post by: Goombert on August 08, 2013, 07:51:13 pm
Quote
Also, why is Rober's git commits showing in the future?
I live Pennsylvania, today is August 9th 2013 at 8:51 P.M.
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on August 09, 2013, 01:13:31 am
Quote
Let me know which scenario we're looking at.
Right now I cannot test on my home PC (on which the original number was based), but on my laptop I get about 960mb of ram filled with 1mil. objects. If I do drawing (especially GL3 which has batching) I get up to 1.1gig. With 1.2mil. it takes about 1.15gig and with drawing it goes up to 1.4gb. But my previous test also used a simple logic as well which could increase the size. Also, I don't think it crashes only when going pass the 2gb mark as reallocation can fail way before that.

So basically right now 1 instance takes quite an amount. On VBO side (which also takes like 300mb with 1.2mil. quads) I could just set an upper limit and then make it draw before batching is done. Normally you wouldn't draw 1mil. things on the screen at once and even if you do, then normally you wouldn't be able to batch all of them. The same with instances though. So while it would be nice to show an error or something (at least in debug), I don't think any real change is needed.

Quote
I live Pennsylvania, today is August 9th 2013 at 8:51 P.M.
If only the ordering is changed, then I am alright (although it still could be pain in the ass when needing to revert). I just feared for one minute that the changes are also done in reverse and that would fubar everything.
Title: Re: GL3 changes from immediate to retained mode
Post by: Josh @ Dreamland on August 09, 2013, 12:07:47 pm
Pause. Are you storing the batching info in each instance?
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on August 09, 2013, 02:06:34 pm
No, I store it in a global VBO. I don't do anything with instances at all. It's the drawing functions which just populate the global vector and then are rendered.
Title: Re: GL3 changes from immediate to retained mode
Post by: Josh @ Dreamland on August 09, 2013, 03:15:07 pm
Then how's that memory usage getting so big? You don't use one giant fixed size, do you?
Title: Re: GL3 changes from immediate to retained mode
Post by: TheExDeus on August 10, 2013, 04:44:40 am
If you are talking about 1.15gb, then it is without drawing. That size is just instances, so VBO isn't the problem. The VBO takes 250mb for 1.2mil. sprites which is ok considering it takes 4 vertices per sprite and 8 attributes per vertex (all of them float), as well as 6 indices per sprites (int's).
And no, it is not fixed size. It grows depending on what you draw (every time it's too small and has to resize, I resize it so it fits 5 more sprites), but I don't downsize it right now (which doesn't matter in the batching tests).