Wijitmaker, on 05 April 2012 - 04:35 PM, said:
Old mesh:
* Triangles: 390
* Vertexes: 302
* Model triangles drawn: 798,720
* Vertex buffers allocated: 10,372,852 bytes
New mesh:
* Triangles: 6656
* Vertexes: 3402
* Model triangles drawn: 13,631,488
* Vertex buffers allocated: 112,016,048 bytes
Old mesh / GeForce 560 Ti:
* Total frame time: 12.5 msec/frame
* Time in "prepare models": 3.5 msec/frame
* Total frame time when paused: 2.5 msec/frame
New mesh / GeForce 560 Ti:
* Total frame time: 45.5 msec/frame
* Time in "prepare models": 38.0 msec/frame
* Total frame time when paused: 24.0 msec/frame
Old mesh / Intel HD Graphics 3000:
* Total frame time: 26 msec/frame
* Time in "prepare models": 3.5 msec/frame
* Total frame time when paused: 17 msec/frame
New mesh / Intel HD Graphics 3000:
* Total frame time: 145 msec/frame
* Time in "prepare models": 100 msec/frame
* Total frame time when paused: 130 msec/frame
There's 17x as many triangles in the new mesh, and 11x as many vertexes. Vertex buffers are 32 bytes per vertex, for each instance of the mesh.
"Total frame time" is limited by the CPU or GPU, whichever is slower (since they run in parallel).
"Time in "prepare models"" is the CPU cost of the skinning computation and vertex data upload - in the "New mesh / GeForce 560 Ti" case, "prepare models" is about 60% skinning and 40% upload. (Skinning should have the same cost in the Intel HD 3000 case, but the upload is much slower.)
"Total frame time when paused" means the meshes aren't animating, so there's no skinning or vertex data upload - it's basically just the GPU cost of rendering all the triangles.
Based on the paused times, GF560Ti can render about 600M tri/sec, HD3000 can render about 100M tri/sec - those figures sound vaguely plausible so I'll assume they're right. If we want 30fps on HD3000, that means at most 3M tri/frame. With the new 6656-tri mesh (keeping shadows enabled, ignoring props and buildings and trees which will eat into the polygon count), we could have ~200 units on screen at once before hitting the triangle count limit. Half as many triangles would allow twice as many units.
Independent of this triangle rendering, the CPU skinning takes about 25 msec/frame for these 1024 units. 200 units should therefore be ~5 msec/frame. This is a fairly fast CPU, so multiply by perhaps 2 for a reasonable lower-end CPU. Running at 60fps means we only have 16 msec/frame in total, and 5ms (or 10ms) is a big chunk. So I think we'd be primarily limited by CPU skinning cost, before being limited by triangle rendering cost, except on especially slow GPUs and fast CPUs.
Vertex data upload seems unpleasantly expensive; 100MB of vertex data per frame at 60fps is approaching the PCIe 16x bandwidth limit so that'll never work especially well, and with smaller numbers of units it's still a lot of bandwidth. I think our current vertex data upload code is somewhat inefficient (it updates lots of tiny chunks instead of throwing out the entire vertex buffer each frame, which'll probably prevent some driver optimisations) and could be improved, but that wouldn't solve the fundamental bandwidth problem.
So... I don't think the 6656-tri mesh is obscenely high resolution, but it's a bit too much if we want 200 units on screen at once (and much too much if we want more). But what we should really try is to do skinning on the GPU instead of on the CPU - that wouldn't increase the GPU's maximum renderable tris/sec, but it would eliminate the CPU skinning cost (at the expense of putting more load on the GPU vertex shaders) and would also eliminate the vertex data upload. That shouldn't be technically complex (I hope), so I suppose I'll experiment with that to see how it influences performance. With that data it should be possible to make a more informed tradeoff between gameplay design (number of units) and art design (number of triangles per unit).












