Name: John Carmack
Project: Quake 3 Arena
This is something I have been preaching for a couple years, but I
finally got around to setting all the issues down in writing.
First, the statement:
Virtualized video card local memory is The Right Thing.
Now, the argument (and a whole bunch of tertiary information):
If you had all the texture density in the world, how much texture
memory would be needed on each frame?
For directly viewed textures, mip mapping keeps the amount of
referenced texels between one and one quarter of the drawn pixels.
When anisotropic viewing angles and upper level clamping are taken into
account, the number gets smaller. Take 1/3 as a conservative estimate.
Given a fairly aggressive six texture passes over the entire screen,
that equates to needing twice as many texels as pixels. At 1024x768
resolution, well under two million texels will be referenced, no matter
what the finest level of detail is. This is the worst case, assuming
completely unique texturing with no repeating. More commonly, less
than one million texels are actually needed.
As anyone who has tried to run certain Quake 3 levels in high quality
texture mode on an eight or sixteen meg card knows, it doesnít work out
that way in practice. There is a fixable part and some more
fundamental parts to the fall-over-dead-with-too-many-textures problem.
The fixable part is that almost all drivers perform pure LRU (least
recently used) memory management. This works correctly as long as the
total amount of textures needed for a given frame fits in the cardís
memory after they have been loaded. As soon as you need a tiny bit
more memory than fits on the card, you fall off of a performance cliff.
If you need 14 megs of textures to render a frame, and your graphics
card has 12 megs available after its frame buffers, you wind up loading
14 megs of texture data over the bus every frame, instead of just the 2
megs that donít fit. Having the cpu generate 14 megs of command
traffic can drop you way into the single digit frame rates on most
If an application makes reasonable effort to group rendering by
texture, and there is some degree of coherence in the order of texture
references between frames, much better performance can be gotten with a
swapping algorithm that changes its behavior instead of going into a
While ( memory allocation for new texture fails )
Find the least recently used texture.
If the LRU texture was not needed in the previous frame,
Free the most recently used texture that isnít bound to an
active texture unit
Freeing the MRU texture seems counterintuitive, but what it does is
cause the driver to use the last bit of memory as a sort of scratchpad
that gets constantly overwritten when there isnít enough space. Pure
LRU plows over all the other textures that are very likely going to be
needed at the beginning of the next frame, which will then plow over
all the textures that were loaded on top of them.
If an application uses textures in a completely random order, any given
replacement policy has the some effectÖ
Texture priority for swapping is a non-feature. There is NO benefit to
attempting to statically prioritize textures for swapping. Either a
texture is going to be referenced in the next frame, or it isnít.
There arenít any useful gradations in between. The only hint that
would be useful would be a notice that a given texture is not going to
be in the next frame, and that just doesnít come up very often or cover
very many texels.
With the MRU-on-thrash texture swapping policy, things degrade
gracefully as the total amount of textures increase but due to several
issues, the total amount of textures calculated and swapped is far
larger than the actual amount of texels referenced to draw pixels.
The primary problem is that textures are loaded as a complete unit,
from the smallest mip map level all the way up to potentially a 2048 by
2048 top level image. Even if you are only seeing 16 pixels of it off
in the distance, the entire 12 meg stack might need to be loaded.
Packing can also cause some amount of wasted texture memory. When you
want to load a two meg texture, it is likely going to require a lot
more than just two megs of free texture memory, because a lot of it is
going to be scattered around in 8k to 64k blocks. At the pathological
limit, this can waste half your texture memory, but more reasonably it
is only going to be 10% or so, and cause a few extra texture swap outs.
On a frame at a time basis, there are often significant amounts of
texels even in referenced mip levels that are not seen. The back sides
of characters, and large textures on floors can often have less than
50% of their texels used during a frame. This is only an issue as they
are being swapped in, because they will very likely be needed within
the next few frames. The result is one big hitch instead of a steady
There are schemes that can help with these problems, but they have
Packing losses can be addressed with compaction, but that has rarely
proven to be worthwhile in the history of memory management. A 128-bit
graphics accelerator could compact and sort 10 megs of texture memory
in about 10 msec if desired.
The problems with large textures can be solved by just not using large
textures. Both packing losses, and non- referenced texels can be
reduced by chopping everything up into 64x64 or 128x128 textures. This
requires preprocessing, adds geometry, and requires messy overlap of
the textures to avoid seaming problems.
It is possible to estimate which mip levels will actually be needed and
only swap those in. An application canít calculate exactly the mip
map levels that will be referenced by the hardware, because there are
slight variations between chips and the slope calculation would add
significant processing overhead. A conservative upper bound can be
taken by looking at the minimum normal distance of any vertex
referencing a given texture in a frame. This will overestimate the
required textures by 2x or so and still leave a big hit when the top
mip level loads for big textures, but it can allow giant cathedral
style scenes to render without swapping.
Clever programmers can always work harder to overcome obstacles, but in
this case, there is a clear hardware solution that gives better
performance than anything possible with software and just makes
everyoneís lives easier: virtualize the cardís view of its local
With page tables, address fragmentation isnít an issue, and with the
graphics rasterizer only causing a page load when something from that
exact 4k block is needed, the mip level problems and hidden texture
problems just go away. Nothing sneaky has to be done by the
application or driver, you just manage page indexes.
The hardware requirements are not very heavy. You need translation
lookaside buffers (TLB) on the graphics chip, the ability to
automatically load the TLB from a page table set up in local memory,
and the ability to move a page from AGP or PCI into graphics memory and
update the page tables and reference counts. You donít even need that
many TLB, because graphics access patterns donít hop all over the place
like CPU access can. Even with only a single TLB for each texture
bilerp unit, reloads would only account for about 1/32 of the memory
access if the textures were 4k blocked. All you would really want at
the upper limit would be enough TLB for each texture unit to cover the
texels referenced on a typical rasterization scan line.
Some programmers will say ďI donít want the system to manage the
textures, I want full control!Ē There are a couple responses to that.
First, a page level management scheme has flexibility that you just
canít get with a software only scheme, so it is a set of brand new
capabilities. Second, you can still just choose to treat it as a fixed
size texture buffer and manage everything yourself with updates.
Third, even if it WAS slower than the craftiest possible software
scheme (and I seriously doubt it), so much of development is about
willingly trading theoretical efficiency for quicker, more robust
development. We donít code overlays in assembly language any moreÖ
Some hardware designers will say something along the lines of ďBut
the graphics engine goes idle when you are pulling the page over from
AGP!Ē Sure, you are always better off to just have enough texture
memory and never swap, and this feature wouldnít let you claim any more
megapixels or megatris, but every card winds up not having enough
memory at some point. Ignoring those real world cases isnít helping
your customers. In any case, it goes idle a hell of a lot less than if
you were loading the entire texture over the command fifo.
3Dlabs is supposed to have some form of virtual memory management in
the permedia 3, but I am not familiar with the details (if anyone from
3dlabs wants to send me the latest register specs, I would appreciate
A mouse controlled first person shooter is fairly unique in how quickly
it can change the texture composition of a scene. A 180-degree snap
turn can conceivably bring in a completely different set of textures on
a subsequent frame. Almost all other graphics applications bring
textures in at a much steadier pace.
So, given that 180-degree snap turn to a completely different and
uniquely textured scene, what would be the worst case performance? An
AGP 2x bus is theoretically supposed to have over 500 mb/sec of
bandwidth. It doesnít get that high in practice, but linear 4k block
reads would give it the best possible conditions, and even at 300
mb/sec, reloading the entire texture working set would only take 10
Rendering is not likely to be buffered sufficiently to overlap
appreciably with page loading, and the command transport for a complex
scene will take significant time by itself, so it shows that a worst
case scene will often not be able to be rendered in 1/60th of a second.
This is roughly the same lower bound that you get from a chip texturing
directly from AGP memory. A direct AGP texture gains the benefit of
fine-grained rendering overlap, but loses the benefit of subsequent
references being in faster memory (outside of small on-chip caches).
A direct AGP texture engine doesnít have the higher upper bounds of a
cached texture engine, though. Itís best and worst case are similar
(generally a good thing), but the cached system can bring several times
more bandwidth to bear when it isnít forced to swap anything in.
The important point is that the lower performance bound is almost an
order of magnitude faster than swapping in the textures as a unit by
If you just positively couldnít deal with the chance of that much worst
case delay, some form of mip level biasing could be made to kick in, or
you could try and do pre-touching, but I donít think it would ever be
worth it. The worst imaginable case is acceptable, and you just wonít
hit that case very often.
Unless a truly large number of TLB are provided, the textures would
need to be blocked. The reason is that with a linear texture, a 4k
page maps to only a couple scan lines on very large textures. If you
are going with the grain you get great reuse, but if you go across it,
you wind up referencing a new page every couple texel accesses. What
is wanted is an addressing mechanism that converts a 4k page into a
square area in the texture, so the page access is roughly constant for
all orientations. There is also a benefit from having a 128 bit access
also map to a square block of pixels, which several existing cards
already do. The same interleaving-of-low-order-bits approach can just
be extended a few more bits.
Dealing with blocked texture patterns is a hassle for a driver writer,
but most graphics chips have a host blit capability that should let the
chip deal with changing a linear blit into blocked writes. Application
developers should never know about it, in any case.
There are some other interesting things that could be done if the page
tables could trigger a cpu interrupt in addition to being automatically
backed by AGP or PCI memory. Textures could be paged in directly from
disk for truly huge settings, or decompressed from jpeg blocks, or even
procedurally generated. Even the size limits of the AGP aperture could
usefully be avoided if the driver wanted to manage each pageís
Aside from all the basic swapping issue, there are a couple of other
hardware trends that push things this way.
Embedded dram should be a driving force. It is possible to put several
megs of extremely high bandwidth dram on a chip or die with a video
controller, but wonít be possible (for a while) to cram a 64 meg
geforce in. With virtualized texturing, the major pressure on memory
is drastically reduced. Even an 8mb card would be sufficient for 16
bit 1024x768 or 32 bit 800x600 gaming, no matter what the texture load.
The only thing that prevents a geometry processor based card from
turning almost any set of commands in a display list into a single
static dma buffer is the fact that textures may be swapped in and out,
causing the register programming in the buffer to be wrong. With
virtual texture addressing, a textureís address never changes, and an
arbitrarily complex model can be described in a static dma buffer.
Copyright © Square
Eight 1998-2015. All Rights Reserved.
The BlueTracker is provided by Webdog.
We are not responsible for the content of the .plans displayed here.