you guys are missing something...
register address calls are 5 bit in 32 bit systems.
31-26 25-21 20-16 15-0
31-26 = opcode
25-21 = register
20-16 = register
15-0 = immediate ||+ target register
that means that you have 2^5 registers max. thats only 32 registers.
some of those go away for the OS, some are FP registers, some are FnCall registers, some are return registers, etc. so the registers software can actually work with is smaller than 32.
when you go 64 bit, you expand the max register count.
along with this, the actual register count goes up too (though not necissarilly as far as it can at most.
so when software goes through its code in the CPU. it is fetched and unrolled so that it can fill the pipeline.
for example, this code swaps 2 arrays of length 100
(a0) = address of array 1
(a1) = address of array 2
t0 - t2 = temporary registers
add t0 <- zero, 100 // sets reg. t0 to 100
lw t1 <- (a0)
lw t2 <- (a1)
sw t1 -> (a1)
sw t2 -> (a0)
add a0 <- a0, 4
add a0 <- a1, 4
sub t0 <- t0, 1
beq t0, zero -> done
the CPU unrolls the loop and executes it 2 at a time, using MORE REGISTERS, but ending up using LESS instructions.
(a0) = address of array 1
(a1) = address of array 2
t0 - t4 = temporary registers, 2 more than before
add t0 <- zero, 100 // sets reg. t0 to 100
lw t1 <- (a0)
lw t2 <- (a1)
sw t1 -> (a1)
sw t2 -> (a0)
lw t3 <- 4(a0)
lw t4 <- 4(a1)
sw t3 -> 4(a1)
sw t4 -> 4(a0)
add a0 <- a0, 8
add a0 <- a1, 8
sub t0 <- t0, 2
beq t0, zero -> done
-------
the first loop will go 100 times
the second loop will go 50 times
the first will execute 8 commands per pass
the second will execute 12 commands per pass
100 * 8 = 800 commands
50 * 12 = 600 commands
600 < 800
so the second loop that uses MORE registers, will execute in less time. meaning its FASTER.
now this example is simplified, but in this nature, more registers can be used to speed up calculation.
so a 64 bit cpu, having more registers than a 32 bit cpu, will execute its code in less time, because it has more resources to work with. there will be more robust code with more parallelism in execution, and less 're-writing temp data to main memory' to fit more temp data that is needed.
this means that when you compile a program in 32 bit, or 64 bit, the compiler generates DIFFERENT ASSEMBLY CODE for both. the 64 bit assembly code will nearly always be faster than the 32 bit, and in turn, the resulting set of machine instructions for 64 bit will turn up faster than the 32 bit instructions.
even though the C++ code may be the same, the REAL code that the CPU reads will be very different.
so when UT2003 is compiled for 64 bit, its code is faster than before (32 bit).
THOUGH for identical code, the 64 bit could be slower. the issue is, though, the code will not be identical. thats the benefit of wider systems. (that and more accuracy in data)
sure cache may need to be bigger, but it'll just be bigger. problem solved. even if it didnt increase, things would still be faster.
-scheherazade
This comment was edited on Nov 21, 18:37.