Announcement
Collapse
No announcement yet.
Hammer Debuts
Collapse
X
-
As I understand .... only at 800-1000 Mhz !!
Yes .. I know it's A0 revision, but this still makes me a bit sceptical about the rumoured 2Ghz.Last edited by Kosh Naranek; 27 February 2002, 02:56.Fear, Makes Wise Men Foolish !
incentivize transparent paradigms
Comment
-
Go check out <A HREF="http://forums.murc.ws/showthread.php?s=&threadid=31531">the other thread</A> rubank. They've moved the memory controller on-die.Gigabyte P35-DS3L with a Q6600, 2GB Kingston HyperX (after *3* bad pairs of Crucial Ballistix 1066), Galaxy 8800GT 512MB, SB X-Fi, some drives, and a Dell 2005fpw. Running WinXP.
Comment
-
My view on VIA memory conrollers have changed since the release of the KT226A. Now that's how a controller is supposed to be designed, almost as good as the old BX. And the KT333 improves even further on the KT226A controller.
Their chipsets do have a whole lot of flaws, but in my opinion the memory controller is not one of them anymore.
Comment
-
Some thoughts on Latency from ACE's article:
http://www.aceshardware.com/read.jsp?id=45000308
By: elazardo $$$$$
08 Mar 2002, 08:03 AM EST Msg. 80343 of 80352
(This msg. is a reply to 80341 by q_azure_q.)
milo, I think that the effect is likely even more pronounced than you infer.
Just for anyone who might care:
To really see the effects, calculate net throughput. Let's say that the time to refill a cache line is
approximately 100nS, ( core to No bridge to memory to No bridge back to core ). If a cache hit returns
data in 2 clk cycles, and we have a 96% hit rate, then total cycles is:
( 0.96 X 2 ) + ((100E-9)*2E9 X 0.04) = 9.92 cycles ave.
Useable work = 2E9/9.92 2.0E8.
Now, change the numbers to 3GHz:
( 0.96 X 2 ) + ((100E-9)*3E9 X 0.04) = 13.92 cycles ave.
Useable work = 3E9/13.92 = 2.2E8
A 50% increase in processor speed results in only a 10% gain in net throughput.
Improving the cache turn-around to 1 clock cycle makes little difference:
2E9/8.96 = 2.2E8, and
3E9/12.96 = 2.3E8
We only improved by 5%. The main memory latency totally dominates.
Now, if you can reduce the memory delay to 60ns total from the core to returned data, and you can
operate at lower frequencies, the ABSOLUTE throughput improves assuming equal work for an equal
number of unblocked instructions:
( 0.96 X 2 ) + ( 60E-9 * 1.5E9 * 0.04 ) = 5.52
1.5E9 / 5.52 = 2.7E8
vs
( 0.96 X 2 ) + ( 60E-9 * 2.3E9 * 0.04 ) = 7.44
2.3E9 / 7.44 = 3.1
Here, we see a 1.5 GHz device with a 2clk cache outperforming a 3GHz device with a 1clk cache
because of the latency issue, even with high cache hit rates. In real life the situation is a little better,
but not a lot. When the core can chew up more than 1 instruction per ns, waiting dozens of ns for a
cache miss makes higher clock rates almost completely futile.
The incremental gain of getting the cache hit delay down to 1 cycle improves by 20% versus what we
saw above:
1.5E9 / 4.56 = 3.3E8
or
2.3E9 / 6.48 = 3.5E8
It is the right hand terms that dominate the performance:
Absolute latency, clock rate, and cache miss rate. The only way to significantly improve performance
is to work the absolute latency and cache miss rate, as the clock rate term appears in both the
numerator and denominator at almost equal weight, and so almost cancels itself out. This is where I
think INTC must have lost its mind by going for a combination of high clock rate and a modest cache.
I know they have sophisticated modelling, but something went terribly wrong. It is almost as though
they hired Fleishman and Pons to evaluate the models.
Regards,
Maybe that's why Jerry Sanders is so confident with Hammer's integrated memory controller.
Comment
-
His math is good, but his estimates are bad. It looks like he copied the equations out of an <i>old</I> copy of Hennessy & Patterson, but didn't update the variables. 96% cache hit rate is awful, I'd expect 98, 99, 99.5% hit rate. The odd thing is that he talks about intel going with a modest cache, but then doesn't explore the cache at all in his "demonstrative" equations. Partial credit to him for touching on part of the answer, but not exploring it.
Now, just for kicks, let's work with that 96% rate, and those latency speeds. I'll even give him the 40% improvement he guesstimates on latency. Suppose that instead of putting the memory controller on the die, we put in more cache, and the hit rate improves to 98%. Remember, die space is at a premium, so doing both is almost certainly prohibitively cost ineffective.
The memory controller (100ns->60ns) decreases the right term (RHT) by 40%, while going from 96% to 98% hit rate decreases it by 50% (with neglegible increase on the LHT). Now, factor in the little intangibles: cache bugs are often repairable on the manufacturing floor, and ditching the MMC lowers pin count, therefore your yield goes up significantly.
Granted, you now have a second chip, increasing the cost of the rest of the system, but when a CPU costs hundreds or thousands of dollars, and the motherboard is much less, it's likely to work out better for you.
Of course, this is just an example, and all this depends on the technology used. Also, I understand that he was trying to simplify the demonstration, but there are generally a couple more levels of cache in the way.
I'm really not trying to slam the guy, it's just that the answer isn't always clear. Everybody in the business is playing with 3rd and 4th order terms these days, since we pretty much had the easy stuff nailed a long time ago.Last edited by Wombat; 9 March 2002, 16:00.Gigabyte P35-DS3L with a Q6600, 2GB Kingston HyperX (after *3* bad pairs of Crucial Ballistix 1066), Galaxy 8800GT 512MB, SB X-Fi, some drives, and a Dell 2005fpw. Running WinXP.
Comment
Comment