So I guess it's time for me to chime in. There's some things talked about here that I know the answer about, but it isn't public yet. Sorry if I seem to ignore some things because of that.
Compilers: Yes, IA-64 depends a lot on its compilers. But it isn't stupid without them. For example, branch prediction is very good in McKinley. One stat I remember is that McKinley predicts and fetches branch indirects about 10% of the time, while everything else out there is closer to 1%. That's huge, since it means McKinley prefetches that code from main memory way ahead of time. That's a pretty good job of "turning corners."
Predication is also really cool. It's kind of like tagging assembly code with (if A then B else C). It sounds kind of like branch prediction, but predication performs a lot like conditional branching without the fetching overhead. McKinley can start executing B & C, leaving things incomplete where it has to, and as soon as it can figure out A, it discards B or C accordingly. IA-64 does it, and a good compiler will do a better job of tuning the predicates to boost performance.
Also, IA-64 has loop flattening (parallelism). So if you wrote (for i = 0; i < 10; i++) etc, most processors will have to branch back 10 times and iterate over the loop 10 times. IA-64 will run the cases in parallel, even though it might be a recursive loop, and merge the answer at the end. That might take as long as 2-3 iterations of the loop would, instead of 10. When you think about how much of your processor's intensive time is spent in loops, that's a big plus.
IA-64 is really built with speculative processing in mind, but I can't get in to what the future holds there too much. IA-64 does have ALAT instructions though, which help a lot if the compiler can use them well. ALATs are advanced loads, and when they're in the code, it's like saying "get this if you have time." So, if McKinley is doing a bunch of loads and stores for the code it's dealing with at the moment, it may see the ALAT and ignore it, but when it's got some free memory bandwidth, it will remember the ALAT entry, and pre-fetch something from memory before it is needed. You can throw a bunch of ALATs in to the code, and they can be invalidated easily, so that if you don't need them, no harm - no foul.
A lot of this stuff is covered lightly here: http://cpus.hp.com/technical_referen...64_arch_wp.pdf
McKinley isn't that much hotter because of things like parallel execution. It's hotter because it's frigging huge, and on a 180nm process. 3MB of single-cycle cache burns a hell of a lot of power, and having the world's fastest, most aggressive FPU doesn't help that either
On the other hand, it does mean that the McKinley SETI@Home client is ****loads faster than anybody else's. But Madison is coming, and later IA-64 implementations. Power should get better, while you'll see everyone else's power consumption rise over time.
Compilers: Yes, IA-64 depends a lot on its compilers. But it isn't stupid without them. For example, branch prediction is very good in McKinley. One stat I remember is that McKinley predicts and fetches branch indirects about 10% of the time, while everything else out there is closer to 1%. That's huge, since it means McKinley prefetches that code from main memory way ahead of time. That's a pretty good job of "turning corners."
Predication is also really cool. It's kind of like tagging assembly code with (if A then B else C). It sounds kind of like branch prediction, but predication performs a lot like conditional branching without the fetching overhead. McKinley can start executing B & C, leaving things incomplete where it has to, and as soon as it can figure out A, it discards B or C accordingly. IA-64 does it, and a good compiler will do a better job of tuning the predicates to boost performance.
Also, IA-64 has loop flattening (parallelism). So if you wrote (for i = 0; i < 10; i++) etc, most processors will have to branch back 10 times and iterate over the loop 10 times. IA-64 will run the cases in parallel, even though it might be a recursive loop, and merge the answer at the end. That might take as long as 2-3 iterations of the loop would, instead of 10. When you think about how much of your processor's intensive time is spent in loops, that's a big plus.
IA-64 is really built with speculative processing in mind, but I can't get in to what the future holds there too much. IA-64 does have ALAT instructions though, which help a lot if the compiler can use them well. ALATs are advanced loads, and when they're in the code, it's like saying "get this if you have time." So, if McKinley is doing a bunch of loads and stores for the code it's dealing with at the moment, it may see the ALAT and ignore it, but when it's got some free memory bandwidth, it will remember the ALAT entry, and pre-fetch something from memory before it is needed. You can throw a bunch of ALATs in to the code, and they can be invalidated easily, so that if you don't need them, no harm - no foul.
A lot of this stuff is covered lightly here: http://cpus.hp.com/technical_referen...64_arch_wp.pdf
McKinley isn't that much hotter because of things like parallel execution. It's hotter because it's frigging huge, and on a 180nm process. 3MB of single-cycle cache burns a hell of a lot of power, and having the world's fastest, most aggressive FPU doesn't help that either
On the other hand, it does mean that the McKinley SETI@Home client is ****loads faster than anybody else's. But Madison is coming, and later IA-64 implementations. Power should get better, while you'll see everyone else's power consumption rise over time.



)
Comment