Announcement

**Rob M.** · 5 February 2001, 23:12

Originally posted by superfly:
The main reason why there is isn't a big difference in performance as you move up from agp 1x/2x,etc..,is because game developers WILLINGLY limit themselves in such a way as to make their games playable on the the average machine,and therefore affect potential sales,but the bottleneck exists none the less if you want much higher quality graphics which current T.l enabled cards and ghz+ cpu's are quite capable of.

Yes, and developers are praising the one system that actually does have excess bandwidth - the PS2 - for how great it is to create 3d games on.[/sarcasm]

Tell me again about the one where the developers limit themselves to speeds under 266 mb/s when they have over 1 gb/s of bandwidth because 1 gb/s isn't enough.

But just like those two developers have mentioned bus limitations so have many other as well,like dave baranec,lead freespace 2 developer,who i personally had a talk with in the freespace 2 forums several months ago,regarding that very issue and he stated the exact same thing as the others.

If you think sweeney said the bus was a limitation, then I could see you thinking this FS2 developer (and I hope he's one of the lead programmers) saying there's a bus limitation currently. So, gotta thread link to this intimate discussion you were having?

I did a little quick research on dave baranec. Here's what he said:

Well, I came into Freespace 1 about halfway through the project. I did a little bit of everything on it, interface (esp multiplayer interface), some PXO work, main hall screen, the realtime voice system (network layer), and assorted misc items. On Fs2, I'm the lead programmer.

So I think "oh, so he must've done extensive testing with FS2 while developing that engine."

Well, uhh :

What improvements have been made to the Freespace engine in FS2?

Several things have been changed. We've revamped the core of the bitmap and texture manager so that everything internally to us is 16 bits. This keeps things nice and flat and easy to manage. It also speeds up interface screens considerably and makes texture uploads a wee bit faster. We've also shaken out a lot of D3D related bugs which were in FS1. Adding 32 bit support helped the nebula effect really sparkle (the alpha blending is so much nicer). And finally, we added the hardware fogging for the nebula effect

So, he uhh.. tacked 32 bit colour onto the FS1 engine, and fixed some bugs. Google didn't come up with any other games he'd previously worked on. Not exactly a developer who'd I believe.. BTW, I had to turn the nebula effect off on my G400 because it was so slow.. heheh, must've been that awful matrox engineering

Even nvidia has stated that will carefull optimization using an engine that's built with the Geforce in mind(currently there are none btw)and assuming 60 fps(for smooth gameplay),regardless of resolution or color depth or depth complexity(overdraw),you'd be limited to 50.000 polys per frame max,and it isn't because it's pushing the t.l capabilities of the card,it's a bus speed limitation.

again... link?

And you still ignored the fact that DX8 now has vertex compression routines built in.
Now if there's enough bandwith to go around,then why go to the trouble to develop it since it's only reason for existing is for bandwith savings,nothing more.

I didn't comment because I don't know anything about what it's for. Maybe it is to reduce the bandwidth taken up by sending vertexes over the bus, it's quite possible Microsoft forcees T&L becoming so good that maybe future generations will have bandwidth problems. I can't explain why video card manafacturers rushed to get unstable AGP4x cards on the market either, but that doesn't mean it alleviated a bottleneck...

You mentioned that at 1024*768 resolution the performance of recent video cards starts leveling out because of insuficient bus bandwiths and i never disagreed with you on that point in the first place,push a card hard enough(resolution,color depth,depth complexity,etc..)and you'll reach the limits of any video card(for the time being anyways)in terms of either fill rate or local bus bandwith.

Tell me, how does turning 32-bit colour or increasing the screen resolution force the computer to send more data to the video card?
I can see how some video cards (older ones, mostly) can be fill-rate limited, but there's no extra geometry needed to build a scene at a higher resolution/colour depth, just a helluva lot more framebuffer/z-buffer bandwidth (did I mention the framebuffer and z-buffer are in video memory? probably).

[This message has been edited by Rob M. (edited 06 February 2001).]

**Rob M.** · 5 February 2001, 23:19

Originally posted by frankymail:
Take a look at this schematic I made to explain what I'm talking of lower...

Franky, it's a good idea, that's the whole reasoning behind rambus. I'll tell you why it doesn't work with (ddr or normal) sdram:

it gets expensive to build a chip with a lot of leads. adding in a 2nd channel is simply too hard to do. Also, in order to get the extra bandwidth, you have to have the same information in both banks, so if you want to have 64mb of video memory, you'd need to put 128mb on the card... I'd be a little surprised to see any design waste memory like that with today's prices

**DGhost** · 5 February 2001, 23:33

Originally posted by Rob M.:
Franky, it's a good idea, that's the whole reasoning behind rambus. I'll tell you why it doesn't work with (ddr or normal) sdram:

it gets expensive to build a chip with a lot of leads. adding in a 2nd channel is simply too hard to do. Also, in order to get the extra bandwidth, you have to have the same information in both banks, so if you want to have 64mb of video memory, you'd need to put 128mb on the card... I'd be a little surprised to see any design waste memory like that with today's prices

not true. 2 and 4 way memory interleaving proves this.

you have two banks of SDRAM, and all you do is write half of the data to one and half of the data to the other. all at the same time. just like striping in RAID.

oh, and the 'twice the memory, double performance, same effective size' sounds a bit like 3dfx's logic on the voodoo 5....

**Rob M.** · 6 February 2001, 00:15

Originally posted by DGhost:
ok... implementing it to feed the triangles back to front... how do you do it without a depth buffer? and how does feeding it from back to front perform HSR on a scene??

If you send triangles from back to front, and all the video card does is rasterize those triangles, then you'll have a properly drawn scene, and the video card would not use a depth buffer. You can implement HSR by storing the vertexes for the current frame in main memory, and then have the cpu calculate intersections, and then send only visible triangles to the card (back to front). No depth buffer needed.

the whole idea of HSR is that it *doesn't* render the triangles that are behind others. you keep a depth buffer and when you issue the command that causes it to render, it goes through, eliminates the triangles that or not visible (or it does it as you add them - you do the same math, its just the difference between all at once or as you do it) from the depth buffer, and *then* sends the remaining data in the depth buffer to the renderer. that way its not having to render trianges that are not visible. or even do anything to them.

the way the kyro renders, it stores the poly data in the card and then takes it into the core in chunks, providing FSAA and HSR capabilities to each of these chunks as it processes them, reducing the amount of memory required to implement a depth buffer and making it easier to perform. thus the name 'tile based rendering'. it still has a true depth buffer in the core - but its designed to handle a relatively small chunk of polys. if nothing else, HSR can be done as its moving data between the external 'emulated' depth buffer and the internal depth buffer.

As I figured, we did get a new definition of a depth buffer. Traditional depth buffers store the entire scene, and as I said, what you're describing is a 'mini depth buffer'. I guess technically it stores the depth and it is a buffer, so in that sense, it is a depth buffer... Although it's useless for video game companies though, as you can't extract meaningfull data from it. It may as well not exist.
As far as what it does, you're saying... the depth buffer stores the depth of the triangles? Ok, I was wrong, I said that it only stores the depth of rendered triangles, when in fact all depth buffers except the kyro's does this. The kyro's only stores the depth of triangles before it renders. Again, the kyro's implementation isn't really a depth buffer in the traditional sense.

you mention DDR memory... DDR is a marketing ploy. it does not deliver anywhere near the peak bandwidth that it claims to. there are numerous different places elsewhere on the net that will back up this statement...

which is kinda I said when I was talking about ddr memory (remember 'at best 50% increase in bandwidth'?)...

3dfx acctually had computers design their core. maybe thats why their cards acctually worked on all platforms. maybe thats why they don't require heatsinks as big as your processor and use less space per chip - using an older manufacturing technology. maybe thats why they don't have to release drivers every week and have a list of known bugs longer than i am tall. maybe, despite the fact they were out dated, they had more than hype and a marketing department backing them.

i agree that the VSA-100 arch was too little too late, but a voodoo4 will out perform a GF2MX on an athlon. or a P3. especially at high resolutions. if nvidia is the god that you claim them to be, why would they ever release a card that is crippled and gutted, castrated for all purposes?

I'm comparing chip to chip. Paraphrasing, you said 'nvidia can't design a chip worth beans, 3dfx voodoo4 is a much more elegant design'. I said 'vsa-100 is so bad they had to use 2 on a card and STILL were slower than 1 nvidia chip'.

I'm not even going to bother comparing a crippled nvidia chip to a full-blown 3dfx chip.

Ok I lied. I will compare them (and you thought you got away with something there, didn't you?). Look here. The 3dfx chip is beaten out by the crippled nvidia chip in every benchmark thrown at it. Hell that crippled chip beat out the v5 in some of those benchmarks, even at 1024x768!

Oh and BTW, I'm not saying nvidia is god either, I just use them because they are the current market leader when it comes to fast consumer 3d. I have never spent any money on a nvidia product myself. As for who's drivers are best, I'm growing tired of arguing, and that's a whole new can of worms. I will say that anyone who spends any time in a matrox forum must know what bad drivers really are, and shouldn't be throwing stones..

and, why are you going to complain when you are talking above what you can tell the difference of?

uh, en englais, s'il vous plait.

**Rob M.** · 6 February 2001, 00:23

Originally posted by DGhost:
not true. 2 and 4 way memory interleaving proves this.

you have two banks of SDRAM, and all you do is write half of the data to one and half of the data to the other. all at the same time. just like striping in RAID.

oh, and the 'twice the memory, double performance, same effective size' sounds a bit like 3dfx's logic on the voodoo 5....

yep, it was a lot like 3dfx's logic... I stand by my statement that this logic is too costly. 3dfx went out of business after releasing v5. Concidence? I think not.

I havn't seen a good enough explanation of what 2 and 4 way memory interleaving is even begin to say it could be implemented on a video card(who's to say it isn't already?), but I can say gives at best a 5% increase in speed. Pretty marginal compared to having a 256 bit wide bus all the way to the memory.

[edit: it occured to me that you're talking about interleaving such as with the i840, i850 and pentium (1 and 2fx) chipsets. I thought you were referring to the 'memory interleaving' option many p3 and athlon bios's currently give. It's true that you wouldn't need to double the amount of memory, but it's still too difficult/expensive to build a dual-channel sdram GPU]

[This message has been edited by Rob M. (edited 06 February 2001).]

**dZeus** · 6 February 2001, 03:35

a 256 bit bus from the video core to the memory would bring the price of the total product through the roof. That's exactly why 3Dfx chose to use 2 chips, where it is practical to use this approach.

3Dfx design is much more efficient than nVidia's, which doesn't mean they're faster or that 3Dfx is making a profit. Saying that is just flawed logic. The question is if it's cheaper to make a single large core with very expensive and fast DDR SDRAM, or to make 2 cores on a card with cheap SDR SDRAM.

Fact is that the Voodoo-4 4500 is as fast as a nVidia MX. No, that Anandtech bench is flawed, because I've seen about 5 other respectible sites that came up with about identical results, unless in TnL benchmarks which the Voodoo-4 4500 doesn't have.

3Dfx was killed by it's own crappy managment, and nVidia's hype machine.

[This message has been edited by dZeus (edited 06 February 2001).]

**frankymail** · 6 February 2001, 06:53

Hey, a dual-channel architechture is not the same as the dual-chip, dual-memory subsystem that was used in the V5 5500... With the Voodoo 5/6, each VSA-100 chip had its own 32 MB memory subsystem... So, each textures had to load in each of the memory subsystems to allow both chip to render the same 3D image together... However, a dual-channel memory subsystem is not exactly the same as the memory stripping used by some VIA chipsets... With the VIA chipsets, the stripping was an nothing else than an optimization process, as the path from the chipset to the memory was still only a 64-bits bus... chunks of raw data would be halves, one half being put in a memory bank, the other in an other memory bank. The result was an increased efficacity, yet the theorical maximum bandwidth was still the same... A Dual-Channel architechture DOUBLES the number of path between the memory and the host.

E.G.: Let's take two Pentium III with two BX motherboards and 128 MB per motherboard. Now, let's remove a P3 and link the two motherboards directly to the single Pentium III CPU. That would double the memory bandwidth... OK; now, let's take Pr. Einstein, that would chat with two other scientists at the same time on MIRC; if Einstein was fast enough at typing (an analogy to current graphics chips; those chips have an enormous untapped fillrate...), he would be able to gather twice the knowledge/data that he could be able to gather by only chatting with one scientist, yet, by chatting to the two scientists at the same time, he would still have access to the knowledge of both scientists...(integral memory capacity...)

My concept is the same Intel used with the P4: a conventionnal single-channel RDRAM subsystem would achieve a peak bandwidth of 1.6 GB/s (800MHz x 16-bits/cycle / 8-bits per byte...=1.6GBytes/second). With the dual-channel architecture, the EFFECTIVE bandwidth DOUBLES to 3.2 GBytes/second. Thats why RIMMs used on i850-based motherboards have to be upgraded in pair, as the chipset accesses both RIMMs each clock cycle, and, to the best of my knowledge, when you put 128 MB of Rambus on a i850-motherboard, it still is 128MB of RAM accessible by the CPU...

Now, we have three options to double memory bandwidth to have 12.8 GBytes/s; 1) First, you could use 400 MHz DDR-SDRAM (800 MHz effective...), which is absolutely NOWHERE near availibility, and the cost would be totally prohibitive; 2) You can also use 256-bits, 200MHz DDR-SDRAM (400 MHz effective), which cost would be titanesque, and would also require the addition of pins to the chipset and traces to the board; 3) You could use 128-bits, 200 MHz DDR-SDRAM configured in a Dual-Channel architechture... I'll leave the answer to you (but the least expensive and the least complicated is the third possibility...!)

But, I do acknowledge that bandwidth-optimization like textures compression and hidden surfaces removal have to be integrated in future chips, but their effects is limited to a certain extent, limit where you WILL need more memory bandwidth, where dual-channeling will be the sensible choice...

(BTW, I think I'm growing impotent because of all that G800-induced stress; last night, my gal and I only danced the Grand Mambo once and then both fell asleep watching "The Late Late Show with Craig Kilborn"...)

Francis Beausejour
Frankymail@yahoo.com

----------------------------------------
- Intel P3 850 MHz Retail O/C 1.133 GHz
- Alpha FC-PAL6030
- Asus CUSL2-M
- 256 MB TinyBGA PC150 (2 DIMMs)
- Matrox Millennium G400 DH 32 MB O/C 175/233
- Sony E400 + Sony 210GS (the E400 is a beauty, and very cheap too!!!)
- Quantum Fireball LM 10 GB + AS 40 GB
- Teac W54E CD-RW + Panasonic 8X DVD-ROM
- Sound Blaster Live! Value
- 3Dfx VoodooTV 200
Francis: 19/Male/5'8"/155Lbs/light blue eyes/dark auburn hair/loves SKA+punk+Hard/Software... just joking, I'm not THAT desperate (again, Marie-Odile, if you ever happen to read this (like THAT could happen...), I'm just joking and I REALLY, REEEAAALLLLLYYY love you!))

**Rob M.** · 6 February 2001, 11:06

Originally posted by frankymail:
Now, we have three options to double memory bandwidth to have 12.8 GBytes/s; 1) First, you could use 400 MHz DDR-SDRAM (800 MHz effective...), which is absolutely NOWHERE near availibility, and the cost would be totally prohibitive; 2) You can also use 256-bits, 200MHz DDR-SDRAM (400 MHz effective), which cost would be titanesque, and would also require the addition of pins to the chipset and traces to the board; 3) You could use 128-bits, 200 MHz DDR-SDRAM configured in a Dual-Channel architechture... I'll leave the answer to you (but the least expensive and the least complicated is the third possibility...!)

The problem still remains, to have a 256 bit wide data path to the core requires too many core traces. That rules out options 2 and 3. The best way like I said, is HSR or new memory technology: QDR - 200mhz x 4 - is certainly possible, and I wouldn't be surprised at all to see it. My favourite is using e-dram to store the framebuffer/z-buffer. Inside the core you can have 256 bit wide data paths.

**Rob M.** · 6 February 2001, 11:42

Originally posted by dZeus:
Fact is that the Voodoo-4 4500 is as fast as a nVidia MX. No, that Anandtech bench is flawed, because I've seen about 5 other respectible sites that came up with about identical results, unless in TnL benchmarks which the Voodoo-4 4500 doesn't have.

you're right, it's a little difficult to find reviews on the voodoo4, as it was killed basically right after it's release, but over at firingsquad they run it through more extensive benchmarks that show it to be on par with the MX.

As I said before though, I don't understand how people can compare a 3dfx chip to a crippled nvidia chip and come out saying the 3dfx chip is better - it only wins some of the benchmarks because nvidia didn't design an entire chip to compete in the value segment.

the vsa-100 uses a 128 bit path with sdram, the gf2mx uses DDR, but to make it a 'value' card, they crippled half the bus - down to 64bit. We all know that ddr isn't 2x as fast as SDR, that's the reason why it loses out in the tests that are bandwidth limited.

of course, it does win when TnL comes into play, and that can't be attributed to anying except nvidia's superior design.

Had nvidia redesigned the chip to work with sdr at a 128 bit bus, you'd see the fillrate tests give the same result for both cards (as they're both memory bandwidth dependent).

Face it, the vsa-100 was meant to compete with the geforce 1. I'd consider benchmarks between those two valid measures of design capability. However, being a full year late, the voodoo4 would have to beat the gf1 by as fairly large margin before you'd convince me that it was 3dfx's poor chip design that did them in. Poor management isn't what kept the vsa-100 in beta silicon for that extra year.

**DGhost** · 6 February 2001, 11:47

Rob: the method you describe does the same thing as a 'true' depth buffer does, except it is achieved in software on the drivers. like i said earlier, a software trick. And, asuming that you don't have a depth or z buffer, you would have so many problems with software its not even funny. if you didn't buffer the data, any time someone sent a poly out of order (or went front to back) your display would be screwed up. this is why you *have* to have a buffer like that. you can manipulate the data while in it or being added or what not, but the end result is the same - you *have* to have a buffer like that. In the engine, in the drivers, in the card. it doesn't matter where. but you cannot assume all polys will go in the correct order.

and the kyro still has a real depth buffer. the 'emulated' one that you refered to earlier. that does all that you describe it does. the on-chip depth buffer is a small one, designed for smaller scenes which is consistant with the tile rendering architecture of it.

And rendering/rasterization is that *very* last step in the GL/DX/software whatever pipeline. you have to know where the polys are to render them. And there is alot of other math thats done before they get rendered. texturing for example. lighting for example. the difference between the kyro and others is that the kyro removes polys not visible before it processes textures and lighting and what not. so it doesn't have to. nvidias cards do not do this. and there would be little to no advantage on performing HSR on a scene *after* textures have been applied and lighting is in place.

and memory interleaving is the same on all the VIA chipset and intel chipsets. the only difference is that one bank of memory is on one bus, the other is on a completely different one. this gives them the ability to do it in parallel without hitting the same bottleneck. the implementation is still the same. when something wants to write something to a memory location, it takes that and divides it by how many banks it is interleaving across, does the same for the length of the data its writing it across, and writes the first xx bits to one bank, the next to another bank, and so on. on a P3 and Athlon because they only have one memory bus, it turns into more of an optimization of memory timings. you can write to while the other is taking its performance hit to write. its more, 'while this bank is busy writing the data i just sent it, i'll send the data to another bank'. when you have dual memory channels, like the i840 and i850 it allows you to write the memory in parallel. thus it does use the capabilities. just like IDE RAID. you can do RAID on a single hard drive with several partitions. does it improve performance? no. but if the other partition is on a different drive, it does.

HSR is not the key to fixing everything. all it does is optimize the data being processed. it will not fix the problem, it will just make games run faster on older computers/3d accelerators/etc.

**Jorden** · 6 February 2001, 12:31

Well Frank, going on like this for another 11 pages, and you might break the record

I wonder if Rob and DGhost wouldn't be better of with a chatbox going on like this. Tell me anyone else than Rob&DGhost: Did you read any of their stuff?

For the oldies out here. Do know at what point the ohter big thread broke down, and see if this new server can take this one

Go for it Frank! Thumbs up.

Jord.

**dZeus** · 6 February 2001, 12:36

You know Rob and DGost, not that I want to stop you from having a discussion about 3D Hardware architecture and stuff, but at the forums of www.beyond3d.com there are several experts on this, including former 3Dfx employees... I bet you would find some interesting discussion partners over there (This absolutely is not, by any means, a way of me to ask you to bugger of out of this thread! So if you feel like it, just go on! )

**DGhost** · 6 February 2001, 12:40

hehe... point taken Jorden

Rob, if you wish to discuss/argue/quabble/etc further, email me instead. my addy is in my profile...

**Rob M.** · 6 February 2001, 12:48

Originally posted by DGhost:
Rob: the method you describe does the same thing as a 'true' depth buffer does, except it is achieved in software on the drivers. like i said earlier, a software trick. And, asuming that you don't have a depth or z buffer, you would have so many problems with software its not even funny. if you didn't buffer the data, any time someone sent a poly out of order (or went front to back) your display would be screwed up.

Just like the shadow/glare problem that is shown on sharky extreme' kyro interview when you disable the emulated framebuffer, yes.
You asked me to describe a method that would implement HSR without a depth buffer. I gave you one. I never said it was optimal..

and the kyro still has a real depth buffer. the 'emulated' one that you refered to earlier. that does all that you describe it does. the on-chip depth buffer is a small one, designed for smaller scenes which is consistant with the tile rendering architecture of it

Do you have a problem admitting you're wrong? Before you were arguing that the kyro has an on-chip depth buffer. Now you're saying you're right because it has an external depth buffer...
At least you're properly describing the kyro now, although I'd still caution against referring to kyro's on-chip mini depth buffer like it was a standard depth buffer. Future products will probably have a true on-chip depth buffer, and there'll be a big difference between speed and features.

And rendering/rasterization is that *very* last step in the GL/DX/software whatever pipeline. you have to know where the polys are to render them. And there is alot of other math thats done before they get rendered. texturing for example. lighting for example. the difference between the kyro and others is that the kyro removes polys not visible before it processes textures and lighting and what not. so it doesn't have to. nvidias cards do not do this. and there would be little to no advantage on performing HSR on a scene *after* textures have been applied and lighting is in place.

I don't think I've ever argued you're wrong here, but I would like to say that all cards are required to do the processing of all textures and lighting except the kryo. I wouldn't single out nvidia as being dumb here, I'd instead refer to the kyro implentation as being unique. Nvidia, s3 and ATI all have integrated t&l into their cards to remove this burden from the cpu at least, so it's not like those companies are twiddling their thumbs ignoring the problem either.

when something wants to write something to a memory location, it takes that and divides it by how many banks it is interleaving across, does the same for the length of the data its writing it across, and writes the first xx bits to one bank, the next to another bank, and so on.

That makes sense, I think I have read that before. Although I'm still unsure of what the definition of a 'bank' of sdram is, I've seen bios's that allow 4-way interleaving, on a single stick of sdram, and on other motherboards you boot up with a single stick and it says bank 0/1 (sdram slot1) filled, banks 2/3, 4/5 empty.

**Jorden** · 6 February 2001, 12:51

Errrm... DGhost, I wasn't saying I wanted you to move on. Shit, me and me big mouth

If you want to continue it here, it's fine with me. I got nothing to say, remember?

Although it would make a good reading after this whole Fushion thing was finally brought out in the open, and we can look back on what was said here.

Please, stay!!

Jord.

------------------
This cat, is cat, a cat, very cat, good cat, way cat, to cat, keep cat, a cat, fool cat, like cat, you cat, busy cat! Now read the previous sentence without the word cat...

Announcement

'Fusion' cards

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment