Of the GPU and Shading

This is my favorite part, really. After the CPU has started sending draw calls to the graphics card, the GPU can begin work on actually rendering the frame containing the input that was generated somewhere in the vicinity of 3ms to 21ms ago depending on the software (and it would be an additional 1ms to 7ms for a slower mouse). Modern, complex, games will tend push up to the long end of that spectrum, while older games (or games that aren't designed to do a lot of realistic simulation like twitch shooters) will have a lower latency.

Again, the actual latency during this stage depends greatly on the complexity of the scene and the techniques used in the game.

These days, geometry processing and vertex shading tend to be pretty fast (geometry shading is slower but less frequently used). With features like instancing and the fact that the majority of detail is introduced via the pixel shader (which is really a fragment shader, but we'll dispense with the nit picking for now). If the use of tessellation catches on after the introduction of DX11, we could see even less actual time spent on geometry as the current level of detail could be achieved with fewer triangles (or we could improve quality with the same load). This step could still take a millisecond or two with modern techniques.

 

When it comes to actual fragment generation from the geometry data (called rasterization), the fixed function hardware and early z / z culling techniques used make this step pretty fast (yet this can be the limiting factor in how much geometry a GPU can realistically handle per frame).

Most of our time will likely be spent processing pixel shader programs. This is the step where every pin point spot on every triangle that falls behind the area of a screen space pixel (these pin point spots are called fragments) is processed and its color determined. During this step, texture maps are filtered and applied, work is done on those textures based on things like the fragments location, the angle of the underlying triangle to the screen, and constants set for the fragment. Lighting is also part of the pixel shading process.

Lighting tends to be one of the heaviest loads in a heavily loaded portion of the pipeline. Realistic lighting can be very GPU intensive. Getting into the specifics is beyond the scope of this article, but this lighting alone can take a good handful of milliseconds for an entire frame. The rest of the pixel shading process will likely also take multiple milliseconds.

After it's all said and done, with the pixel shader as the bottleneck in modern games, we're looking at something like 6ms to 25ms. In fact, the latency of the pixel shaders can hide a lot of the processing time of other parts of the GPU. For instance, pixel shaders can start executing before all the geometry is processed (pixel shaders are kicked off as fragments start coming out of the rasterizer). The color/z hardware (render outputs, render backends or ROPs depending on what you want to call them) can start processing final pixels in the framebuffer while the pixel shader hardware is still working on the majority of the scene. The only real latency that is added by the geometry/vertex processing portion of the pipeline is the latency that happens before the first pixels begin processing (which isn't huge). The only real latency added by the ROPs is the processing time for the last batch of pixels coming out of the pixel shaders (which is usually not huge unless really complicated blending and stencil technique are used).

With the pixel shader as the bottleneck, we can expect that the entire GPU pipeline will add somewhere between 10ms and 30ms. This is if we consider that most modern games, at the resolutions people run them, produce something between 33 FPS and 100 FPS.

But wait, you might say, how can our framerate be 33 to 100 FPS if our graphics card latency is between 10ms and 30ms: don't the input and CPU time latencies add to the GPU time to lower framerate?

The answer is no. When we are talking about the total input lag, then yes we do have to add these latencies together to find out how long it has been since our input was gathered. After the GPU, we are up to something between 13ms and 58ms of input lag. But the cool thing is that human response happens in parallel to input gathering which happens in parallel to CPU time spent processing game logic and draw calls (which can happen in parallel to each other on multicore CPUs) which happens in parallel to the GPU rendering frames. There is a sequential path from input to the screen, but we can almost look at this like a heavily pipelined path where each stage operates in parallel on a different upcoming frame.

So we have the GPU rendering the previous frame while simulation and game logic are executing and input is being gathered for the next frame. In this way, the CPU can be ready to send more draw calls to the GPU as soon as the GPU is ready (provided only that we are not CPU limited).

So what happens after the frame is finished? The easy answer is a buffer swap and scanout. The subtle answer is mounds of potential input lag.

Parsing Input in Software and the CPU Limit Scanout and the Display
Comments Locked

85 Comments

View All Comments

  • psilencer - Tuesday, August 18, 2009 - link

    First time poster, so be gentle!

    For each of the cases you analyze the bandwidth and take the lag to be the inverse of the bandwidth. This is incorrect. Lag and bandwidth not related as such. Consider a road with a constant speed limit. Lag would be related to the length of the road (the time it takes for some signal starting at A to reach it's destination B). Bandwidth is related to number of lanes (how many signals you can send from A to B within some time). Although there is some relationship between the two, it is not the inverse.

    With this in mind, everything analyzed by this article is incorrect.

    Consider a mouse that has 500 reports/second. Taking the inverse gives 2ms, which is the average time between completed reports. However, you don't consider that multiple "reports" may be pipelined in the mouse. Say for example, your mouse has a camera, some simple processing logic to decipher the data from the camera, and then the usb interface. For simplicity, assume that these units process one and only report at a time (and bandwidth/latency would have the inverse relationship). In that case, each section works at 500 reports/second, and would have a latency of 2ms. However the total latency of the mouse would be at 6ms, since each report needs to go through each section.


    This also applies to the CPU and GPU.

    Sorry, if I'm completely wrong, just ignore this =P

  • siberx - Thursday, July 30, 2009 - link

    Fantastic article - I smile each time AnandTech posts one of these groundbreaking articles that just cuts straight through the BS and gets to the truth behind issues that have been muddled in hearsay and rumours for years.

    I am personally particularly sensitive to input lag, and with my current LCD even in a fast game like TF2 or UT I find the lag intolerable if vsync is enabled - I have to run with it disabled in just about any game demanding fast response.

    My question, however, is the effect that multi-gpu solutions have on input lag. I have never seen something describing exactly how both ATI and nVidia's multi-gpu solutions affect lag, as well as how different multi-gpu rendering modes (AFR, SFR, etc...) affect lag. I would assume that using a multi-gpu solution would, in most cases incur at least an extra frame of delay to mix or move frames between cards, etc... but an actual analysis of this would be very useful. It may, in fact, be worthwhile to disable multi-gpu when running an older twitch game to improve latency...

    Additionally, testing with a couple other LCDs to see how they compare latency-wise would be interesting - I get the feeling your Dell panel is a fair step faster than your standard-issue modern panel doing overdriving to reduce switching times...
  • race2 - Saturday, August 1, 2009 - link

    When you say that all non-Nvidia driver Triple Buffering for OpenGL programs are simply one frame flip queues, do you mean that D3DOverrider's forced Triple Buffering is a one frame flip queue as well?
  • race2 - Saturday, August 1, 2009 - link

    Sorry, first time posting here. Previous comment was not meant to be a reply.
  • arcsign - Sunday, July 26, 2009 - link

    It's nice to know that the whole input lag issue is finally getting some attention. I've been trying to find ways to improve it, without buying new hardware, for a little while now, and came across some options that might be of interest for future articles. (I don't have access to much in terms of equipment to measure these things, so my testing hasn't been so much empirical as it has "well, that seems a bit better... maybe.")

    -- The two that stick out in my mind as far as software options go are (at least for WinXP) the boot.ini options "/INTAFFINITY," and "/TIMERES= xxxxx." The former assigns all interrupts to the highest numbered core, and the latter changes the resolution of the Windows kernel timer.

    -- It would also be interesting to see what sort of effects overclocking might have on various latencies, as I've noticed that Windows doesn't always agree with the BIOS/CPU-Z as to the processor's speed, and in cases where a game uses Windows Performance Counters to calculate time deltas for networking/inputs/etc, if there are any counters that depend on an accurate cpu speed, this could present a problem. (Although this isn't directly related to input lag, it is related to the interaction between the game and the player...)

    -- AHCI multimedia timers versus TSC's (more of an issue in XP than more recent OS's, as I believe Vista and 7 both require the use of the AHCI timers) may also have a significant effect on gameplay.

    Anyways, nice article, and keep up the good work.
  • William Gaatjes - Saturday, July 25, 2009 - link

    Hello, you might find something interesting on the website of Avago .

    Avago technologies manufactures optical mouse chips.
    Another manufacturer is SGS thomson or st electronics.

    Here is a link to avago chips.

    http://www.avagotech.com/pages/en/navigation_inter...">http://www.avagotech.com/pages/en/navig.../navigat...

    You might find some information you seek there.




    I noticed you where writing about 3 keynumbers but you mention 4 on the page : "Reflexes and Input Generation".
  • William Gaatjes - Saturday, July 25, 2009 - link

    And a very nice article i forgot to add.

  • camylarde - Tuesday, July 21, 2009 - link

    Now all that remains is to incorporate a multiplayer fps game and dissect how network comunication affects it, and how that knowledge can be used to clearly select wallhackers and aimbotters from the regular pack, just by watching a demo of them, and doing basic math counts of their reported network lag.
  • DerekWilson - Monday, July 20, 2009 - link

    This is something we would love to do, and while it is on the table we may not have the time in the near term to get something like that up right now.

    But trust me, we've been thinking of many cool ways to use high speed footage :-)
  • JimboMahoney - Monday, July 20, 2009 - link

    I also found Fallout 3 extremely laggy until I edited the Fallout.ini file from this

    iPresentInterval=1

    to this:

    iPresentInterval=0

    (Thanks to TweakGuides.com for this tip).

    It seems that Fallout 3 has VSync enabled at all times, even if you disable it in the menu, unless you make this change. The game was pretty unpleasant to play before I did this (I never use VSync).

Log in

Don't have an account? Sign up now