The Intel XeSS Interview with Principal Engineer Karthik Vaidyanathan

Usman Pirzada

Today we will be joined by Intel Principal Engineer Karthik Vaidyanathan and talk extensively about Intel's upcoming neural supersampling technology, XeSS, and that very impressive demo. Karthik was instrumental to XeSS development and will be fielding the community's questions about the upscaling tech. The agenda of the interview will begin with some basic background on the classification of upscaling technologies and then move on to more pointed questions.

While you might be tempted to skip towards the juicier parts of the interview, we would encourage everyone to read through it all, because there are nuggets of insight in practically every statement that Karthik responds with (and we especially loved the examples that will help gamers understand just what is going on behind the scenes).

A screenshot from Intel's XeSS demo showing the impressive upscaling technology at work.

Karthik was also able to reveal some truly exciting details about XeSS (I won't spoil them) and we can't wait to tell you about them. This writeup will be a 36-minute read for the average reader, so grab a cup of coffee, some snacks, and enjoy the interview!

Nicolas Mijuskovic [Intel Moderator]: Maybe you want to take it off, Karthik? I mean, this can be super conversational, but in terms of maybe giving a little bit of background on upscaling in general and how it works and the different approaches and then we can kind of dig in. I'm sure Usman will have plenty of questions once we start getting to the neural part and our challenges and our approaches.

Karthik: That's good. The ultimate goal for us with a technology like XeSS is to produce the highest quality rendering with the most accurate lighting that has the most detailed shadows and reflections - and introduce that at a smooth frame rate. Having such a high quality and accurate rendering has its complexity and it can be challenging to produce all the pixels that you need especially for a 4K display at a smooth frame rate. And that's where all these technologies come in. The idea being that we can render a smaller set of pixels every frame and then use that information to generate all the pixels that you need for your target display, which could be a very high resolution display. But the key is to be able to do this without losing any of the goodness that you have in your highest quality render without losing all of the details, because that just defeats the purpose of having such a high quality render in the first place. You don't want to have all this photorealism and just have them blurred out in your final display [once an upscaling technology is applied]. And that brings us to the different kind of technologies that you will see and they broadly fall into two camps.

There is spatial upscaling. Probably heard about that quite a bit. They are also commonly referred to as super resolution techniques. And then on the other hand you have super sampling techniques and a lot of the modern games already use super sampling techniques like checkerboard rendering. Now diving deeper into some of these spatial upscaling techniques, they tend to be simpler. They typically engage after anti-aliasing. After TAA, for example, they look at a single frame at a lower resolution and they just upscale it to the target resolution. There are lots of smart techniques that they use to upscale the image, but when you're working with such a small set of pixels to begin with often the information is not there. And moreover, when you're working after anti-aliasing, you can think of anti-aliasing as a low pass filter, that also ends up removing a lot of the details in a lot of the information.

So you're working with a very small set of information, and then you're left with two choices to try and produce all the missing pixels and either approximate or hallucinate. And most of the real-time techniques that you will see are approximators. And the way it works is, if you have a well defined edge or some well-defined features, you can detect it. And then either sharpen it, or produce those details because you have detected it in the first place. Now you can imagine a scenario where you have very fine details, the most common one being thin wires, but there's so many more - fine reflections, highlights, all of these things, if you look at a low resolution render of a wire you would have maybe one pixel over here and one pixel over there. And there's no way to infer that there's a wire, so it’s impossible to produce those details. And that's where it becomes quite challenging for these kind of single frame spatial upscaling techniques to produce that detail. Now there are state-of-the-art techniques like GANs which can hallucinate but these are really not fast enough for the real time domain and also they don't generalize very well, at least not yet.

Super sampling is a completely different kind of approach and it's good to point out the distinction between these two. The way super sampling works is first of all, it treats anti-aliasing and upscaling as a single problem. These are not separate problems and it works directly off of the unfiltered pixels coming from the render. So you have all the information coming from the render and not only that, you also look at pixels and previous frames and there's a lot of information there and you combine that information to produce all the pixels that you need for your target resolution. So first of all, you're not limited as much as you would normally be, in the amount of information that you have. For example if you have a wire, in one frame, you might only see two pixels on that wire but over say eight frames, you are beginning to see that this is something that looks like a wire. And the interesting thing is that super sampling has been around for a while. Checkerboard rendering is already there. The state-of-the-art game engines renders already employ these kind of techniques. And often you might find that these already performed better than spatial upscaling.

[Super sampling] is not an easy problem, especially with the current state of the art super sampling, which are often based on heuristics. The problem is that it’s challenging to find all the information, all the pixels from your previous frames that can actually be utilized to reconstruct your current frame because there are lots of scenarios where it might not be feasible - for example, disocclusion - where something is visible in the current frame, but was not visible in the in the previous frame. Imagine some big object in the foreground, which was visible in the previous frame but in the current frame it just disappears - it's moved away - and therefore you cannot use those pixels because they're not visible in the current frame.  So that's one scenario. There are many such scenarios where because the scene is dynamic, things are moving, you cannot have a one-to-one correspondence between the pixels in your previous frame and your current frame.

You really need some smartness to try and detect which pixels are usable. And in the event that you are not able to use those pixels you still need a good approximation like spatial techniques have. Most state-of-the-art game [techniques] use a lot of heuristics, a lot of hand-designed approaches to try and use as many pixels as they can, but try to do it in a way where you don't end up integrating false information or invalid information, but they don't work all the time, and therefore you will often see artifacts like ghosting, blurring and these are issues, commonly associated with techniques like TAA, checkerboard rendering. And that's where neural networks come in because this is almost an ideal problem for neural networks, because theys are very good at detecting complex features and that's where we can use them to integrate just the right amount of information and when that information is not there, try to detect these complex features and reconstruct them. So that sort of summarizes the technology.

Usman: Thanks for that, Karthik, so a little plug here. In terms of competitors, we have FSR, DLSS 1.0, DLSS 2.0 and even Unreal Engine's temporal upscaling, right? It's my understanding that XeSS is more like DLSS 1.0 and 2.0 because it's also based on neural super sampling while FSR does not use neural networks, and it's not ML or AI based and it also doesn't have motion vector support. So how would you classify FSR, DLSS 1 and 2, and XeSS just so readers can sort of place you on a map in terms of rendering technique.

Karthik: Yes, so the super sampling techniques need motion vectors because you're trying to use information from previous frames and you need to know how they move and how objects move.

So FSR to my knowledge is spatial upscaling, and we already discussed spatial upscaling and some of its limitations. DLSS 1.0, again I am not aware of the internals of DLSS because its not open, but from my understanding, it was not something that generalized across games, DLSS 2.0 plus was neural network based and it generalized very well. XeSS from day one, our objective has to be a generalized technique.

Usman: So it's most comparable to DLSS 2.0

Karthik: You could say that. Again, the technology is likely very different. To the extent that yes, it's neural network based. Yes. It is a super sampling technique. And yes, it generalizes to that extent. It's similar to the other system [DLSS 2.0], but the underlying technology is likely very different. Because when you have two independent groups, trying to solve a problem in their own way they will likely end up with very creative solutions to the problem.

You mentioned Unreal Engine 5 and it produces some of the highest quality shadow geometric lighting fidelity and when you're investing so much in your render, you really don't want to lose any of that quality when you scale from that - those pixels to your target resolution - and that's been our objective from day one. And also you don't want to have a solution that's fragile that requires training for every game that someone ships that's also been our objective from day one.

The XeSS demo showcased by Intel was not used in the neural network's training set - making it all the more impressive.

Nicolas: I think we can also share on the demo that was shown Karthik, that that was not-

Karthik:  Yeah, I guess Usman, you've seen the demo and I can say that XeSS has never seen that demo. It was never trained on that demo. All the content in that scene that you saw was not used as a part of our training process.

Usman: That is very impressive. So, you can roll this out to a lot of games very fast because you don't need to train it on a per game basis. It's sort of generalizes.

Karthik: Exactly

Usman: And I actually heard yesterday that you guys are planning to support XeSS on older models as well. So can you elaborate on how that would work. To my understanding, the reason FSR is not machine learning based is because AMD wanted to cast a wider net on older models as well and we already know DLSS requires some form of inference capability in the hardware. So, how, how would you plan on rolling this out to older models, if you are planning for those?

Karthik: Yes, we require inference capabilities but matrix acceleration is not the only form of inference capability that is available on GPUs. If you go all the way back to- I think Skylake- we had dot product acceleration, which is DP 4.4 – there’s various names for it. Nvidia has had this I think, since Turing and AMD has this now on RDNA2. So even without Matrix acceleration you can go quite far. It might not be as fast as matrix acceleration, but certainly meets the objective. And as I said, the objective is to maintain the fidelity of your render and achieve smooth frame rates. So, when it comes to older models, on older internal GPUs, we've had dot product acceleration (DP4a) for a while now.

Microsoft has enabled this through Shader Model 6.4 and above and on all these platforms XeSS will work.

Usman: We have seen spatial techniques suffer when upscaling from low resolution input frames, but since you guys are using neural super sampling techniques, how would you say XeSS fares when upscaling lower resolutions? Think 1080p and below.

Karthik: We should scale well to lower resolutions. And that's also the point for our own laptop devices, right? Those are not rendering to say 4K. These may be rendering to 1080p or 720p. So its key - especially as you go down the range of GPUs that you have and, you know, towards mid-range, lower-end gpus, you need a technology that scales down and that puts more burden on the upscaling technology because it has to work with a smaller set of information. And 8k in some cases is an easy problem because there's already a lot of redundancy in the render that you can produce, but as you go down that chain towards 2k or 1080p, it becomes harder and harder and you might also see this with other approaches. But our objective is to try and maintain that image quality as you go down to lower resolutions.

Usman: What is the current state of, and how does XeSS handle, motion and other temporal artifacts like we have seen in DLSS 1.0 – which NVIDIA has by and large fixed with DLSS 2.0? What kind of expectation can users have in terms of artifacts on day 1?

Karthik: So, I think some of the issues that you might have seen with DLSS 1.0 – and again, I am basing this on publicly available information - all I can say, is that the only thing I know at this point is it's an approach that didn’t generalize and we can make some inference based on that. And here's one way to look at it: when it comes to neural networks the key to generalization, well one part of it is to have you know, robust data sets and robust training, but that's only one part of the problem and I would argue the bigger problem is to define the problem in a simpler way for example a lower dimensional space, so it's a lot easier for the network to generalize.

I can give you a very simple example. You can have a network try to predict all the three-color components R, G and B. Or you can have a network that only predicts a filter that applies on all the three color components, right? And this is a very very simple example that I'm coming up just to convey the idea that the way you define the problem gives you a path to better generalization. And that is the key and all the issues that you might have seen with DLSS 1.0 where, you know, it might have been blurry in some places, issues with motion, inherently, it comes down to: defining, the problem in a way that is easy to generalize and treating anti-aliasing and upscaling as a single problem where you use motion vectors for both these problems by treating it as one, and that's the key.

Usman: So to summarize, XeSS generalizes. It supports motion vectors, and it is most comparable to your knowledge to DLSS 2.0

Karthik: Yes

Usman: Great, let's talk a bit about roadmap as well. So Nvidia split their DLSS tech into DLSS 1.0, DLSS 2.0. Are you guys going to split it up into multiple iterations as well, or is it just going to be a single release? And I've also heard that, I think Raja mentioned this in the previous event yesterday that you guys are planning to open source it as well at some point.

Karthik: Let's talk first about the roadmap. A technology like XeSS, I believe there is so much more we can do and it would be naive of me to say that we solved all problems and that XeSS is just perfect, and that we're done. It's going to improve more. Maybe you could get even more aggressive,scaling, maybe go down to lower resolutions. There's so many interesting problems in this space that we will continue to improve, evolve and lead. So yes. There will be XeSS 2.0 at some point, XeSS 3.0 at some point. You know at some point maybe graphics just completely changes and it's all neural networks.

And yes, we do plan to open source. And it's important for us. It's part of our vision for a technology like this. First of all, ISVs actually know what they're integrating when we share the source, they understand the technology, they can build upon it, we get a bigger mindshare as a result.

We have a certain perspective on this. We have some of the best researchers solving this problem, but I imagine sharing this with the larger community allows us to leverage so much more, and there's so much more that we could do as a result. So that's one part of it. There's of course having something that is cross vendor and open source that is much better. Because there's lower barrier to adoption, right? Because if you have a technology that's open source and runs on multiple platforms. It's something that you can integrate into your game engine and not have to differentiate for every single platform that you're running on. So, yes, it's also been our objective from day one to have a solution that works on other GPUs, is open source and can set the path or establish a path to wider adoption across the industry.

And that is something you need for wider application of technologies like this. You can come up with the most disruptive technology, but if it's a black box, there's always a challenge.

Usman: and it's my understanding is that it will initially be released as closed source, but eventually move to open source. Is this understanding, correct?

Karthik: I am not fully aware of the timelines involved – but it will be eventually open sourced that I can confirm.

Usman: Fair enough. Are you guys working with any major game developers right now, to integrate XeSS or any ecosystem partners for XeSS specifically, like I know you guys are working with a lot of people as far as Xe HPG goes but what about XeSS or has it not reached that stage yet where you would sort of partner up with the ecosystem partners or other game developers?

Karthik: There are several partners that we are working with. I cannot comment more at this point.

Usman: but these are game developers right or other partners.

Karthik: This includes game developers. I wish I could share more haha.

Bioshock Infinite and Half Life are Karthik's favorite games.

Usman: So lets take a breather and talk about something lighter for a while. You have been here for 10 years, right? What would you say is your favorite game?

Karthik: Maybe 11. I lost count at some point. Oh, that's a very interesting question. The one game that comes to mind is BioShock Infinite. Haha. I think that's one that stuck with me for a while and then, you know, going back to my college days. There was Half-Life, of course, everyone's waiting for Half Life 3 and you can include me in that list. It will probably not happen in my lifetime, ha.

But yeah, BioShock Infinite was probably the last game that I recall very distinctly for one reason, for me, personally, I am more into it for the story telling aspect, for me personally games are another form of artistic expression, something more like a movie but just with more degrees of freedom and more and more potential. And so I tend to be biased towards games that have a very strong storytelling aspect. Right? I'm not really into multiplayer gaming because for me, that's that's not what has interested me all along. Its about having an immersive experience and sort of just admiring the visuals, the creative input, just the artistic aspect of it. And that's what draws me to this field, this technology. And yeah, I guess. BioShock Infinite was one of the games that had a very strong plotline in my opinion.

Usman: Souunds good and I know this might be a question that would seem biased, but if out of all of the Xe HPG ecosystem, what is your favorite feature? So it could be anything. Ray Tracing, XeSS. So, feel free to give whatever fancies you.

Karthik:  I also worked on raytracing so I would put ray tracing and XeSS at the same level. They're both disruptive, forward-looking technologies that are setting the trend for next generation graphics.

Usman: So if you would play a game, you would definitely play it with raytracing on.

Karthik: As long as I can get 60 FPS. I must admit. I played Cyberpunk with raytracing off because it couldn't hit 60 FPS on my NVIDIA RTX 2070. But yes, smooth frame rates are very important. And that's the problem you have to go solve. It pretty much goes back to what I said, initially, you want to have the highest quality pixels. I want to have this immersive experience, you know, just have the highest fidelity from the artists creative input to the visual experience, right? And not lose any information along the way, and that's what excites us. Yeah, and I would like to play Cyberpunk with ray tracing turned on. We want to work on technologies that unlock the next generation of rendering, more photorealism and without paying any frames for it.

Usman: And actually, that that sort of reminds me of a small follow-up. So other rendering tech like DLSS and FSR they have this performance mode and the quality mode. So are you guys trying to do something like that as well? Or is it just going to be a standard XeSS ? What should a user expect in terms of mechanism? Like is it going to be slider? Is it going to be a performance mode in the quality mode or just XeSS on off?

Karthik: We will have the quality modes as both FSR and DLSS have those at this point. So, you know, we will support the same when users are used to it. So we would support that. But I also wanted to point out that the one thing that sort of gets lost in these different modes, performance, quality, ultra quality is that what you really want to have is something like the performance mode produce an image quality that is so close to ultra quality that it doesn't take away from the visual experience.

And the way these settings work in all games, across FSR, DLSS  is that it just controls your input resolution. So ultra quality runs the input at a higher resolution – you are producing more pixels. Quality produces slightly fewer pixel, balance produces even fewer pixels, performance produces smallest number of pixels. And in some ways that reflects on the capabilities of your upscaling technology. When you're running at quality or ultra quality, there's not much for the upscaling technique to do. You already have you know, a majority of the pixels and you might be able to get away with a cheaper upscaling technique. It's when you start going down to performance, the kind which unlocks 60 frames per second with the highest quality rendering, that's where you, you know, the capabilities of the upscaling technology really start coming out.

Usman: For XeSS development, how does intel quantitatively measure the quality of the upscaled graphics? Or is it all subjective.

Karthik: We do both. We do user testing and we have a set of qualitative metrics that we use. There are also several quantitative metrics that we can use and at the very least, you have things like PSNR (Picture Signal to Noise Ratio), but there are more advanced metrics that are available to us now, which we use for quantitative analysis, but that is not sufficient. No metric is perfect when it comes to user perception, especially with gaming. So, we always have to rely a fair amount on user testing too. We do whatever we can within our capabilities to test this and get another level of image quality validation. You already have some very basic metrics like PSNR, but that's only scratching the surface when it comes to image quality assessment. There are more advanced metrics that are available to us these days like perceptual metrics.

Usman: Will XeSS be available in titles/game engines on day 1 when Intel's ARC Alchemist GPU launches or will it be made available later?

Karthik: We haven’t detailed that yet so I can’t comment on that.

Usman: Does XeSS XMX implementation take advantage of competitor hardware like Tensor Cores (NVIDIA)?

Karthik: Ah, no. Until there is standardization around Matrix acceleration that is cross-platform, it's not easy for us to build something that runs on all kinds of Matrix acceleration hardware. DP4a has reached a stage where, you know, it supports this on all platforms. Certainly on all modern platforms. So that makes it much easier for us. But Matrix acceleration is not at that same stage. So our matrix implementation that targets XMX is Intel specific.

Usman: Does XeSS work on hardware with no XMX or DP4a support with an FP16 or FP32 fallback?

Karthik: No, not at the moment. We will look into it, but I cannot commit to anything at this point. Even, if you were able to do it, there's the big question of performance and whether its justified.

Usman: What resolution was the XeSS model trained at. For context, NVIDIA DLSS was trained at 16k.

Karthik: That's a very interesting question. Let me put it differently, we train with 64 samples per pixel reference images and I think that makes more sense. Because what we are trying to match, the kind of quality that we are trying to train the network with is 64x SSAA. That's what we use to train the network, and another way of looking at it is how many samples it ends up being overall. So, when NVIDIA says 16k images, I am assuming it translates to the number of samples it has inside a pixel.

So from our standpoint, that's what I can talk about. We train with reference images that have 64 samples per pixel.

Now, now if you want to draw a resolution from that you can you can do the math. 64 would be, you know, 8 samples in X and Y so you could you know, that would be 32k – that’s what it would be. Okay, but I wouldn't call it 32k because what we are doing is effectively all those samples are contributing to the same pixel. I think 64 samples are all contributing to the same pixel. But yeah, effectively 32k pixels is what we use to train – is what we used to create the reference for one image.

Usman: Will XeSS work on resolutions higher than 4k? Say 8k?

Karthik: That is certainly on our roadmap.

Usman: Is DP4a directly working over DX12 with SM6.4 and how is the XMX version exposed, over DirectML with Metacommands?

Karthik: That's an interesting question. First of all, I would wanted to point out that both the DP4a version and the XMX version are exposed through the same API. So as far as the integration is concerned, it's actually the same, it's the same. What the game engine sees is the same interface and underneath that interface, you know, you can select the DP4a or the XMX version and depending on the platform. So I wanted to clarify that, it's not two different interfaces. It's the same interface and the same library that it's integrated with two different paths inside of it, which makes it a lot easier for game developers.

Now, coming to your question. So, for DP4a, yes, SM 6.4 and beyond SM 6.6 for example supports DP4a and SM 6.6 also supports these packing intrinsics for extracting 8-bit data and packing 8-bit data. So we recommend SM 6.6.

We don’t use DirectML. See the kind of performance that you're looking at when it comes to real time, even a hundred microseconds is very significant. And so for our implementation we need to really push the boundaries of the implementation and the optimization and we need a lot of custom capabilities, custom layers, custom fusion, things like that to extract that level of performance, and in its current form, DirectML doesn't meet those requirements, but we are certainly looking forward to the evolution of the standards around Matrix acceleration and we're definitely keeping an eye on it and we hope that our approach to XeSS sets the stage for the standardization effort around real-time neural networks.

Usman: This should be an easy one, when did the work on XeSS start?

Karthik: From the point at which we started working on our research, it's been more than a couple of years. Let's just say that. So certainly not something we put together in the last year or the last couple of months, it has been going on for a while.

Usman: When can we expect the software development toolkit to go live?

Nicolas: It will be later this month for ISVs (XMX) and later this year for the DP4a but it will not be a public release. And that as XeSS matures, we’ll open up to tools and SDK for everyone.

Usman: Will XeSS be applied at the driver level in all games or need to be implemented natively in game engines? We are hearing of a tool which allows FSR to be implemented on pretty much all steam games?

Karthik: Just like DLSS, it would have to be integrated into the game engine. It's not something that can be hidden from the game engine, and that's why we need to work with ISVs to get this into the game and are trying our best to make it as easy as possible integrate this.

It requires developer support but having said that, generally, super sampling technologies that are implemented at the tail end of the pipeline, closer to the display, will always have more challenges. I can give you a clear example. Let's say you had film grain noise that was introduced as a post process - trying to apply an upscaling or super sampling solution after that fact becomes very challenging.

So even if one were to implement something like this as an upscaling solution, for example, just close to the display, there's always going to be scenarios like this when you know the game engine does some kind of post processing that just breaks it. So being closer to the render gives you, as we discussed the last time, the highest fidelity information with the amount of controllability that you need to be able to produce the best result.

Usman: How hard will XeSS be to implement compared to DLSS?

Karthik: It should be similar and there's another way to look at it. So for a game that implements TAA already, integrating something like XeSS should only be a small amount of effort because you already have all the pieces that we need with any TAA Implementation. Like you have the motion vectors, you have the jitter. So you have all the pieces for any kind of super sampling technique to be integrated if the game already has a TAA implementation. So that's a pretty large set of games right now. TAA has almost become like a de facto, you know, standard for antialiasing. So for any game that already has TAA, it already has the pieces that you would need to integrate XeSS or any super sampling technique with a few modifications, of course, but those are small modifications.

Usman: Thank you so much Karthik and Nicolos. This is a wrap from my side.

Special shout out to Patrick Moorehead (@PatrickMoorhead), Sebastian Moore (@Sebasti90655465), Locuza (@Locuza_), MeoldoicWarrior (@MelodicWarrior1), Albert Thomas (@ultrawide219), and WildCracks (@wild_cracks) for contributing to the interview questions.

Share this story

Comments