[GRUEDORF] mgambrell's making other people's games

xor_mgambrell_0

So a game was running slow on the switch. After a while I realized it was because of some text I’d added which was getting rendered every frame. Like… one sentence of text. Using a spritefont and drawn in a batch. What the heck? This shouldn’t be so slow.

After digging in, I realized it was due to the kerning support I’d added to imgui which necessitates lookups of kerning pairs between letters. On top of this, imgui’s basic glyph lookup was slow. (Using imgui was a mistake, but that’s a story for another day and I’m stuck with it for now.)

Both these things are ABSURDLY slow because they use a binary search through a rather sizable list. Binary searches are slow because by definition they hop around memory without paying any heed to where they are in a cache. Every single hop is to a fresh cache line. This is 100% the worst possible case for a memory subsystem. And that’s the weakest point of mobile platforms, which the switch is. Turns out this just cost me a few FPS but it was enough to drop it below 60 and thus to 30.

I’ll have to redo this some day. But not now… I just optimized a different thing entirely to earn back some FPS from a different budget and so the game regained 60fps. However, I do know how to do it. And as luck would have it… I had to do basically the same thing for another text renderer just the same week!

In this other case, I’m high-level emulating the Turbo CD “system card” bios. This is so we can release games that are emulated without having to distribute the copyrighted bios, which there’s no way we could ever get the rights to. This proceeded pretty well. But finally I discovered at the very end of the game we were porting, it would use a built-in SJIS font that the bios has for rendering the credits. Now, I really don’t want to include that copyrighted font verbatim. On the other hand, it takes a LOT of work to make a Japanese font. So I wondered… had they just used someone else’s public domain font? So I spent hours scouring the web and could find no proof that they had. Fortunately while doing so I found 100s of BDF-format dot fonts, a few of which were 16x16 (which is what I needed), and 3 of which had the needed attribute of leaving one column empty (so that the glyphs would be spaced at least a little if put on a grid).

So it was time to make a BDF-processing tool, and a necessary part of that is creating the glyph lookup table so that I could get from an input SJIS code to a specific offset in the dot font bitmap data. And now I will tell you the proper way to do it.

SJIS is a two-byte code. UTF16 and even UTF-8 can be handled the same way (albeit 3-byte codes are common in UTF-8 when it comes to Japanese). You can make a 64k-entry table if you want (the entry is either the offset to the glyph or missing-glyph sentinel). This involves just one random access cache hit per lookup. The table entries really need to be 4 bytes so it ends up being 256KB. Not bad, but we could do better.

So generally what you do is you simply make a 256-entry “1st-level” table. The first byte of the code is the key index and the value is the number of a 2nd-level table. These can be bytes, so long as you know how to calculate the offset of a 2nd-level table from the number [it’s 256+number*sizeof(2nd-level table)]. So the first level table is tiny, although it’s still basically a random access each time. The 2nd level table can be 2 byte entries (to index another table of offset to the final glyph) or 4 byte entries (to directly index the offset of the final glyph).

My glyphs were all fixed-size (32 bytes apiece) so I could calculate them from a number, so I ended up with 2 byte entries. Note that in MANY cases your glyphs will be fixed-size: insofar as they are records containing e.g. texcoords. If you’re careful, a spritefont renderer can operate completely in-cache, and really only the final access of the glyph record is random (the gpu batch data you’re producing is written consecutively so it’s actually the ideal case for any CPU and is basically negligable compared to all the rest, in this case where it’s not much data).

But anyway that’s more optimization than I really care to think about now. By using the 2-level index you change a dozen or more random accesses to 2 or 3 and due to the way these memory bandwidth-constrained devices work, that can directly map to a 4x speedup. I did it with a 2-level index in this case because I actually put the data in my fake bios and I didn’t want to worry about what would happen if it grew too large.

One day I’ll improve that imgui text renderer some more so I can actually use it seriously; in that case I will probably make a direct 64k-entry map from 2byte codes to glyph data. But wait–what about the kerning! We’re not through yet. I brought all this up for a reason! The kerning data is kind of a multi-level index, isn’t it? 1st index is the 1st character and 2nd index is the 2nd character. Now in this case you can end up with some giant tables (if any 2-byte character can be kerned with any other 2-byte character, you can end up with 64k*64k=4GB entries in the worst case…) but still it’s only a small number of random accesses. In practice the table won’t be that big, and if it is too big you can work it into 3 levels somehow. Not a big deal. I’ll write about how that goes when I get around to doing it.

xor_mgambrell_0

Some platforms have really weird patching systems. If you do something like frameshift all your data by 1 byte in an update, the patch engine will fall apart and it will be 100% the same size as the new file. And there are limits to how large patches can be, too. This sounds like it might be amateurish, but there’s reasons for it: making a fully-powered patching system has performance implications. Specifically on the disk IO system. And consoles are all about the bargain of the platform guaranteeing resources while the developer guarantees he’ll use them. So the patching apparatus cannot be infinitely-powered, like to the extent where it’s a fully optimal difference which requires unbounded time to uncrunch patch data while referencing sectors from all over the version0 datafile. Specifically that last part: in the worst case scenario: the individual seek and IO time will get out of control, as will the wear and tear on the flash media. I call it read amplification because that’s basically what it is.

There’s a piece of the puzzle you’re missing if there was something creepy in the above that was giving you a bad puzzled feeling. And it’s this: the premise for a console platform’s patching engine is that the version0 datafile is always present. It’s in the physical media! The patch must be applied on the fly. There’s no separate storage for it. Even on modern consoles, this same principle may be in effect.

On the developer’s side of that guarantee, then: we are supposed to read the docs about how the platform’s patching apparatus works, and then design disk file formats which don’t cause problems for it. This always catches Unity developers by surprise–the default way for doing things there makes a complete mess out of the data every time you touch anything. It’s always sad to see the horror slowly dawn on them. Well, it catches everyone by surprise at least a little the first time, but at least the developer of a proprietary engine would have had this kind of thing on their radar already (and it’s not as likely to be their first time doing anything.)

But I’m personally in a bit of a pickle. See, I’m porting other people’s engines, and they’ve unknowingly made poor choices about the structure of the packfiles. I don’t want to hack the engine (part of my business plan is to do everything I can to touch engines and games as little as possible, to keep costs low.) But what do I do when I need to patch one of these games? I ran into this scenario last week. So here’s how I handled it.

I have a file format which consists of an 8 byte record per ‘sector’, where the ‘sector’ size can be tuned somewhat depending on the nature of the data. This sector consists of what I call a ‘literal prefix’, a ‘chunk’, and a ‘literal suffix’. Tracking that requires 8 bytes per sector which can be directly-indexed by sector. The literal prefix and suffix are stored in the patch file at the end together. The ‘chunk’ comes from the version0 datafile. The sector may naturally be 100% literal prefix or 100% chunk–and commonly is. After all, we’re dealing with packfiles, and unless internal data has been re-indexed somehow so a bunch of IDs and references are changed (we’ll hope that doesn’t happen *), most of the data is the SAME–only moved around. And most importantly, the implications on read amplification are contained: reading one sector of data from the patched file results in reading no more than two sectors of data from the version0 file.

I made an effort to find the best ‘chunk’ that I possibly could. I do exhaustively search for those, using an acceleration data structure that consists of a list of offsets of strings in the version0 file beginning with each 16bit value. So that actually cuts the number of strings in the version0 file I have to search by 1/65536th. Using that filter, I just look for the longest matching string I can. Even this would be excruciatingly slow (hours?) but I found another technique: It so happens that the best string in version0 for version1’s sector 124 is most commonly the string immediately succeeding version1’s sector 123 chunk (as long as the literal suffix for sector 123 was empty–the sector is concluded with a chunk). That is, in other words, I can find the next piece of data from the packfile’s inner-packed file without even needing to know anything about the structure of the packfile. With THAT (and a 32 thread processor) I can handle a 1GB packfile in 30 seconds or so. The whole algorithm is trivially parallelizable. Even the acceleration table is parallelizable (each of 256 threads indexes the strings starting with that byte). That table is about 10GB for a 1GB packfile, when it’s all said and done.

I did evaluate all the existing patch file formats, but none of them were really designed for on-the-fly use. For sensible reasons. I’m sure in exotic filesystems there may have been something like I needed but it would have been impossible to reuse that stuff. I didn’t even look for it.

Now to finally complete the picture, what we do is make version1 of the game out of the version0 packfile + the patch file. The patch that gets shipped to users is no larger than our patch file, then.

Actually the format I devised is kind of bloated, so I do compress the whole thing. Could compress to anywhere from 10% to 99% of the original file size… depends on details. Since this is done on the entire file, and you may have noticed my read-amplification analysis didn’t discuss reading the literal parts, it follows that my patch files are loaded completely into memory. My defense of this is that any engines we have that make such good use of a packfile that they are enormous enough to cause problems, will also be rather efficient about NOT loading too much data in memory, meaning I will have plenty of room to hold a patchfile. Hope I’m not wrong…

I had to modify my engine to try opening myfile.ext.patch any time myfile.ext is opened (actually this analysis is done when our filesystem is built during content building) and then have the ability to do this patching on the fly. But there’s one more read-amplification issue I had to deal with: since my patch file is sector-based, and usually games will read entire multi-sector files from the packfile all in one operation, I don’t want to turn that into several individual-sector reads from the physical filesystem. To make sure this doesn’t happen, I interposed another layer of logic that ‘coalesces’ adjacent reads from the version0 file and doesn’t execute them until a read comes from somewhere else.

Whew, that was a lot of work! I hope it works out. Part of the concept here is that no work is required IN ADVANCE, so despite my testing, I won’t be 100% sure about it until I actually run into the scenario where I have to do the patch (this whole endeavour was precautionary before we shipped a game with huge packfiles).

In case this does happen, I have another strategy planned. What I’ll do is find the best-matching sector’s worth of data in the version0 file and then store the XOR of it vs version1 in the patch. The whole patch will then be compressed (and it’s hopefully mostly zeroes) but it needs to be done one sector at a time, since the mountain of zeroes will be too big for RAM. I think this approach has promise, but it’s just not what I ended up doing. Plus, I think the computation to find “best-matching sector” is pretty rough. I don’t want a patch file that takes days to compute… unless I’m really, REALLY desperate. Since I have the on-the-fly patching built into my engine now, I could make an alternate patch format without too much work. Note that in order to avoid read-amplificaton issues, I would need to STRONGLY prefer sequences of sectors. I mean, I know it’s paranoid, but I’d rather miss once and have an inefficient sector than even have a chance of a perverse case turning one long sequential read into dozens of short ones.

xor_mgambrell_0

I ran into a problem today with a game that we thought was already totally DONE and basically ready to go to manufacturing. Sony bounced it back because it sat too long at a black screen while booting up (like, almost a minute). They gave me the clue I needed to pretty much immediately figure it out: this only happened if you started the game right after inserting the disc for the first time. After the game is fully copied to the hard disk, the problem doesn’t occur.

This isn’t the first game I’ve had go on a PS4 disc, but I have never actually had to worry about this before. I mean, it’s always a concern if you’re making real CD games but my stuff has been all digital-first with physical editions as an afterthought and so I simply don’t even bother to think about it nowadays. And it’s a relief not to.

This one game posed a problem though because it was loading 70 giant high-def pngs during this time. … on as many threads as the system has. 70MB worth. It would ordinarily take longer to decode them than to read them so the threads are useful. It’s not a great idea on a console to let the OS read the media from so many threads, but it doesn’t really matter for this kind of thing since the reads are done whole-file-at-a-time and there’s no reason for the game or the OS to do anything smart about scheduling them. Actually in practice that work (when the game is fully installed) happens in the blink of an eye.

Except… in this case… PlayGo was busy installing the the game to the hard disk. So the disc was being asked to constantly seek between the pngs and the task of reading the whole contents. That turned something which was supposed to take a blink of an eye plus an invisible background process (lasting about 10-20 seconds I believe) into 1 minute of thrashing.

Actually I didn’t even know PlayGo was gonna do that. I always thought that it would finish installing the game to the hard disk before it would boot, just like it would for a downloaded game (since I do not configure PlayGo to do anything smart.) But apparently not.

So after considering waiting for PlayGo to report that it was done, or adding a loading screen, I came up with something super safe and easy: my engine has the capability to load rom files into a memory cache by glob. This happens in one thread. So I use the synchronous version of this to load all the pngs into memory cache. This takes about 17 seconds since it still has to fight with PlayGo, but it’s only one thread so it isn’t as bad. Once this is done the pngs are loaded through unmodified code that transparently uses the cached files in my virtual filesystem–and that’s the part that takes a blink of an eye. So I transformed umpteen threads thrashing the disc to 2 and cut the runtime to 33%.

The moral of the story is that when dealing with media access on game consoles you are expected to do this kind of stuff. The OS isn’t gonna bail you out.

xor_mgambrell_0

I’ve got this game with 12 hours and 23 minutes of extra music. It’s pretty good music, so it’s worth it… I’m just listening to some of it for the first time as I write this…

This is supposed to be split up between two skus somehow, and released on switch, so the media size is an issue (since it affects the manufacturing cost). We also have a to-be-determined amount of extra junk in the form of movies and collectible art. So I don’t know yet what format we’re going to ship that music in. And there’s no need to even encode it on the big console versions since the blu ray discs are huge. I want to encode the music as best as possible for the switch version, but it’s not clear what quality level to use. Currently I’m assuming it will be ogg, but I really should make some effort to support opus… but that library is kind of hard to compile. I gave up on it once. Hmm…

We’re using git for source control, so I didn’t want to repeatedly assault my repository with multiple versions of encoded music as I continually fool around and change my mind. I’m constantly worried about git repositories getting too large to be practical. So I commit them as wav to another repository as a submodule, knowing I could completely wipe the that repository and recreate it if for some reason I need to. I used wav instead of flac for convenience… it was already gonna be a big repository, cutting the size to 2/3 won’t really make a big difference.

So I made this little tool that will convert all the wav files into encoded forms in whatever formats and bitrates are needed. For development purposes I have it set to make low quality oggs just so our test builds aren’t huge. OK, problem mostly solved.

Now today I get a bunch more more music and one of them is a 170MB file… even flac’ed, that’s more than 100MiB. Did I mention I’m using github for this and there’s a file size limit of exactly 100MiB for any file in the repository. I am pretty sure there is also be a limit to how big a single push can be, So I commit and push them in ~100MB batches.

So… hmm… what to do? Well, since I have that encoding tool… I just split the wav in two halves and commit those. The tool now knows how to merge a -PART1.wav and a -PART2.wav file into a single one before encoding it. Problem solved!

xor_mgambrell_0

So I’ve got this game that plays 4 movies at once. Well… due to how the engine works, it actually plays 8 movies at once–a 2nd instance is spun up paused and queued after the 1st instance so that it’s hot and ready to go in order to effect loops. This is actually important… many games in this engine are using it for looping effects, instead of sprite animations. It’s super annoying.

Needless to say, I did not plan for this possibility when porting the engine. When I first started handling this game, I added support for hardware-accelerated movie decoding, because I knew it was doing 2 at once at least (the 4/8 was a surprise) and game consoles can barely manage 1. Well… wouldn’t you know it, but some hardware-accelerated video decoders use, like, 13 threads per video. And game consoles are always limited in resources (the overhead and unpredictability required to create the abstraction of unlimited resources is an unacceptable imposition–you are supposed to accept the limitations in exchange for the benefits) so of course the threads are limited, also. Now the console-makers aren’t stupid. There’s plenty of threads for a game to play a movie or two. But 8 movies? Actually there’s a reason the movie decoder uses 13 threads that really pisses me off and would piss you off too if you heard it, but it’s too revealing to go into. Really unflattering to all involved. To be fair, I am a little wasteful of threads, too. In fact it takes me 3 threads to play a video on my own, on top of the 13 for the hardware decoder. One for mux, one for video, one for audio. But you can tell what some of these 13 threads are used for, and it’s so lazy.

Man, let me tell you some of the other things I did for this game. To cut down on time wasted copying 1080p frames (did I mention these were 1080p and even bigger videos?) from a hardware-accelerated output memory, to an engine-internal soft-surface, and then BACK to a texture… what I do is allocate a texture, specify it as the hardware-accelerated decode target, and then return the mapped pointer as the engine-internal soft-surface… and then, after all that, add some magic logic in my port layer that detects soft-surfaces which are actually mapped texture pointers which then elides the copy when creating the GPU texture from the soft-surface. Whew! That really took some doing. Probably one of the most amazing things I ever did, but what’s even more amazing about it is that I didn’t do it in a fork of our porting framework but feel it’s generally stable enough to keep around forever. WOW that does not happen often.

So… since I added the hardware decoder in order to decode movies fast enough, I no longer have the threads to do it (as soon as the game gets to the near-endgame point where 8 play at once).

I thought about re-engineering my system so it didn’t need 3 of my own threads per video which would actually get me down to near the limit… I should really not be so sloppy… but I ran the numbers and even though it would put me in the safe zone (which is why I thought of it in the first place) it wouldn’t put me FAR ENOUGH in the safe zone. Why knows what this game may do on top of 8 videos? Ahhh now’s the time to say, I’ve got like 30 threads reserved… for playing sound effects… sigh… ok, let me explain.

This engine doesn’t really distinguish between music and sound effects. Sound effects are streamed just like music. And sometimes a game will use a long sound effect for something like “guiselect.wav” or “chitchatter.wav” for no good reason–that is, it will have a long silent tail or a super long reverb or somesuch. Yeah, I could fix that, but I’m trying to build a rugged engine port here. And don’t forget, games may be playing some ambient sounds–doubled up for the looping the way this engine does. So I did some measurement and figured out 30 should be safe. I didn’t know I’d need 100 threads for movies when I figured that.

And anyway that re-engineering is… pretty not very safe. I don’t want to do this kind of thing now. This game has been a thorn in our side for months now.

So my next step was to find out if I can fix this doubling-up for the loops that the engine does. Turns out I could. I had to get help from the engine devs to figure out how, but if I simply played the second video after the first one ended, it would look nice enough – on this game, at least. I didn’t want to break any other games using this engine, so I did this hack just for this game. That combined with ANOTHER hack to lower the 30 multimedia threads to 20 (since by now I’d seen the characteristics of this game enough to know it would be safe) got me safely into the safe zone. Yeah, there’s safe zones and then there’s the safe zone in the safe zone. I honestly couldn’t believe I could spin up new movies without even hitching a frame, but that’s how it was. The videos, despite being 1080p, are actually a little low on the framerate side, so that helps things.

Now after all this, would you believe… in the very same scene that plays 8 videos… microseconds after it would have exhausted its threads and crashed… it now HANGS because of a new bug in my hardware-accelerated video decoding that caused threads to go into busy loops trying to consume frames without leaving the main thread free to produce them? Well, that was just my bad. Only a game that used so many videos at once would have ever smoked that out. Took some time to find but wasn’t so hard to fix. Would not have been a problem if I hadn’t had to hardware-accelerate the videos for this game.

(Actually I’m glad I finally did the hardware-acceleration–I felt like such a loser leaning on libvpx to do it, even though I have nice things to say about the library. It’s just not the Console Way to use that, though)

This @*#@ing game also smoked out a 1/10000 times problem in my basic video playback logic which would rarely hang due to a race condition between the engine consuming the last frame and the porting framework producing it. It just plays so many videos that it actually made it happen reliably. So I had previously fixed it (just for this game) but made a mistake in it: I’d added a std::atomic<int> which wasn’t initialized to zero. Yeah, those aren’t initialized to zero. In general it’s completely impossible to predict what anything in c++ is initialized to nowadays due to some combination of the semantics being completely off the rails in complexity, and the language design being absolutely hostile to users (I mean, even 20 years ago we were joking that everyone knew only half of c++ but it’s a different half for everyone); and I am only very slowly learning how to properly initialize everything. I mean, you don’t want to type std::vector<int> = {} – isnt that stupid? But you better type std::atomic<int> = {0} or else it could be garbage (std::atomic<int> = 0 is a compile error for reasons so complex explaining them would be a great term paper for a graduate student)

By the way, while I’m typing this, I’m nursing the game through an autoplay mode with my left pinkie finger. On top of all this other crap (and more problems just for this game I haven’t told you about) the game is simply running out of memory due to some unlucky timing of spikes before garbage collections. So I’m testing some complete playthroughs now to see if some finetuning is going to work to fix that.

Now I’m adding “playthroughs” and “finetuning” and “autoplay” to my dictionary file in this text editor. Man, what an annoying day.

Well, I’ve got 866MB memory free and the game’s 2/3 over. Time to focus on ending this thing (and then test the campaign again without rebooting the game, just in case). Seeya!

xor_mgambrell_0

One of the funnest parts of my job is picking out icons for projects on our gitlab server. I try to balance encapsulating the spirit of the game with finding something iconic while trying to keep it varied so it’s not a wall of 8bit sprites or anime girls. I don’t always do a good job but I do put some serious consideration into it. I made a collage of jumbled up icons.

xor_mgambrell_0

Let’s talk about hacking roms. Well, more precisely, let’s talk about NOT hacking roms.

I can do it, but I’m not great at it… and I find most emulator debuggers quite disagreeable. I think it would take me a long time to do some of the hacks I need to do. I have some workarounds that have proved quite versatile… they are MORE versatile than hacks, in some cases. And less, in some others. But they can be way faster than hacking, in some cases.

First, I should mention, classic game genie and action replay cheats are easier than hacks. Yeah, sure, always having 99 lives isn’t quite as elegant as it would be to simply never lose any of your 3 starting lives. And blinking invincible as if you just got hit for the entire game isn’t quite as elegant as simply clipping through bad guys without taking damage. But it does get the job done.

Moving on to more interesting things: I call them “input macros”. As a simple example, I have a game with a hidden mode that requires pressing select five times on a menu. Now, I don’t want to even have to expose the select button to the user as a binding. So what I do instead is configure the triangle button to set up a sequence that seizes control of the input and sends a series of five select button presses across several frames. It can get more complex than that. In Turrican, we have dozens of screens full of code doing this kind of stuff. For instance, to drop a bomb you have to crouch and shoot. Ideally we’d have a button that does this instead, so we have to crouch for the right (minimum) number of frames and then presses the button at the right time and waits for long enough crouched for it to register.

Another input-related thing we need to do is to fix menus so they behave in more sensible ways, using the console’s normal accept/cancel buttons (which varies by console and region of course, most noticeably differing between Nintendo and everyone else). This requires unhooking the normal input bindings completely and simulating a different kind of setup depending on whether we’re in a menu. Detecting if we’re in a menu… can be hard. In theory somewhere there’s probably a variable that says what game mode we’re in… maybe, maybe not. Maybe you can only tell by noticing that certain code only runs when you’re in a menu, and you can detect when that PC is getting hit. That’s too close to hacking for my tastes though. Here’s what I do instead: I scan the VRAM for particular patterns of tiles. That works for any screens that have some unique static elements. If that’s not good enough, and it’s a transitory condition, you can set up a state machine to catch transitions between menus.

Scanning VRAM is kind of more of a dumb thing. You do this, I do that. If you try to analyze the game’s programming, you can get into mind games really quickly. The game will be reusing code for something, or it will do every screen in a completely different way (that’s totally normal; each screen may have been programmed by a different person or be pushing the hardware enough to be basically a different program) so your reverse engineering workload ends up proportional to the number of scenes. But scanning VRAM is the same trick every time, and you can build tools to help you do it.

The final workaround I want to mention is the inverse of scanning VRAM – it’s modifying it! Suppose a game prints something static in Japanese and you want it to be in English. The tiles are compressed, the tilemap data’s compressed… you’re gonna crack all that, PLUS the code that draws it, and redirect it all somewhere? NOPE! Detect it, and replace it. This is easy if you have spare tiles in VRAM. But you don’t always…

I came up with a neat way to fix that which has totally worked for me on the PC Engine. It’s not directly scalable to the other consoles, but I’ll tell you how it might be (not proven yet.) The PC Engine is odd for having twice as much VRAM addressible for tiles as it has VRAM. So you can just park extra tiles in the 2nd half of VRAM and rewrite the tilemap to reference it. A big concern when patching art like this is whether it will fade out with the rest of the screen. You could patch it with pngs that don’t use the proper console palette, but what’s going to make it fade out? It can be really tricky (in an emergency you can fade it yourself by checking a known-white color in the palette).

Here’s the plan I have for other consoles (and I’m definitely going to have to attempt it on the SNES soon). Who says the hardware can’t have a separate bitplane tacked onto it in a way that the CPU can’t even access? I can access it from the emulation framework. So I will just “add” tiles by putting them in an auxiliary bank of vram and then set a bit in the separate fake bitplane which will signal the PPU renderer for the system to fetch the tiles from the auxiliary blank.

This is kind of an approximation of how consoles have extended their PPU capabilities before (the GBA and the NDS did quite a bit of this). There’s no reason you can’t apply it more generally, though. I feel sure you could do it on a playstation, for instance. In fact every HD textures approach for 3d consoles is working in a philosophically similar way. There’s additional resolution packed away in a buffer not accessible to the CPU (because it wouldn’t know how). I look forward to running into the problem on a PSX game, but I have to find a PSX game to port first…

xor_mgambrell_0

There’s something everyone must run into at some point. You’re happily rendering UI straightforwardly and then all the sudden for one reason or another you need to composite two layers. But you want it to look exactly (give or take some numerical precision issues) like it did before.

I’ve done this before but never in a fully satisfactory way. This time I decided I wanted the perfect solution. I decided to write this out while I work through it.

When compositing layers, the first problem you run into is that the alpha values in the top layer are nonsense. Well, what should they even be, anyway? The alpha values are meaningless until you need to composite that layer. You probably have no control over those values. They might be all 0, or all 1. They might be a complete mess.

Let’s say I have layer TOP and BOT and BOT just contains a wallpaper. I clear TOP to 0,0,0,0 and then draw an opaque triangle texture on it. Now to composite I draw TOP normally, as long as I replicate the triangle’s alpha values into TOP.

Suppose I scale the triangle a bit so it’s linear filtered. Now what we will run into is texels (perhaps 1,1,1,0) from outside the triangle texture getting incorporated into the filtering. You have to use premultiplied alpha to fix that (a 1,1,1,0 pixel will be come 0,0,0,0 after multiplying rgb *= a). This is well-known. I just wanted to get it out of the way and make you visualize the fuzzy-edged alpha values that will get produced by this. If we had a white triangle, we may end up with (0.5,0.5,0.5,0.5) at the edge (with premultiplied alpha, that’s a half-transparent full-white).

When this is drawn normally onto BOT, everything still works normally.

Now, let’s draw a big red semitransparent box over all that. Let’s do it at alpha=0.1 … now what should the alpha values be where the semitransparent box overlays the fuzzy edges in the triangle? We have 0.5 alpha clashing with 0.1 alpha. Do you want to to multiply them? That’s wrong… it suggests we now have an alpha of 0.05 which is basically invisible. The art on TOP should have produced higher alpha values. Do you want to add them? You’ll exceed 1.0 at some point. Addition/subtraction isn’t logical and it won’t work.

There’s not a lot you can do with the ROP (blending) units on the GPU so you may be tempted to go to the shader. But there’s no such thing as a free lunch… you can’t just do what amounts to blending on the shader without losing something, and it turns out that’s performance. GPUs just aren’t meant to do this. Maybe you know a way to do it with a cutting edge GPU and API, but: I doubt it, and there’s still no way it’s as fast as if you used the very old GPU programming models that were designed to be efficient from the outset. And anyway you still have to figure out the math!

In theory, if we weren’t compositing, after drawing the triangle edges we would have something like this:

WALLPAPER * 0.5 + TRIANGLE * 0.5

and after drawing the box we would have

(WALLPAPER * 0.5 + TRIANGLE * 0.5) * 0.9 + BOX * 0.1
-> WALLPAPER * 0.45 + TRIANGLE * 0.45 + BOX * 0.1

We note that the wallpaper represents the BOT layer and so when we composite (using only the information created while rendering TOP) we need to end up with a TOP alpha of 0.55 in order to blend the layers correctly. So somehow we need to end up with an 0.55. In particular we needed to get:

0.55 = 1.0 - 0.5 * (1-0.1)
dest_alpha = 1.0 - dest_alpha * (1-src_alpha)

So now we have to figure out how to produce that with normal GPU resources. Right off the bat, we have the ROP blending capabilities problem. While we could multiply DST and ONE_MINUS_SRC, there’s no way to get the extra 1.0- in there. Maybe we can do some algebra to help?

dest_alpha = (1.0 - dest_alpha) + (dest_alpha * src_alpha)

This isn’t helping, there’s just no way to deal with the 1.0-. Even though we have ONE_MINUS_DST_ALPHA, we can’t use it without multiplying it with SRC_ALPHA or DST_ALPHA which we don’t want (that’s simply how the blenders work.)

So what we’ll have to do is invert our sense of alpha so we don’t have this problem. Now we’d need to have 0.45 in there (we’ll un-invert this sense in the compositing stage, trust me). So to be clear:

0.45 = 0.5 * (1-0.1)
dest_alpha = dest_alpha * (1-src_alpha)

In other words, it’s where we got stuck before but without the 1.0- because that’s always baked into the dest_alpha.

Well, this looks much easier. now we can use a ONE_MINUS_SRC_ALPHA * DST_ALPHA configuration in the blender. But there’s one snag. Instead of clearing to 0,0,0,0 (fully transparent, ordinarily) we now need to clear to 0,0,0,1 (fully transparent, with the inverted alpha sense).

Now, how to composite it? Well, this is what’s needed (with the un-inverting taken into consideration):

dest_color = (dest_color * src_alpha) + (src_color * (1.0-src_alpha))

And that’s simply a blender configuration of:

SRC_COLOR * ONE_MINUS_SRC_ALPHA + DST_COLOR * SRC_ALPHA

The inverting of the alpha sense happens here, where you can see things are plainly backwards-seeming.

In all my years of doing this kind of stuff, I think I’ve never seen it written out this way before. You’d think the key characteristics of having an inverted sense of alpha and clearing to 0,0,0,1 would be memorable. But no… Finally, after writing all this, I thought of enough keywords to find an article that mentioned it, in the 2nd half:

http://hacksoflife.blogspot.com/2010/02/alpha-blending-back-to-front-front-to.html

A “little bit of head scratching” indeed.

I posted this anyway so you could see my analytical method involving trial numbers and so I’d have something to use to stop losing gruedorf.

UPDATE 2022:
It seems I forgot something here. Ordinarily when using premultiplied alpha, it affects the blending you use during composition. Without it you’d have:
SRC_COLOR * SRC_ALPHA + DST_COLOR * ONE_MINUS_SRC_ALPHA
With it you have:
SRC_COLOR * ONE + DST_COLOR * ONE_MINUS_SRC_ALPHA

So let me correct the analysis above. I said we’d need this:

dest_color = (dest_color * src_alpha) + (src_color * (1.0-src_alpha))

But actually src_color has src_alpha already incorporated via the premultiplication (actually it’s all muddled due to the inverse sense of alpha). So really what we need is this:

dest_color = (dest_color * src_alpha) + ((src_color/(1.0-src_alpha) * (1.0-src_alpha))
-> dest_color = (dest_color * src_alpha) + src_color

And thus a blender configuration of
SRC_COLOR * ONE + DST_COLOR * SRC_ALPHA

xor_mgambrell_0

“How do you handle movie playback portably?” you ask. I’m glad you asked! I discussed it before. Let me write some more about it.

Every console vendor gives you a video player sample which is about 50 times more complex than it needs to be and is bogged down with unnecessary clutter due to taking care of showing frames and playing sound, and possibly handling 99% useless features like resolutions changing mid-video, metadata events, and convoluted selection of multiple video tracks. OK maybe only 90% useless, but 90% and that’s a fact.

Most game studios will just fiddle around with those samples to make a basic canned FMV player. Heck, all half of the game studios need is playing a vain publisher’s animated splash screen. But if you want to mix the audio with other things and render stuff on textures, and have fine control over seeking and rewinding and stuff, you have to rip it all apart. It was Hypnospace Outlaw I had to originally do this work for, because a canned FMV player did not cut it for that game which has videos playing in windows all over the screen and even “glitch” effects by seeking around crazily.

I didn’t look forward to ripping this all apart for several consoles, so what I did instead was find a software video codec that would work for me. Now this isn’t easy. Most software video codecs are either GPL, uncompileable trash, or just plain weird trash. After filtering out all that, I was left with libvpx which is pretty awesome. I got it compiling portably and even using multiple threads for decode without any trouble. And as a codec, it rocks.

There’s another benefit of libvpx which made me glad I found it and chose to pick it. It’s that the switch can hardware decode vp8 and vp9 natively. So in theory, I would be able to use the same videos on all consoles without having to change the encoding. Now, I didn’t do that at first, because the Switch could actually decode a 1080p30 video in software fast enough, but eventually I came across the scenarios I’ve mentioned before which made me need to graduate to the hardware video decoder.

Now, it’s important that I share as much code as possible between the hardware and software video players. I came up with a design that turns the actual decoder into a standard interface with the libvpx and the switch hardware implementation. It’s too much detail to go into now, but the gist of it is that it is very simply and consists just of sending it input frames and grabbing output frames. Seeking isn’t supported; the decoder just expects to start with a I-frame (which may be in the middle of the video). If I need to seek I actually tear down the whole apparatus and boot it up again ready to seek to the needed I-frame. None of this is really astonishing, but it was surprising to me how simple a properly designed codec interface can be when the unnecessary concerns are removed. And this information is needed to see how the other parts are organized.

Before I explain the overall organization I need to tell you about the IVF file format. This is basically a mind-numbingly simple video stream format designed alongside the vpx codecs. It’s so simple that it doesn’t even necessitate a library. Imagine the simplest format you can for doing this kind of thing, and then imagine IVF is simpler. FFmpeg has support for writing to IVF streams.

So with all that background in mind, finally here’s my approach:

I typically receive webm files with vpx codec in it. I run that through a script that dumps the video stream to a .ivf file and encodes the audio as ogg (it’s not worth the trouble to get libopus compiling natively, that library is a mess.) I then have a tool that runs through the ivf file to create an index for seeking. Yeah, IVF is so simple it doesn’t have that. I’m telling you, when you have simple enough formats and concerns separated properly, it’s easier to look at specs and code everything from scratch than it is to figure out how to wrestle overcomplicated libraries needed for overcomplicated formats. Unfortunately the IVF frame headers do NOT say whether the frame is I-frame or P-frame … it’s somewhere in the packed frame data. So my tool has to run ffprobe to gather that data. I create my own “.dat” file with all this stuff, and it’s now alongside a .ivf and .ogg.

The very first thing that happens during playback is the ogg is handled by the exact same codec that plays all music in my engine. That… you could call it OggAudioStream is routed through a MuxStream which also gets a VpxVideoStream. So already things are simpler by audio being done and out of the picture.

Next I have the portable layer that manages the .dat index. It gets the resolution out of there and does the seeking as needed–it goes to the nearest preceding I-frame for the desired seek timestamp and then drives the VPX decoder, throwing away frames as needed. This layer is seeking through the .ivf file and reading the video frames out. Finally all that’s left is hooking up the correct hardware or software video decoder to do the decoding of the frames read from the .ivf.

You may be asking why I don’t use libwebm directly. It’s because that library was a nightmare to compile and I would have to pipe the audio out of it somehow, and it was easier to just make my own “container” format and leverage IVF.

I could combine the .dat and .ivf into a true container, but… why? I would never put the ogg data in there also–it’s so nice to have it played basically totally autonomously. My MuxStream determines its timing from the audio component (or the video component in case there’s no audio track).

Any videos I receive that arent vpx encoded just get transcoded, which isn’t ideal in principle but vpx is such a good codec it basically loses no quality and can take advantage of the lost information from the original encoding to produce smaller files. Like, if you ever have to transcode anything, transcoding to vpx is the way to do it. It seems to do an amazing job.

None of this is a great revelation, really. But it is a great relief to have video playback done so well that it’s a fully solved problem… without having to pay for a 3rd party library on every game (bink or smacker or whatever people are using nowadays). The FMV splash-screen displayer process source code is 100 lines only. There’s some areas for improvement in the API that’s interposed between all the stuff I’ve described and the “user” API due to how things evolved, but being able to drop a movie into your game in 100 lines is pretty good. With about 100 more lines I could scale it down, stick it in a window, and make a gamepad-controlled FMV mediaplayer, and that would take just about an hour.

More importantly, I feel good enough about this general approach that I think I can recommend it for everyone, in every situation. This problem is SOLVED.

xor_mgambrell_0

Welp. I forgot oggenc litters its output oggs with random garbage that’s different on every run.

So now I’ve got 600MB patches every time I make a new build unless I change the build process for this game to copy over the ogg files from the first rom version that got approved. That’s because ogg by default puts a serial number in each internal chunk (i.e. 100s of times in the file at unpredictable offsets) which is seeded randomly every time oggenc runs.

Moral of the story: if you’re running oggenc without “-s 0” to seed the serial number to 0, then you’re doing it wrong. Always, always, always, always.

In fact at one point I even made a custom oggenc that always does that so there’s no way I could forget… except… I forgot to use that custom oggenc this time.

xor_mgambrell_0

I’m working on a big Wonder Boy collection. The biggest challenge is keeping all these names straight. Wonder Boy 5 Monster World 3? Wonder boy 3 – dragon’s trap, or monster lair? Wonder boy 4 was skipped because there were two wonder boy 3’s… when I started working on this project, the first thing I did was come up with a naming scheme so I wouldn’t get confused by everything. This post is about WBIML - wonder boy in monster land (basically, monster world 1 [wonder boy in monster world is not monster world 1, it’s monster world 2]).

The game lets you buy some magic spells / items. Fireballs, bombs, that kind of stuff. But there’s no menu to choose what’s equipped. I think it’s because it’s was actually an arcade game first and they didn’t want people to fool around in the menu forever… nor add another button for hot-swapping it. The SMS port of the game could have been a chance to improve that, but they seem to have kept it just like the arcade game for some strange reason. Like, time limit and all. The time limit really doesn’t make sense for this game…

So what happens is when you get an item, it’s pushed on a stack, and you can’t access whatever item you had before until you exhaust all of that new batch you’ve got.

Well, this game is pretty… bleh. I couldn’t stand it being totally pointlessly insufferable, too. So I added a system that lets you choose your magic by hitting a hotkey.

In order to do this, I have to run through memory to see what the game’s put in the stack. I collect these in large-sized bins and keep track of the order I found them. Then I clear all that memory and write them back out in 0x7F sized bins. If the user pressed the “select magic” hotkey, before I do that I rotate the ordering.

I then have to cause the HUD to get updated. In the arcade version I found a flag that does this; I just set it to 1 and the game knows to update the hud (I found the code that acquires a new item setting that flag). For the SMS version I couldn’t find such a flag, but found the function that updates the hud so I call that from my frontend by saving the cpu state and then meddling with it to have the right SP and PC and stepping the instruction loop without stepping the whole machine until the SP passes the original value (i.e. the RET was excuted).

The best way to do these hacks is to hack the rom to use a physical button that doesn’t really exist, but… I like to try doing things the other way sometimes. Just to prove I can. Plus, occasionally (commonly?) the ROM will be totally full and there’s no place to put that code. You can always expand the rom, but this is apparently kind of a fraught process. Or so I’m told. I don’t understand why… really you should just use what little bit is available to set up a “syscall” by ID mechanism that makes a “far call” to another rom bank. Maybe I’ll do it that way some day.

I’m not sure if this feature will make it into the final game… if QA finds problems with it, depending on WHEN they find the problem, I may have to nix it rather than incur more risk. Why we can’t QA while the game is being developed instead of at the last minute is a story for another day…

bengrue

Wonderboy has such delightfully candy-like tiles. It’s always been so nice just to look at!

slimehunter84

@xor_mgambrell_0 This is cool! I’m excited to see the launcher UI. Also I never knew there was an arcade version of the wondrous lad.

xor_mgambrell_0

We didnt like how the money in this NES game was spawning (too rare, seemed broken) so we started investigating it. Seems it used a threshold of $A0 and it gave money whenever the threshold was exceeded by a random roll. But this didnt seem to explain the whole story. So we checked the random number generator.

It seems to be four state bytes, each for a stream, where the stream is just a sequential walk through a table of 256 bytes of entropy. And each time a random number is required, it moves to the next stream, so that as long as theres an odd number of things requiring random numbers, it will be getting then from multiple streams and not stuck in a predictable sequence unto itself. It works out okay but I dont know if a choice of four streams makes for a good divisor or whatever. Perhaps something prime?

So the algorithm is basically:

byte curr;
byte state[4];
byte nextrand() { return randtable[state[curr++%4]++]; }

This all seemed alright, except that randtable doesnt contain entropy. Im pretty sure it contains SOME OTHER table (I dont know what, yet). I surmise the devs ran out of room in that bank (I certainly couldnt find room for a 256 byte table in there) and just said whatever, this looks jumbled enough.

As long as the random numbers are used to choose a random wandering pattern, its not so bad, but when this $A0 threshold is used for something simple like a money drop, it makes it almost impossible to behave sensibly. For example if we changed the threshold to $80 but there were no values between $80 and $A0 in the table, then it would make no difference to the drops.

So we’re either going to TRY HARDER to find room in the rom for the table. Or since this game has WRAM (battery-backed) and Ive studied its use of the WRAM, I happen to know that it doesnt use it all (almost no game does) and so we could just copy the table to $6F00 (the save game slots begin at $7000) at bootup using whatever extra banks we can trivially tack on (tacking on banks to run new code at bootup is substantially easier than inserting code patches all over that switch to other banks on the fly, or finding room to jam stuff into the common bank)

One final detail to mention: how is state[] initialized? When entering a dungeon, state[0] is set from the global frame counter, and then state[n] is set as randtable[state[n-1]]. This means there’s only 256 configurations… but I doubt anyone will notice.

xor_mgambrell_0

More NES emulation…

We didn’t like how this other RPG game was rolling random encounters. It happened on the first step too often. Very annoying. I think there needs to be some kind of step counter setup with a minimum value… I don’t know how many games do that. Is it possible they have arranged the statistics so that low rolls are less common? I haven’t checked this in a while.

This game has 4 terrain types, and up to 4 encounters per terrain type; on top of that it is organized into zones in the world map (thus further-away zones have harder encounter sets). There is only one monster per encounter, so encounter type = monster type. The data is organized this way:

struct EncounterDataForWorldZone
{
 byte chances[4][4];
 byte monsters[4][4];
}

EncounterDataForWorldZone encounterData[N]; //N zones in the game, N unknown

Now, for each step, the game gets a random number R from a flat distribution. It then considers chances[currterrain][0] = C as a C/256 chance of getting the encounter on that step. We see if R is less than C and do the encounter if so. Otherwise C is subtracted from the random number, so the next chance C’ from chances[currterrain][1] is a C’/(256-C) chance. And so on.

Generally the chances near the beginning of the game where we were testing, is, I dont know, 16-20 or so. So it’s easy to see why we get too many encounters… and why they can happen on the first step so often.

My first thought was to use the old “turn it into a normal distribution” trick by changing a 1d256 roll into a 2d128. This biases more normal rolls in the middle at the expense of rolls in the tails. So we did this for a while, and it felt okay… but we weren’t testing the whole game, and I got nervous.

Let’s take a simplified scenario where we have 2 encounter chances, each of value 64. Now the original way it was done, each of those would be a 64 in 256 chance (1 in 256). Well, kind of, because of how the subtraction happens in the algorithm. Anyway with two dice, there’s fewer ways of rolling the smallest 64 numbers than there are the next 64 numbers (just like there’s only one way to roll a 2 on two dice but there’s many ways to roll a 7). And yet since they’re adding up to 128, the new probability distribution is centered on 128, and the absolute probability of an encounter is still the same! All we accomplished was to bias the selection to the later encounters. This would change the character of the game in a way we don’t really have time to test and scrutinize.

So my romhacker came up with a better idea to add a tiny bit of logic to only do encounter checks on every other step. This way you won’t get them on successive steps under any circumstances, and we get the other desired result of just generally fewer encounters–precisely half as many, without having to grapple with the really thorny problem of how to do it, logically speaking, without affecting the design of the game by biasing certain encounters. Actually after a little more iteration we developed an approach to run it with any divisor so that I could give another option to reduce encounters by half again (so a divisor of 4) from the emulation menu, so people would feel a bit empowered to be masters of their own destinies instead of frustrated by the fixed design. (There’s also a cheat to disable encounters entirely)