xor_mgambrell_0

xor_mgambrell_0

Let’s talk about hacking roms. Well, more precisely, let’s talk about NOT hacking roms.

I can do it, but I’m not great at it… and I find most emulator debuggers quite disagreeable. I think it would take me a long time to do some of the hacks I need to do. I have some workarounds that have proved quite versatile… they are MORE versatile than hacks, in some cases. And less, in some others. But they can be way faster than hacking, in some cases.

First, I should mention, classic game genie and action replay cheats are easier than hacks. Yeah, sure, always having 99 lives isn’t quite as elegant as it would be to simply never lose any of your 3 starting lives. And blinking invincible as if you just got hit for the entire game isn’t quite as elegant as simply clipping through bad guys without taking damage. But it does get the job done.

Moving on to more interesting things: I call them “input macros”. As a simple example, I have a game with a hidden mode that requires pressing select five times on a menu. Now, I don’t want to even have to expose the select button to the user as a binding. So what I do instead is configure the triangle button to set up a sequence that seizes control of the input and sends a series of five select button presses across several frames. It can get more complex than that. In Turrican, we have dozens of screens full of code doing this kind of stuff. For instance, to drop a bomb you have to crouch and shoot. Ideally we’d have a button that does this instead, so we have to crouch for the right (minimum) number of frames and then presses the button at the right time and waits for long enough crouched for it to register.

Another input-related thing we need to do is to fix menus so they behave in more sensible ways, using the console’s normal accept/cancel buttons (which varies by console and region of course, most noticeably differing between Nintendo and everyone else). This requires unhooking the normal input bindings completely and simulating a different kind of setup depending on whether we’re in a menu. Detecting if we’re in a menu… can be hard. In theory somewhere there’s probably a variable that says what game mode we’re in… maybe, maybe not. Maybe you can only tell by noticing that certain code only runs when you’re in a menu, and you can detect when that PC is getting hit. That’s too close to hacking for my tastes though. Here’s what I do instead: I scan the VRAM for particular patterns of tiles. That works for any screens that have some unique static elements. If that’s not good enough, and it’s a transitory condition, you can set up a state machine to catch transitions between menus.

Scanning VRAM is kind of more of a dumb thing. You do this, I do that. If you try to analyze the game’s programming, you can get into mind games really quickly. The game will be reusing code for something, or it will do every screen in a completely different way (that’s totally normal; each screen may have been programmed by a different person or be pushing the hardware enough to be basically a different program) so your reverse engineering workload ends up proportional to the number of scenes. But scanning VRAM is the same trick every time, and you can build tools to help you do it.

The final workaround I want to mention is the inverse of scanning VRAM – it’s modifying it! Suppose a game prints something static in Japanese and you want it to be in English. The tiles are compressed, the tilemap data’s compressed… you’re gonna crack all that, PLUS the code that draws it, and redirect it all somewhere? NOPE! Detect it, and replace it. This is easy if you have spare tiles in VRAM. But you don’t always…

I came up with a neat way to fix that which has totally worked for me on the PC Engine. It’s not directly scalable to the other consoles, but I’ll tell you how it might be (not proven yet.) The PC Engine is odd for having twice as much VRAM addressible for tiles as it has VRAM. So you can just park extra tiles in the 2nd half of VRAM and rewrite the tilemap to reference it. A big concern when patching art like this is whether it will fade out with the rest of the screen. You could patch it with pngs that don’t use the proper console palette, but what’s going to make it fade out? It can be really tricky (in an emergency you can fade it yourself by checking a known-white color in the palette).

Here’s the plan I have for other consoles (and I’m definitely going to have to attempt it on the SNES soon). Who says the hardware can’t have a separate bitplane tacked onto it in a way that the CPU can’t even access? I can access it from the emulation framework. So I will just “add” tiles by putting them in an auxiliary bank of vram and then set a bit in the separate fake bitplane which will signal the PPU renderer for the system to fetch the tiles from the auxiliary blank.

This is kind of an approximation of how consoles have extended their PPU capabilities before (the GBA and the NDS did quite a bit of this). There’s no reason you can’t apply it more generally, though. I feel sure you could do it on a playstation, for instance. In fact every HD textures approach for 3d consoles is working in a philosophically similar way. There’s additional resolution packed away in a buffer not accessible to the CPU (because it wouldn’t know how). I look forward to running into the problem on a PSX game, but I have to find a PSX game to port first…

xor_mgambrell_0

There aren’t a lot of PS5 games. Not as many as you’d expect. I mean, you can get a lot of games on your PS5 but most of them are PS4 games.

That may or may not have to do with this:

In order to make a PS5 game, you have to implement this complicated thing Sony is trying to promote called “activities” or possibly “stories”. I don’t know exactly what it’s called; it’s complicated.

The idea is that it should let people jump into the game from the OS with minimum fuss – no splash screens (insofar as it’s possible), no title screens, no menus, no file selects. It’s like a menu your game has to support, outside the game. You could call it kind of like windows jump lists. I might have a limited view of this feature (I think it’s way more powerful) because I have made limited use of it.

If you haven’t heard of this thing, don’t feel bad. I don’t know if anyone actually uses it.

OK, so you want to do this. Prepare to turn your game inside out. You need to boot straight into gameplay, so any accidental dependencies your game had on earlier screens need to be handled differently. You have to tell the OS when the player has begun an activity (so it puts it in the list of resumable activities) and tell the OS when the player has ended it. You need to be able to handle an instruction from the OS to jump from whatever state the game is in IMMEDIATELY to the user’s selected activity. The meaning of jumping to the activity is left up to the game: it could be a save file, or it could spawn you in the starting area rather than where you left off last time so you can begin doing some more chores. That helps a bit.

But I’m working on these emulated games and I can’t turn them inside out. I can only make tiny laser incisions. Here’s some of the things I’ve had to do:

I need to decide when the player has “begun an activity” (namely, “goal: beat the game”.) The game has an attractmode that shows a gameplay demo so I can’t use anything in-game to identify that the player has done this. Before the game begins, it shows the weapon loadout selection menu. This it does NOT show in the attractmode. It DOES show it inbetween levels and when you die. So what I do is this: I check a tile in a certain spot to see if it’s part of the loadout menu. When I find it, I tick a state machine. If I don’t find it, I tick it again: and at that point the “activity” begins. I can’t trigger “begin” when the user exits the title menu because we’re supposed to skip as many menus as possible (making as many assumptions as we please) so I have to wait until after the menu is done and the gameplay has really begun.
I need to jump to the activity begin point even if the player has never begun the activity in the actual game. That means I possibly don’t even have a savestate. So I have prepared a savestate just before the first level fades in, along with an arbitrarily-selected loadout. What else am I supposed to do?
I need to jump to the activity begin point if the player has actually begun the activity. In this case, I use the last savestate he loaded or saved. Unless that one’s not there, then I punt to item 2.
I need to decide when the player has “ended an activity”. I have previously prepared hooks for when each level is beat, for achievements, but these hooks don’t exist in an “un-enhanced” version of the game shipped with the SKU. In case this is what the player’s playing, I still have to detect the end of the game. The achievement hook I created before was done when a specific cutscene is shown at the end of the game. It’s a good, dramatic timing. So, I decided to reproduce that for the activity ending. To do this I will repeatedly memcmp the beginning of VRAM where I happen to know the tiles for that cutscene are. The moment they appear, the game is over. Well… the cutscene is fading in, so the screen is black at that time (that’s not ideal for achievements because screenshots get taken at that time) but that’s what we’ve got for now.

In another game I’m working on, which wasn’t emulated, I had to skip the language selection menu and just assume the user wanted English. What else am I supposed to do? What else is anyone doing? I should probably get a PS5 so I can check some games. In retrospect I should have made an effort to base it on the user’s OS-selected language, but what about people using languages I don’t support?

xor_mgambrell_0

So a game was running slow on the switch. After a while I realized it was because of some text I’d added which was getting rendered every frame. Like… one sentence of text. Using a spritefont and drawn in a batch. What the heck? This shouldn’t be so slow.

After digging in, I realized it was due to the kerning support I’d added to imgui which necessitates lookups of kerning pairs between letters. On top of this, imgui’s basic glyph lookup was slow. (Using imgui was a mistake, but that’s a story for another day and I’m stuck with it for now.)

Both these things are ABSURDLY slow because they use a binary search through a rather sizable list. Binary searches are slow because by definition they hop around memory without paying any heed to where they are in a cache. Every single hop is to a fresh cache line. This is 100% the worst possible case for a memory subsystem. And that’s the weakest point of mobile platforms, which the switch is. Turns out this just cost me a few FPS but it was enough to drop it below 60 and thus to 30.

I’ll have to redo this some day. But not now… I just optimized a different thing entirely to earn back some FPS from a different budget and so the game regained 60fps. However, I do know how to do it. And as luck would have it… I had to do basically the same thing for another text renderer just the same week!

In this other case, I’m high-level emulating the Turbo CD “system card” bios. This is so we can release games that are emulated without having to distribute the copyrighted bios, which there’s no way we could ever get the rights to. This proceeded pretty well. But finally I discovered at the very end of the game we were porting, it would use a built-in SJIS font that the bios has for rendering the credits. Now, I really don’t want to include that copyrighted font verbatim. On the other hand, it takes a LOT of work to make a Japanese font. So I wondered… had they just used someone else’s public domain font? So I spent hours scouring the web and could find no proof that they had. Fortunately while doing so I found 100s of BDF-format dot fonts, a few of which were 16x16 (which is what I needed), and 3 of which had the needed attribute of leaving one column empty (so that the glyphs would be spaced at least a little if put on a grid).

So it was time to make a BDF-processing tool, and a necessary part of that is creating the glyph lookup table so that I could get from an input SJIS code to a specific offset in the dot font bitmap data. And now I will tell you the proper way to do it.

SJIS is a two-byte code. UTF16 and even UTF-8 can be handled the same way (albeit 3-byte codes are common in UTF-8 when it comes to Japanese). You can make a 64k-entry table if you want (the entry is either the offset to the glyph or missing-glyph sentinel). This involves just one random access cache hit per lookup. The table entries really need to be 4 bytes so it ends up being 256KB. Not bad, but we could do better.

So generally what you do is you simply make a 256-entry “1st-level” table. The first byte of the code is the key index and the value is the number of a 2nd-level table. These can be bytes, so long as you know how to calculate the offset of a 2nd-level table from the number [it’s 256+number*sizeof(2nd-level table)]. So the first level table is tiny, although it’s still basically a random access each time. The 2nd level table can be 2 byte entries (to index another table of offset to the final glyph) or 4 byte entries (to directly index the offset of the final glyph).

My glyphs were all fixed-size (32 bytes apiece) so I could calculate them from a number, so I ended up with 2 byte entries. Note that in MANY cases your glyphs will be fixed-size: insofar as they are records containing e.g. texcoords. If you’re careful, a spritefont renderer can operate completely in-cache, and really only the final access of the glyph record is random (the gpu batch data you’re producing is written consecutively so it’s actually the ideal case for any CPU and is basically negligable compared to all the rest, in this case where it’s not much data).

But anyway that’s more optimization than I really care to think about now. By using the 2-level index you change a dozen or more random accesses to 2 or 3 and due to the way these memory bandwidth-constrained devices work, that can directly map to a 4x speedup. I did it with a 2-level index in this case because I actually put the data in my fake bios and I didn’t want to worry about what would happen if it grew too large.

One day I’ll improve that imgui text renderer some more so I can actually use it seriously; in that case I will probably make a direct 64k-entry map from 2byte codes to glyph data. But wait–what about the kerning! We’re not through yet. I brought all this up for a reason! The kerning data is kind of a multi-level index, isn’t it? 1st index is the 1st character and 2nd index is the 2nd character. Now in this case you can end up with some giant tables (if any 2-byte character can be kerned with any other 2-byte character, you can end up with 64k*64k=4GB entries in the worst case…) but still it’s only a small number of random accesses. In practice the table won’t be that big, and if it is too big you can work it into 3 levels somehow. Not a big deal. I’ll write about how that goes when I get around to doing it.

xor_mgambrell_0

I’ve got this game with 12 hours and 23 minutes of extra music. It’s pretty good music, so it’s worth it… I’m just listening to some of it for the first time as I write this…

This is supposed to be split up between two skus somehow, and released on switch, so the media size is an issue (since it affects the manufacturing cost). We also have a to-be-determined amount of extra junk in the form of movies and collectible art. So I don’t know yet what format we’re going to ship that music in. And there’s no need to even encode it on the big console versions since the blu ray discs are huge. I want to encode the music as best as possible for the switch version, but it’s not clear what quality level to use. Currently I’m assuming it will be ogg, but I really should make some effort to support opus… but that library is kind of hard to compile. I gave up on it once. Hmm…

We’re using git for source control, so I didn’t want to repeatedly assault my repository with multiple versions of encoded music as I continually fool around and change my mind. I’m constantly worried about git repositories getting too large to be practical. So I commit them as wav to another repository as a submodule, knowing I could completely wipe the that repository and recreate it if for some reason I need to. I used wav instead of flac for convenience… it was already gonna be a big repository, cutting the size to 2/3 won’t really make a big difference.

So I made this little tool that will convert all the wav files into encoded forms in whatever formats and bitrates are needed. For development purposes I have it set to make low quality oggs just so our test builds aren’t huge. OK, problem mostly solved.

Now today I get a bunch more more music and one of them is a 170MB file… even flac’ed, that’s more than 100MiB. Did I mention I’m using github for this and there’s a file size limit of exactly 100MiB for any file in the repository. I am pretty sure there is also be a limit to how big a single push can be, So I commit and push them in ~100MB batches.

So… hmm… what to do? Well, since I have that encoding tool… I just split the wav in two halves and commit those. The tool now knows how to merge a -PART1.wav and a -PART2.wav file into a single one before encoding it. Problem solved!

xor_mgambrell_0

So I’ve got this game that plays 4 movies at once. Well… due to how the engine works, it actually plays 8 movies at once–a 2nd instance is spun up paused and queued after the 1st instance so that it’s hot and ready to go in order to effect loops. This is actually important… many games in this engine are using it for looping effects, instead of sprite animations. It’s super annoying.

Needless to say, I did not plan for this possibility when porting the engine. When I first started handling this game, I added support for hardware-accelerated movie decoding, because I knew it was doing 2 at once at least (the 4/8 was a surprise) and game consoles can barely manage 1. Well… wouldn’t you know it, but some hardware-accelerated video decoders use, like, 13 threads per video. And game consoles are always limited in resources (the overhead and unpredictability required to create the abstraction of unlimited resources is an unacceptable imposition–you are supposed to accept the limitations in exchange for the benefits) so of course the threads are limited, also. Now the console-makers aren’t stupid. There’s plenty of threads for a game to play a movie or two. But 8 movies? Actually there’s a reason the movie decoder uses 13 threads that really pisses me off and would piss you off too if you heard it, but it’s too revealing to go into. Really unflattering to all involved. To be fair, I am a little wasteful of threads, too. In fact it takes me 3 threads to play a video on my own, on top of the 13 for the hardware decoder. One for mux, one for video, one for audio. But you can tell what some of these 13 threads are used for, and it’s so lazy.

Man, let me tell you some of the other things I did for this game. To cut down on time wasted copying 1080p frames (did I mention these were 1080p and even bigger videos?) from a hardware-accelerated output memory, to an engine-internal soft-surface, and then BACK to a texture… what I do is allocate a texture, specify it as the hardware-accelerated decode target, and then return the mapped pointer as the engine-internal soft-surface… and then, after all that, add some magic logic in my port layer that detects soft-surfaces which are actually mapped texture pointers which then elides the copy when creating the GPU texture from the soft-surface. Whew! That really took some doing. Probably one of the most amazing things I ever did, but what’s even more amazing about it is that I didn’t do it in a fork of our porting framework but feel it’s generally stable enough to keep around forever. WOW that does not happen often.

So… since I added the hardware decoder in order to decode movies fast enough, I no longer have the threads to do it (as soon as the game gets to the near-endgame point where 8 play at once).

I thought about re-engineering my system so it didn’t need 3 of my own threads per video which would actually get me down to near the limit… I should really not be so sloppy… but I ran the numbers and even though it would put me in the safe zone (which is why I thought of it in the first place) it wouldn’t put me FAR ENOUGH in the safe zone. Why knows what this game may do on top of 8 videos? Ahhh now’s the time to say, I’ve got like 30 threads reserved… for playing sound effects… sigh… ok, let me explain.

This engine doesn’t really distinguish between music and sound effects. Sound effects are streamed just like music. And sometimes a game will use a long sound effect for something like “guiselect.wav” or “chitchatter.wav” for no good reason–that is, it will have a long silent tail or a super long reverb or somesuch. Yeah, I could fix that, but I’m trying to build a rugged engine port here. And don’t forget, games may be playing some ambient sounds–doubled up for the looping the way this engine does. So I did some measurement and figured out 30 should be safe. I didn’t know I’d need 100 threads for movies when I figured that.

And anyway that re-engineering is… pretty not very safe. I don’t want to do this kind of thing now. This game has been a thorn in our side for months now.

So my next step was to find out if I can fix this doubling-up for the loops that the engine does. Turns out I could. I had to get help from the engine devs to figure out how, but if I simply played the second video after the first one ended, it would look nice enough – on this game, at least. I didn’t want to break any other games using this engine, so I did this hack just for this game. That combined with ANOTHER hack to lower the 30 multimedia threads to 20 (since by now I’d seen the characteristics of this game enough to know it would be safe) got me safely into the safe zone. Yeah, there’s safe zones and then there’s the safe zone in the safe zone. I honestly couldn’t believe I could spin up new movies without even hitching a frame, but that’s how it was. The videos, despite being 1080p, are actually a little low on the framerate side, so that helps things.

Now after all this, would you believe… in the very same scene that plays 8 videos… microseconds after it would have exhausted its threads and crashed… it now HANGS because of a new bug in my hardware-accelerated video decoding that caused threads to go into busy loops trying to consume frames without leaving the main thread free to produce them? Well, that was just my bad. Only a game that used so many videos at once would have ever smoked that out. Took some time to find but wasn’t so hard to fix. Would not have been a problem if I hadn’t had to hardware-accelerate the videos for this game.

(Actually I’m glad I finally did the hardware-acceleration–I felt like such a loser leaning on libvpx to do it, even though I have nice things to say about the library. It’s just not the Console Way to use that, though)

This @*#@ing game also smoked out a 1/10000 times problem in my basic video playback logic which would rarely hang due to a race condition between the engine consuming the last frame and the porting framework producing it. It just plays so many videos that it actually made it happen reliably. So I had previously fixed it (just for this game) but made a mistake in it: I’d added a std::atomic<int> which wasn’t initialized to zero. Yeah, those aren’t initialized to zero. In general it’s completely impossible to predict what anything in c++ is initialized to nowadays due to some combination of the semantics being completely off the rails in complexity, and the language design being absolutely hostile to users (I mean, even 20 years ago we were joking that everyone knew only half of c++ but it’s a different half for everyone); and I am only very slowly learning how to properly initialize everything. I mean, you don’t want to type std::vector<int> = {} – isnt that stupid? But you better type std::atomic<int> = {0} or else it could be garbage (std::atomic<int> = 0 is a compile error for reasons so complex explaining them would be a great term paper for a graduate student)

By the way, while I’m typing this, I’m nursing the game through an autoplay mode with my left pinkie finger. On top of all this other crap (and more problems just for this game I haven’t told you about) the game is simply running out of memory due to some unlucky timing of spikes before garbage collections. So I’m testing some complete playthroughs now to see if some finetuning is going to work to fix that.

Now I’m adding “playthroughs” and “finetuning” and “autoplay” to my dictionary file in this text editor. Man, what an annoying day.

Well, I’ve got 866MB memory free and the game’s 2/3 over. Time to focus on ending this thing (and then test the campaign again without rebooting the game, just in case). Seeya!

xor_mgambrell_0

One of the funnest parts of my job is picking out icons for projects on our gitlab server. I try to balance encapsulating the spirit of the game with finding something iconic while trying to keep it varied so it’s not a wall of 8bit sprites or anime girls. I don’t always do a good job but I do put some serious consideration into it. I made a collage of jumbled up icons.

xor_mgambrell_0

“How do you handle movie playback portably?” you ask. I’m glad you asked! I discussed it before. Let me write some more about it.

Every console vendor gives you a video player sample which is about 50 times more complex than it needs to be and is bogged down with unnecessary clutter due to taking care of showing frames and playing sound, and possibly handling 99% useless features like resolutions changing mid-video, metadata events, and convoluted selection of multiple video tracks. OK maybe only 90% useless, but 90% and that’s a fact.

Most game studios will just fiddle around with those samples to make a basic canned FMV player. Heck, all half of the game studios need is playing a vain publisher’s animated splash screen. But if you want to mix the audio with other things and render stuff on textures, and have fine control over seeking and rewinding and stuff, you have to rip it all apart. It was Hypnospace Outlaw I had to originally do this work for, because a canned FMV player did not cut it for that game which has videos playing in windows all over the screen and even “glitch” effects by seeking around crazily.

I didn’t look forward to ripping this all apart for several consoles, so what I did instead was find a software video codec that would work for me. Now this isn’t easy. Most software video codecs are either GPL, uncompileable trash, or just plain weird trash. After filtering out all that, I was left with libvpx which is pretty awesome. I got it compiling portably and even using multiple threads for decode without any trouble. And as a codec, it rocks.

There’s another benefit of libvpx which made me glad I found it and chose to pick it. It’s that the switch can hardware decode vp8 and vp9 natively. So in theory, I would be able to use the same videos on all consoles without having to change the encoding. Now, I didn’t do that at first, because the Switch could actually decode a 1080p30 video in software fast enough, but eventually I came across the scenarios I’ve mentioned before which made me need to graduate to the hardware video decoder.

Now, it’s important that I share as much code as possible between the hardware and software video players. I came up with a design that turns the actual decoder into a standard interface with the libvpx and the switch hardware implementation. It’s too much detail to go into now, but the gist of it is that it is very simply and consists just of sending it input frames and grabbing output frames. Seeking isn’t supported; the decoder just expects to start with a I-frame (which may be in the middle of the video). If I need to seek I actually tear down the whole apparatus and boot it up again ready to seek to the needed I-frame. None of this is really astonishing, but it was surprising to me how simple a properly designed codec interface can be when the unnecessary concerns are removed. And this information is needed to see how the other parts are organized.

Before I explain the overall organization I need to tell you about the IVF file format. This is basically a mind-numbingly simple video stream format designed alongside the vpx codecs. It’s so simple that it doesn’t even necessitate a library. Imagine the simplest format you can for doing this kind of thing, and then imagine IVF is simpler. FFmpeg has support for writing to IVF streams.

So with all that background in mind, finally here’s my approach:

I typically receive webm files with vpx codec in it. I run that through a script that dumps the video stream to a .ivf file and encodes the audio as ogg (it’s not worth the trouble to get libopus compiling natively, that library is a mess.) I then have a tool that runs through the ivf file to create an index for seeking. Yeah, IVF is so simple it doesn’t have that. I’m telling you, when you have simple enough formats and concerns separated properly, it’s easier to look at specs and code everything from scratch than it is to figure out how to wrestle overcomplicated libraries needed for overcomplicated formats. Unfortunately the IVF frame headers do NOT say whether the frame is I-frame or P-frame … it’s somewhere in the packed frame data. So my tool has to run ffprobe to gather that data. I create my own “.dat” file with all this stuff, and it’s now alongside a .ivf and .ogg.

The very first thing that happens during playback is the ogg is handled by the exact same codec that plays all music in my engine. That… you could call it OggAudioStream is routed through a MuxStream which also gets a VpxVideoStream. So already things are simpler by audio being done and out of the picture.

Next I have the portable layer that manages the .dat index. It gets the resolution out of there and does the seeking as needed–it goes to the nearest preceding I-frame for the desired seek timestamp and then drives the VPX decoder, throwing away frames as needed. This layer is seeking through the .ivf file and reading the video frames out. Finally all that’s left is hooking up the correct hardware or software video decoder to do the decoding of the frames read from the .ivf.

You may be asking why I don’t use libwebm directly. It’s because that library was a nightmare to compile and I would have to pipe the audio out of it somehow, and it was easier to just make my own “container” format and leverage IVF.

I could combine the .dat and .ivf into a true container, but… why? I would never put the ogg data in there also–it’s so nice to have it played basically totally autonomously. My MuxStream determines its timing from the audio component (or the video component in case there’s no audio track).

Any videos I receive that arent vpx encoded just get transcoded, which isn’t ideal in principle but vpx is such a good codec it basically loses no quality and can take advantage of the lost information from the original encoding to produce smaller files. Like, if you ever have to transcode anything, transcoding to vpx is the way to do it. It seems to do an amazing job.

None of this is a great revelation, really. But it is a great relief to have video playback done so well that it’s a fully solved problem… without having to pay for a 3rd party library on every game (bink or smacker or whatever people are using nowadays). The FMV splash-screen displayer process source code is 100 lines only. There’s some areas for improvement in the API that’s interposed between all the stuff I’ve described and the “user” API due to how things evolved, but being able to drop a movie into your game in 100 lines is pretty good. With about 100 more lines I could scale it down, stick it in a window, and make a gamepad-controlled FMV mediaplayer, and that would take just about an hour.

More importantly, I feel good enough about this general approach that I think I can recommend it for everyone, in every situation. This problem is SOLVED.

xor_mgambrell_0

Well, I’m not working on my own game right now. What if I tell you some interesting stories each week from the other people’s games I’m working on which is keeping me from the real business of making my own game? I hope you don’t mind.

This week I had a few walls of text in FIGS languages which lost formatting during localization. I needed to break the paragraphs at “Unfortunately” and “Watching”. I felt a little powerful for recognizing those words immediately. Italian “Guardare”… that one tripped me for a half a beat. Sounds like ‘block’ or ‘pause’… those I have seen before. Ah, but “regard”… get it?

My general process for double-checking stuff like this (triple check is to ask a native speaker) is to paste it in google translate and see if the splits make sense in English. This is not hard for the FIGS languages… but… with Japanese… ok, let me show you what happens.

Here’s a Japanese sentence we might have to wrap ourselves:

あなたは恐ろしいハンギョ博士の魔の手からこの世界を救ったのです！
“You have saved this world from the terrifying hands of Dr. Hankyo!”

Now what I’m going to do is split it at the most reasonable points and show you how google translate re-interprets the sentence:

あなたは – 恐ろしいハンギョ博士の魔の手からこの世界を救ったのです！
You – He saved this world from the terrifying hands of Dr. Hankyo!

あなたは恐ろし – いハンギョ博士の魔の手からこの世界を救ったのです！
You are scared – He saved this world from the magical hands of Dr. Hankyo!

あなたは恐ろしいハンギョ – 博士の魔の手からこの世界を救ったのです！
You are scary hangyo – He saved this world from the hands of the doctor’s devil!

あなたは恐ろしいハンギョ博士の – 魔の手からこの世界を救ったのです！
You are a terrifying Dr. Hankyo – I saved this world from the hands of the devil!

あなたは恐ろしいハンギョ博士の魔 – の手からこの世界を救ったのです！
You are the terrifying Dr. Hankyo’s demon – I saved this world from my hands!

あなたは恐ろしいハンギョ博士の魔の手からこの – 世界を救ったのです！
You are this from the terrifying Dr. Hankyo’s devil’s hand – You saved the world!

あなたは恐ろしいハンギョ博士の魔の手からこの世界を救 – ったのです！
You save this world from the terrifying hands of Dr. Hankyo – It was!

The guidance on Japanese typesetting will tell you it doesn’t really matter much, as long as maybe you cut it between kanji and kana, but these examples run the whole gamut of persons, tenses, senses, and storylines, so wtf?

Anyway I had to prepare all these FIGS translations in a Photoshop file so I could export the 1080p pngs for display in-game without needing a Photoshop-calibre typesetting engine. I’d say this should be someone else’s job, but apparently it’s my job to do whatever everybody else doesn’t do in cases where it would take me longer to tell them to do it, explain how to do it, and then check their work. Photoshop does have its own wrapping logic which is probably good but in this context I had to be creative about where to wrap it or it would turn into too many lines.

xor_mgambrell_0

Not a day goes by I don’t have to do something like this…

This time, I had to read a png myself and receive the palette indexed data, because I’m using it to patch tiles in the 8bit system VRAM. I can’t (straightforwardly) draw it on top of the finished output because it wouldn’t fade along with the rest of the screen unless I took some really wacky measures. My content building process munges pngs phenomenally into a custom format which uses LZ4 to compress (pngs are very slow to decode and LZ4 advertises “faster than memcpy”). This works well except on big photorealistic stuff which it can inflate by 3x relative to the pngs… everything else is just fairly comparable in size but also loads faster. So I fiddle with my build scripts to replace the munged pngs with specially chosen pristine ones.

Anyway the point of all that was to show what a highly managed process importing PNGs are in my engine, so that I always end up with 32bpp data. And yet in this case I need palettes. And I don’t want to re-paletteize it because sometimes these patches need to use the palette that’s already there (replacing tiles is an order of magnitude easier than replacing palettes, which are computed on the fly [fading, you know] rather than just parked there like tiles and BG entries). So rather than generalize my 32bpp textures system, I added a new 8bpp image loading system.

stb_image is what I use for loading unmunged pngs, which can only load images into 32bpp format. I don’t to add lodepng… I don’t know, I just don’t want another library. Corona is overkill (and another library I’m not using yet). SDL_image is not going to happen because I’m not using SDL. But I already have libpng available as an importable module in my engine (some games need it) so I might as well use that.

Problem is, libpng is like… a nuts and bolts library, not a user-friendly library at all. There are a lot of user-unfriendly parts of it (that is to say, all the parts) but the one in particular that’s got me down today is that it could give me an image with a palette that’s size 16 or 256 depending on how I saved the PNG. So far out of 2 images I have saved the PNG both ways. Now, libpng has a lot of fun transformations it can do on its pixel formats, but not the one I need (or I can’t figure out which one it is). So it’s time to write my own 4bpp-to-8bpp converter! It took a few tries, but I came up with this, which is so neat and nutty its what inspired me to write this post today:

if(bit_depth == 4)
{
  for (int y=0; y<height; y++)
  {
    for (int x=0; x<width; x++)
    {
      int xx=width-x-1;
      row_pointers[y][xx] = (row_pointers[y][xx/2]>>(((xx&1)^1)<<2))&0xF;
    }
  }
}

Here we work from right to left (don’t ask me why I don’t reverse the loop, I have no idea) because we’re expanding the data so the store is always further to the end than the load so that we never step on our own feet. Then we just have to pick the N/2th 4bpp byte for the Nth pixel, shift it down if it’s the upper nybble, and mask it down to 1 nybble in case it wasn’t. Easy! To tell you the truth, I didn’t even check if there was a png transformation, I just did it this way myself immediately because I knew it would be fun.

xor_mgambrell_0

Some platforms have really weird patching systems. If you do something like frameshift all your data by 1 byte in an update, the patch engine will fall apart and it will be 100% the same size as the new file. And there are limits to how large patches can be, too. This sounds like it might be amateurish, but there’s reasons for it: making a fully-powered patching system has performance implications. Specifically on the disk IO system. And consoles are all about the bargain of the platform guaranteeing resources while the developer guarantees he’ll use them. So the patching apparatus cannot be infinitely-powered, like to the extent where it’s a fully optimal difference which requires unbounded time to uncrunch patch data while referencing sectors from all over the version0 datafile. Specifically that last part: in the worst case scenario: the individual seek and IO time will get out of control, as will the wear and tear on the flash media. I call it read amplification because that’s basically what it is.

There’s a piece of the puzzle you’re missing if there was something creepy in the above that was giving you a bad puzzled feeling. And it’s this: the premise for a console platform’s patching engine is that the version0 datafile is always present. It’s in the physical media! The patch must be applied on the fly. There’s no separate storage for it. Even on modern consoles, this same principle may be in effect.

On the developer’s side of that guarantee, then: we are supposed to read the docs about how the platform’s patching apparatus works, and then design disk file formats which don’t cause problems for it. This always catches Unity developers by surprise–the default way for doing things there makes a complete mess out of the data every time you touch anything. It’s always sad to see the horror slowly dawn on them. Well, it catches everyone by surprise at least a little the first time, but at least the developer of a proprietary engine would have had this kind of thing on their radar already (and it’s not as likely to be their first time doing anything.)

But I’m personally in a bit of a pickle. See, I’m porting other people’s engines, and they’ve unknowingly made poor choices about the structure of the packfiles. I don’t want to hack the engine (part of my business plan is to do everything I can to touch engines and games as little as possible, to keep costs low.) But what do I do when I need to patch one of these games? I ran into this scenario last week. So here’s how I handled it.

I have a file format which consists of an 8 byte record per ‘sector’, where the ‘sector’ size can be tuned somewhat depending on the nature of the data. This sector consists of what I call a ‘literal prefix’, a ‘chunk’, and a ‘literal suffix’. Tracking that requires 8 bytes per sector which can be directly-indexed by sector. The literal prefix and suffix are stored in the patch file at the end together. The ‘chunk’ comes from the version0 datafile. The sector may naturally be 100% literal prefix or 100% chunk–and commonly is. After all, we’re dealing with packfiles, and unless internal data has been re-indexed somehow so a bunch of IDs and references are changed (we’ll hope that doesn’t happen *), most of the data is the SAME–only moved around. And most importantly, the implications on read amplification are contained: reading one sector of data from the patched file results in reading no more than two sectors of data from the version0 file.

I made an effort to find the best ‘chunk’ that I possibly could. I do exhaustively search for those, using an acceleration data structure that consists of a list of offsets of strings in the version0 file beginning with each 16bit value. So that actually cuts the number of strings in the version0 file I have to search by 1/65536th. Using that filter, I just look for the longest matching string I can. Even this would be excruciatingly slow (hours?) but I found another technique: It so happens that the best string in version0 for version1’s sector 124 is most commonly the string immediately succeeding version1’s sector 123 chunk (as long as the literal suffix for sector 123 was empty–the sector is concluded with a chunk). That is, in other words, I can find the next piece of data from the packfile’s inner-packed file without even needing to know anything about the structure of the packfile. With THAT (and a 32 thread processor) I can handle a 1GB packfile in 30 seconds or so. The whole algorithm is trivially parallelizable. Even the acceleration table is parallelizable (each of 256 threads indexes the strings starting with that byte). That table is about 10GB for a 1GB packfile, when it’s all said and done.

I did evaluate all the existing patch file formats, but none of them were really designed for on-the-fly use. For sensible reasons. I’m sure in exotic filesystems there may have been something like I needed but it would have been impossible to reuse that stuff. I didn’t even look for it.

Now to finally complete the picture, what we do is make version1 of the game out of the version0 packfile + the patch file. The patch that gets shipped to users is no larger than our patch file, then.

Actually the format I devised is kind of bloated, so I do compress the whole thing. Could compress to anywhere from 10% to 99% of the original file size… depends on details. Since this is done on the entire file, and you may have noticed my read-amplification analysis didn’t discuss reading the literal parts, it follows that my patch files are loaded completely into memory. My defense of this is that any engines we have that make such good use of a packfile that they are enormous enough to cause problems, will also be rather efficient about NOT loading too much data in memory, meaning I will have plenty of room to hold a patchfile. Hope I’m not wrong…

I had to modify my engine to try opening myfile.ext.patch any time myfile.ext is opened (actually this analysis is done when our filesystem is built during content building) and then have the ability to do this patching on the fly. But there’s one more read-amplification issue I had to deal with: since my patch file is sector-based, and usually games will read entire multi-sector files from the packfile all in one operation, I don’t want to turn that into several individual-sector reads from the physical filesystem. To make sure this doesn’t happen, I interposed another layer of logic that ‘coalesces’ adjacent reads from the version0 file and doesn’t execute them until a read comes from somewhere else.

Whew, that was a lot of work! I hope it works out. Part of the concept here is that no work is required IN ADVANCE, so despite my testing, I won’t be 100% sure about it until I actually run into the scenario where I have to do the patch (this whole endeavour was precautionary before we shipped a game with huge packfiles).

In case this does happen, I have another strategy planned. What I’ll do is find the best-matching sector’s worth of data in the version0 file and then store the XOR of it vs version1 in the patch. The whole patch will then be compressed (and it’s hopefully mostly zeroes) but it needs to be done one sector at a time, since the mountain of zeroes will be too big for RAM. I think this approach has promise, but it’s just not what I ended up doing. Plus, I think the computation to find “best-matching sector” is pretty rough. I don’t want a patch file that takes days to compute… unless I’m really, REALLY desperate. Since I have the on-the-fly patching built into my engine now, I could make an alternate patch format without too much work. Note that in order to avoid read-amplificaton issues, I would need to STRONGLY prefer sequences of sectors. I mean, I know it’s paranoid, but I’d rather miss once and have an inefficient sector than even have a chance of a perverse case turning one long sequential read into dozens of short ones.

xor_mgambrell_0

More NES emulation…

We didn’t like how this other RPG game was rolling random encounters. It happened on the first step too often. Very annoying. I think there needs to be some kind of step counter setup with a minimum value… I don’t know how many games do that. Is it possible they have arranged the statistics so that low rolls are less common? I haven’t checked this in a while.

This game has 4 terrain types, and up to 4 encounters per terrain type; on top of that it is organized into zones in the world map (thus further-away zones have harder encounter sets). There is only one monster per encounter, so encounter type = monster type. The data is organized this way:

struct EncounterDataForWorldZone
{
 byte chances[4][4];
 byte monsters[4][4];
}

EncounterDataForWorldZone encounterData[N]; //N zones in the game, N unknown

Now, for each step, the game gets a random number R from a flat distribution. It then considers chances[currterrain][0] = C as a C/256 chance of getting the encounter on that step. We see if R is less than C and do the encounter if so. Otherwise C is subtracted from the random number, so the next chance C’ from chances[currterrain][1] is a C’/(256-C) chance. And so on.

Generally the chances near the beginning of the game where we were testing, is, I dont know, 16-20 or so. So it’s easy to see why we get too many encounters… and why they can happen on the first step so often.

My first thought was to use the old “turn it into a normal distribution” trick by changing a 1d256 roll into a 2d128. This biases more normal rolls in the middle at the expense of rolls in the tails. So we did this for a while, and it felt okay… but we weren’t testing the whole game, and I got nervous.

Let’s take a simplified scenario where we have 2 encounter chances, each of value 64. Now the original way it was done, each of those would be a 64 in 256 chance (1 in 256). Well, kind of, because of how the subtraction happens in the algorithm. Anyway with two dice, there’s fewer ways of rolling the smallest 64 numbers than there are the next 64 numbers (just like there’s only one way to roll a 2 on two dice but there’s many ways to roll a 7). And yet since they’re adding up to 128, the new probability distribution is centered on 128, and the absolute probability of an encounter is still the same! All we accomplished was to bias the selection to the later encounters. This would change the character of the game in a way we don’t really have time to test and scrutinize.

So my romhacker came up with a better idea to add a tiny bit of logic to only do encounter checks on every other step. This way you won’t get them on successive steps under any circumstances, and we get the other desired result of just generally fewer encounters–precisely half as many, without having to grapple with the really thorny problem of how to do it, logically speaking, without affecting the design of the game by biasing certain encounters. Actually after a little more iteration we developed an approach to run it with any divisor so that I could give another option to reduce encounters by half again (so a divisor of 4) from the emulation menu, so people would feel a bit empowered to be masters of their own destinies instead of frustrated by the fixed design. (There’s also a cheat to disable encounters entirely)

xor_mgambrell_0

We didnt like how the money in this NES game was spawning (too rare, seemed broken) so we started investigating it. Seems it used a threshold of $A0 and it gave money whenever the threshold was exceeded by a random roll. But this didnt seem to explain the whole story. So we checked the random number generator.

It seems to be four state bytes, each for a stream, where the stream is just a sequential walk through a table of 256 bytes of entropy. And each time a random number is required, it moves to the next stream, so that as long as theres an odd number of things requiring random numbers, it will be getting then from multiple streams and not stuck in a predictable sequence unto itself. It works out okay but I dont know if a choice of four streams makes for a good divisor or whatever. Perhaps something prime?

So the algorithm is basically:

byte curr;
byte state[4];
byte nextrand() { return randtable[state[curr++%4]++]; }

This all seemed alright, except that randtable doesnt contain entropy. Im pretty sure it contains SOME OTHER table (I dont know what, yet). I surmise the devs ran out of room in that bank (I certainly couldnt find room for a 256 byte table in there) and just said whatever, this looks jumbled enough.

As long as the random numbers are used to choose a random wandering pattern, its not so bad, but when this $A0 threshold is used for something simple like a money drop, it makes it almost impossible to behave sensibly. For example if we changed the threshold to $80 but there were no values between $80 and $A0 in the table, then it would make no difference to the drops.

So we’re either going to TRY HARDER to find room in the rom for the table. Or since this game has WRAM (battery-backed) and Ive studied its use of the WRAM, I happen to know that it doesnt use it all (almost no game does) and so we could just copy the table to $6F00 (the save game slots begin at $7000) at bootup using whatever extra banks we can trivially tack on (tacking on banks to run new code at bootup is substantially easier than inserting code patches all over that switch to other banks on the fly, or finding room to jam stuff into the common bank)

One final detail to mention: how is state[] initialized? When entering a dungeon, state[0] is set from the global frame counter, and then state[n] is set as randtable[state[n-1]]. This means there’s only 256 configurations… but I doubt anyone will notice.

xor_mgambrell_0

I’m working on a big Wonder Boy collection. The biggest challenge is keeping all these names straight. Wonder Boy 5 Monster World 3? Wonder boy 3 – dragon’s trap, or monster lair? Wonder boy 4 was skipped because there were two wonder boy 3’s… when I started working on this project, the first thing I did was come up with a naming scheme so I wouldn’t get confused by everything. This post is about WBIML - wonder boy in monster land (basically, monster world 1 [wonder boy in monster world is not monster world 1, it’s monster world 2]).

The game lets you buy some magic spells / items. Fireballs, bombs, that kind of stuff. But there’s no menu to choose what’s equipped. I think it’s because it’s was actually an arcade game first and they didn’t want people to fool around in the menu forever… nor add another button for hot-swapping it. The SMS port of the game could have been a chance to improve that, but they seem to have kept it just like the arcade game for some strange reason. Like, time limit and all. The time limit really doesn’t make sense for this game…

So what happens is when you get an item, it’s pushed on a stack, and you can’t access whatever item you had before until you exhaust all of that new batch you’ve got.

Well, this game is pretty… bleh. I couldn’t stand it being totally pointlessly insufferable, too. So I added a system that lets you choose your magic by hitting a hotkey.

In order to do this, I have to run through memory to see what the game’s put in the stack. I collect these in large-sized bins and keep track of the order I found them. Then I clear all that memory and write them back out in 0x7F sized bins. If the user pressed the “select magic” hotkey, before I do that I rotate the ordering.

I then have to cause the HUD to get updated. In the arcade version I found a flag that does this; I just set it to 1 and the game knows to update the hud (I found the code that acquires a new item setting that flag). For the SMS version I couldn’t find such a flag, but found the function that updates the hud so I call that from my frontend by saving the cpu state and then meddling with it to have the right SP and PC and stepping the instruction loop without stepping the whole machine until the SP passes the original value (i.e. the RET was excuted).

The best way to do these hacks is to hack the rom to use a physical button that doesn’t really exist, but… I like to try doing things the other way sometimes. Just to prove I can. Plus, occasionally (commonly?) the ROM will be totally full and there’s no place to put that code. You can always expand the rom, but this is apparently kind of a fraught process. Or so I’m told. I don’t understand why… really you should just use what little bit is available to set up a “syscall” by ID mechanism that makes a “far call” to another rom bank. Maybe I’ll do it that way some day.

I’m not sure if this feature will make it into the final game… if QA finds problems with it, depending on WHEN they find the problem, I may have to nix it rather than incur more risk. Why we can’t QA while the game is being developed instead of at the last minute is a story for another day…

xor_mgambrell_0

Welp. I forgot oggenc litters its output oggs with random garbage that’s different on every run.

So now I’ve got 600MB patches every time I make a new build unless I change the build process for this game to copy over the ogg files from the first rom version that got approved. That’s because ogg by default puts a serial number in each internal chunk (i.e. 100s of times in the file at unpredictable offsets) which is seeded randomly every time oggenc runs.

Moral of the story: if you’re running oggenc without “-s 0” to seed the serial number to 0, then you’re doing it wrong. Always, always, always, always.

In fact at one point I even made a custom oggenc that always does that so there’s no way I could forget… except… I forgot to use that custom oggenc this time.

xor_mgambrell_0