[GRUEDORF] mgambrell's making other people's games

xor_mgambrell_0

One of the funnest parts of my job is picking out icons for projects on our gitlab server. I try to balance encapsulating the spirit of the game with finding something iconic while trying to keep it varied so it’s not a wall of 8bit sprites or anime girls. I don’t always do a good job but I do put some serious consideration into it. I made a collage of jumbled up icons.

xor_mgambrell_0

Let’s talk about hacking roms. Well, more precisely, let’s talk about NOT hacking roms.

I can do it, but I’m not great at it… and I find most emulator debuggers quite disagreeable. I think it would take me a long time to do some of the hacks I need to do. I have some workarounds that have proved quite versatile… they are MORE versatile than hacks, in some cases. And less, in some others. But they can be way faster than hacking, in some cases.

First, I should mention, classic game genie and action replay cheats are easier than hacks. Yeah, sure, always having 99 lives isn’t quite as elegant as it would be to simply never lose any of your 3 starting lives. And blinking invincible as if you just got hit for the entire game isn’t quite as elegant as simply clipping through bad guys without taking damage. But it does get the job done.

Moving on to more interesting things: I call them “input macros”. As a simple example, I have a game with a hidden mode that requires pressing select five times on a menu. Now, I don’t want to even have to expose the select button to the user as a binding. So what I do instead is configure the triangle button to set up a sequence that seizes control of the input and sends a series of five select button presses across several frames. It can get more complex than that. In Turrican, we have dozens of screens full of code doing this kind of stuff. For instance, to drop a bomb you have to crouch and shoot. Ideally we’d have a button that does this instead, so we have to crouch for the right (minimum) number of frames and then presses the button at the right time and waits for long enough crouched for it to register.

Another input-related thing we need to do is to fix menus so they behave in more sensible ways, using the console’s normal accept/cancel buttons (which varies by console and region of course, most noticeably differing between Nintendo and everyone else). This requires unhooking the normal input bindings completely and simulating a different kind of setup depending on whether we’re in a menu. Detecting if we’re in a menu… can be hard. In theory somewhere there’s probably a variable that says what game mode we’re in… maybe, maybe not. Maybe you can only tell by noticing that certain code only runs when you’re in a menu, and you can detect when that PC is getting hit. That’s too close to hacking for my tastes though. Here’s what I do instead: I scan the VRAM for particular patterns of tiles. That works for any screens that have some unique static elements. If that’s not good enough, and it’s a transitory condition, you can set up a state machine to catch transitions between menus.

Scanning VRAM is kind of more of a dumb thing. You do this, I do that. If you try to analyze the game’s programming, you can get into mind games really quickly. The game will be reusing code for something, or it will do every screen in a completely different way (that’s totally normal; each screen may have been programmed by a different person or be pushing the hardware enough to be basically a different program) so your reverse engineering workload ends up proportional to the number of scenes. But scanning VRAM is the same trick every time, and you can build tools to help you do it.

The final workaround I want to mention is the inverse of scanning VRAM – it’s modifying it! Suppose a game prints something static in Japanese and you want it to be in English. The tiles are compressed, the tilemap data’s compressed… you’re gonna crack all that, PLUS the code that draws it, and redirect it all somewhere? NOPE! Detect it, and replace it. This is easy if you have spare tiles in VRAM. But you don’t always…

I came up with a neat way to fix that which has totally worked for me on the PC Engine. It’s not directly scalable to the other consoles, but I’ll tell you how it might be (not proven yet.) The PC Engine is odd for having twice as much VRAM addressible for tiles as it has VRAM. So you can just park extra tiles in the 2nd half of VRAM and rewrite the tilemap to reference it. A big concern when patching art like this is whether it will fade out with the rest of the screen. You could patch it with pngs that don’t use the proper console palette, but what’s going to make it fade out? It can be really tricky (in an emergency you can fade it yourself by checking a known-white color in the palette).

Here’s the plan I have for other consoles (and I’m definitely going to have to attempt it on the SNES soon). Who says the hardware can’t have a separate bitplane tacked onto it in a way that the CPU can’t even access? I can access it from the emulation framework. So I will just “add” tiles by putting them in an auxiliary bank of vram and then set a bit in the separate fake bitplane which will signal the PPU renderer for the system to fetch the tiles from the auxiliary blank.

This is kind of an approximation of how consoles have extended their PPU capabilities before (the GBA and the NDS did quite a bit of this). There’s no reason you can’t apply it more generally, though. I feel sure you could do it on a playstation, for instance. In fact every HD textures approach for 3d consoles is working in a philosophically similar way. There’s additional resolution packed away in a buffer not accessible to the CPU (because it wouldn’t know how). I look forward to running into the problem on a PSX game, but I have to find a PSX game to port first…

xor_mgambrell_0

There’s something everyone must run into at some point. You’re happily rendering UI straightforwardly and then all the sudden for one reason or another you need to composite two layers. But you want it to look exactly (give or take some numerical precision issues) like it did before.

I’ve done this before but never in a fully satisfactory way. This time I decided I wanted the perfect solution. I decided to write this out while I work through it.

When compositing layers, the first problem you run into is that the alpha values in the top layer are nonsense. Well, what should they even be, anyway? The alpha values are meaningless until you need to composite that layer. You probably have no control over those values. They might be all 0, or all 1. They might be a complete mess.

Let’s say I have layer TOP and BOT and BOT just contains a wallpaper. I clear TOP to 0,0,0,0 and then draw an opaque triangle texture on it. Now to composite I draw TOP normally, as long as I replicate the triangle’s alpha values into TOP.

Suppose I scale the triangle a bit so it’s linear filtered. Now what we will run into is texels (perhaps 1,1,1,0) from outside the triangle texture getting incorporated into the filtering. You have to use premultiplied alpha to fix that (a 1,1,1,0 pixel will be come 0,0,0,0 after multiplying rgb *= a). This is well-known. I just wanted to get it out of the way and make you visualize the fuzzy-edged alpha values that will get produced by this. If we had a white triangle, we may end up with (0.5,0.5,0.5,0.5) at the edge (with premultiplied alpha, that’s a half-transparent full-white).

When this is drawn normally onto BOT, everything still works normally.

Now, let’s draw a big red semitransparent box over all that. Let’s do it at alpha=0.1 … now what should the alpha values be where the semitransparent box overlays the fuzzy edges in the triangle? We have 0.5 alpha clashing with 0.1 alpha. Do you want to to multiply them? That’s wrong… it suggests we now have an alpha of 0.05 which is basically invisible. The art on TOP should have produced higher alpha values. Do you want to add them? You’ll exceed 1.0 at some point. Addition/subtraction isn’t logical and it won’t work.

There’s not a lot you can do with the ROP (blending) units on the GPU so you may be tempted to go to the shader. But there’s no such thing as a free lunch… you can’t just do what amounts to blending on the shader without losing something, and it turns out that’s performance. GPUs just aren’t meant to do this. Maybe you know a way to do it with a cutting edge GPU and API, but: I doubt it, and there’s still no way it’s as fast as if you used the very old GPU programming models that were designed to be efficient from the outset. And anyway you still have to figure out the math!

In theory, if we weren’t compositing, after drawing the triangle edges we would have something like this:

WALLPAPER * 0.5 + TRIANGLE * 0.5

and after drawing the box we would have

(WALLPAPER * 0.5 + TRIANGLE * 0.5) * 0.9 + BOX * 0.1
-> WALLPAPER * 0.45 + TRIANGLE * 0.45 + BOX * 0.1

We note that the wallpaper represents the BOT layer and so when we composite (using only the information created while rendering TOP) we need to end up with a TOP alpha of 0.55 in order to blend the layers correctly. So somehow we need to end up with an 0.55. In particular we needed to get:

0.55 = 1.0 - 0.5 * (1-0.1)
dest_alpha = 1.0 - dest_alpha * (1-src_alpha)

So now we have to figure out how to produce that with normal GPU resources. Right off the bat, we have the ROP blending capabilities problem. While we could multiply DST and ONE_MINUS_SRC, there’s no way to get the extra 1.0- in there. Maybe we can do some algebra to help?

dest_alpha = (1.0 - dest_alpha) + (dest_alpha * src_alpha)

This isn’t helping, there’s just no way to deal with the 1.0-. Even though we have ONE_MINUS_DST_ALPHA, we can’t use it without multiplying it with SRC_ALPHA or DST_ALPHA which we don’t want (that’s simply how the blenders work.)

So what we’ll have to do is invert our sense of alpha so we don’t have this problem. Now we’d need to have 0.45 in there (we’ll un-invert this sense in the compositing stage, trust me). So to be clear:

0.45 = 0.5 * (1-0.1)
dest_alpha = dest_alpha * (1-src_alpha)

In other words, it’s where we got stuck before but without the 1.0- because that’s always baked into the dest_alpha.

Well, this looks much easier. now we can use a ONE_MINUS_SRC_ALPHA * DST_ALPHA configuration in the blender. But there’s one snag. Instead of clearing to 0,0,0,0 (fully transparent, ordinarily) we now need to clear to 0,0,0,1 (fully transparent, with the inverted alpha sense).

Now, how to composite it? Well, this is what’s needed (with the un-inverting taken into consideration):

dest_color = (dest_color * src_alpha) + (src_color * (1.0-src_alpha))

And that’s simply a blender configuration of:

SRC_COLOR * ONE_MINUS_SRC_ALPHA + DST_COLOR * SRC_ALPHA

The inverting of the alpha sense happens here, where you can see things are plainly backwards-seeming.

In all my years of doing this kind of stuff, I think I’ve never seen it written out this way before. You’d think the key characteristics of having an inverted sense of alpha and clearing to 0,0,0,1 would be memorable. But no… Finally, after writing all this, I thought of enough keywords to find an article that mentioned it, in the 2nd half:

http://hacksoflife.blogspot.com/2010/02/alpha-blending-back-to-front-front-to.html

A “little bit of head scratching” indeed.

I posted this anyway so you could see my analytical method involving trial numbers and so I’d have something to use to stop losing gruedorf.

UPDATE 2022:
It seems I forgot something here. Ordinarily when using premultiplied alpha, it affects the blending you use during composition. Without it you’d have:
SRC_COLOR * SRC_ALPHA + DST_COLOR * ONE_MINUS_SRC_ALPHA
With it you have:
SRC_COLOR * ONE + DST_COLOR * ONE_MINUS_SRC_ALPHA

So let me correct the analysis above. I said we’d need this:

dest_color = (dest_color * src_alpha) + (src_color * (1.0-src_alpha))

But actually src_color has src_alpha already incorporated via the premultiplication (actually it’s all muddled due to the inverse sense of alpha). So really what we need is this:

dest_color = (dest_color * src_alpha) + ((src_color/(1.0-src_alpha) * (1.0-src_alpha))
-> dest_color = (dest_color * src_alpha) + src_color

And thus a blender configuration of
SRC_COLOR * ONE + DST_COLOR * SRC_ALPHA

xor_mgambrell_0

“How do you handle movie playback portably?” you ask. I’m glad you asked! I discussed it before. Let me write some more about it.

Every console vendor gives you a video player sample which is about 50 times more complex than it needs to be and is bogged down with unnecessary clutter due to taking care of showing frames and playing sound, and possibly handling 99% useless features like resolutions changing mid-video, metadata events, and convoluted selection of multiple video tracks. OK maybe only 90% useless, but 90% and that’s a fact.

Most game studios will just fiddle around with those samples to make a basic canned FMV player. Heck, all half of the game studios need is playing a vain publisher’s animated splash screen. But if you want to mix the audio with other things and render stuff on textures, and have fine control over seeking and rewinding and stuff, you have to rip it all apart. It was Hypnospace Outlaw I had to originally do this work for, because a canned FMV player did not cut it for that game which has videos playing in windows all over the screen and even “glitch” effects by seeking around crazily.

I didn’t look forward to ripping this all apart for several consoles, so what I did instead was find a software video codec that would work for me. Now this isn’t easy. Most software video codecs are either GPL, uncompileable trash, or just plain weird trash. After filtering out all that, I was left with libvpx which is pretty awesome. I got it compiling portably and even using multiple threads for decode without any trouble. And as a codec, it rocks.

There’s another benefit of libvpx which made me glad I found it and chose to pick it. It’s that the switch can hardware decode vp8 and vp9 natively. So in theory, I would be able to use the same videos on all consoles without having to change the encoding. Now, I didn’t do that at first, because the Switch could actually decode a 1080p30 video in software fast enough, but eventually I came across the scenarios I’ve mentioned before which made me need to graduate to the hardware video decoder.

Now, it’s important that I share as much code as possible between the hardware and software video players. I came up with a design that turns the actual decoder into a standard interface with the libvpx and the switch hardware implementation. It’s too much detail to go into now, but the gist of it is that it is very simply and consists just of sending it input frames and grabbing output frames. Seeking isn’t supported; the decoder just expects to start with a I-frame (which may be in the middle of the video). If I need to seek I actually tear down the whole apparatus and boot it up again ready to seek to the needed I-frame. None of this is really astonishing, but it was surprising to me how simple a properly designed codec interface can be when the unnecessary concerns are removed. And this information is needed to see how the other parts are organized.

Before I explain the overall organization I need to tell you about the IVF file format. This is basically a mind-numbingly simple video stream format designed alongside the vpx codecs. It’s so simple that it doesn’t even necessitate a library. Imagine the simplest format you can for doing this kind of thing, and then imagine IVF is simpler. FFmpeg has support for writing to IVF streams.

So with all that background in mind, finally here’s my approach:

I typically receive webm files with vpx codec in it. I run that through a script that dumps the video stream to a .ivf file and encodes the audio as ogg (it’s not worth the trouble to get libopus compiling natively, that library is a mess.) I then have a tool that runs through the ivf file to create an index for seeking. Yeah, IVF is so simple it doesn’t have that. I’m telling you, when you have simple enough formats and concerns separated properly, it’s easier to look at specs and code everything from scratch than it is to figure out how to wrestle overcomplicated libraries needed for overcomplicated formats. Unfortunately the IVF frame headers do NOT say whether the frame is I-frame or P-frame … it’s somewhere in the packed frame data. So my tool has to run ffprobe to gather that data. I create my own “.dat” file with all this stuff, and it’s now alongside a .ivf and .ogg.

The very first thing that happens during playback is the ogg is handled by the exact same codec that plays all music in my engine. That… you could call it OggAudioStream is routed through a MuxStream which also gets a VpxVideoStream. So already things are simpler by audio being done and out of the picture.

Next I have the portable layer that manages the .dat index. It gets the resolution out of there and does the seeking as needed–it goes to the nearest preceding I-frame for the desired seek timestamp and then drives the VPX decoder, throwing away frames as needed. This layer is seeking through the .ivf file and reading the video frames out. Finally all that’s left is hooking up the correct hardware or software video decoder to do the decoding of the frames read from the .ivf.

You may be asking why I don’t use libwebm directly. It’s because that library was a nightmare to compile and I would have to pipe the audio out of it somehow, and it was easier to just make my own “container” format and leverage IVF.

I could combine the .dat and .ivf into a true container, but… why? I would never put the ogg data in there also–it’s so nice to have it played basically totally autonomously. My MuxStream determines its timing from the audio component (or the video component in case there’s no audio track).

Any videos I receive that arent vpx encoded just get transcoded, which isn’t ideal in principle but vpx is such a good codec it basically loses no quality and can take advantage of the lost information from the original encoding to produce smaller files. Like, if you ever have to transcode anything, transcoding to vpx is the way to do it. It seems to do an amazing job.

None of this is a great revelation, really. But it is a great relief to have video playback done so well that it’s a fully solved problem… without having to pay for a 3rd party library on every game (bink or smacker or whatever people are using nowadays). The FMV splash-screen displayer process source code is 100 lines only. There’s some areas for improvement in the API that’s interposed between all the stuff I’ve described and the “user” API due to how things evolved, but being able to drop a movie into your game in 100 lines is pretty good. With about 100 more lines I could scale it down, stick it in a window, and make a gamepad-controlled FMV mediaplayer, and that would take just about an hour.

More importantly, I feel good enough about this general approach that I think I can recommend it for everyone, in every situation. This problem is SOLVED.

xor_mgambrell_0

Welp. I forgot oggenc litters its output oggs with random garbage that’s different on every run.

So now I’ve got 600MB patches every time I make a new build unless I change the build process for this game to copy over the ogg files from the first rom version that got approved. That’s because ogg by default puts a serial number in each internal chunk (i.e. 100s of times in the file at unpredictable offsets) which is seeded randomly every time oggenc runs.

Moral of the story: if you’re running oggenc without “-s 0” to seed the serial number to 0, then you’re doing it wrong. Always, always, always, always.

In fact at one point I even made a custom oggenc that always does that so there’s no way I could forget… except… I forgot to use that custom oggenc this time.

xor_mgambrell_0

I’m working on a big Wonder Boy collection. The biggest challenge is keeping all these names straight. Wonder Boy 5 Monster World 3? Wonder boy 3 – dragon’s trap, or monster lair? Wonder boy 4 was skipped because there were two wonder boy 3’s… when I started working on this project, the first thing I did was come up with a naming scheme so I wouldn’t get confused by everything. This post is about WBIML - wonder boy in monster land (basically, monster world 1 [wonder boy in monster world is not monster world 1, it’s monster world 2]).

The game lets you buy some magic spells / items. Fireballs, bombs, that kind of stuff. But there’s no menu to choose what’s equipped. I think it’s because it’s was actually an arcade game first and they didn’t want people to fool around in the menu forever… nor add another button for hot-swapping it. The SMS port of the game could have been a chance to improve that, but they seem to have kept it just like the arcade game for some strange reason. Like, time limit and all. The time limit really doesn’t make sense for this game…

So what happens is when you get an item, it’s pushed on a stack, and you can’t access whatever item you had before until you exhaust all of that new batch you’ve got.

Well, this game is pretty… bleh. I couldn’t stand it being totally pointlessly insufferable, too. So I added a system that lets you choose your magic by hitting a hotkey.

In order to do this, I have to run through memory to see what the game’s put in the stack. I collect these in large-sized bins and keep track of the order I found them. Then I clear all that memory and write them back out in 0x7F sized bins. If the user pressed the “select magic” hotkey, before I do that I rotate the ordering.

I then have to cause the HUD to get updated. In the arcade version I found a flag that does this; I just set it to 1 and the game knows to update the hud (I found the code that acquires a new item setting that flag). For the SMS version I couldn’t find such a flag, but found the function that updates the hud so I call that from my frontend by saving the cpu state and then meddling with it to have the right SP and PC and stepping the instruction loop without stepping the whole machine until the SP passes the original value (i.e. the RET was excuted).

The best way to do these hacks is to hack the rom to use a physical button that doesn’t really exist, but… I like to try doing things the other way sometimes. Just to prove I can. Plus, occasionally (commonly?) the ROM will be totally full and there’s no place to put that code. You can always expand the rom, but this is apparently kind of a fraught process. Or so I’m told. I don’t understand why… really you should just use what little bit is available to set up a “syscall” by ID mechanism that makes a “far call” to another rom bank. Maybe I’ll do it that way some day.

I’m not sure if this feature will make it into the final game… if QA finds problems with it, depending on WHEN they find the problem, I may have to nix it rather than incur more risk. Why we can’t QA while the game is being developed instead of at the last minute is a story for another day…

bengrue

Wonderboy has such delightfully candy-like tiles. It’s always been so nice just to look at!

slimehunter84

@xor_mgambrell_0 This is cool! I’m excited to see the launcher UI. Also I never knew there was an arcade version of the wondrous lad.

xor_mgambrell_0

We didnt like how the money in this NES game was spawning (too rare, seemed broken) so we started investigating it. Seems it used a threshold of $A0 and it gave money whenever the threshold was exceeded by a random roll. But this didnt seem to explain the whole story. So we checked the random number generator.

It seems to be four state bytes, each for a stream, where the stream is just a sequential walk through a table of 256 bytes of entropy. And each time a random number is required, it moves to the next stream, so that as long as theres an odd number of things requiring random numbers, it will be getting then from multiple streams and not stuck in a predictable sequence unto itself. It works out okay but I dont know if a choice of four streams makes for a good divisor or whatever. Perhaps something prime?

So the algorithm is basically:

byte curr;
byte state[4];
byte nextrand() { return randtable[state[curr++%4]++]; }

This all seemed alright, except that randtable doesnt contain entropy. Im pretty sure it contains SOME OTHER table (I dont know what, yet). I surmise the devs ran out of room in that bank (I certainly couldnt find room for a 256 byte table in there) and just said whatever, this looks jumbled enough.

As long as the random numbers are used to choose a random wandering pattern, its not so bad, but when this $A0 threshold is used for something simple like a money drop, it makes it almost impossible to behave sensibly. For example if we changed the threshold to $80 but there were no values between $80 and $A0 in the table, then it would make no difference to the drops.

So we’re either going to TRY HARDER to find room in the rom for the table. Or since this game has WRAM (battery-backed) and Ive studied its use of the WRAM, I happen to know that it doesnt use it all (almost no game does) and so we could just copy the table to $6F00 (the save game slots begin at $7000) at bootup using whatever extra banks we can trivially tack on (tacking on banks to run new code at bootup is substantially easier than inserting code patches all over that switch to other banks on the fly, or finding room to jam stuff into the common bank)

One final detail to mention: how is state[] initialized? When entering a dungeon, state[0] is set from the global frame counter, and then state[n] is set as randtable[state[n-1]]. This means there’s only 256 configurations… but I doubt anyone will notice.

xor_mgambrell_0

More NES emulation…

We didn’t like how this other RPG game was rolling random encounters. It happened on the first step too often. Very annoying. I think there needs to be some kind of step counter setup with a minimum value… I don’t know how many games do that. Is it possible they have arranged the statistics so that low rolls are less common? I haven’t checked this in a while.

This game has 4 terrain types, and up to 4 encounters per terrain type; on top of that it is organized into zones in the world map (thus further-away zones have harder encounter sets). There is only one monster per encounter, so encounter type = monster type. The data is organized this way:

struct EncounterDataForWorldZone
{
 byte chances[4][4];
 byte monsters[4][4];
}

EncounterDataForWorldZone encounterData[N]; //N zones in the game, N unknown

Now, for each step, the game gets a random number R from a flat distribution. It then considers chances[currterrain][0] = C as a C/256 chance of getting the encounter on that step. We see if R is less than C and do the encounter if so. Otherwise C is subtracted from the random number, so the next chance C’ from chances[currterrain][1] is a C’/(256-C) chance. And so on.

Generally the chances near the beginning of the game where we were testing, is, I dont know, 16-20 or so. So it’s easy to see why we get too many encounters… and why they can happen on the first step so often.

My first thought was to use the old “turn it into a normal distribution” trick by changing a 1d256 roll into a 2d128. This biases more normal rolls in the middle at the expense of rolls in the tails. So we did this for a while, and it felt okay… but we weren’t testing the whole game, and I got nervous.

Let’s take a simplified scenario where we have 2 encounter chances, each of value 64. Now the original way it was done, each of those would be a 64 in 256 chance (1 in 256). Well, kind of, because of how the subtraction happens in the algorithm. Anyway with two dice, there’s fewer ways of rolling the smallest 64 numbers than there are the next 64 numbers (just like there’s only one way to roll a 2 on two dice but there’s many ways to roll a 7). And yet since they’re adding up to 128, the new probability distribution is centered on 128, and the absolute probability of an encounter is still the same! All we accomplished was to bias the selection to the later encounters. This would change the character of the game in a way we don’t really have time to test and scrutinize.

So my romhacker came up with a better idea to add a tiny bit of logic to only do encounter checks on every other step. This way you won’t get them on successive steps under any circumstances, and we get the other desired result of just generally fewer encounters–precisely half as many, without having to grapple with the really thorny problem of how to do it, logically speaking, without affecting the design of the game by biasing certain encounters. Actually after a little more iteration we developed an approach to run it with any divisor so that I could give another option to reduce encounters by half again (so a divisor of 4) from the emulation menu, so people would feel a bit empowered to be masters of their own destinies instead of frustrated by the fixed design. (There’s also a cheat to disable encounters entirely)