• xor_mgambrell_0 xor_mgambrell_0

    Some platforms have really weird patching systems. If you do something like frameshift all your data by 1 byte in an update, the patch engine will fall apart and it will be 100% the same size as the new file. And there are limits to how large patches can be, too. This sounds like it might be amateurish, but there's reasons for it: making a fully-powered patching system has performance implications. Specifically on the disk IO system. And consoles are all about the bargain of the platform guaranteeing resources while the developer guarantees he'll use them. So the patching apparatus cannot be infinitely-powered, like to the extent where it's a fully optimal difference which requires unbounded time to uncrunch patch data while referencing sectors from all over the version0 datafile. Specifically that last part: in the worst case scenario: the individual seek and IO time will get out of control, as will the wear and tear on the flash media. I call it read amplification because that's basically what it is.

    There's a piece of the puzzle you're missing if there was something creepy in the above that was giving you a bad puzzled feeling. And it's this: the premise for a console platform's patching engine is that the version0 datafile is always present. It's in the physical media! The patch must be applied on the fly. There's no separate storage for it. Even on modern consoles, this same principle may be in effect.

    On the developer's side of that guarantee, then: we are supposed to read the docs about how the platform's patching apparatus works, and then design disk file formats which don't cause problems for it. This always catches Unity developers by surprise--the default way for doing things there makes a complete mess out of the data every time you touch anything. It's always sad to see the horror slowly dawn on them. Well, it catches everyone by surprise at least a little the first time, but at least the developer of a proprietary engine would have had this kind of thing on their radar already (and it's not as likely to be their first time doing anything.)

    But I'm personally in a bit of a pickle. See, I'm porting other people's engines, and they've unknowingly made poor choices about the structure of the packfiles. I don't want to hack the engine (part of my business plan is to do everything I can to touch engines and games as little as possible, to keep costs low.) But what do I do when I need to patch one of these games? I ran into this scenario last week. So here's how I handled it.

    I have a file format which consists of an 8 byte record per 'sector', where the 'sector' size can be tuned somewhat depending on the nature of the data. This sector consists of what I call a 'literal prefix', a 'chunk', and a 'literal suffix'. Tracking that requires 8 bytes per sector which can be directly-indexed by sector. The literal prefix and suffix are stored in the patch file at the end together. The 'chunk' comes from the version0 datafile. The sector may naturally be 100% literal prefix or 100% chunk--and commonly is. After all, we're dealing with packfiles, and unless internal data has been re-indexed somehow so a bunch of IDs and references are changed (we'll hope that doesn't happen *), most of the data is the SAME--only moved around. And most importantly, the implications on read amplification are contained: reading one sector of data from the patched file results in reading no more than two sectors of data from the version0 file.

    I made an effort to find the best 'chunk' that I possibly could. I do exhaustively search for those, using an acceleration data structure that consists of a list of offsets of strings in the version0 file beginning with each 16bit value. So that actually cuts the number of strings in the version0 file I have to search by 1/65536th. Using that filter, I just look for the longest matching string I can. Even this would be excruciatingly slow (hours?) but I found another technique: It so happens that the best string in version0 for version1's sector 124 is most commonly the string immediately succeeding version1's sector 123 chunk (as long as the literal suffix for sector 123 was empty--the sector is concluded with a chunk). That is, in other words, I can find the next piece of data from the packfile's inner-packed file without even needing to know anything about the structure of the packfile. With THAT (and a 32 thread processor) I can handle a 1GB packfile in 30 seconds or so. The whole algorithm is trivially parallelizable. Even the acceleration table is parallelizable (each of 256 threads indexes the strings starting with that byte). That table is about 10GB for a 1GB packfile, when it's all said and done.

    I did evaluate all the existing patch file formats, but none of them were really designed for on-the-fly use. For sensible reasons. I'm sure in exotic filesystems there may have been something like I needed but it would have been impossible to reuse that stuff. I didn't even look for it.

    Now to finally complete the picture, what we do is make version1 of the game out of the version0 packfile + the patch file. The patch that gets shipped to users is no larger than our patch file, then.

    Actually the format I devised is kind of bloated, so I do compress the whole thing. Could compress to anywhere from 10% to 99% of the original file size.. depends on details. Since this is done on the entire file, and you may have noticed my read-amplification analysis didn't discuss reading the literal parts, it follows that my patch files are loaded completely into memory. My defense of this is that any engines we have that make such good use of a packfile that they are enormous enough to cause problems, will also be rather efficient about NOT loading too much data in memory, meaning I will have plenty of room to hold a patchfile. Hope I'm not wrong...

    I had to modify my engine to try opening myfile.ext.patch any time myfile.ext is opened (actually this analysis is done when our filesystem is built during content building) and then have the ability to do this patching on the fly. But there's one more read-amplification issue I had to deal with: since my patch file is sector-based, and usually games will read entire multi-sector files from the packfile all in one operation, I don't want to turn that into several individual-sector reads from the physical filesystem. To make sure this doesn't happen, I interposed another layer of logic that 'coalesces' adjacent reads from the version0 file and doesn't execute them until a read comes from somewhere else.

    Whew, that was a lot of work! I hope it works out. Part of the concept here is that no work is required IN ADVANCE, so despite my testing, I won't be 100% sure about it until I actually run into the scenario where I have to do the patch (this whole endeavour was precautionary before we shipped a game with huge packfiles).

    • In case this does happen, I have another strategy planned. What I'll do is find the best-matching sector's worth of data in the version0 file and then store the XOR of it vs version1 in the patch. The whole patch will then be compressed (and it's hopefully mostly zeroes) but it needs to be done one sector at a time, since the mountain of zeroes will be too big for RAM. I think this approach has promise, but it's just not what I ended up doing. Plus, I think the computation to find "best-matching sector" is pretty rough. I don't want a patch file that takes days to compute... unless I'm really, REALLY desperate. Since I have the on-the-fly patching built into my engine now, I could make an alternate patch format without too much work. Note that in order to avoid read-amplificaton issues, I would need to STRONGLY prefer sequences of sectors. I mean, I know it's paranoid, but I'd rather miss once and have an inefficient sector than even have a chance of a perverse case turning one long sequential read into dozens of short ones.

    posted in Blogs read more
  • xor_mgambrell_0 xor_mgambrell_0

    So a game was running slow on the switch. After a while I realized it was because of some text I'd added which was getting rendered every frame. Like... one sentence of text. Using a spritefont and drawn in a batch. What the heck? This shouldn't be so slow.

    After digging in, I realized it was due to the kerning support I'd added to imgui which necessitates lookups of kerning pairs between letters. On top of this, imgui's basic glyph lookup was slow. (Using imgui was a mistake, but that's a story for another day and I'm stuck with it for now.)

    Both these things are ABSURDLY slow because they use a binary search through a rather sizable list. Binary searches are slow because by definition they hop around memory without paying any heed to where they are in a cache. Every single hop is to a fresh cache line. This is 100% the worst possible case for a memory subsystem. And that's the weakest point of mobile platforms, which the switch is. Turns out this just cost me a few FPS but it was enough to drop it below 60 and thus to 30.

    I'll have to redo this some day. But not now... I just optimized a different thing entirely to earn back some FPS from a different budget and so the game regained 60fps. However, I do know how to do it. And as luck would have it... I had to do basically the same thing for another text renderer just the same week!

    In this other case, I'm high-level emulating the Turbo CD "system card" bios. This is so we can release games that are emulated without having to distribute the copyrighted bios, which there's no way we could ever get the rights to. This proceeded pretty well. But finally I discovered at the very end of the game we were porting, it would use a built-in SJIS font that the bios has for rendering the credits. Now, I really don't want to include that copyrighted font verbatim. On the other hand, it takes a LOT of work to make a Japanese font. So I wondered... had they just used someone else's public domain font? So I spent hours scouring the web and could find no proof that they had. Fortunately while doing so I found 100s of BDF-format dot fonts, a few of which were 16x16 (which is what I needed), and 3 of which had the needed attribute of leaving one column empty (so that the glyphs would be spaced at least a little if put on a grid).

    So it was time to make a BDF-processing tool, and a necessary part of that is creating the glyph lookup table so that I could get from an input SJIS code to a specific offset in the dot font bitmap data. And now I will tell you the proper way to do it.

    SJIS is a two-byte code. UTF16 and even UTF-8 can be handled the same way (albeit 3-byte codes are common in UTF-8 when it comes to Japanese). You can make a 64k-entry table if you want (the entry is either the offset to the glyph or missing-glyph sentinel). This involves just one random access cache hit per lookup. The table entries really need to be 4 bytes so it ends up being 256KB. Not bad, but we could do better.

    So generally what you do is you simply make a 256-entry "1st-level" table. The first byte of the code is the key index and the value is the number of a 2nd-level table. These can be bytes, so long as you know how to calculate the offset of a 2nd-level table from the number [it's 256+number*sizeof(2nd-level table)]. So the first level table is tiny, although it's still basically a random access each time. The 2nd level table can be 2 byte entries (to index another table of offset to the final glyph) or 4 byte entries (to directly index the offset of the final glyph).

    My glyphs were all fixed-size (32 bytes apiece) so I could calculate them from a number, so I ended up with 2 byte entries. Note that in MANY cases your glyphs will be fixed-size: insofar as they are records containing e.g. texcoords. If you're careful, a spritefont renderer can operate completely in-cache, and really only the final access of the glyph record is random (the gpu batch data you're producing is written consecutively so it's actually the ideal case for any CPU and is basically negligable compared to all the rest, in this case where it's not much data).

    But anyway that's more optimization than I really care to think about now. By using the 2-level index you change a dozen or more random accesses to 2 or 3 and due to the way these memory bandwidth-constrained devices work, that can directly map to a 4x speedup. I did it with a 2-level index in this case because I actually put the data in my fake bios and I didn't want to worry about what would happen if it grew too large.

    One day I'll improve that imgui text renderer some more so I can actually use it seriously; in that case I will probably make a direct 64k-entry map from 2byte codes to glyph data. But wait--what about the kerning! We're not through yet. I brought all this up for a reason! The kerning data is kind of a multi-level index, isn't it? 1st index is the 1st character and 2nd index is the 2nd character. Now in this case you can end up with some giant tables (if any 2-byte character can be kerned with any other 2-byte character, you can end up with 64k*64k=4GB entries in the worst case...) but still it's only a small number of random accesses. In practice the table won't be that big, and if it is too big you can work it into 3 levels somehow. Not a big deal. I'll write about how that goes when I get around to doing it.


    posted in Blogs read more
  • xor_mgambrell_0 xor_mgambrell_0

    Not a day goes by I don't have to do something like this...

    This time, I had to read a png myself and receive the palette indexed data, because I'm using it to patch tiles in the 8bit system VRAM. I can't (straightforwardly) draw it on top of the finished output because it wouldn't fade along with the rest of the screen unless I took some really wacky measures. My content building process munges pngs phenomenally into a custom format which uses LZ4 to compress (pngs are very slow to decode and LZ4 advertises "faster than memcpy"). This works well except on big photorealistic stuff which it can inflate by 3x relative to the pngs.. everything else is just fairly comparable in size but also loads faster. So I fiddle with my build scripts to replace the munged pngs with specially chosen pristine ones.

    Anyway the point of all that was to show what a highly managed process importing PNGs are in my engine, so that I always end up with 32bpp data. And yet in this case I need palettes. And I don't want to re-paletteize it because sometimes these patches need to use the palette that's already there (replacing tiles is an order of magnitude easier than replacing palettes, which are computed on the fly [fading, you know] rather than just parked there like tiles and BG entries). So rather than generalize my 32bpp textures system, I added a new 8bpp image loading system.

    stb_image is what I use for loading unmunged pngs, which can only load images into 32bpp format. I don't to add lodepng... I don't know, I just don't want another library. Corona is overkill (and another library I'm not using yet). SDL_image is not going to happen because I'm not using SDL. But I already have libpng available as an importable module in my engine (some games need it) so I might as well use that.

    Problem is, libpng is like... a nuts and bolts library, not a user-friendly library at all. There are a lot of user-unfriendly parts of it (that is to say, all the parts) but the one in particular that's got me down today is that it could give me an image with a palette that's size 16 or 256 depending on how I saved the PNG. So far out of 2 images I have saved the PNG both ways. Now, libpng has a lot of fun transformations it can do on its pixel formats, but not the one I need (or I can't figure out which one it is). So it's time to write my own 4bpp-to-8bpp converter! It took a few tries, but I came up with this, which is so neat and nutty its what inspired me to write this post today:

    if(bit_depth == 4)
      for (int y=0; y<height; y++)
        for (int x=0; x<width; x++)
          int xx=width-x-1;
          row_pointers[y][xx] = (row_pointers[y][xx/2]>>(((xx&1)^1)<<2))&0xF;

    Here we work from right to left (don't ask me why I don't reverse the loop, I have no idea) because we're expanding the data so the store is always further to the end than the load so that we never step on our own feet. Then we just have to pick the N/2th 4bpp byte for the Nth pixel, shift it down if it's the upper nybble, and mask it down to 1 nybble in case it wasn't. Easy! To tell you the truth, I didn't even check if there was a png transformation, I just did it this way myself immediately because I knew it would be fun.

    posted in Blogs read more
  • xor_mgambrell_0 xor_mgambrell_0

    There aren't a lot of PS5 games. Not as many as you'd expect. I mean, you can get a lot of games on your PS5 but most of them are PS4 games.

    That may or may not have to do with this:

    In order to make a PS5 game, you have to implement this complicated thing Sony is trying to promote called "activities" or possibly "stories". I don't know exactly what it's called; it's complicated.

    The idea is that it should let people jump into the game from the OS with minimum fuss -- no splash screens (insofar as it's possible), no title screens, no menus, no file selects. It's like a menu your game has to support, outside the game. You could call it kind of like windows jump lists. I might have a limited view of this feature (I think it's way more powerful) because I have made limited use of it.

    If you haven't heard of this thing, don't feel bad. I don't know if anyone actually uses it.

    OK, so you want to do this. Prepare to turn your game inside out. You need to boot straight into gameplay, so any accidental dependencies your game had on earlier screens need to be handled differently. You have to tell the OS when the player has begun an activity (so it puts it in the list of resumable activities) and tell the OS when the player has ended it. You need to be able to handle an instruction from the OS to jump from whatever state the game is in IMMEDIATELY to the user's selected activity. The meaning of jumping to the activity is left up to the game: it could be a save file, or it could spawn you in the starting area rather than where you left off last time so you can begin doing some more chores. That helps a bit.

    But I'm working on these emulated games and I can't turn them inside out. I can only make tiny laser incisions. Here's some of the things I've had to do:

    1. I need to decide when the player has "begun an activity" (namely, "goal: beat the game".) The game has an attractmode that shows a gameplay demo so I can't use anything in-game to identify that the player has done this. Before the game begins, it shows the weapon loadout selection menu. This it does NOT show in the attractmode. It DOES show it inbetween levels and when you die. So what I do is this: I check a tile in a certain spot to see if it's part of the loadout menu. When I find it, I tick a state machine. If I don't find it, I tick it again: and at that point the "activity" begins. I can't trigger "begin" when the user exits the title menu because we're supposed to skip as many menus as possible (making as many assumptions as we please) so I have to wait until after the menu is done and the gameplay has really begun.

    2. I need to jump to the activity begin point even if the player has never begun the activity in the actual game. That means I possibly don't even have a savestate. So I have prepared a savestate just before the first level fades in, along with an arbitrarily-selected loadout. What else am I supposed to do?

    3. I need to jump to the activity begin point if the player has actually begun the activity. In this case, I use the last savestate he loaded or saved. Unless that one's not there, then I punt to item 2.

    4. I need to decide when the player has "ended an activity". I have previously prepared hooks for when each level is beat, for achievements, but these hooks don't exist in an "un-enhanced" version of the game shipped with the SKU. In case this is what the player's playing, I still have to detect the end of the game. The achievement hook I created before was done when a specific cutscene is shown at the end of the game. It's a good, dramatic timing. So, I decided to reproduce that for the activity ending. To do this I will repeatedly memcmp the beginning of VRAM where I happen to know the tiles for that cutscene are. The moment they appear, the game is over. Well... the cutscene is fading in, so the screen is black at that time (that's not ideal for achievements because screenshots get taken at that time) but that's what we've got for now.

    In another game I'm working on, which wasn't emulated, I had to skip the language selection menu and just assume the user wanted English. What else am I supposed to do? What else is anyone doing? I should probably get a PS5 so I can check some games. In retrospect I should have made an effort to base it on the user's OS-selected language, but what about people using languages I don't support?


    posted in Blogs read more
  • xor_mgambrell_0 xor_mgambrell_0

    Well, I'm not working on my own game right now. What if I tell you some interesting stories each week from the other people's games I'm working on which is keeping me from the real business of making my own game? I hope you don't mind.

    This week I had a few walls of text in FIGS languages which lost formatting during localization. I needed to break the paragraphs at "Unfortunately" and "Watching". I felt a little powerful for recognizing those words immediately. Italian "Guardare"... that one tripped me for a half a beat. Sounds like 'block' or 'pause'... those I have seen before. Ah, but "regard".. get it?

    My general process for double-checking stuff like this (triple check is to ask a native speaker) is to paste it in google translate and see if the splits make sense in English. This is not hard for the FIGS languages... but... with Japanese... ok, let me show you what happens.

    Here's a Japanese sentence we might have to wrap ourselves:

    "You have saved this world from the terrifying hands of Dr. Hankyo!"

    Now what I'm going to do is split it at the most reasonable points and show you how google translate re-interprets the sentence:

    あなたは -- 恐ろしいハンギョ博士の魔の手からこの世界を救ったのです!
    You -- He saved this world from the terrifying hands of Dr. Hankyo!

    あなたは恐ろし -- いハンギョ博士の魔の手からこの世界を救ったのです!
    You are scared -- He saved this world from the magical hands of Dr. Hankyo!

    あなたは恐ろしいハンギョ -- 博士の魔の手からこの世界を救ったのです!
    You are scary hangyo -- He saved this world from the hands of the doctor's devil!

    あなたは恐ろしいハンギョ博士の -- 魔の手からこの世界を救ったのです!
    You are a terrifying Dr. Hankyo -- I saved this world from the hands of the devil!

    あなたは恐ろしいハンギョ博士の魔 -- の手からこの世界を救ったのです!
    You are the terrifying Dr. Hankyo's demon -- I saved this world from my hands!

    あなたは恐ろしいハンギョ博士の魔の手からこの -- 世界を救ったのです!
    You are this from the terrifying Dr. Hankyo's devil's hand -- You saved the world!

    あなたは恐ろしいハンギョ博士の魔の手からこの世界を救 -- ったのです!
    You save this world from the terrifying hands of Dr. Hankyo -- It was!

    The guidance on Japanese typesetting will tell you it doesn't really matter much, as long as maybe you cut it between kanji and kana, but these examples run the whole gamut of persons, tenses, senses, and storylines, so wtf?

    Anyway I had to prepare all these FIGS translations in a Photoshop file so I could export the 1080p pngs for display in-game without needing a Photoshop-calibre typesetting engine. I'd say this should be someone else's job, but apparently it's my job to do whatever everybody else doesn't do in cases where it would take me longer to tell them to do it, explain how to do it, and then check their work. Photoshop does have its own wrapping logic which is probably good but in this context I had to be creative about where to wrap it or it would turn into too many lines.

    posted in Blogs read more