Well I have now hit the brick wall, the point where I may end up restarting from scratch or burning out to take a significant break from my project.
I managed to finish all my lua bindings for creating meshes and all that and now I’m at the part where I actually test things and see if they run.
Testing things goes as it normally does, find segfault, fix segfault. I usually test compilation regularly, but not always execution regularly, sometimes I need all the parts made and the lua bindings in before I can make a lua script to test. Unit tests may have been useful here but it’s far too late and far less fun to do those.
So now I’m at a conniving little segfault. Right smack on an assignment operation. I checked, nothing is null as far as I can tell, let me give some context so maybe someone can help or laugh at how awful this code is and how its better to do it some other way.
Yes I’m using std::shared_ptr and I’m thinking of changing that, since that’s the thing that’s segfaulting, so here’s the line that’s failing:
(*texture) = TextureLoader::getResource(path);
texture
is a std::shared_ptr<Texture>*
, or a pointer to a shared pointer. the reason for that is this is in the luabinding, and the idea is to have a lua userdata that is the shared pointer, so that means I do a lua_newuserdata(state, sizeof(std::shared_ptr<Texture>))
and that returns a pointer to the shared pointer.
The reason to have lua hold the shared pointer is of course to have the reference count of lua’s reference be included in the shared pointer, that way it won’t get deleted when its only reference is in lua.
Now here’s the kicker, I know the textureloader code works, it’s already been loading a logo for months and months now without issues.
And I know this assignment operation works, it actually fails on a second call to it from lua, as can be seen from the log output here:
04-20-2021 19:49:26 | D LUASCRIPT: path: assets/textures/Bricks054_2K-PNG/Bricks054_2K_Color.png
04-20-2021 19:49:26 | I TextureLoader: TextureLoader loading: assets/textures/Bricks054_2K-PNG/Bricks054_2K_Color.png
04-20-2021 19:49:26 | D FSManager: loaded file from assets/textures/Bricks054_2K-PNG/Bricks054_2K_Color.png with size 24775201, total loaded datasize: 41202785
04-20-2021 19:49:27 | D Texture: loaded texture with nchannels: 4
04-20-2021 19:49:27 | D Texture: Loaded all the things, scheduled work.
04-20-2021 19:49:27 | D LUASCRIPT: path: assets/textures/Bricks054_2K-PNG/Bricks054_2K_Normal.png
04-20-2021 19:49:27 | I TextureLoader: TextureLoader loading: assets/textures/Bricks054_2K-PNG/Bricks054_2K_Normal.png
04-20-2021 19:49:27 | D FSManager: loaded file from assets/textures/Bricks054_2K-PNG/Bricks054_2K_Normal.png with size 24154252, total loaded datasize: 40581836
04-20-2021 19:49:28 | D Texture: loaded texture with nchannels: 4
04-20-2021 19:49:28 | D Texture: Loaded all the things, scheduled work.
Segmentation fault (core dumped)
It loads the Color texture just fine, no issues, then it goes to load the Normal texture, that finds the image file just fine, you can see that it finds and loads 24154252 bytes, and it has 4 channels.
It exits out of the texture creation.
Then exits from the textureloader, and proceeds to segfault on the assignment, as seen in the backtrace here:
Thread 22 "Nebula3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff977fe640 (LWP 17747)]
0x000055555568e9f8 in __gnu_cxx::__exchange_and_add (__val=-1, __mem=0xa32c049ccdffa6c6) at /usr/include/c++/10.2.0/ext/atomicity.h:50
50 { return __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
(gdb) bt 10
#0 0x000055555568e9f8 in __gnu_cxx::__exchange_and_add (__val=-1, __mem=0xa32c049ccdffa6c6) at /usr/include/c++/10.2.0/ext/atomicity.h:50
#1 __gnu_cxx::__exchange_and_add_dispatch (__val=-1, __mem=0xa32c049ccdffa6c6) at /usr/include/c++/10.2.0/ext/atomicity.h:84
#2 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0xa32c049ccdffa6be) at /usr/include/c++/10.2.0/bits/shared_ptr_base.h:155
#3 0x000055555568de4b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7fff977fd1e8, __in_chrg=<optimized out>)
at /usr/include/c++/10.2.0/bits/shared_ptr_base.h:733
#4 0x00005555556b7fb2 in std::__shared_ptr<Texture, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fff977fd1e0, __in_chrg=<optimized out>)
at /usr/include/c++/10.2.0/bits/shared_ptr_base.h:1183
#5 0x00005555556b8af6 in std::__shared_ptr<Texture, (__gnu_cxx::_Lock_policy)2>::operator= (this=0x7fff88218af8, __r=...) at /usr/include/c++/10.2.0/bits/shared_ptr_base.h:1279
#6 0x00005555556b8616 in std::shared_ptr<Texture>::operator= (this=0x7fff88218af8, __r=...) at /usr/include/c++/10.2.0/bits/shared_ptr.h:384
#7 0x00005555557745c1 in loadtexture (state=0x5555570a7df8) at /home/alex/nebula3/luatexturelib.cpp:29
#8 0x00005555558e03f0 in luaD_precall ()
#9 0x00005555558ee4ab in luaV_execute ()
The one other hint I have is that adding two log lines to my render loop (separate thread from this, this is in the update thread) will suddenly cause this whole issue to not exist. This would maybe indicate some locking issue, and I inserted it around calling on a work queue that is loaded from the function that creates the texture. That’s all fine and good but when I rip out either loading the work into that queue, or the call to the queue entirely, the problem still persists. And it’s segfaulting consistently here, whenever I’ve had a threading issue in the past, it would usually be sporadic, so far it’s failing on the second lua loadtexture call 100% of the time. Never the first, never any others after that. Never segfaults from any other thread, just consistently on this assignment.
So what are my next steps?
Probably stop using std::shared_ptr
and try rolling my own resource counter. it would give me more control at least, and possibly fewer possibilities for bugs, and it may already fit considering I already have loaders for resources, just need to revamp those to utilize my customer counter thingies, and then have it count somehow. problem is keeping track of the count, easier with shared pointers if possible. Also there’s a chance that doesn’t fix this issue.
Knowing my luck by the time I roll my own resource counter, I’ll figure out that this was some simple fix and fix it. and then it’ll be too late to go back to the shared pointers, and there’ll be tons of bugs in the resource counters to fix. Such is life I suppose.
Oh and maybe I’ll think about adding gmock/gtest unit tests. I haven’t done that in like a year and a half, might be worth brushing up on it again, and I can’t really argue that it wouldn’t be beneficial for if I ever decide to make this engine actually a big thing.
An after note while editing/writing:
the “total loaded datasize” from the FSManager seems to be a bit telling, perhaps things are getting freed unexpectedly by the shared pointers?
The total data size is supposed to be the grand total of all loaded memory, but here it actually decreases between the first and the second loadtexture calls
04-20-2021 19:49:26 | D FSManager: loaded file from assets/textures/Bricks054_2K-PNG/Bricks054_2K_Color.png with size 24775201, total loaded datasize: 41202785
04-20-2021 19:49:27 | D FSManager: loaded file from assets/textures/Bricks054_2K-PNG/Bricks054_2K_Normal.png with size 24154252, total loaded datasize: 40581836
This could possibly be due to the logo being free of course, however neither total datasize is equal to both image sizes together, instead they’re both whatever the image size is plus 16427584
so it seems that the lua textures are being freed when they shouldn’t be, which could be an issue with something else, but would hopefully be 99% fixed by a custom resource counter implementation.
Gruedorfing this possibly helped me fix this bug, tune in next time after much excruciating pain implementing my own resource counters to know if that fixed this!