> I finally was able to capture a full heap dump during a crash using the Debug Diagnostics Tool. Not sure if this is
> helpful at all..
The list has a message limit, but I got it via list owner, and yes, it helps enormously.
It confirms there's a bug in either the SP's code using the GCM encryption support in Santuario or the Santuario support for that algorithm which I had to completely wing since OpenSSL's documentation consists of its source code. Both are a possibility since GCM is never seen in SAML due to the lack of support outside Shibboleth.
This confirms that the crash is happening due to the use of the new cookie recovery feature, and that's exactly what I assumed. I don't know how to load test that code easily at this point, but please file a bug and attach that dump image and I can at least start the process of eyeballing code.
Your only fix at this point is just not using that feature for the time being.
I greatly, greatly appreciate the time you spent to bottom this out, it's sincerely a big help given the small amount of time I have to spend on the SP.
> Your only fix at this point is just not using that feature for the time being.
(*) Technically you could add a catchAll="true" to the OutOfProcess element to force it to trap the exception caused by the crash and kep shibd upright, but in practice that's not a great idea since the state of the heap at that point is unclear. The process may just destablize anyway.
> We are seeing two types of crashes one is the one I sent you earlier, and this is the second type. Both happen two to
> three times per day. We were thinking the increasing the StackSize on the TCP Listener might help with this one???
No, they're the same GCM bug. You will (largely anyway) not get crashes if you don't use the cookie session option.
The stack size issue in fact is not a factor now at all. I had forgotten that the move to 64-bit ended the constraints on address space. The old 32-bit builds were limited to 2-3G of VM depending on OS, and that was very possible to exhaust under enough load and enough threads, though very rarely on Windows. 64-bit systems have no meaningful limit on VM allocation so that's totally irrelevant now, just a historical artifact.