David Brazdil said:
tl;dr: We should change our spinlock implementation to ticket locks, same as Linux.
Actual test failing:
When testing on the RPi4, test vcpu_state.concurrent_save_restore fails to make progress. Two threads are racing to run a secondary which keeps sending back empty SPCI messages, waiting to be rescheduled and checking that its register state was correctly saved/restored.
The two threads are stuck as the secondary tries to send a message to the primary, just before Hf returns back to the primary.
Thread is stuck looping in api_vcpu_prepare_run(). It acquires vcpu->lock, sees that vcpu->regs_available is FALSE and vcpu->state is not RUNNING, unlocks and repeats. It always wins the race to lock the vcpu struct.
Thread was the first to see vcpu->regs_available == TRUE, prepared to return back to the primary and is now in the process of saving registers of the secondary's vCPU. hvc_handler() returns back to assembly which saves registers and calls complete_saving_state(), which calls api_regs_state_saved(), which sets vcpu->regs_available to TRUE. However, it has to acquire the same vcpu->lock which thread A keeps acquiring and releasing.
Because the spinlock implementation is not fair, thread B never gets a chance to update vcpu->regs_available and thread A remains stuck waiting for it.
Wedson had a patch in mind.
David Brazdil said:
Hey Wedson, I know you have something in the works but I was intrigued to learn more about how arm64 guarantees progress and ended up writing my own implementation using the LDXR/STXR pair. Hope you don't mind. Tested it on the Pi4 and it seems to work fine. WDYT? https://hafnium-review.googlesource.com/c/hafnium/+/6661
ascull@ said you were thinking of using the ARMv8.1 CAS? Might make sense for v8.1+ but I think we'll need a v8.0 fallback if we want to support the Pi (Cortex A72 is v8.0).
David Brazdil said:
For the sake of keeping discussions in one place and that place being the buganizer, I'm gonna copy what was said on the CL and continue here...
Not sure this is true. Previously, it lowered to ldaxrb/stxrb which are just the byte width version of these and there wasn't a guarantee from the architecture that a thread will ever succeed in this.
Addition of the events is the bit that is more likely to be making the difference in progress but still doesn't seem to be guaranteeing it.
According to the reference manual, there is some magic underneath. Can you post the assembly snippet from before?
LoadExcl/StoreExcl loops are guaranteed to make forward progress only if, for any LoadExcl/StoreExcl loop within a single thread of execution, the software meets all of the following conditions:
- Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory accesses, preloads, direct or indirect System register writes, address translation instructions, cache or TLB maintenance instructions, exception generating instructions, exception returns, or indirect branches.
- Between the Store-Exclusive returning a failing result and the retry of the corresponding Load-Exclusive:
• There are no stores or PRFM instructions to any address within the Exclusives reservation granule accessed by the Store-Exclusive.
• There are no loads or preloads to any address within the Exclusives reservation granule accessed by the Store-Exclusive that use a different VA alias to that address.
• There are no direct or indirect System register writes, address translation instructions, cache or TLB maintenance instructions, exception generating instructions, exception returns, or indirect branches.
• All loads and stores are to a block of contiguous virtual memory of not more than 512 bytes in size.
I'll let Wedson see if this is what he was concerned about before.
Snippet of the generated asm:
ldaxrb w12, [x8] stxrb w13, w11, [x8] and w12, w12, #0x1 cmp w13, #0x0 ccmp w12, #0x0, #0x0, eq // eq = none bne <to the top>
Re snippet: Right, the only difference is the WFE called if the lock is taken, but that should be just a power optimization.
Hmm, the manual also says it's possible for another thread to break the progress guarantee. But that doesn't sound like what's happening in our test:
The Exclusives monitor can be cleared at any time without an application-related cause, provided that such clearing is not systematically repeated so as to prevent the forward progress in finite time of at least one of the threads that is accessing the Exclusives monitor. However, it is permissible for the LoadExcl/StoreExcl loop not to make forward progress if a different thread is repeatedly doing any of the following in a tight loop:
— Performing stores to a PA covered by the Exclusives monitor.
— Prefetching with intent to write to a PA covered by the Exclusives monitor.
— Executing data cache clean, data cache invalidate, or data cache clean and invalidate instructions to a PA covered by the Exclusives monitor.
— Executing instruction cache invalidate all instructions.
— Executing instruction cache invalidate by VA instructions to a PA covered by the Exclusives monitor.
.. unless "performing stores" includes the lock variable itself. In that case the WFE could be what's breaking the tight loop.
Damn, anybody understands this?
The part of the quote that concerns me is "provided that such clearing is not systematically repeated so as to prevent the forward progress in finite time of *at least one of the threads* that is accessing the Exclusives monitor."
My reading of this is that an implementation conforms to this spec as long as *one* [unspecified] thread eventually makes progress. This doesn't appear to be strong enough to guarantee liveness as one unlucky thread may spin forever, which appears to be our issue in pi4.
David, would you mind trying out the following patch? https://hafnium-review.googlesource.com/c/hafnium/+/6662 -- it still has the following loop:
40009a24: 485f7c08 ldxrh w8, [x0] 40009a28: 11000509 add w9, w8, #0x1 40009a2c: 480a7c09 stxrh w10, w9, [x0] 40009a30: 35ffffaa cbnz w10, 40009a24
which is undesirable, but it has the advantage that threads contending for the lock can get through it without acquiring the lock. In our two-thread case, it allows one thread to go through this increment loop while another thread is doing work with the lock held [which is not possible with the current implementation].
What I wanted to do was convert this loop to use ldadd or some variant thereof but Andrew tells me that this was seen on a v8.0 CPU so we can't use ldadd.
David Brazdil said:
It feels like we have two semi-linked issues here:
(a) observed starvation on Cortex A72, and
(b) us being unsure about the fairness and forward progress guarantees of our spinlock implementation.
We have found a way how to tackle (a) now, (b) will take a while to get some clarity on.
Wedson, I tried your implementation of the Ticket lock and it runs into the same problem - one of the cores never manages to get a ticket. The LDXR/STXR pair simply does not have forward progress guarantees. Inserting WFE between failed attempts fixes it, but I did some experiments and even that is not sufficient if the lock/unlock loop is extremely tight.
Mentioned this briefly when we met Will Deacon today and he said that, unfortunately, it is a known problem and this is exactly what LDADD is meant to address. So what I would propose is to merge the spinlock+WFE, use it as a fallback for ARMv8.0 and implement an LDADD-based version for v8.1+.
https://hafnium-review.googlesource.com/c/hafnium/+/6661/ was submitted but rolled back.
Wedson sent https://hafnium-review.googlesource.com/c/hafnium/+/6880 but never submitted.
https://hafnium-review.googlesource.com/c/hafnium/+/6783/ was submitted.
Andrew Scull said:
Will Deacon let us know about something that would likely be affecting the RPI4:
> Ah: the dreaded Cortex-A72! There's a control bit you need to set on that
> CPU so that the cacheline containing the lock isn't held forever by one
> core. I /think/ that's CPUACTLR_EL1 which, despite the name, probably
> needs to be set from EL3.
(Migrated from b/141087046.)