arve@ reported:
Linux fails to boot sometimes when running hafnium with trusty as a guest. It appears to be stuck waiting for a response to a message that was never delivered to trusty. When I look at the state in hafnium it trusty-cpu-0 seems to be waiting for a message (I assume this it what VCPU_STATE_BLOCKED_MAILBOX means). The mailbox for trusty is in the MAILBOX_STATE_RECEIVED state, and the content of the receive buffer matches the message that was last sent to the hafnium socket.
(gdb) print vms[1].vcpus[0].state
$297 = VCPU_STATE_BLOCKED_MAILBOX
(gdb) print vms[1].mailbox
$298 = {state = MAILBOX_STATE_RECEIVED, recv = 0x7fe27000, send = 0x7fe25000, waiter_list = {
next = 0x400d9860 <vms+18448>, prev = 0x400d9860 <vms+18448>}, ready_list = {
next = 0x400d9870 <vms+18464>, prev = 0x400d9870 <vms+18464>}}
(gdb) print/x *(struct trusty_hafnium_req_msg *)(((struct spci_message *)( vms[1].mailbox.recv))->payload + sizeof(struct hf_msg_hdr))
$299 = {id = 0x40, args = {r0 = 0xbc00000a, r1 = 0x3e, r2 = 0x0, r3 = 0x0}}
I think the problem is that the hafnium linux driver only wakes up the vcpu if it was already waiting in the linux driver when send message comes in, or if it could inject an interrupt. It the interrupt is already pending and the message comes in after the vcpu has called read, but before the vcpu linux thread has set waiting_for_message, then the thread never runs again.
ascull@ replied:
My expectation for this series of events would be:
1. the vCPU calls read but hasn't exited to the driver
2. a message is delivered to the VM but there are no vCPUs waiting to the driver's knowledge
3. the driver injects the HF_MAILBOX_READABLE_INTID interrupt
4. the driver tries to wake up the vCPU by
a) setting the flag to stop it sleeping if it exits to the driver
b) waking and kicking the vCPU thread to allow the interrupt to be injected
As Arve points out, the driver will never try step 4 if the HF_MAILBOX_READABLE_INTID interrupt is already pending so when the vCPU does exit to the driver, it sees no reason not to go to sleep. The dependence on the interrupt to keep things alive might be flawed or does the Hafnium API require the HF_MAILBOX_READABLE_INTID interrupt to be cleared once the message it is referring to has been dealt with?
Arve, how do you deal with the HF_MAILBOX_READABLE_INTID interrupt in trusty?
arve@ replied:
We use the HF_MAILBOX_READABLE_INTID interrupt to wake up a thread that reads from the mailbox. In this case one of those threads is already running, and since the message it last got was a "fastcall", it did not enable interrupts.
It seems the race I'm hitting is not the one where the HF_MAILBOX_READABLE_INTID interrupt is already pending, but one where it is never sent. &vm->vcpu[i].waiting_for_message is uninitialized and non-zero for all secondary cpus in my failing case. Since only vcpu 0 waits for messages, then means the wrong vcpu will be woken up.
https://hafnium-review.googlesource.com/c/hafnium/driver/linux/+/5282/ and https://hafnium-review.googlesource.com/c/hafnium/+/5281/ submitted.
arve@ replied:
Initializing the waiting_for_message field is only a partial fix. The race described in #2 is still possible.
ascull@ replied:
Following the direction of SPCI, the delivery of messages was exported to the scheduler VM with RUN being the call to enact the delivery to a specific vCPU. The transition isn't working for this case due to the assumption that an interrupt is enough to inform the vCPU of the message. If the driver could also mirror the state of the VM mailbox it would be able to deliver the message when a vCPU calls RECV even if an interrupt has already been sent for that message. The interrupt will become spurious as the message has already been read or could result in a future message being read.
https://hafnium-review.googlesource.com/c/hafnium/driver/linux/+/5385/ is a checkpoint of some thinking along these lines. The driver only approximately mirrors the mailbox state since it doesn't get all the signals it needs but it can learn what it needs to in order to be useful I believe.
The rough process when a message arrives:
1. If any vCPUs are waiting, pick one to run and deliver the message. Done.
2. Pick a vCPU to send an interrupt about the message having arrived.
3. Assume the message has not been read until RUN is called on a vCPU that was WFM. This means any that exit with WFM should be run right away if a message is though to be pending (if the message was already read, this will early exit with WFM again).
4. If another message is asked to be delivered, then it must have been read and we're onto the next message.
This is a driver logic issue that should still be absorbed into the migration to the SPCI driver.
(Migrated from b/131434148.)