Page MenuHomePhabricator

Stuck in tfm_nspm_thread_entry()
Closed, ResolvedPublic

Description

Background:

Synced to the latest TF-M:

This commit works Okay:
93dabfd3a35faf9ed88285e09997491e93cefa5c

But these two commits caused the problem:
5248af2d7b86775364a0e131eb80ac0330bc81fb
490281df3736b11b62e25bc98d3e2c6e4e10478c

There is a known issue in these two patches, which accesses the privileged string table and function for printing the message. And here is the fix:
cba90782908626f955fe361f803558181a85c6fc

After switching into the latest commit, looks like the memmanage fault happens on push {r0, r1}.

Analysis
It looks like a memory accessing fault, in general, can be caused by two reasons:

  • NSPM uses ARM_LIB_STACK as working stack, check if the 'sp' equals ARM_LIB_STACK, and ARM_LIB_STACK is put with npriv rw.
  • R0 contains an invalid value. R0 should hold the non-secure entry address, this value is been set in:
secure_fw\spm\spm_api_ipc.c:580:tfm_spm_init

578: if (partition->static_data->partition_id == TFM_SP_NON_SECURE_ID) {
579:     p_ns_entry_thread = pth;
580:     pth->param = (void *)tfm_spm_hal_get_ns_entry_point();
581: }

Please check these two places to see if the value in R0 is correct or memory attribute is okay.

Event Timeline

KenLSoft created this task.Jan 22 2020, 2:26 AM

Hi Ken,
Thank you by taking care about it.
Guess, the issue may be caused by the way we are using TFM. We have to compile TFM Secure in one project (instead of separate libraries). Guess, it may cause that some parts of data/code may be placed to unwanted memory sections.
I will try to analyze the generated map file.

AndreyButokNXP added a comment.EditedJan 24 2020, 11:57 AM

So the PSP, during main() has following values:
PSP = 0x300001000 PSPLIM = 0x30000800

On entering to tfm_nspm_thread_entry(), PSP equals to 0x30001048 => guess, this is wrong value as it is outside of the ARM_LIB_STACK (nob-privileged) range , right? => so push {r0, r1} goes to the privileged memory.

BTW: There are a lot of assembler naked code in TFM. We are using the armclang v6 assembler, but TFM is still using the assembler armclang v5=> may be this is a reason for this issue. For example, it may have influence on amount of pushed/popped registers or some optimizations.

After tfm_core_svc_handler() returns, the execution jumps to tfm_nspm_thread_entry().
Strange is, that during this jump PSP is changed from 0x30000FE0 to 0x30001048.

Looks like this was caused by the push operation, and it was caused by accessing an address out of ARM_LIB_STACK.

ARM_LIB_STACK is the initial PSP for main, and it will be used by the veneer entry later, it is set as the NSP stack at here:

tfm_spm_db.h: 965:

const struct tfm_spm_partition_memory_data_t memory_data_list[] =
{
    {
        .stack_bottom         = PART_REGION_ADDR(ARM_LIB_STACK, $$ZI$$Base),
        .stack_top            = PART_REGION_ADDR(ARM_LIB_STACK, $$ZI$$Limit),
        .rw_start             = PART_REGION_ADDR(ARM_LIB_STACK, $$ZI$$Base),
    },

In IPC model, there would be a svc into handler mode to initilize all the threads at the last step of main(); (tfm_core_handler_mode->tfm_spm_init). and then the schduler would run the SP threads one by one and finally the tfm_nspm_thraed_entry, it should be the last one. The stack info is stored in thread list. So:

  • Check the thread list to know if the nspm_thread has been set with a correct stack information, check global 'p_thrd_head' you can find all the thread info, it is the head of the list.
  • Check tthe partition info (listed in upper code block) to see if partition info is correct.

I am on CNY lunar holiday so may delay the reply.

AndreyButokNXP added a comment.EditedJan 27 2020, 8:26 AM

Hi Ken,

I have just checked the tfm_thrd_ctx with
. pfn = tfm_nspm_thread_entry()
and it has correct initialization values:
.sp_btm = 0x30001000
.sp_top = 0x30000800

On entering tfm_nspm_thread_entry() the values of sp_btm and sp_top are the same (not changed/damaged).
So the thread structure is correct. But PSP equals to 0x30001048.

BTW: I have feeling that the mixed assembler and C code has different expectations on push&pop. We are using ArmClang V6.13.1 and Armasm v6.13.1

AndreyButokNXP added a comment.EditedJan 27 2020, 9:07 AM

Hi Ken,

It looks like my guess about the ASM&C&Compiler version conflict issue is valid.
I have just tried MCUx which is using GCC, and the application (PSA tests) works without issues (no faults).
It looks like the assembler code for ArmClang/Armasm V6.13.1 v6.13.1 must be different than for GCC. Could you check on your side?

AndreyButokNXP added a comment.EditedJan 27 2020, 9:33 AM

BTW: At the same time the TFM Regression Tests application (using GCC) goes to infinite reset on test_app()=>tfm_secure_client_run_tests()=>psa_connect(TFM_SECURE_CLIENT_SFN_RUN_TESTS_SID,TFM_SECURE_CLIENT_SFN_RUN_TESTS_VERSION)=>tfm_ns_interface_dispatch()=>fn()=>causes reset

As the PSA application tests are OK (with GCC) so most probably this is other issue for a different task.
In the terminal window:
[Sec Thread] Secure image initializing!
[Sec Thread] Secure image initializing!
[Sec Thread] Secure image initializing!
[Sec Thread] Secure image initializing!
[Sec Thread] Secure image initializing!
[Sec Thread] Secure image initializing!
[Sec Thread] Secure image initializing!
...

Another experiment. Switching to Armasm V5 from v6 did not help, the same result - goes to the Memory fault.
So, for this moment only GCC (for PSA tests) is OK.

FYI:
For ARMClang/Keil

  1. Functional API and TFM_LVL=1, all applications/tests are OK!!!
  2. IPC API and TFM_LVL=1 => repetitive reset.
  3. IPC API and TFM_LVL=2 => goes to the mentioned memory fault.
galak added a subscriber: galak.Jan 28 2020, 4:34 PM

I have tried to workaround it. Current steps:

  1. Updated to the latest TFM commit.
  2. Reverted changes related to:
    1. 5248af2d7b86775364a0e131eb80ac0330bc81fb
    2. 490281df3736b11b62e25bc98d3e2c6e4e10478c
    3. 483da6447e4360c514538807275904be68395dff

After that, PSA Test application starts working.
But the regression test application still goes to the Memory fault during the first App ROT partition initialization, on the first SVC call.

  1. Switched App ROT partitions to PSA ROT, in tfm_spm_db.inc => after that regression tests are OK.

Any comments are welcome.

Sorry for the late response and thanks for the findings. Could you share your image binary so that I could take a try? I got a board but not set up the environment yet.

These items are necessary under the package:

  • The .axf file
  • The .map file
  • The tf-m source commit
  • The updatable firmware
  • The debug method.

You can upload packages for different problems, please describe the problem somewhere in the package.

Let me try it on my side. I think most of the assemblers are using __naked attribute, I will do a search at my side or it would be great you can help to point it.

BTW, I need one more week to back to the office and get the board due to virus outbreak, so may not reply in time. After you uploaded your images I can do the first analysis.

Thanks for your help.

KenLSoft added a comment.EditedJan 31 2020, 4:31 AM

More hints for you:

From the description, it looks like the thread structure is inited correctly, but the scheduler does not set correct PSP for this NS thread.

The scheduler doing this switch at (Guess your platform is armv8m.main):

secure_fw\core\arch\tfm_arch_v8m_main.c:64:
__attribute__((naked)) void PendSV_Handler(void)
{
    __ASM volatile(
        "mrs     r0, psp                    \n"
        "mrs     r1, psplim                 \n"
        "push    {r0, r1, r2, lr}           \n"
        "push    {r4-r11}                   \n"
        "mov     r0, sp                     \n"
        "bl      tfm_pendsv_do_schedule     \n"
        "pop     {r4-r11}                   \n"
        "pop     {r0, r1, r2, lr}           \n"
        "msr     psp, r0                    \n"
        "msr     psplim, r1                 \n"
        "bx      lr                         \n"
    );
}

After tfm_pendsv_do_schedule is called, the stacked context {r0,r1,r2,lr,r4~r11} is changed to the thread going to be running.
For example, if the thread is going to be running is ns_thread, then:
r0 = ns_thread.psp (0x30000800)
r1 = ns_thread.psp_limit (0x30001000)
others should be ZERO, and lr should have a value (FFFFFFFD).

After psp is set, bx FFFFFFFD jump to the tfm_nspm_thread_entry and the thread is going to run with the new set PSP.

This stacked context is changed here in C code:

secure_fw\core\ipc\tfm_thread.c:216:

void tfm_thrd_context_switch(struct tfm_state_context_ext *ctxb,
                             struct tfm_thrd_ctx *prev,
                             struct tfm_thrd_ctx *next)
{
    TFM_CORE_ASSERT(prev != NULL);
    TFM_CORE_ASSERT(next != NULL);

    /*
     * First, update latest context into the current thread context.
     * Then, update background context with next thread's context.
     */
    tfm_core_util_memcpy(&prev->state_ctx.ctxb, ctxb, sizeof(*ctxb));
    tfm_core_util_memcpy(ctxb, &next->state_ctx.ctxb,
                         sizeof(next->state_ctx.ctxb));

    /* Update current thread indicator */
    CURR_THRD = next;
}

Hope it can help the debugging.

One thing to confirm, you mean when the instruction runs at the beginning of tfm_nspm_thread_entry(): '"mov r4, r0 \n", the PSP is 0x30001048 already?

There is another issue reporting repeating reset, let me check if they are the same and create another issue if they are not.

BR

AndreyButokNXP added a comment.EditedJan 31 2020, 11:32 AM

Hi Ken,

One thing to confirm, you mean when the instruction runs at the beginning of tfm_nspm_thread_entry(): '"mov r4, r0 \n", the PSP is 0x30001048 already?

Yes, PSP is 0x30001048 already. I will try to investigate a reason.
PSPLIM is correct 0x30000800, only the PSP has 0x48 more.

In the map file 0x30001000 - 0x30002000 is marked as ARM_LIB_HEAP:
{
Execution Region ARM_LIB_HEAP (Exec base: 0x30001000, Load base: 0x10022ca0, Size: 0x00001000, Max: 0x00001000, ABSOLUTE)

Exec Addr    Load Addr    Size         Type   Attr      Idx    E Section Name        Object

0x30001000        -       0x00001000   Zero   RW           11    ARM_LIB_HEAP.bss    anon$$obj.o

}

It looks like the initialization structures are correct:

NS STACK (ARM_LIB_STACK) = [0x30000800, 0x30001000]

const struct tfm_spm_partition_memory_data_t memory_data_list[] =
{

{
    .stack_bottom         = PART_REGION_ADDR(ARM_LIB_STACK, $$ZI$$Base),
    .stack_top            = PART_REGION_ADDR(ARM_LIB_STACK, $$ZI$$Limit),
    .rw_start             = PART_REGION_ADDR(ARM_LIB_STACK, $$ZI$$Base),
},

...

AndreyButokNXP added a comment.EditedJan 31 2020, 2:34 PM

After the following fix - 0x48, the applications does not have the memory fault:

{
    .stack_bottom         = PART_REGION_ADDR(ARM_LIB_STACK, $$ZI$$Base),
    .stack_top            = PART_REGION_ADDR(ARM_LIB_STACK, $$ZI$$Limit) - 0x48, //NXP
    .rw_start             = PART_REGION_ADDR(ARM_LIB_STACK, $$ZI$$Base),
},
AndreyButokNXP added a comment.EditedJan 31 2020, 3:42 PM

So I have returned to the RESET issue.

On the first psa_connect() => goes to reset.

The reset is caused by tfm_core_panic() in:
tfm_svcalls.c, tfm_core_validate_caller() line 1001:

if (stacked_ctx_pos != p_cur_sp->runtime_data.sp_thrd.sp_btm) {
    tfm_core_panic();
}

stacked_ctx_pos = 0x30001000 but p_cur_sp->runtime_data.sp_thrd.sp_btm = 0x30000FB8

!!!!! The difference is 0x48 !!!!!!!
The same value as for the previous issue.

P.S.
If to revert changes in the 483da6447e4360c514538807275904be68395dff commit (Core: Apply naked implementation for secure entry)
for the following files, the issue disappears:
tfm/interface/include/tfm_api.h
tfm/secure_fw/core/tfm_core_svcalls_ipc.c
tfm/secure_fw/include/core/tfm_core_svc.h
tfm/secure_fw/ns_callable/tfm_psa_api_veneers.c

So my workaround, to avoid psa_panic(), is to disable the check:

tfm_svcalls.c, tfm_core_validate_caller() line 1001:

#if 0 
        if (stacked_ctx_pos != p_cur_sp->runtime_data.sp_thrd.sp_btm) {
            tfm_core_panic();
        }
#endif
KenLSoft added a comment.EditedFeb 1 2020, 4:33 AM

Hi Andrey,

I guess the reason was caused by the FP is activated before scheduler gets initialized, so the non-secure EXC_RETURN is not 0xFFFFFFFD (Guess it is 0xFFFFFFED), then extra 0x48 bytes get poped and then caused the error.

There are two fixes for this, one is deactivating float point during very start. Currently, there is not FP requirement during the initializing process. set CONTROL.FPCA = 0 before scheduler get initialized.

Another solution would make more sense, it permits FP usage before scheduler initialized. There should be a function 'tfm_arch_update_ctx' at secure_fw\core\arch\include\tfm_arch_v8m.h:108, add one more line to reset the exc_return here:

__STATIC_INLINE void tfm_arch_update_ctx(struct tfm_state_context_ext *pctx)
{
    __set_PSP(pctx->sp);
    __set_PSPLIM(pctx->sp_limit);
   /* Reset the EXC_RETURN in case FP is invovled before scheduler starts */
    pctx->lr = INIT_LR_UNPRIVILEGED;
}

This should resolve both the init fault problem and the repeating reset problem.

If it works, can you help to push the fix as a patch?

Thanks.

/Ken

KenLSoft added a comment.EditedFeb 1 2020, 4:35 AM

BTW, can you update other tfm_arch_*.h as well?
We can help verify the changes not covered in your side.

Thanks.

/Ken

Hi Ken,

In project settings, we have the Floating Point Hardware set to enabled.
Adding pctx->lr = INIT_LR_UNPRIVILEGED into tfm_arch_update_ctx() has helped.
So you may add the fix to the original code/repository.

Thank you,
Andrej

Hi Ken,
Also, it should be done something with the check in tfm_svcalls.c, tfm_core_validate_caller() line 1001:

if (stacked_ctx_pos != p_cur_sp->runtime_data.sp_thrd.sp_btm) {
    tfm_core_panic();
}

It is calling tfm_core_panic(), as stacked_ctx_pos == 030001048 but p_cur_sp->runtime_data.sp_thrd.sp_btm == 030001000
I have disabled this check so far.

Hi Ken,
after farther testings, PSA Test application still goes to the memory fault.
So I have returned all my two workarounds, which work for all applications:

  1. .stack_top = PART_REGION_ADDR(ARM_LIB_STACK, $$ZI$$Limit) - 0x48,
  2. Disable if (stacked_ctx_pos != p_cur_sp->runtime_data.sp_thrd.sp_btm)
KenLSoft added a comment.EditedFeb 3 2020, 4:52 PM

Thanks for your input, that helps much. I will create fixes when I back to office.

One thing to confirm:

After you applied the change I mentioned, the memory fault is still there, with ctx pos = 30001048 and the expected value is 30001000? Can you tell me the value of PSP_S while the tfm is going to panic?

This checking is the only way (maybe there are better ways but I can't figure out) to detect the veneer re-entrant. The NS re-entrant would lead to NS secure calling is not in the expected sequence, it would make NS thread 1 get the NS thread 2's context.

The checking is simple:

  • After jumping to NS, PSP_S = 30001000
  • Then NS performs a secure call, after SG, it calls SVC, then a state context is pushed into the stack.
  • SPM would check if there is only one state context in the stack, so we uses :PSP_S + sizeof(state context) == PSP_BOTTOM to decide this.

So where is the PSP-S is when the problem is there is a key point.

Should to confirm, that after the patch the issue is still present.

Thanks. I just found the patch missed something, I will re-create a patch to update that.

Hi Andrej,
I have updated patch and test with FP enabled in my side and it works. Can you take a try on your side?
Thanks.

AndreyButokNXP added a comment.EditedFeb 12 2020, 9:40 AM

Hi Ken,
Yes, the final patch has fixed the issue.
It works now without our workarounds.

BTW.
Do you know why this issue has been detected only by us, there is no other complains? Nobody is using IPC&L2?
Thank you

KenLSoft added a comment.EditedFeb 12 2020, 12:10 PM

Hi Andrej,

There is an answer I got after discussion with some team members about the similar behaviour in other projects.

  • They found that some compiler version would have this issue while they compile it on their own PC, the issue is gone.
  • Another team member just met the problem you met but he did not provide more details so I have not taken a look at this.

Some compiler library may access the FP registers to know FP status but some else won't. And in your project, you mentioned that you have enabled the FP, so I think the FP registers are accessed during booting caused the CONTROL.FPCA = 1. While TFM boots in thread mode and SPM get initialized in handler mode (to initialize the threads), the SVC from booting thread to handler mode is affected by CONTROL.FPCA and the exc_return is 0xFFFFFFED.

In TFM default setting, no one enables FP and no one accesses FP, so there is no problem. But some particular system would access FP registers as your platform, then the problem shows up.

Thanks for providing such detail info so that I could fix this.

/Ken

KenLSoft closed this task as Resolved.Feb 13 2020, 7:04 AM