in ,

A one in a million bug in Switch kernel

== A one in a million bug in Switch kernel ==

Nintendo Switch firmware 14.0.0 was released yesterday. It contained many minor
changes to their kernel. One of them, was that during user-mode cache
operations (flush / clean / zero), it now sets a secret byte in the thread local
storage (TLS) to 1.

If an interrupt is received, kernel-mode reads the user-mode byte from TLS, and
if it’s equal to 1, the kernel performs a memory barrier.

Why is this complicated TLS communication scheme necessary between user-mode
and kernel? Nintendo would not introduce this out-of-the-blue, there is some
weird hardware phenomenon going on.

This took some time to figure out, but imagine the following sequence of
instructions executing:

    dc  civac, x8
    add x8, x8, #32
    dc  civac, x8
    add x8, x8, #32
    dc  civac, x8     <——- what happens if you take an interrupt here?
    add x8, x8, #32
    dc  civac, x8
    add x8, x8, #32
    dsb sy            <——- memory barrier
    ret

An interrupt may be received by the CPU at any point during game execution.
Interrupts may lead to “core migration”, which is when the kernel scheduler
moves a thread to a different CPU core.

If we imagine a core migration in this code sequence, we can clearly see the
problem:

    dc  civac, x8     <— Core 0
    add x8, x8, #32   <— Core 0
    dc  civac, x8     <— Core 0
    add x8, x8, #32   <— Core 0
    dc  civac, x8     <— Core 1 [interrupt! core migration]
    add x8, x8, #32   <— Core 1
    dc  civac, x8     <— Core 1
    add x8, x8, #32   <— Core 1
    dsb sy            <— Core 1 [memory barrier]
    ret

Do you see the problem? There was never a memory barrier on core 0!

This means that *not necessarily* all cache ops are completed by the time
the function returns! For a brief time, the physical DRAM, for some of the
cache lines, will be incorrect.

So to summarize, if the CPU:
    (1) takes an interrupt inside a function like this    (super rare)
AND
    (2) the scheduler decides to perform core migration   (super rare)

Then, you’d get some graphical glitches (games mainly use cache operations when
talking to the GPU).

In this situation, devs would probably blame faulty DRAM chips or CPU errata,
but this is totally a pure software bug!

This bug has existed since day zero, which means that it took 5 years (!) for
Nintendo to track it down.

Credits to whoever nameless employee at Nintendo found this bug! The attention
to detail is incredible. And how do you even find / debug a bug like this?

Makes you think, do Linux, Windows and Mac handle this properly? Honestly, I
doubt it!

Thanks to SciresM for discussion / diff.

–plutoo

This post was created with our nice and easy submission form. Create your post!

What do you think?

Posted by SH

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings

Heap Overflow in OpenBSD's slaacd via Router Advertisement

ImpressCMS Unauthenticated SQL injection to RCE