QEMU VM Escape

August 29, 2019

pwn writeup

QEMU VM Escape : CVE-2019-14378
- Vulnerability Details
- Exploitation

QEMU VM Escape : CVE-2019-14378

This post will describe how I exploited CVE-2019-14378, which is a pointer miscalculation in network backend of QEMU. The bug is triggered when large IPv4 fragmented packets are reassembled for processing. It was found by code auditing.

Vulnerability Details

There are two parts to networking within QEMU¹:

The virtual network device that is provided to the guest (e.g. a PCI network card).
The network backend that interacts with the emulated NIC (e.g. puts packets onto the host’s network).

By default QEMU will create a SLiRP user network backend and an appropriate virtual network device for the guest (eg an e1000 PCI card)

The bug was found in the packet reassembly in SLiRP.

IP fragmentation²

IP fragmentation is an Internet Protocol (IP) process that breaks packets into smaller pieces (fragments), so that the resulting pieces can pass through a link with a smaller maximum transmission unit (MTU) than the original packet size. The fragments are reassembled by the receiving host.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version|  IHL  |Type of Service|          Total Length         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Identification        |Flags|      Fragment Offset    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Time to Live |    Protocol   |         Header Checksum       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Source Address                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Destination Address                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Options                    |    Padding    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Flags³: 3 bits
- Bit 0: reserved, must be zero
- Bit 1: (DF) 0 = May Fragment, 1 = Don’t Fragment.
- Bit 2: (MF) 0 = Last Fragment, 1 = More Fragments.
- Fragment Offset: 13 bits

struct mbuf {
    /* header at beginning of each mbuf: */
    struct mbuf *m_next; /* Linked list of mbufs */
    struct mbuf *m_prev;
    struct mbuf *m_nextpkt; /* Next packet in queue/record */
    struct mbuf *m_prevpkt; /* Flags aren't used in the output queue */
    int m_flags; /* Misc flags */

    int m_size; /* Size of mbuf, from m_dat or m_ext */
    struct socket *m_so;

    char *m_data; /* Current location of data */
    int m_len; /* Amount of data in this mbuf, from m_data */

    ...

    char *m_ext;
    /* start of dynamic buffer area, must be last element */
    char m_dat[];
};

mbuf structure is used to store IP layer information received. There are two buffers m_dat which is inside the structure and m_ext is allocated on the heap if the m_dat is insufficient to store the packet.

For the NAT translation if the incoming packets are fragmented they should be reassembled before they are edited and re transmitted. This reassembly is done by the ip_reass(Slirp *slirp, struct ip *ip, struct ipq *fp) function. ip contains the current IP packet data, fp is a link list containing the fragmented packets.

ip_reass does the following:
- If first fragment to arrive (fp==NULL), create a reassembly queue and insert ip into this queue.
- Check if the fragment is overlapping with previous received fragments, then discard it.
- If all the fragmented packets are received reassemble it. Create header for new ip packet by modifying header of first packet;

/*
 * Take incoming datagram fragment and try to
 * reassemble it into whole datagram.  If a chain for
 * reassembly of this datagram already exists, then it
 * is given as fp; otherwise have to make a chain.
 */
static struct ip *ip_reass(Slirp *slirp, struct ip *ip, struct ipq *fp)
{

    ...
    ...

    /*
     * Reassembly is complete; concatenate fragments.
     */
    q = fp->frag_link.next;
    m = dtom(slirp, q);

    q = (struct ipasfrag *)q->ipf_next;
    while (q != (struct ipasfrag *)&fp->frag_link) {
        struct mbuf *t = dtom(slirp, q);
        q = (struct ipasfrag *)q->ipf_next;
        m_cat(m, t);
    }

    /*
     * Create header for new ip packet by
     * modifying header of first packet;
     * dequeue and discard fragment reassembly header.
     * Make header visible.
     */
    q = fp->frag_link.next;

    /*
     * If the fragments concatenated to an mbuf that's
     * bigger than the total size of the fragment, then and
     * m_ext buffer was alloced. But fp->ipq_next points to
     * the old buffer (in the mbuf), so we must point ip
     * into the new buffer.
     */
    if (m->m_flags & M_EXT) {
        int delta = (char *)q - m->m_dat;
        q = (struct ipasfrag *)(m->m_ext + delta);
    }

The bug is at the calculation of the variable delta. The code assumes that the first fragmented packet will not be allocated in the external buffer (m_ext). The calculation q - m->dat is valid when the packet data is inside mbuf->m_dat ( q will be inside m_dat ) ( q is structure containing link list of fragments and packet data). Otherwise if m_ext buffer was allocated, then q will be inside the external buffer and the calculation of the delta will be wrong.

slirp/src/ip_input.c:ip_reass
    ip = fragtoip(q);
    ip->ip_len = next;
    ip->ip_tos &= ~1;
    ip->ip_src = fp->ipq_src;
    ip->ip_dst = fp->ipq_dst;

Later the newly calculated pointer q is converted into ip structure and values are modified, Due to the wrong calculation of the delta, ip will be pointing to incorrect location and ip_src and ip_dst can be used to write controlled data onto the calculated location. This may also crash qemu if the calculated ip is located in unmaped area.

Exploitation

What are we facing

If we control delta we will be able to write controlled data relative to m->m_ext. For that need precise control over the heap.
Need leaks to bypass ASLR
There are no useful function pointers on the heap to get code execution. We have to get arbitrary write.

Controlling Heap

Let’s look into how heap objects are allocated in slirp.

// How much room is in the mbuf, from m_data to the end of the mbuf
#define M_ROOM(m)                                                        \
    ((m->m_flags & M_EXT) ? (((m)->m_ext + (m)->m_size) - (m)->m_data) : \
                            (((m)->m_dat + (m)->m_size) - (m)->m_data))
// How much free room there is
#define M_FREEROOM(m) (M_ROOM(m) - (m)->m_len)

slirp/src/slirp.c:slirp_input

      m = m_get(slirp); // m_get return mbuf object, internally calls g_malloc(0x668)
      ...
      /* Note: we add 2 to align the IP header on 4 bytes,
       * and add the margin for the tcpiphdr overhead  */
      if (M_FREEROOM(m) < pkt_len + TCPIPHDR_DELTA + 2) { // TCPIPHDR_DELTA + 2 = 
          m_inc(m, pkt_len + TCPIPHDR_DELTA + 2); // allocates new m_ext buffer since m_dat is insufficiant
      }
      ...

      if (proto == ETH_P_IP) {
          ip_input(m);

m_get, m_free , m_inc and m_cat are wrappers for handling dynamic memory allocation. When new packet arrives new mbuf object is allocated, and if m_dat is sufficient for storing the packet data then it is used, otherwise new external buffer is allocated with m_inc and the data is copied onto it.

slirp/src/ip_input.c:ip_input
    /*
        * If datagram marked as having more fragments
        * or if this is not the first fragment,
        * attempt reassembly; if it succeeds, proceed.
        */
    if (ip->ip_tos & 1 || ip->ip_off) {
        ip = ip_reass(slirp, ip, fp);
        if (ip == NULL)
            return;

slirp/src/ip_input.c:ip_reass
    /*
     * If first fragment to arrive, create a reassembly queue.
     */
    if (fp == NULL) {
        struct mbuf *t = m_get(slirp);
        ...

If the incoming packet is fragmented, new mbuf object is used to store the packets (fp) until all the fragments arrives. When next part arrives they are enqueued onto this list.

This gives us a good primitive to allocate controlled chunks on the heap size ( > 0x608 ). Few things to keep in mind is that, for every packets mbuf(0x670) will be allocated and if it is the first fragment then the another mbuf will be allocated (fp : fragment queue).

malloc(0x670)
if(pkt_len + TCPIPHDR_DELTA + 2 > 0x608)
   malloc(pkt_len + TCPIPHDR_DELTA + 2)
if(ip->ip_off & IP_MF)
   malloc(0x670)

We can use this to spray the heap,so that the subsequent allocation will be taken from the top chunk, which gives us a predictable heap state.

Getting controlled write on heap

Now that we can control the heap. Let’s see how we can use the bug to overwrite something useful.

    q = fp->frag_link.next; // Points to first fragment
    if (m->m_flags & M_EXT) {
        int delta = (char *)q - m->m_dat;
        q = (struct ipasfrag *)(m->m_ext + delta);
    }

Assume this heap state

            +------------+
            |     q      |
            +------------+
            |            |
            |            |
            |  padding   |
            |            |
            |            |
            +------------+
            |   m->m_dat |
            +------------+

Now delta will be -padding and this will be added with m->m_ext and later we can write to that offset. Thus controlling this padding we are able to control delta.

When all the fragments arrive they are concatenated to one mbuf object with m_cat function.

slirp/src/muf.c
void m_cat(struct mbuf *m, struct mbuf *n)
{
    /*
     * If there's no room, realloc
     */
    if (M_FREEROOM(m) < n->m_len)
        m_inc(m, m->m_len + n->m_len);

    memcpy(m->m_data + m->m_len, n->m_data, n->m_len);
    m->m_len += n->m_len;

    m_free(n);
}


slirp/src/muf.c
void m_inc(struct mbuf *m, int size)
{
    ...
    if (m->m_flags & M_EXT) {
        gapsize = m->m_data - m->m_ext;
        m->m_ext = g_realloc(m->m_ext, size + gapsize);
    ...
}

The m_inc calls realloc function, realloc function return the same chunk if it can accommodate the requested size. So even after the reassembly of the packets, we can get the same m->m_ext buffer of the fist packet. Note, m_ext will be allocated for the first fragment packet, q will be pointing inside this buffer . Then the addition of -padding will also be relative to q. This just makes things bit easier

            +------------+
            |  target    |
            +------------+
            |            |
            |            |
            |  padding   |
            |            |
            |            |
m-m_ext  -> +------------+  // q = m->m_ext + -padding  will point to target
            |     q      |  // delta = -paddig 
            +------------+
            |            |
            |            |
            |  padding   |
            |            |
            |            |
            +------------+
            |   m->m_dat |
            +------------+

So after the pointer calculation q will be pointing to target

slirp/src/ip_input.c:ip_reass
    ip = fragtoip(q);
    ...
    ip->ip_src = fp->ipq_src;
    ip->ip_dst = fp->ipq_dst;

since we control fp->ipq_src and fp->ipq_dst which is the source and destination ip of the packet we can overwrite targets content.

Arbitrary Write

My initial target was to overwrite the m_data field, so that we can use the packet reassembly’s m_cat() to get arbitrary write, but that seems to be not possible due to some alignment and offsets issues.

slirp/src/muf.c:m_cat
    memcpy(m->m_data + m->m_len, n->m_data, n->m_len);

But was able to overwrite m_len field of the object. Since there is no check in the m_cat function we can use the m_len to get arbitrary write relative to m_data. So now we do not have the issue of alignment and we use this to overwrite the m_data of different object to get arbitrary write.

Send packet with id 0xdead and MF bit set (1)
Send packet with id 0xcafe and MF bit set (1)
Trigger the bug to overwrite m_len of 0xcafe so that m_data + m_len points to 0xdead’s m_data
Send packet with id 0xcafe and MF bit unset (0) to trigger reassembly and overwrite 0xdead’s m_data with target address
Send packet with id 0xdead and MF bit unset (0) which will write the content of this packet to m_data.

Getting Leaks

We need leaks to bypass ASLR and PIE. For that we need some way to transfer data back to the guest . It turns out that there is a very common service that matches that description exactly: ICMP echo request. SLiRP gateway responds to a ICMP echo requests, reflecting back the payload of the packet (after the ICMP headers) unchanged.

We have arbitrary write, but where will we write to since leaks are not known at this point ?

We can do a partial overwrite of the m_data and write data on the heap.

Leaks⁴ :

Use arbitrary write to create fake ICMP header on the heap
Send an ICMP request with the MF bit set (1).
Partially Overwrite m_data to point to fake header on heap
Send the packet with MF bit to 0 to end the ICMP request.
Receive leaks from the host.

Getting Code Execution

Timers (more precisely QEMUTimers) provide a means of calling a given routine (a callback) after a time interval has elapsed, passing an opaque pointer to the routine.

struct QEMUTimer {
    int64_t expire_time;        /* in nanoseconds */
    QEMUTimerList *timer_list;
    QEMUTimerCB *cb;
    void *opaque;
    QEMUTimer *next;
    int scale;
};

struct QEMUTimerList {
    QEMUClock *clock;
    QemuMutex active_timers_lock;
    QEMUTimer *active_timers;
    QLIST_ENTRY(QEMUTimerList) list;
    QEMUTimerListNotifyCB *notify_cb;
    void *notify_opaque;
    QemuEvent timers_done_ev;
};

main_loop_tlg is a array in bss which contains QEMUTimerList associated with different timer. And these contains list of QEMUTimer structures. qemu loops through these to check whether there any of them have expired, If so, cb function is called with argument opaque.

RIP control⁵ :

Create fake QEMUTimer with callback as system and opaque as the argument
Create fake QEMUTImerList which contains our fake QEMUTimer
Overwrite main_loop_tlg entry with fake QEMUTimerList

You can find the full exploit at CVE-2019-14378