libxenctrl (libxc) Domain Image Format

David Vrabel <david.vrabel@citrix.com>

Andrew Cooper <andrew.cooper3@citrix.com>

Wen Congyang <wency@cn.fujitsu.com>

Yang Hongyang <hongyang.yang@easystack.cn>

Revision 2

1 Introduction

1.1 Purpose

The domain save image is the context of a running domain used for snapshots of a domain or for transferring domains between hosts during migration.

There are a number of problems with the format of the domain save image used in Xen 4.4 and earlier (the legacy format).

A new format that addresses the above is required.

ARM does not yet have have a domain save image format specified and the format described in this specification should be suitable.

1.2 Not Yet Included

The following features are not yet fully specified and will be included in a future draft.

2 Overview

The image format consists of two main sections:

2.1 Headers

There are two headers: the image header, and the domain header. The image header describes the format of the image (version etc.). The domain header contains general information about the domain (architecture, type etc.).

2.2 Records

The main part of the format is a sequence of different records. Each record type contains information about the domain context. At a minimum there is a END record marking the end of the records section.

2.3 Fields

All the fields within the headers and records have a fixed width.

Fields are always aligned to their size.

Padding and reserved fields are set to zero on save and must be ignored during restore.

Integer (numeric) fields in the image header are always in big-endian byte order.

Integer fields in the domain header and in the records are in the endianness described in the image header (which will typically be the native ordering).

3 Headers

3.1 Image Header

The image header identifies an image as a Xen domain save image. It includes the version of this specification that the image complies with.

Tools supporting version V of the specification shall always save images using version V. Tools shall support restoring from version V. If the previous Xen release produced version V - 1 images, tools shall supported restoring from these. Tools may additionally support restoring from earlier versions.

The marker field can be used to distinguish between legacy images and those corresponding to this specification. Legacy images will have at one or more zero bits within the first 8 octets of the image.

Fields within the image header are always in big-endian byte order, regardless of the setting of the endianness bit.

 0     1     2     3     4     5     6     7 octet
+-------------------------------------------------+
| marker                                          |
+-----------------------+-------------------------+
| id                    | version                 |
+-----------+-----------+-------------------------+
| options   | (reserved)                          |
+-----------+-------------------------------------+
Field Description
marker 0xFFFFFFFFFFFFFFFF.
id 0x58454E46 (“XENF” in ASCII).
version 0x00000003. The version of this specification.
options bit 0: Endianness. 0 = little-endian, 1 = big-endian.
bit 1-15: Reserved.

The endianness shall be 0 (little-endian) for images generated on an i386, x86_64, or arm host.

3.2 Domain Header

The domain header includes general properties of the domain.

 0      1     2     3     4     5     6     7 octet
+-----------------------+-----------+-------------+
| type                  | page_shift| (reserved)  |
+-----------------------+-----------+-------------+
| xen_major             | xen_minor               |
+-----------------------+-------------------------+
Field Description
type 0x0000: Reserved.
0x0001: x86 PV.
0x0002: x86 HVM.
0x0003 - 0xFFFFFFFF: Reserved.
page_shift Size of a guest page as a power of two.
i.e., page size = 2 page_shift.
xen_major The Xen major version when this image was saved.
xen_minor The Xen minor version when this image was saved.

The legacy stream conversion tool writes a xen_major version of 0, and sets xen_minor to the version of itself.

4 Records

A record has a record header, type specific data and a trailing footer. If body_length is not a multiple of 8, the body is padded with zeroes to align the end of the record on an 8 octet boundary.

 0     1     2     3     4     5     6     7 octet
+-----------------------+-------------------------+
| type                  | body_length             |
+-----------+-----------+-------------------------+
| body...                                         |
...
|           | padding (0 to 7 octets)             |
+-----------+-------------------------------------+
Field Description
type 0x00000000: END
0x00000001: PAGE_DATA
0x00000002: X86_PV_INFO
0x00000003: X86_PV_P2M_FRAMES
0x00000004: X86_PV_VCPU_BASIC
0x00000005: X86_PV_VCPU_EXTENDED
0x00000006: X86_PV_VCPU_XSAVE
0x00000007: SHARED_INFO
0x00000008: X86_TSC_INFO
0x00000009: HVM_CONTEXT
0x0000000A: HVM_PARAMS
0x0000000B: TOOLSTACK (deprecated)
0x0000000C: X86_PV_VCPU_MSRS
0x0000000D: VERIFY
0x0000000E: CHECKPOINT
0x0000000F: CHECKPOINT_DIRTY_PFN_LIST (Secondary -> Primary)
0x00000010 - 0x7FFFFFFF: Reserved for future mandatory records.
0x80000000 - 0xFFFFFFFF: Reserved for future optional records.
body_length Length in octets of the record body.
body Content of the record.
padding 0 to 7 octets of zeros to pad the whole record to a multiple of 8 octets.

Records may be mandatory or optional. Optional records have bit 31 set in their type. Restoring an image that has unrecognised or unsupported mandatory record must fail. The contents of optional records may be ignored during a restore.

The following sub-sections specify the record body format for each of the record types.

4.1 END

An end record marks the end of the image, and shall be the final record in the stream.

 0     1     2     3     4     5     6     7 octet
+-------------------------------------------------+

The end record contains no fields; its body_length is 0.

4.2 PAGE_DATA

The bulk of an image consists of many PAGE_DATA records containing the memory contents.

 0     1     2     3     4     5     6     7 octet
+-----------------------+-------------------------+
| count (C)             | (reserved)              |
+-----------------------+-------------------------+
| pfn[0]                                          |
+-------------------------------------------------+
...
+-------------------------------------------------+
| pfn[C-1]                                        |
+-------------------------------------------------+
| page_data[0]...                                 |
...
+-------------------------------------------------+
| page_data[N-1]...                               |
...
+-------------------------------------------------+
Field Description
count Number of pages described in this record.
pfn An array of count PFNs and their types.
Bit 63-60: XEN_DOMCTL_PFINFO_* type (from public/domctl.h but shifted by 32 bits)
Bit 59-52: Reserved.
Bit 51-0: PFN.
page_data page_size octets of uncompressed page contents for each page set as present in the pfn array.

Note: Count is strictly > 0. N is strictly <= C and it is possible for there to be no page_data in the record if all pfns are of invalid types.

XEN_DOMCTL_PFINFO_* Page Types.
PFINFO type Value Description
NOTAB 0x0 Normal page.
L1TAB 0x1 L1 page table page.
L2TAB 0x2 L2 page table page.
L3TAB 0x3 L3 page table page.
L4TAB 0x4 L4 page table page.
0x5-0x8 Reserved.
L1TAB_PIN 0x9 L1 page table page (pinned).
L2TAB_PIN 0xA L2 page table page (pinned).
L3TAB_PIN 0xB L3 page table page (pinned).
L4TAB_PIN 0xC L4 page table page (pinned).
BROKEN 0xD Broken page.
XALLOC 0xE Allocate only.
XTAB 0xF Invalid page.

PFNs with type BROKEN, XALLOC, or XTAB do not have any corresponding page_data.

The saver uses the XTAB type for PFNs that become invalid in the guest’s P2M table during a live migration1.

Restoring an image with unrecognised page types shall fail.

4.3 X86_PV_INFO

 0     1     2     3     4     5     6     7 octet
+-----+-----+-----------+-------------------------+
| w   | ptl | (reserved)                          |
+-----+-----+-----------+-------------------------+
Field Description
guest_width (w) Guest width in octets (either 4 or 8).
pt_levels (ptl) Number of page table levels (either 3 or 4).

4.4 X86_PV_P2M_FRAMES

 0     1     2     3     4     5     6     7 octet
+-----+-----+-----+-----+-------------------------+
| p2m_start_pfn (S)     | p2m_end_pfn (E)         |
+-----+-----+-----+-----+-------------------------+
| p2m_pfn[p2m frame containing pfn S]             |
+-------------------------------------------------+
...
+-------------------------------------------------+
| p2m_pfn[p2m frame containing pfn E]             |
+-------------------------------------------------+
Field Description
p2m_start_pfn First pfn index in the p2m_pfn array.
p2m_end_pfn Last pfn index in the p2m_pfn array.
p2m_pfn Array of PFNs containing the guest’s P2M table, for the PFN frames containing the PFN range S to E (inclusive).

4.5 X86_PV_VCPU_BASIC, EXTENDED, XSAVE, MSRS

The format of these records are identical. They are all binary blobs of data which are accessed using specific pairs of domctl hypercalls.

 0     1     2     3     4     5     6     7 octet
+-----------------------+-------------------------+
| vcpu_id               | (reserved)              |
+-----------------------+-------------------------+
| context...                                      |
...
+-------------------------------------------------+
Field Description
vcpu_id The VCPU ID.
context Binary data for this VCPU.
Record type Accessor hypercalls
X86_PV_VCPU_BASIC XEN_DOMCTL_{get,set}vcpucontext
X86_PV_VCPU_EXTENDED XEN_DOMCTL_{get,set}_ext_vcpucontext
X86_PV_VCPU_XSAVE XEN_DOMCTL_{get,set}vcpuextstate
X86_PV_VCPU_MSRS XEN_DOMCTL_{get,set}_vcpu_msrs

4.6 SHARED_INFO

The content of the Shared Info page.

 0     1     2     3     4     5     6     7 octet
+-------------------------------------------------+
| shared_info                                     |
...
+-------------------------------------------------+
Field Description
shared_info Contents of the shared info page. This record should be exactly 1 page long.

4.7 X86_TSC_INFO

Domain TSC information, as accessed by the XEN_DOMCTL_{get,set}tscinfo hypercall sub-ops.

 0     1     2     3     4     5     6     7 octet
+------------------------+------------------------+
| mode                   | khz                    |
+------------------------+------------------------+
| nsec                                            |
+------------------------+------------------------+
| incarnation            | (reserved)             |
+------------------------+------------------------+
Field Description
mode TSC mode, TSC_MODE_* constant.
khz TSC frequency, in kHz.
nsec Elapsed time, in nanoseconds.
incarnation Incarnation.

4.8 HVM_CONTEXT

HVM Domain context, as accessed by the XEN_DOMCTL_{get,set}hvmcontext hypercall sub-ops.

 0     1     2     3     4     5     6     7 octet
+-------------------------------------------------+
| hvm_ctx                                         |
...
+-------------------------------------------------+
Field Description
hvm_ctx The HVM Context blob from Xen.

4.9 HVM_PARAMS

HVM Domain parameters, as accessed by the HVMOP_{get,set}_param hypercall sub-ops.

 0     1     2     3     4     5     6     7 octet
+------------------------+------------------------+
| count (C)              | (reserved)             |
+------------------------+------------------------+
| param[0].index                                  |
+-------------------------------------------------+
| param[0].value                                  |
+-------------------------------------------------+
...
+-------------------------------------------------+
| param[C-1].index                                |
+-------------------------------------------------+
| param[C-1].value                                |
+-------------------------------------------------+
Field Description
count The number of parameters contained in this record. Each parameter in the record contains an index and value.
param index Parameter index.
param value Parameter value.

4.10 TOOLSTACK (deprecated)

This record was only present for transitionary purposes during development. It is should not be used.

An opaque blob provided by and supplied to the higher layers of the toolstack (e.g., libxl) during save and restore.

 0     1     2     3     4     5     6     7 octet
+------------------------+------------------------+
| data                                            |
...
+-------------------------------------------------+
Field Description
data Blob of toolstack-specific data.

4.11 VERIFY

A verify record indicates that, while all memory has now been sent, the sender shall send further memory records for debugging purposes.

 0     1     2     3     4     5     6     7 octet
+-------------------------------------------------+

The verify record contains no fields; its body_length is 0.

4.12 CHECKPOINT

A checkpoint record indicates that all the preceding records in the stream represent a consistent view of VM state.

 0     1     2     3     4     5     6     7 octet
+-------------------------------------------------+

The checkpoint record contains no fields; its body_length is 0

If the stream is embedded in a higher level toolstack stream, the CHECKPOINT record marks the end of the libxc portion of the stream and the stream is handed back to the higher level for further processing.

The higher level stream may then hand the stream back to libxc to process another set of records for the next consistent VM state snapshot. This next set of records may be terminated by another CHECKPOINT record or an END record.

4.13 CHECKPOINT_DIRTY_PFN_LIST

A checkpoint dirty pfn list record is used to convey information about dirty memory in the VM. It is an unordered list of PFNs. Currently only applicable in the backchannel of a checkpointed stream. It is only used by COLO, more detail please reference README.colo.

 0     1     2     3     4     5     6     7 octet
+-------------------------------------------------+
| pfn[0]                                          |
+-------------------------------------------------+
...
+-------------------------------------------------+
| pfn[C-1]                                        |
+-------------------------------------------------+

The count of pfns is: record->length/sizeof(uint64_t).

4.14 STATIC_DATA_END

A static data end record marks the end of the static state. I.e. state which is invariant of guest execution.

 0     1     2     3     4     5     6     7 octet
+-------------------------------------------------+

The end record contains no fields; its body_length is 0.

4.15 X86_CPUID_POLICY

CPUID policy content, as accessed by the XEN_DOMCTL_{get,set}_cpu_policy hypercall sub-ops.

 0     1     2     3     4     5     6     7 octet
+-------------------------------------------------+
| CPUID_policy                                    |
...
+-------------------------------------------------+
Field Description
CPUID_policy Array of xen_cpuid_leaf_t[]’s

4.16 X86_MSR_POLICY

MSR policy content, as accessed by the XEN_DOMCTL_{get,set}_cpu_policy hypercall sub-ops.

 0     1     2     3     4     5     6     7 octet
+-------------------------------------------------+
| MSR_policy                                      |
...
+-------------------------------------------------+
Field Description
MSR_policy Array of xen_msr_entry_t[]’s

5 Layout

The set of valid records depends on the guest architecture and type. No assumptions should be made about the ordering or interleaving of independent records. Record dependencies are noted below.

Some records are used for signalling, and explicitly have zero length. All other records contain data relevant to the migration. Data records with no content should be elided on the source side, as their presence serves no purpose, but results in extra work for the restore side.

5.1 x86 PV Guest

A typical save record for an x86 PV guest image would look like:

There are some strict ordering requirements. The following records must be present in the following order as each of them depends on information present in the preceding ones.

5.2 x86 HVM Guest

A typical save record for an x86 HVM guest image would look like:

HVM_PARAMS must precede HVM_CONTEXT, as certain parameters can affect the validity of architectural state in the context.

6 Compatibility with older versions

6.1 v3 compat with v2

A v3 stream is compatible with a v2 stream, but mandates the presense of a STATIC_DATA_END record ahead of any memory/register content. This is to ease the introduction of new static configuration records over time.

A v3-compatible reciever interpreting a v2 stream should infer the position of STATIC_DATA_END based on finding the first X86_PV_P2M_FRAMES record (for PV guests), or PAGE_DATA record (for HVM guests) and behave as if STATIC_DATA_END had been sent.

6.2 Legacy Images (x86 only)

Restoring legacy images from older tools shall be handled by translating the legacy format image into this new format.

It shall not be possible to save in the legacy format.

There are two different legacy images depending on whether they were generated by a 32-bit or a 64-bit toolstack. These shall be distinguished by inspecting octets 4-7 in the image. If these are zero then it is a 64-bit image.

Possible values for octet 4-7 in legacy images
Toolstack Field Value
64-bit Bit 31-63 of the p2m_size field 0 (since p2m_size < 232)
32-bit extended-info chunk ID (PV) 0xFFFFFFFF
32-bit Chunk type (HVM) < 0
32-bit Page count (HVM) > 0

This assumes the presence of the extended-info chunk which was introduced in Xen 3.0.

7 Future Extensions

All changes to this specification should bump the revision number in the title block.

All changes to the image or domain headers require the image version to be increased.

The format may be extended by adding additional record types.

Extending an existing record type must be done by adding a new record type. This allows old images with the old record to still be restored.

The image header may only be extended by appending additional fields. In particular, the marker, id and version fields must never change size or location.

8 Errata

  1. For compatibility with older code, the receving side of a stream should tolerate and ignore variable sized records with zero content. Xen releases between 4.6 and 4.8 could end up generating valid HVM_PARAMS or X86_PV_VCPU_{EXTENDED,XSAVE,MSRS} records with zero-length content.

  1. In the legacy format, this is the list of unmapped PFNs in the tail.