Table of Contents |
T3 VM Technical Documentation >
Portable Binary Encoding
T3 Portable Binary Encoding
All T3 binary files are encoded in a portable format that allows a binary file created on one type of computer to be used without any changes with T3 implementations on other types of computers. To achieve this binary portability, T3 binary files use a portable encoding that represents datatypes in a standard format. Each T3 implementation translates between the standard format and the local representation of the datatype when reading and writing files.
Datatypes
The following are the portable datatypes and their encoding:
UTF-8 Text
Unicode text is encoded in UTF-8. This encoding represents each 16-bit Unicode character with one, two, or three bytes:
From
To
Binary Encoding
0x0000
0x007f
0bbbbbbb
0x0080
0x07ff
110bbbbb 10bbbbbb
0x0800
0xffff
1110bbbb 10bbbbbb 10bbbbbb
The bits of the 16-bit value are encoded into the b’s in the table above with the most significant group of bits in the first byte. So: 0x571 has the 16-bit Unicode binary pattern 0000011001110001, and is encoded in UTF-8 as 11011001 10110001.
Note that UTF-8 encodes the most significant bits in the first bytes; this is different from the byte ordering used for other types (such as integers), which are all stored in little-endian format (least-significant byte first). The reason for this disparity is that this ordering makes it possible to compare two UTF-8 strings in a byte-wise fashion. (This type of magnitude comparison is not always especially useful, since it does not produce correct results for a localized sorting order, but it at least produces a uniform sorting order based on the Unicode code points stored in the string and hence may be useful in certain cases for building internal indices and tables.)
Integer values
Integer values are stored in little-endian format (i.e., least-significant byte first) in fixed-size byte arrays. These values are not aligned on any particular boundary in the file. These values can be interpreted as signed or unsigned. Signed values are encoded in 2’s-complement notation.
16-bit integers are stored as 2-byte arrays. The first byte has the low-order 8 bits, the second byte has the high-order 8 bits.
32-bit integers are stored as 4-byte arrays. The first byte has the low-order 8 bits, the second byte has the next more significant 8 bits, the third byte has the next more significant 8 bits, and the fourth byte has the most significant 8 bits.
Data Holders
The T3 VM uses run-time typing, which allows certain types of variables to hold any type of value; this type of value is tagged with its type, so that the VM can interpret the value correctly whenever it is used.
In order to store these “variant” types, the VM defines a composite type called a data holder. This composite contains the type information along with the value.
To store a data holder portably, we store a 5-byte array. The first byte contains the type ID value. The remaining 4 bytes encode the value using the standard primitive type encodings; the table below shows the correspondence between the primitive types and their encodings.
When an encoding does not take up the full 4 bytes, the value is packed into the earlier bytes, and the later bytes have arbitrary values. For example, a property ID is encoded in a data holder as follows:
Byte Index
Value
0
6 (the type code for VM_PROP)
1
low-order 8 bits of property ID value
2
high-order 8 bits of property ID value
3
arbitrary
4
arbitrary
Type ID’s
The table below shows the assigned ID values for the primitive types. (The types shown in italics are reserved for internal use by implementations and will never appear in portable files; we list them for the sake of completeness, but they’ll never be stored persistently and thus are not relevant to the portable file format.)
Type Name
Type ID
Description
Value Encoding
VM_NIL
1
nil (boolean “false” or null pointer)
none
VM_TRUE
2
boolean “true”
none
VM_STACK
3
Reserved for implementation use for storing native machine pointers to stack frames (see note below)
none
VM_CODEPTR
4
Reserved for implementation use for storing native machine pointers to code (see note below)
none
VM_OBJ
5
object reference as a 32-bit unsigned object ID number
UINT4
VM_PROP
6
property ID as a 16-bit unsigned number
UINT2
VM_INT
7
integer as a 32-bit signed number
INT4
VM_SSTRING
8
single-quoted string; 32-bit unsigned constant pool offset
UINT4
VM_DSTRING
9
double-quoted string; 32-bit unsigned constant pool offset
UINT4
VM_LIST
10
list constant; 32-bit unsigned constant pool offset
UINT4
VM_CODEOFS
11
code offset; 32-bit unsigned code pool offset
UINT4
VM_FUNCPTR
12
function pointer; 32-bit unsigned code pool offset
UINT4
VM_EMPTY
13
no value (this is useful in some cases to represent an explicitly unused data slot, such as a slot that has never been initialized)
none
VM_NATIVE_CODE
14
Reserved for implementation use for storing native machine pointers to native code (see note below)
none
VM_ENUM
15
enumerated constant; 32-bit integer
UINT4
VM_BIFPTR
16
built-in function pointer; 32-bit integer, encoding the function set dependency table index in the high-order 16 bits, and the function’s index within its set in the low-order 16 bits.
UINT4
VM_OBJX
17
Reserved for implementation use for an executable object, as a 32-bit object ID number (see note below)
UINT4
Note that types 3 (VM_STACK), 4 (VM_CODEPTR), 14 (VM_NATIVE_CODE), and 17 (VM_OBJX) are reserved for implementation use. These will never appear in a portable binary file; we list them here only for completeness. These types are intended to allow implementations to store native datatypes (such as native machine pointers) for which there is no meaningful portable representation. Implementations are free to use these types for any purposes of their own; the names and descriptions in the table are for mnemonic value only and shouldn’t be taken to imply a required use for these types.
Type Names
The file format specifications use the following names to refer to the portable datatypes:
Name
Description
SBYTE
Signed 8-bit byte
UBYTE
Unsigned 8-bit byte
UTF8
Unicode text encoded as UTF-8
INT2
Signed 16-bit (2-byte) integer
UINT2
Unsigned 16-bit (2-byte) integer
INT4
Signed 32-bit (4-byte) integer
UINT4
Unsigned 32-bit (4-byte) integer
DATA_HOLDER
Data holder for any primitive type
Copyright © 2001, 2006 by Michael J. Roberts.
Revision: September, 2006
TADS 3 Technical Manual
Table of Contents |
T3 VM Technical Documentation >
Portable Binary Encoding