2016-11-19 Virtual machine, – revisited
After implementing the first version of the virtual machine, here is some of my own notes for the next version.
Virtual machine
I work with a clear division of words and chunks.
- a word is 16/32 bits of memory which contains either a tagged integer, or a reference to a chunk in the heap
- a chunk is a piece of memory, containing length, gc-word, etc.
Currently the chunks are handled as the offset of the chunk into the heap. It would be nicer to change this to be a word pointer, to the beginning of the content of the chunk.
The chunk has three words of meta-data
- Word for bookkeeping during garbage collection and compaction
gc-word
- Word with size of data segment in bytes
bytelen
(would be nicer if this was the total size of the content of the chunk in bytes). - Word type and number of pointers in reference segment
ptrlen
ptrlen
words of referencesbytelen
bytes of data- optional 1 byte zero-padding.
Notice that the gc-word
is always zero, except during garbage collection. This means that the first byte after the data is always zero, and strings in chunks can be read as zero-terminated c-strings.
Atoms
At the beginning/back of the stack, is where global atoms/defines, which can be changed, are saved. This works well with the garbage collector, as the stack is the root / only non-persistent/immutable part of the memory.
Functions/code
The code/functions lives on the heap, and may thus be relocated during garbage collection. Instead of just having an instruction pointer, it thus makes sense to see a code address as a function pointer plus an offset into the function code.
A frame pointer will also be useful, as it allows RET-instruction, and ARG1, ARG2, … LOCAL1, … as single byte instructions.
Interaction with the virtual machine
Interaction is done by sending a block of bytecodes to the virtual machine, which is then executed. The virtual machine should have instructions for building function definitions etc. in the heap.
PUSH_BYTE CLOSURE 0 ; write bytecodes for reading first object in closure
PUSH_BYTE CALL_CLOSURE 1 ; write bytecode for printing a value
PUSH_BYTE RET ; write bytecode return, - total number of bytecodes are 4
"Hello world!" ; pushes reference to heap object "Hello world!"
GLOBAL #0 "print" DICT_GET ; get print-function in the table of globals
5 ; length of bytecodes
2 ; length of closure
FUNCTION_TYPE NEW ; create a function object
GLOBAL #0
"hello" ; intended global name of function
LOCAL0 ; the created function
ASSOC
SET_GLOBAL #0
RET
Create a new function using the global function “print” to print hello world, and save it as “hello” in global0.
The above code is written in a semi-assembler. #4
means 4
as a literal bytecode. 17
means the bytecodes that pushes 17 onto the stack (in this case it could be PUSH_BYTE #17
). "foo"
means the bytecode that pushes the string “foo” to the heap (in this case 102 111 111 3 0 STRING_TYPE NEW
).
The bytecode instruction set has not yet stabilised.
List of globals
Not finalised, – lives at the bottom of the stack.
- Table of named values
true
atomfalse
atomundefined
atomnull
atom
Dictionary data type
The language should have a built-in persistent dictionary data types, I intend to make a couple of different implementations to measure which is best.
Simple tree implementation
As a base test case.
TRIE (with path compression)
Each chunk already has the type, the number of bytes in the data section bytelen
, and the number of pointers ptrlen
encoded, so a Compressed Trie node can be implemented as:
- all of the pointers are Leaf nodes (if it has INNER_TRIE type, we have not reached the result)
- the
bytelen - ptrlen
first bytes are the prefix for the node. - the
ptrlen
last bytes are the symbol for each of the pointers.
HAMT
The implementation would be similar to CHAMP.
I would probably make a slight change in the memory layout, so instead of having a separate nodemap
and datamap
, I would have a single bitmap, with two bits per entry:
- 00 empty
- 01 reference to a HAMT for the next part of the hash
- 11 Key-reference and value reference
- 10 Hash-collision: reference to an out-of-node array with keys/values
In addition to the bitmap, there would also be an array of references, – where the position for the current hash-part can be found with popcnt
as in the standard HAMT/CHAMP.
This layout makes it easier to compact the hamt on delete, in my opinion. The general characteristics should be similar to CHAMP.