After I stumbled upon ipfs, I have been thinking a bit about how to implement an efficient data storage / database on top of this, especially targeting linked open data. This has resulted in the following data structures / algorithms, which I would be very curious to implement later, to see how it performs in practise.
- Trustless (ie. intermediaries cannot fiddle with the data)
- Ordered traversal
- Snapshot support
- Compact (I assume that disk/network is the slowest resource)
Essentially any B-tree like append-only data structure using ipfs-addresses as pointers to blocks would do the job.
For persistent/append-only data structures HAMT has shown to have good performance in memory. A Trie data structure includes prefix key compression reducing the size of the tree. B-trees is good for caching / out-of-memory data.
I propose storing data in a trie with a half-octet alphabet (a la HAMT), and path compression, – and combining subtrees into immutable blocks which can then be stored out-of-memory. The half-octet alphabet is choosen, as that yield much smaller nodes than octets, and thus better performance when using it as a persistent data structure, – plus it is easier to pack into balanced blocks.
Conceptually a node in the trie contains the following information:
- The prefix
- Either content/data (for leaf-nodes) or list of 1-16 child nodes.
- Total number of leafs for this subtree
A node could live in two ways: 1) compact encoding within a block, backed to disk/network 2) non-distributed local memory representation. The following will focus on the compact encoding, – the local memory representation, is just a normal datastructure.
In practise there would be four kinds of trie nodes in a block:
- Branch-nodes, – between 2 and 16 children, based on on the current half-octet
- Path-node, – a sequence of half-octets without branches
- Leaf, – the suffix content/data for the key+value. For small values this is encoded in the block, for large data, a single leaf-node lives in a separate block.
- Reference to external node, with count of leaf-nodes in sub-tree
Every node has meta-information about it length (including children encoded in the block), – and node-type. This is an integer + 2 bits, which can be encoded with a variable number of bytes (number of bytes is encoded in 1 bit per byte). This will be 1-3bytes assuming a block size less than half a megabyte.
- Branch nodes are encoded with a 16bit bitmask, indicating which children it has, followed by meta-information for each child, and then the actual children. No pointers are needed, as the length of the children are known, and can just be skipped. Total node memory usage: 2 byte bitmask + 1-3 byte meta information.
- Path nodes contains a variable length encoded integer of the number of half-octets in the path, then the actual path data, and then followed by one child-node with meta-info. A path node thus takes 1-2 byte length + actual data + 1-3 byte meta info.
- Leaf nodes just contains content: actual data excluding prefix + 1-? byte meta information (1-? as a leaf node might be longer than the usual block size). The content might also be compressed, with something similar to lz4, but using other leaf-nodes in block as context too.
- References contains variable integer encoded count of leaf-nodes in subtree, plus the actual reference – link to a ipfs content hash.
Notice that the following information each node is implicit:
- The prefix of the node, is known through the access path to the node.
- The number of leafs in subtree for every, – as the leafs within the block is known, and number of leafs for external reference in the block is known.
- References to children of nodes, via known length and storage structure.
The conversion into block happens through traversal of the non-distributed in-memory representation, where the known number of leafs are used to choose most balanced block-tree.