A look at the pastry distributed hashtable

19 Sep 2020 - tsp
Last update 19 Sep 2020
Reading time 18 mins

Please note: Pastry is not my design or invention, it’s well documented in a bunch of papers. There is a quick overview available by the authors of pastry too.

What is a DHT?

A distributed hash table (DHT) is a distributed database that provides key-value store like semantic. Basically it’s a distributed system that allows one to store values with a given key and retrieve them again using the key. This is called a hash table since the key is usually calculated as a hash value of the stored objects. A hash function is a one way function that should yield a evenly distributed key value that is guaranteed to change whenever an object changes. It’s also often desirable to be collision free - but of course that’s not possible in any case if the keyspace is smaller than the value space; hash tables are usually capable of dealing with such collisions.

The distributed behavior is achieved by assigning nodes IDs from the same keyspace and assigning objects to the numerically nearest node(s). Redundancy can be achieved by using the $n$ closest objects.

Since node IDs are assigned randomly (either by using random number generators, being derived via a hash function from the address of the node or being the hash of a credential like a public key) they’re statistically evenly distributed in keyspace if everything is done right.

Since they operate on some really basic principles DHTs scale well, are self organizing, are robust even in case nodes fail, are fault tolerant and are distributing load evenly on the system. Pastry also offers reduced load on the network by employing a proximity metric while building some of it’s tables. Also pastry is a little bit different than most other DHTs - it includes an overlay routing network that allows one to route messages to nodes that one doesn’t have information about. Routing tables are usually updated whenever a node fails or joins the network.

One can build a huge number of services on top of a DHT:

Distributed object storage (see for example Ceph - that’s not built using pastry though)
Filesystems (see for example again Ceph, PAST)
Publish subscribe systems (for example SCRIBE)
Distributed caches (see SQUIRREL)
Content distribution systems (see SplitStream)
Generic messaging (see POST)
Peer to peer resource sharing systems (see Scrivener)

There is a nearly unlimited amount of applications - depending on the environment they run in one has to take some security considerations though.

Network parameterization

First of all pastry uses 128 bit Node IDs (i.e. 16 bytes). There is a number of tunables though:

The number of bits per digit or subnet inside the keyspace. This is termed $b$ and is usually set to $4$ so each digit in hexadecimal representation would be equal to one routing step.
The number of nodes is termed $N$.
The size of the neighborhood set $\mid N \mid$ is usually set to $2*2^b$
The size of the leaf set $\mid L \mid$ is usually set to $2^b$

Information kept about other nodes

First lets think about the information required to contact other nodes:

It’s IP address(es). Depending on the way the system works this might be a single IPv4 or IPv6 address or a collection of potential addresses used during an interactive connectivity establishment process. Usually containing a single IPv4 and a single IPv6 address is sufficient, sometimes a hostname might also be needed (for example when supporting TOR hidden services which requires connections to be made using SOCKS4a and hostnames)
It’s node ID - usually a 128 bit number.
Depending on signatures: A public signature key (for example a DSS key)
Depending on encryption: A public key (for example for RSA or ECC) or a shared secret if all nodes are trusted.

Additionally the node locally might add:

Proximity information
Timestamp of last contact
Error counters
A network socket handle for an open connection to the given node (depending on the implementation).

Node state

Neighborhood set ($N$)

The neighborhood set is not directly used during message routing but is usually used for locality considerations. It contains the $\mid N \mid$ nearest nodes according to the network metric - one could use hop count, latency or other cost based information. Note that it’s not specified by the original paper which metric one should use and the reference implementation allowed external tools like traceroute and ping to be used to determine the proximity value. Usually the application keeps track of the nodes by simply measuring proximity to known nodes and updates this set accordingly - it’s advised to store them in a ordered fashion for faster access.

The size is usually set to $2*2^b$, i.e. for the often used $b=4$ this would lead to 32 entries (in this case linear scan might also be suitable during access).

During join of the network a node initializes it’s neighborhood set to the neighborhood set of it’s initially contacted seed node.

Routing table ($R$)

The routing table is the main fabric used during message routing .Basically it contains a row for every digit (i.e. group of $b$ bits in the node ID space). For example for 128 bit node IDs and a value of $b=4$ it would contain 32 rows. In the paper describing pastry this number of rows is written as $log_{2^b} N$.

Each row contains $2^b-1$ possible entries - one might implement it due to implementation constraints with $2^b$ entries. Each entry contains the (topologically) closest known node with the corresponding digit that shares the first row-index digits with the current node.

For a node ID of $0123$ the table would start with

none	1xxx	2xxx	3xxx	4xxx	5xxx	6xxx	7xxx	8xxx	9xxx
00xx	none	02xx	03xx	04xx	05xx	06xx	07xx	08xx	09xx
010x	011x	none	013x	014x	015x	016x	017x	018x	019x
0120	0121	0122	none	0124	0125	0126	0127	0128	0129

Note that each entry might be empty, occupied by the own node or contain an arbitrary entry that satisfies the given constraint. On average only $log_(2*b) N$ entries are populated though.

The typical network using $b=4$ and 128 bit node IDs would require 32 rows with 15 entries each (or 16 if not optimized to exclude the own node in each row), thus 480 entries which is totally acceptable even for applications on platforms like the ESP8266/ESP32 - depending on the stored information.

Leaf set ($L$)

The leaf set works somewhat like the neighborhood set and contains the $\frac{L}{2}$ closest nodes that are larger and $\frac{L}{2}$ closest nodes that are smaller than the current node ID - but this time not in proximity space but in key space. Nodes should be stored in order to allow fast binary search. The size of the leaf set is also configurable and is usually set to $2^b$ entries (for a typical $b=4$ i.e. to $16$ entries, which would be small enough even for unordered access).

Message routing

Note that message routing requires a route to be already joined and join requires message routing to be working on other nodes so this is somewhat an chicken-egg-problem. This routing will be used in a similar way during key-value storage to determine which node is responsible for a given key. First I’ll describe how the message routing works. Basically the node received a message from any other node that has a destination node ID (and usually should also carry the source nodes information).

The node first checks if the destination key is contained inside the range of it’s leaf set. It simply checks if the destination key $D$ is larger or equal to the smallest and smaller or equal to the largest known node ID in the leaf set. If this is the case the message is forwarded to the node with the minimum distance to the destination key.

If the leaf set could not be used it tries to use the routing table. This is done by first determining the length of the prefix that the current node and the destination node are sharing ($l$). Then it tries to look up the routing table in the $l$ row at the position corresponding to the next digit. This can be accomplished by simple binary and if $2^b$ entries are stored per row instead of $2^b-1$. If a node is found in the routing table the message gets forwarded to the given node.

In case this also didn’t work out which is said to be a rare case in the original paper the message is forwarded to any known node (from routing table, leaf set or neighborhood set) that:

shares a longer or equally long prefix
has the minimum known distance to the destination node ID

If no such node exists the message has arrived at it’s destination node. Note that during join this is used to determine one of the immediate neighbors of a node. This routing procedure is guaranteed to converge except when $\frac{L}{2}$ of the nearest nodes have failed simultaneously.

Node join procedure

In case a new node wants to join the DHT ring it first generates - if this has not happened before - it’s own credentials (signature and encryption keys) as well as it’s random node ID. The random node ID is sometimes calculated as the SHA-1 hash of it’s IP address or from the hash of it’s public key. After that it requires at least one seed node that has to be supplied by some external mechanism - like being supplied from a central authority which would introduce a single point of failure, using IP multicast which only works in local networks, being supplied manually, etc.

After the seed node has been located it gets contacted by the new joining node. A join message is transmitted to the seed node with it’s destination set to the new node ID. The message is routed according to the standard routing protocol - all nodes that are encountered during forwarding do transmit all of their tables (neighborhood set, routing table, leaf set) to the newly joining node.

The neighborhood set it initially copied from the seed node since it’s assumed to be in proximity to the newly joining node. From this point on the neighborhood set is tracked as it will be done during the whole lifetime of the node - i.e. in case a node is discovered that is closer by proximity metric than the currently known ones it’s included into the neighborhood set and possibly replaces the most distant one that’s currently included in the set.

The last node that the join message arrives on is used as the seed for leaf set information. The leaf set of this node is copied onto the newly joining nodes leaf set - and from this moment on it’s tracked again to contain the closest $L$ nodes ever seen by this node, replacing more distant nodes if required.

The most problematic table to construct is the routing table. First of all all fields corresponding to the current node (i.e. sharing a prefix of $row * b$ bits with the current node ID) are occupied automatically by the node itself (see example above). The first row can be directly copied from any node that doesn’t share a common prefix with the current node at all and are independent of the current node ID. The other entries have to share a prefix with the current node and can initially be filled by information received from the nodes that the join message has been routed through. Note that the length of the shared prefix gets longer and longer for every node encountered on the join messages path.

Note that this is only the initial build of the routing table. The new node also may request additional information (i.e. state tables) from other nodes encountered during this process. It asks all of the encountered nodes for their state tables and updates it’s own state in case it locates closer nodes. The neighborhood set is pretty useful for this process since it keeps tables of nodes in vicinity consistent.

After it’s own tables have been initialized the node transmits a joined message to all nodes applicable to inform them of their presence so they include the new node in their routing tables.

Keeping track of the tables during runtime

Leaf set and neighborhood set

These two sets are the most easy ones to track during runtime. For the leaf set $L$ one can simply track the distance of all active (non currently joining) nodes encountered during any communication process and determine if a more close node is encountered which would lead to the currently known nodes to be replaced by a more local one.

The neighborhood set $N$ which is built according to a proximity metric can be tracked by evaluating the proximity of all encountered nodes on a periodic basis and doing the same thing - just storing nodes that are in proximity of the current node. It’s a good idea to keep track of the worst proximity value for some fast discarding.

Routing table

The routing table might get updated periodically with newly arriving nodes, nodes that fail can be removed and nodes usually do exchange updates routing information with their neighbors.

When any node provides state information to another node it attaches a timestamp value. The receiving node then updates it’s own routing table according to the state Information and eventually notifies the original node - for example - of it’s arrival. The arriving node attaches the original timestamp so the receiving node can check if it’s own routing information has changed since then - and might signal an restart of routing table update to the now joined node.

Node leave procedure

Failure and departure is handled the same way. In case a node is discovered to be dead since there is no keep-alive exchanged or a request times out - or in case the node signals the end of it’s connections in connection oriented setups - the node gets removed from the network.

To replace a node inside the leaf set a node contacts the most distant existing entry in the leaf set and asks for it’s leafset to update it’s own information about neighboring nodes.

To repair a failed node entry in the routing table a node contacts another node from the same row (i.e. the same shared prefix length) and asks for it’s entry for the failed prefix and value. In the event that none of the nodes from the same row has a satisfying live node the net gets casted in a wider fashion - i.e. the node contacts nodes sharing even a longer prefix from rows below. This is likely to find an entry in case it exists.

Network resilience

One of the questions that arises when designing DHTs is the question of resilience against malicious behavior of nodes. Pastry tries to counter malicious nodes by introducing some randomness. Each routing step requires to select a node that lies closer to the target space - but the choice of nodes that fulfill this property can be made arbitrarily with some randomness. It’s entirely possible to collect possible routing hops first and then select one of them. In case of a failed query the query might have to be repeated a number of times but this process of random selection of sub optimal steps allows one to circumvent failed or malicious nodes. Note that of course the probability distribution of the random number generator used during this process should be biased towards the optimal routing decision anyways.

Network partitions

Underlying network failure (i.e. Internet network failure) may lead to DHT network partitioning. Pastry usually is capable of coping with situations in which a node is only reachable by a subset of nodes as long as the underlying network is not fully partitioned - but in case a network segment gets separated from the Internet (or the local network in which the DHT is used) the routing tables might be updated in a way that isolated islands of the DHT are formed. The DHT itself is not capable of recovering from this situation.

To circumvent such situations one has to circumvent the overlay routing network for node localization. In local networks one might use periodic multicast queries for neighboring nodes that might be introduced into the leaf set so islands will get reintegrated periodically - the same information might be used to update routing tables each step.

One might also use some other kind of centralized coordinator to join separated islands on a random periodic basis.