A look at the pastry distributed hashtable

19 Sep 2020 - tsp
Last update 19 Sep 2020
Reading time 18 mins

Please note: Pastry is not my design or invention, it’s well documented in a bunch of papers. There is a quick overview available by the authors of pastry too.

What is a DHT?

A distributed hash table (DHT) is a distributed database that provides key-value store like semantic. Basically it’s a distributed system that allows one to store values with a given key and retrieve them again using the key. This is called a hash table since the key is usually calculated as a hash value of the stored objects. A hash function is a one way function that should yield a evenly distributed key value that is guaranteed to change whenever an object changes. It’s also often desirable to be collision free - but of course that’s not possible in any case if the keyspace is smaller than the value space; hash tables are usually capable of dealing with such collisions.

The distributed behavior is achieved by assigning nodes IDs from the same keyspace and assigning objects to the numerically nearest node(s). Redundancy can be achieved by using the $n$ closest objects.

Since node IDs are assigned randomly (either by using random number generators, being derived via a hash function from the address of the node or being the hash of a credential like a public key) they’re statistically evenly distributed in keyspace if everything is done right.

Since they operate on some really basic principles DHTs scale well, are self organizing, are robust even in case nodes fail, are fault tolerant and are distributing load evenly on the system. Pastry also offers reduced load on the network by employing a proximity metric while building some of it’s tables. Also pastry is a little bit different than most other DHTs - it includes an overlay routing network that allows one to route messages to nodes that one doesn’t have information about. Routing tables are usually updated whenever a node fails or joins the network.

One can build a huge number of services on top of a DHT:

There is a nearly unlimited amount of applications - depending on the environment they run in one has to take some security considerations though.

Network parameterization

First of all pastry uses 128 bit Node IDs (i.e. 16 bytes). There is a number of tunables though:

Information kept about other nodes

First lets think about the information required to contact other nodes:

Additionally the node locally might add:

Node state

Neighborhood set ($N$)

The neighborhood set is not directly used during message routing but is usually used for locality considerations. It contains the $\mid N \mid$ nearest nodes according to the network metric - one could use hop count, latency or other cost based information. Note that it’s not specified by the original paper which metric one should use and the reference implementation allowed external tools like traceroute and ping to be used to determine the proximity value. Usually the application keeps track of the nodes by simply measuring proximity to known nodes and updates this set accordingly - it’s advised to store them in a ordered fashion for faster access.

The size is usually set to $2*2^b$, i.e. for the often used $b=4$ this would lead to 32 entries (in this case linear scan might also be suitable during access).

During join of the network a node initializes it’s neighborhood set to the neighborhood set of it’s initially contacted seed node.

Routing table ($R$)

The routing table is the main fabric used during message routing .Basically it contains a row for every digit (i.e. group of $b$ bits in the node ID space). For example for 128 bit node IDs and a value of $b=4$ it would contain 32 rows. In the paper describing pastry this number of rows is written as $log_{2^b} N$.

Each row contains $2^b-1$ possible entries - one might implement it due to implementation constraints with $2^b$ entries. Each entry contains the (topologically) closest known node with the corresponding digit that shares the first row-index digits with the current node.

For a node ID of $0123$ the table would start with

none 1xxx 2xxx 3xxx 4xxx 5xxx 6xxx 7xxx 8xxx 9xxx
00xx none 02xx 03xx 04xx 05xx 06xx 07xx 08xx 09xx
010x 011x none 013x 014x 015x 016x 017x 018x 019x
0120 0121 0122 none 0124 0125 0126 0127 0128 0129

Note that each entry might be empty, occupied by the own node or contain an arbitrary entry that satisfies the given constraint. On average only $log_(2*b) N$ entries are populated though.

The typical network using $b=4$ and 128 bit node IDs would require 32 rows with 15 entries each (or 16 if not optimized to exclude the own node in each row), thus 480 entries which is totally acceptable even for applications on platforms like the ESP8266/ESP32 - depending on the stored information.

Leaf set ($L$)

The leaf set works somewhat like the neighborhood set and contains the $\frac{L}{2}$ closest nodes that are larger and $\frac{L}{2}$ closest nodes that are smaller than the current node ID - but this time not in proximity space but in key space. Nodes should be stored in order to allow fast binary search. The size of the leaf set is also configurable and is usually set to $2^b$ entries (for a typical $b=4$ i.e. to $16$ entries, which would be small enough even for unordered access).

Message routing

Note that message routing requires a route to be already joined and join requires message routing to be working on other nodes so this is somewhat an chicken-egg-problem. This routing will be used in a similar way during key-value storage to determine which node is responsible for a given key. First I’ll describe how the message routing works. Basically the node received a message from any other node that has a destination node ID (and usually should also carry the source nodes information).

The node first checks if the destination key is contained inside the range of it’s leaf set. It simply checks if the destination key $D$ is larger or equal to the smallest and smaller or equal to the largest known node ID in the leaf set. If this is the case the message is forwarded to the node with the minimum distance to the destination key.

If the leaf set could not be used it tries to use the routing table. This is done by first determining the length of the prefix that the current node and the destination node are sharing ($l$). Then it tries to look up the routing table in the $l$ row at the position corresponding to the next digit. This can be accomplished by simple binary and if $2^b$ entries are stored per row instead of $2^b-1$. If a node is found in the routing table the message gets forwarded to the given node.

In case this also didn’t work out which is said to be a rare case in the original paper the message is forwarded to any known node (from routing table, leaf set or neighborhood set) that:

If no such node exists the message has arrived at it’s destination node. Note that during join this is used to determine one of the immediate neighbors of a node. This routing procedure is guaranteed to converge except when $\frac{L}{2}$ of the nearest nodes have failed simultaneously.

Node join procedure

In case a new node wants to join the DHT ring it first generates - if this has not happened before - it’s own credentials (signature and encryption keys) as well as it’s random node ID. The random node ID is sometimes calculated as the SHA-1 hash of it’s IP address or from the hash of it’s public key. After that it requires at least one seed node that has to be supplied by some external mechanism - like being supplied from a central authority which would introduce a single point of failure, using IP multicast which only works in local networks, being supplied manually, etc.

After the seed node has been located it gets contacted by the new joining node. A join message is transmitted to the seed node with it’s destination set to the new node ID. The message is routed according to the standard routing protocol - all nodes that are encountered during forwarding do transmit all of their tables (neighborhood set, routing table, leaf set) to the newly joining node.

The neighborhood set it initially copied from the seed node since it’s assumed to be in proximity to the newly joining node. From this point on the neighborhood set is tracked as it will be done during the whole lifetime of the node - i.e. in case a node is discovered that is closer by proximity metric than the currently known ones it’s included into the neighborhood set and possibly replaces the most distant one that’s currently included in the set.

The last node that the join message arrives on is used as the seed for leaf set information. The leaf set of this node is copied onto the newly joining nodes leaf set - and from this moment on it’s tracked again to contain the closest $L$ nodes ever seen by this node, replacing more distant nodes if required.

The most problematic table to construct is the routing table. First of all all fields corresponding to the current node (i.e. sharing a prefix of $row * b$ bits with the current node ID) are occupied automatically by the node itself (see example above). The first row can be directly copied from any node that doesn’t share a common prefix with the current node at all and are independent of the current node ID. The other entries have to share a prefix with the current node and can initially be filled by information received from the nodes that the join message has been routed through. Note that the length of the shared prefix gets longer and longer for every node encountered on the join messages path.

Note that this is only the initial build of the routing table. The new node also may request additional information (i.e. state tables) from other nodes encountered during this process. It asks all of the encountered nodes for their state tables and updates it’s own state in case it locates closer nodes. The neighborhood set is pretty useful for this process since it keeps tables of nodes in vicinity consistent.

After it’s own tables have been initialized the node transmits a joined message to all nodes applicable to inform them of their presence so they include the new node in their routing tables.

Keeping track of the tables during runtime

Leaf set and neighborhood set

These two sets are the most easy ones to track during runtime. For the leaf set $L$ one can simply track the distance of all active (non currently joining) nodes encountered during any communication process and determine if a more close node is encountered which would lead to the currently known nodes to be replaced by a more local one.

The neighborhood set $N$ which is built according to a proximity metric can be tracked by evaluating the proximity of all encountered nodes on a periodic basis and doing the same thing - just storing nodes that are in proximity of the current node. It’s a good idea to keep track of the worst proximity value for some fast discarding.

Routing table

The routing table might get updated periodically with newly arriving nodes, nodes that fail can be removed and nodes usually do exchange updates routing information with their neighbors.

When any node provides state information to another node it attaches a timestamp value. The receiving node then updates it’s own routing table according to the state Information and eventually notifies the original node - for example - of it’s arrival. The arriving node attaches the original timestamp so the receiving node can check if it’s own routing information has changed since then - and might signal an restart of routing table update to the now joined node.

Node leave procedure

Failure and departure is handled the same way. In case a node is discovered to be dead since there is no keep-alive exchanged or a request times out - or in case the node signals the end of it’s connections in connection oriented setups - the node gets removed from the network.

To replace a node inside the leaf set a node contacts the most distant existing entry in the leaf set and asks for it’s leafset to update it’s own information about neighboring nodes.

To repair a failed node entry in the routing table a node contacts another node from the same row (i.e. the same shared prefix length) and asks for it’s entry for the failed prefix and value. In the event that none of the nodes from the same row has a satisfying live node the net gets casted in a wider fashion - i.e. the node contacts nodes sharing even a longer prefix from rows below. This is likely to find an entry in case it exists.

Network resilience

One of the questions that arises when designing DHTs is the question of resilience against malicious behavior of nodes. Pastry tries to counter malicious nodes by introducing some randomness. Each routing step requires to select a node that lies closer to the target space - but the choice of nodes that fulfill this property can be made arbitrarily with some randomness. It’s entirely possible to collect possible routing hops first and then select one of them. In case of a failed query the query might have to be repeated a number of times but this process of random selection of sub optimal steps allows one to circumvent failed or malicious nodes. Note that of course the probability distribution of the random number generator used during this process should be biased towards the optimal routing decision anyways.

Network partitions

Underlying network failure (i.e. Internet network failure) may lead to DHT network partitioning. Pastry usually is capable of coping with situations in which a node is only reachable by a subset of nodes as long as the underlying network is not fully partitioned - but in case a network segment gets separated from the Internet (or the local network in which the DHT is used) the routing tables might be updated in a way that isolated islands of the DHT are formed. The DHT itself is not capable of recovering from this situation.

To circumvent such situations one has to circumvent the overlay routing network for node localization. In local networks one might use periodic multicast queries for neighboring nodes that might be introduced into the leaf set so islands will get reintegrated periodically - the same information might be used to update routing tables each step.

One might also use some other kind of centralized coordinator to join separated islands on a random periodic basis.

This article is tagged: Programming, Tutorial


Data protection policy

Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)

This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/

Valid HTML 4.01 Strict Powered by FreeBSD IPv6 support