If we want to achieve thousands of transactions per second, Bitcoin software
needs to be modified to run in a multi-node, multi-core environment. This is
the only way we can achieve visa-level throughputs. The `bitcoind` service will
not be a single process anymore. It would be composed of multiple processes
running on different machines, all working together to perform the same,
overall `bitcoind` functions. This model is, sometimes called scale-out
architecture, how Google, Facebook, etc. scale their services.
When working at such a scale, it is typically assumed that data won't fit into
any one single machine and also, data won't be always available in memory -- in
other words, data is much much larger than the working-set and cannot be cached
Here are few concepts used in implementing such services:
Sharding is a mechanism where data-set is divided into disjoint partitions. One
or more nodes (typically 3 or 5 if we need fault-tolerance) take the
responsibility for one of the partitions. Responsibilities include, efficient
storage for the objects in the partition, read/write access to the objects in
the partition, performing any higher-level operations on the objects in the
Consistent-Hashing is a technique to distribute the partitions among physical
nodes in the cluster, such that, partition migrations can be avoided (or
minimized) when nodes are added or removed or failed in the cluster. Sharding
and Consistent-Hashing together give us linear throughput, i.e., throughput is
directly-proportional to the number of nodes in the cluster -- adding more
nodes gives you more throughput.
Leader-Election picks one of the nodes the cluster to perform coordination
tasks in a fault-tolerant way. Not all tasks of a service can always be
sharded, so one (or more) of the nodes takes the leader role to perform the
centralized tasks. In the event of a node failure, Leader-Election
automatically picks another node for the leader role/tasks.
There are several other technologies like replication, write-ahead-logs,
replicated write-ahead-logs, etc. that are also necessary to build a full
scale-out solution, but lets ignore them for now.
In the context of Bitcoin, the data-set consists of higher-level objects like,
Mempool, UTXO Database, Merkle-Tree, etc. composed of low-level objects like
Transactions, OutPoints, Blocks, etc.
Some operations on the data-set include (1) distributed UTXO db management, (2)
distributed Merkle-Tree construction for the mempool, (3) additions and
removals of transactions to/from the distribute Merkle-Tree, (4) searching for
transaction outpoints in the distributed mempool, (5) migrating transactions
from the mempool into UTXO when committing a block (6) purging committed
transactions out of the merkle-tree when committing a block (7) keeping track
of multiple chaintips (i.e., different mempool-views, merkle-tree tips), etc.
when conflicting blocks are yet to be resolved (8) receiving/syncing new block
quickly using graphene, etc.
Solving most of these problems efficiently requires CTOR and Lexical-Ordering
is a goto candidate (in distributed systems community at least). For example,
Here is how such a system looks-like in a cluster environment:
Lets assume we have N nodes in the cluster.
We partition the tx-hash-space into N disjoint lexical-ranges (ignoring
virtual-nodes concept from Dynamo paper) and associate one node as the owner
for each range. This configuration takes O(number-of-nodes) space and all nodes
keep this in-memory. Given this organization, all nodes *know* which node to
contact for operations on a tx-hash in constant-time.
Primary operations on the UTXO database are Insert, Find and Delete by
OutPoints which are mostly tx-hashes. So, every node *knows* whom to contact
for an OutPoint in O(1) time becuase partitioning-scheme is cached in every
Mempool is mostly similar to UTXO, but for unconfirmed transactions. Mempool
also has the same Insert, Find and Delete operations like UTXO. So, same
comments apply as above.
Merkle-Tree is constructed on top of mempool. If we have X unconfirmed
transactions, with Lexical-Ordering, merkle-tree can be split into N sub-trees
each hosted by one node. Each node maintains a sub-tree root for *its* part of
the mempool trasactions in the tx-hash-space.
Leader node in the cluster computes the Merkle-Tree root (perhaps,
periodically) by picking the subtree roots of all N nodes in the Lexical-Order
of their partitions. This root will be used to compute the block-header for
mining by the leader node.
Additions and removals of transactions to/from the merkle-tree are performed
only by the node which is the owner of tx-hash-space partition; other nodes,
when they see a tx, can simply ignore the transaction when it is not it's
partition owner. Merkle-tree updates to the sub-tree are completely node-local
and do not incur any coordination.
Upon receiving a block, all nodes could verify the
block-header/nonce/difficulty locally. When the block is to be committed,
transactions from the mempool are flushed/committed into the UTXO. If
partitioning-scheme is the same for UTXO and Mempool then moving txs from
mempool to utxo will be node-local and doesn't incur any network-traffic.
Upon receiving a block, we need to purge confirmed transactions out of the
merkle-tree, so that next merkle-tree-root for the block-header doesn't include
the confirmed transactions. This task would also be node-local and doesn't
incur any network-traffic.
If conflicting blocks arrive, mempool and merkle-tree can be (conceptually)
cloned into two logical-instances. So, each node in the cluster keeps track of
multiple subtrees with different subtree root one for each conflicting block.
Above (incomplete) ideas are *one-way* to build scale-out bitcoin
infrastructure that can give us throughput in thousands of transactions per
second. Given my experience, I think Lexical-Ordering is a good choice to reach
Please ignore any typos or grammatical mistakes. Thanks.
No one has reviewed this piece of content yet