Motivation

If we want to achieve thousands of transactions per second, Bitcoin software
needs to be modified to run in a multi-node, multi-core environment. This is
the only way we can achieve visa-level throughputs. The `bitcoind` service will
not be a single process anymore. It would be composed of multiple processes
running on different machines, all working together to perform the same,
overall `bitcoind` functions. This model is, sometimes called scale-out
architecture, how Google, Facebook, etc. scale their services.

When working at such a scale, it is typically assumed that data won't fit into
any one single machine and also, data won't be always available in memory -- in
other words, data is much much larger than the working-set and cannot be cached
in RAM.

Here are few concepts used in implementing such services:

Sharding
Consistent-Hashing
Leader-Election

Sharding is a mechanism where data-set is divided into disjoint partitions. One
or more nodes (typically 3 or 5 if we need fault-tolerance) take the
responsibility for one of the partitions. Responsibilities include, efficient
storage for the objects in the partition, read/write access to the objects in
the partition, performing any higher-level operations on the objects in the
partition, etc.

Consistent-Hashing is a technique to distribute the partitions among physical
nodes in the cluster, such that, partition migrations can be avoided (or
minimized) when nodes are added or removed or failed in the cluster. Sharding
and Consistent-Hashing together give us linear throughput, i.e., throughput is
directly-proportional to the number of nodes in the cluster -- adding more
nodes gives you more throughput.

Leader-Election picks one of the nodes the cluster to perform coordination
tasks in a fault-tolerant way. Not all tasks of a service can always be
sharded, so one (or more) of the nodes takes the leader role to perform the
centralized tasks. In the event of a node failure, Leader-Election
automatically picks another node for the leader role/tasks.

There are several other technologies like replication, write-ahead-logs,
replicated write-ahead-logs, etc. that are also necessary to build a full
scale-out solution, but lets ignore them for now.

Problems

In the context of Bitcoin, the data-set consists of higher-level objects like,
Mempool, UTXO Database, Merkle-Tree, etc. composed of low-level objects like
Transactions, OutPoints, Blocks, etc.

Some operations on the data-set include (1) distributed UTXO db management, (2)
distributed Merkle-Tree construction for the mempool, (3) additions and
removals of transactions to/from the distribute Merkle-Tree, (4) searching for
transaction outpoints in the distributed mempool, (5) migrating transactions
from the mempool into UTXO when committing a block (6) purging committed
transactions out of the merkle-tree when committing a block (7) keeping track
of multiple chaintips (i.e., different mempool-views, merkle-tree tips), etc.
when conflicting blocks are yet to be resolved (8) receiving/syncing new block
quickly using graphene, etc.

Solving most of these problems efficiently requires CTOR and Lexical-Ordering
is a goto candidate (in distributed systems community at least). For example,

Here is how such a system looks-like in a cluster environment:

Lets assume we have N nodes in the cluster.

We partition the tx-hash-space into N disjoint lexical-ranges (ignoring
virtual-nodes concept from Dynamo paper) and associate one node as the owner
for each range. This configuration takes O(number-of-nodes) space and all nodes
keep this in-memory. Given this organization, all nodes *know* which node to
contact for operations on a tx-hash in constant-time.

Primary operations on the UTXO database are Insert, Find and Delete by
OutPoints which are mostly tx-hashes. So, every node *knows* whom to contact
for an OutPoint in O(1) time becuase partitioning-scheme is cached in every
node.

Mempool is mostly similar to UTXO, but for unconfirmed transactions. Mempool
also has the same Insert, Find and Delete operations like UTXO. So, same
comments apply as above.

Merkle-Tree is constructed on top of mempool. If we have X unconfirmed
transactions, with Lexical-Ordering, merkle-tree can be split into N sub-trees
each hosted by one node. Each node maintains a sub-tree root for *its* part of
the mempool trasactions in the tx-hash-space.

Leader node in the cluster computes the Merkle-Tree root (perhaps,
periodically) by picking the subtree roots of all N nodes in the Lexical-Order
of their partitions. This root will be used to compute the block-header for
mining by the leader node.

Additions and removals of transactions to/from the merkle-tree are performed
only by the node which is the owner of tx-hash-space partition; other nodes,
when they see a tx, can simply ignore the transaction when it is not it's
partition owner. Merkle-tree updates to the sub-tree are completely node-local
and do not incur any coordination.

Upon receiving a block, all nodes could verify the
block-header/nonce/difficulty locally. When the block is to be committed,
transactions from the mempool are flushed/committed into the UTXO. If
partitioning-scheme is the same for UTXO and Mempool then moving txs from
mempool to utxo will be node-local and doesn't incur any network-traffic.

Upon receiving a block, we need to purge confirmed transactions out of the
merkle-tree, so that next merkle-tree-root for the block-header doesn't include
the confirmed transactions. This task would also be node-local and doesn't
incur any network-traffic.

If conflicting blocks arrive, mempool and merkle-tree can be (conceptually)
cloned into two logical-instances. So, each node in the cluster keeps track of
multiple subtrees with different subtree root one for each conflicting block.

Summary

Above (incomplete) ideas are *one-way* to build scale-out bitcoin
infrastructure that can give us throughput in thousands of transactions per
second. Given my experience, I think Lexical-Ordering is a good choice to reach
there.

Please ignore any typos or grammatical mistakes. Thanks.

 

25.0¢
0.0¢

No one has reviewed this piece of content yet
Comments
  spent 25.0¢
If we want to achieve thousands of transactions per second,
Back in 2016 Flowee did 4000 transactions a second.
Currently the number is closer to 50,000 transactions per second.
The ideas are not new, but I really hope I can see the code to back up your research so we can actually see if it works.
In the mean time I don't want lexical ordering because it destroys the actually already working parallel validation on Flowee. That would be a shame based on the fact that there is nobody that actually has anything faster. Just theories.
0.0¢
   3mo ago
25.0¢
  earned 0.0¢
@TomZ
At scale, everything breaks -Google. If 50,000 is working already, try 100,000 tps or 1000,000 tps and I am sure things will break.
Fortunately, these ideas are not theory. There are systems built on these ideas and running for a decade by now. I have added links to corresponding papers containing performance evaluation numbers below. We don't need to re-evaluate these ideas, instead can build on top of them.
Given that we are having so much contention and politics in doing protocol changes at the current adoption levels, it is only going to become worse in future. We should do such changes as early as possible.

0.0¢
   3mo ago