I have been measuring the performance and optimising it for over a year now, when I first looked we had 368 million transactions a day. Today with better software and faster hardware we are closer to a 900 million a day. This means that today, the software on modern hardware has a peak-performance in ideal situations which is about equal to Visa peak levels 10 years ago.
These numbers are not sustainable because of various reasons. But they could be[1].
In my research I've identified several bottle-necks and all of them are straight forward enough to solve by sheer engineering. No protocol changes are needed for any of this. Just more mature software.
The bottle-necks I found are all in the Satoshi codebase, most of them for a long time, some of them introduced by Core. This may sound tempting to just start a new client, but as any accomplished software engineer knows, dropping the codebase and starting over is not a viable option when we are required to follow the 1000s of undocumented rules that make Bitcoin, so we are always compliant with the network. I know people may try, but I can't imagine them ever truthfully being able to state they are Bitcoin compliant[2].

Bottlenecks (improvement points);

Representing block-chain data internally.

The transactions, the blocks and generally speaking the block-chain data has as a primary definition that they are immutable. This is why we call bitcoin a "chain". We only append more data, we never change old data.
To my surprise nobody took advantage of this concept in the actual software. Whenever we receive a block or open it for historical purposes the code going back to Satoshi will read the entire thing into a huge array of data-structures. This is not only wasteful, this doesn't scale.
Measurements show that reading a single 100MB block causes about 400MB of memory to be "malloced", in many thousands of chunks equal to about 4 per transaction. The overhead is highest during reading, but still a large percentage stays in memory over the actual size at all times.
As we go up in block size, this creates problems in memory allocation and fragmentation. Failures in those subsystem are fatal and trigger an abrupt node shutdown. As such larger blocksizes cause more unstable software.
I solved this with memory-mapped blocks, dropping the allocation to near zero. regardless of blocksize.

Transaction validation is almost entirely done in a single thread.

Validation of incoming transactions, either new or in a block, are done one after another. Never multiple at the same time. Even though the hardware can easy do that.
We have seen a tiny improvement in Core with regards to this, but only for a very small set of usecases that actually never really hit in production (due to compact-blocks/xthin). In practice this has remained unchanged since the Satoshi client.
What this means is that as the amount of incoming transactions per second grows with larger user numbers, we will choke up a node making it take longer and longer to respond until it just stops being useful. Buying faster computers won't help because it will still only use one core. Production hardware has 32 or more threads available, of which we only use one!
The solution for this is to refactor the validation engine and make it aware of multiple threads. Make it able to validate, process and reject the incoming data on all cores available to it.
I created a new framework for transaction validation which is fully parallel and which shows large improvements with minimal changes. Many more improvements seem possible. At minimum the validation code will be much more readable.

The network (p2p) layer is unable to handle large data or multiple sources

Quite similar to the previous point, the network layer architecture has not been updated since Satoshi wrote its first version.
If a node sends a new block, we will not process any next incoming message from any peer until that block has been fully processed. Everything is single-threaded and this means a newer CPU will have very limited effect.
Fundamentally, in this old architecture as we go up in block-size, there will be various ways to completely slow down a node and make it useless by just sending the right large data to it, or requesting large data from it.
Some easy fixes are possible, but honestly nothing will have a big impact until the previous bottlenecks are fixed.
In my research I wrote a new network-manager which actually scales over all processor cores and which doesn't block small important requests for big block data, like the current one. If we were to re-create the p2p layer onto that this would be solved.

UTXO database slowdown due to growth

This bottleneck was actually introduced by Core and it did not exist in the original Satoshi code. The change in version 0.15 of Core made it slightly better, but the problem remains [3].
The growth of the unspent outputs (essentially who owns how much money) is stored in a rather inappropriate manner. Large immutable data is stored in a dynamic database. It is like we store entire MP3 files in a database and then wonder why it became slow to search.
The direct result is that if we add a lot of users, the database will become so slow that we eventually slow down everything else in the system. This is due to the second problem, the UTXO database can only ever be modified from one thread.
There are a lot of ways to solve this, the most important being that we move the script (the file-data) out of the data-selection of the actually useful data: the actual unspent outputs.
Secondly; some database knowledge leads us to conclude that this can be done quite easy on top of a standard SQL database where we can then store the main data.

Results of proof-of-concept

These bottlenecks are each a rather large project to fix and will take time to do correctly. Priority should be on quality. For me the solution is obvious to each of them, but the details are what slows us down.
As hinted in the text above, various of these projects are well underway and as such I can show some results already.
I think the most exciting project is the block validation engine which is optimised for parallel processing. As an extra I added bench marking options which can be enabled at compile time. Using these benchmarks I can show some graphics of what we gained.
🠳 Full initial block sync with small optimisations only. Total time; 265 min.
https://ibb.co/mTfb0w
🠳 Now starting to use the new validation engine, which gives us more detailed timing benchmarks. Notice that the green bar one is the UTXO, taking roughly 80% of the time. Notice also the 'knee' in the green area halfway. It gets slower after a certain amount of usage. This is a LevelDB problem, as we can see below.
https://ibb.co/kuJktG
🠳 The last image is replacing the UTXO with a std::map, pure in-memory storage. The result is a complete full import of 8 years of Blockchain data in 136 minutes of CPU time. The actual wall-time for an 8 year import was around 2 hours.
https://ibb.co/f3dktG

Final words

This is a proof-of-concept, most of the benefits of the different ideas has not yet been reached. The initial results are promising, though. We know we can read 150 GB in 2 hours while creating an UTXO out of 8 years of Bitcoin data. I expect we can do this in at most 30 minutes when the bottlenecks I addressed are all taken care of. Additionally, no limitations are expected due to block size, even if they reach many gigabytes.
I do this work in a completely Free and Open Source (Copyleft) implementation based off of the original and most used Satoshi codebase. Many of these things can be found in Bitcoin Classic today. I welcome anyone interested in contacting me if they want to help out.
Footnotes;
1) Many people have tried to do the quick fix and just increase the static limits code by Satoshi, but eventually they all realised that this will not get them very far due to the bottle-necks that this article explains.
2) Writing compliant software while there is no paper documenting the actual protocol is going to be a larger and larger issue going forward. The secondary-goal of slowly massaging this codebase into shape is that we want a readable codebase that can be used to document the protocol.
3) In 0.15 of Core the UTXO database structure was changed which has a positive effect on memory consumption. This was a very re-active response to an attack, without realising that it ignored the actual problem.
 

$18.80
$12.75

Reviews
1 of 1 reviewers say it's worth paying for

0 of 1 reviewers say it's not worth paying for
Comments
  spent 5.0¢
Thanks Toms, you are always refreshing to read!
0.0¢
   11mo ago
2.1¢ 10.5¢ 10.5¢ 10.9¢
  spent 5.0¢
Cannot wait to see all those optimizations! :O
0.0¢
   11mo ago
1.9¢ 10.7¢ 10.9¢
  spent 10.0¢
Awesome Tom, keep up the great work.
0.0¢
   11mo ago
2.1¢ 10.7¢ 10.7¢ 10.9¢
  spent 20.0¢
This is great. This is exactly the kind of stuff we need.
0.0¢
   11mo ago
2.2¢