I’ve got an idea for extreme data compression. It would allow for essentially arbitrarily large amounts of data to be transmitted or stored in a very small space – even within the OP_Return space on the Bitcoin Cash blockchain.
Disclaimer: I am a complete and utter novice. I don’t know anything, nor do I have any technical expertise. This idea popped into my head a few nights ago when thinking about how to send longer messages through memo.cash. However, I’ve run this idea past a few people I respect, and some parts look promising. It might have a catastrophic flaw or assumption that I’m missing – or it might already be in common use, which wouldn’t be surprising – but it’s good enough to share and risk looking silly. Since I don’t speak the technical language, I can only communicate the big-picture concepts without technical precision.
The general idea is this: to compress data by assigning unique characters to unique patterns of data. Then, storing or transmitting the characters, instead of the data. The characters can then by decompressed to reconstruct the original data. I call it “character compression.”
There are currently more than 130,000 unique characters in Unicode alone, which could be a substantial enough set for significant data compression.
So, let’s take a simple example. Let’s say I want to communicate the phrase, “Bitcoin Cash is peer-to-peer electronic cash.” That’s 46 individual characters. Now imagine I assign the following, using Armenian characters, which are 2-bytes each:
“Bitcoin Cash” = “Ի”
“Is” = “Թ”
“Peer-to-peer” = “Զ”
“Electronic cash” = “Բ”
Call these four assignments a “compression key.” Now, I can communicate the phrase “Bitcoin Cash is peer-to-peer electronic cash” by only using 4 characters – ԻԹԶԲ.  If the receiver of information shares the same compression key, then he’ll be able to read the 4 symbols and fully reconstruct the phrase “Bitcoin Cash is peer-to-peer electronic cash.” Simple. In this case, the message goes from about 45 bytes down to 8 – a more than 80% reduction in size.
So, I propose we make a public database for assigning specific characters to the most used English words. Apparently, there are about 3,000 English words that make up 95% of English communication. To start, we could assign 3,000 individual characters to each word. We could use any of the characters not in use for English speakers, like Japanese kanji. “感” could mean anything. That way, we could communicate most English words simply by sharing the corresponding characters with each other. The fewer bytes taken up by the chosen character, the better.
Take for example, the play Romeo and Juliet. It’s got about 25,000 individual words. If each word had its own character, we could compress Romeo and Juliet down to 25,000 characters. Still, not small enough to fit inside a BCH transaction, but we can get much smaller.
Extreme compression
With 135,000 characters to use right now, and say, 5,000 assigned to the words in Romeo and Juliet, that still leaves about 130,000 characters that we’re free to assign. Here’s where the biggest compression happens: what if we assign unique characters to the most popularpatterns in English speech. Multiple words at the same time, including the spaces. The words “give or take” are 12 characters together, but they come up all the time in written communication. What if we assigned “give or take” its own character?
You could have one list of, say, the top 100,000 most common English phrases (each a few words long), and compress each one down to a single character. Communicate the character – or put it inside the blockchain – and then somebody else with the same compression key could re-construct the original data.
In other words, you could essentially hack the character limit on platforms like Twitter or memo.cash. With Twitter, you might able to only use 144 characters, or with memo.cash, you might only be able to send 80 bytes worth of compressed characters per transaction, but those  characters would correspond to a much larger amount of information reconstructed on the receiver’s computer. The reduction in bandwidth would be enormous.
Even smaller words in common phrases could be reduced by 50%. Take the random phrases:
“And then,”
“To be”
“To the”
“The other”
“A different”
“Not again,” etc.
All of these could be compressed more than 50% if they had their own characters. The longer the phrases or the bigger the words, the more compression.
(Also, in the long run, why stop at 135,000 characters? Why couldn’t we come up with 10,000,000 individual characters that we could assign to unique and more complex pieces of data, to compress things even further? We could create new, 1-byte characters that are not in use in any language, for the sole-purpose of assigning common patterns to them. I don’t know the process involved in creating these things, so maybe this is impossible for some reason.)
All of this compression and reconstruction could be done behind the scenes. Humans wouldn’t need to see the symbols; they’d just see the reconstruction of it. Instead of communicating raw data with somebody, you’d instead be communicating the blueprint for reconstructing the data locally.
Public Database
This system requires creating a public database – the compression key that we all share. I don’t think it would be hard. It’d simply be a very large list of words and phrases, with their corresponding symbols that they get compressed to. The key could be hosted locally on people’s computers (kind of like downloading a new font for your computer to understand), or it could be hosted on something like Github. We could create a kind of universal standard for the compression of ordinary English into characters. Then, your computer would either automatically convert the phrases to characters, or you might have something like a Chrome plugin that does it for you – or perhaps individual companies will offer this kind of compression on their platform (I’m lookin’ at you, Memo.cash).
Playing out the idea further, companies could even have their own unique compression keys, if particular strings of data come up a lot in their systems. There’s no reason why characters would have to be assigned to English words only; they could also correspond to particular pieces of computer code, or any other string of data that’s in a repeated pattern.
If it’s not already widely in use, this system could not only dramatically reduce the amount of information communicated to people, but it might also dramatically reduce the amount of information stored on hard drives. Think how many times Google has the phrase “I am hungry” stored on their hard drives, throughout all the emails and Youtube comments they host. If the data is duplicated over and over, it could dramatically be reduced by replacing the phrases by a single character, then storing the compression key for data-reconstruction when necessary. Heck, you could write a computer program that specifically looks for repeated patterns of information, assigns those patterns a particular symbol, then overwrites the data with a single character and keeps the compression key for later reconstruction. These compressed symbols become a placeholder for patterns. Maximum compression would be the replacement of all repeating patterns with a single character placeholder. Again, I don’t know if this is something that’s already done, but if it’s not, it seems like a good idea.
This idea is very broad and open-ended. Who knows: there could be industry-standard conventions for character compression that differ per language or use-case. Or, it could be used to power a computer program that receives its instructions by monitoring the blockchain for specific inputs. Large amounts of data could be written/executed remotely by communicating only a small amount of symbols. Or, if your data isn’t compact enough, you could simply chain together BCH transactions to encode larger and larger amounts of data that could all be uncompressed by the same key. Or, perhaps you could compress the compression with a different key – finding patterns in your compression and then using a secondary key to compress even farther.
Perhaps this could even be used in P2P file sharing. Instead of sharing the entire file with somebody, perhaps you could share the compression key with them plus the instructions for reconstructing the data on their own machines. (Again, this key could be unique to the particular data that’s being compressed. It could maximize the compression on a per-file basis. So, every time that a common pattern appears, it’s compressed down to one character. The more repetition, the more compression.) Or maybe it could just be used to compress information between two people, who share their own unique keys.
Or maybe it’s just a clever hack to pack more information into limited blockspace. What do you think? There are obviously lots of details to work out, but it seems like this system could help us pack more information into tight spaces like OP_Return.
 

$9.25
$15.35

Reviews
6 of 6 reviewers say it's worth paying for

0 of 6 reviewers say it's not worth paying for
Comments
  earned 25.0¢
This seems really smart to me. I'm surprised it's not a thing already!
25.0¢
   6mo ago
25.0¢
  earned $2.45
You may want to read about lossless data compression. People have been working on this problem for quite a long time. The best compression does an amazing job of removing statistical redundancy (the generalized idea you are describing) from data. https://en.wikipedia.org/wiki/Data_compression#Lossless
It is possible that there is room to optimize one of the standard algorithms to fit the idea of limited text data --> compressed text data. It has to generalize to all languages, vocabularies, etc. The end result would probably get about the same result as existing techniques.
$2.45
   6mo ago
50.0¢ 25.0¢ 25.0¢ 10.0¢ 25.0¢ 10.0¢ 25.0¢ 25.0¢ 25.0¢ 25.0¢
  earned 25.0¢
@emergent_reasons - "It has to generalize to all languages, vocabularies, etc." But no it doesn't. That's pretty explicit. English speakers would have a very strong use-case for themselves.
25.0¢
   6mo ago
25.0¢
  earned 0.0¢
This is why I love Bitcoin, constant thinking outside the box!
0.0¢
   6mo ago
  earned 25.0¢
What about the rest of the world that are not writing in English?
25.0¢
   6mo ago
25.0¢
  earned 10.0¢
@Steve Patterson I prefer to stick to "keep it simple" until we really have a clear need for something more complex.
However if Bitcoin achieves large scale adoption, a standards group will emerge and one of the components will be compression. A natural language compression option like you are suggesting will almost certainly be an option. There will be other options as well such as purely algorithmic methods. I would be willing to bet a large amount of money that this problem (compressing small data / language packets) is well studied.
I actually hope that we can find a way to avoid that kind of external and static database/standards by using Bitcoin itself. Otherwise it opens up vectors for capture of Bitcoin.
10.0¢
   6mo ago
10.0¢
  earned 25.0¢
@emergent_reasons - I agree simpler is better. I think the system I'm describing is pretty simple, though. And optional. We're talking about platforms like memo, whose whole existence is based on the idea of putting natural language messages into the blockchain, using this on their backend. I don't think that's a big attack vector. I do hope that people are developing these kinds of things. I just don't see them and can't use them. So hopefully they'll release it soon, so we can share longer messages with eachother.
25.0¢
   6mo ago
25.0¢
  earned 70.0¢
Now you have me thinking about an on-chain compression algorithm...
70.0¢
   6mo ago
10.0¢ 10.0¢ 25.0¢ 25.0¢
  earned $1.00
I have been very excited about the idea of embedding RDF databases into BCH addresses. If you did something like the Datomic Database and combined it with some compression scheme like your are describing here, you could write Single Page Applications that store all of their preferences in the blockchain. Rather than having to store all of your usernames, passwords and site preferences in a MySQL db that you have to pay to maintain someplace, the client app could drop all those things into an BCH address via op codes, for a one time fee. If they came back years later, the data would still be there. You could cypher each entry, and deposit them in many different Cash Wallets, so it may be possible that you could securely store private data on a public chain. No permission is needed to do this kind of thing. It will be interesting what effect this will have on scaling. We will soon see if "Scaling isn't a problem" is really true.
$1.00
   6mo ago
50.0¢ 25.0¢ 25.0¢
  earned 25.0¢
Current text compression algorithms are very good. The problem with your approach is that the larger the bit of data you assign to a key, like words with more than 5 characters, the less frequent they will occur in any given text. Words like 'a', 'the', 'where', are common and frequent, but other longer words, do not occur frequently enough to be used for efficient pattern compression. Whereas, randomly emerging patterns of sequential non-word characters are more likely to occur more frequently, even though the length of the patterns would be smaller.
25.0¢
   6mo ago
25.0¢
  earned 25.0¢
Reminds me of my idea for super compression that I had almost twenty years ago. Since millions of people use the same operating systems, like Windows Home, or Windows Pro, if one used the OS as a common data sink, and if you sliced a video file, for example, into small chunks of bytes that matched small chunks of bytes in the OS binaries, then instead of sending a large video file you could send a smaller reference file with small keys that pointed to large chunks of bytes in the OS that the receiver already has. I wrote this in Python, but it turned out that the keys (addresses) pointing into the OS were larger than the data to be retrieved. So, that didn't work.

Made me think of doing the same with a blockchain instead of an OS. But a blockchain is exponentially larger than an OS, so keys would end up being much larger than the data patterns, I would think. Well, to be honest, the OS keys and patterns were close, but I didn't have the time nor the expertise to find a solution. Perhaps someone today could make the common blockchain sink compression tech work.
25.0¢
   6mo ago
25.0¢
  earned 75.0¢
It's been fun thinking through this idea. I'll share with others what I've found in looking into this. There have been several attempts to do something similar before, though not for blockchain messaging. Here's one paper: https://www.sciencedirect.com/science/article/abs/pii/S0920548914000737 They even have an open source implementation: https://github.com/unixba/b64pack
I tried using it to see if I could get space reduction on messages, but thus far the reductions are less than 10% in all the cases I've tried. However in the best case scenario where both sides have a shared dictionary of characters that covers most words and you eliminate spaces I think it's possible to get over 40% compression. This is with only a few hours of testing from a novice so I'd love to hear from more experienced people.
75.0¢
   6mo ago
25.0¢ 25.0¢ 25.0¢
  earned 25.0¢
I didn't go over the statistics and do the math but if it can be efficient there should be a library which suggests the user which expressions are in the set while typing.
25.0¢
   6mo ago
25.0¢
  earned $1.00
Yes, this is the basic idea behind *all* data compression. You figure out which messages are "more likely" than others, and you give the "more likely" messages shorter codes. The problem is, "more likely" usually depends on context. General-purpose compression schemes make general-purpose assumptions about the data. They usually just look for duplicates and such, which works OK-ish in most cases.
We aren't dealing with general-purpose data on Memo.cash, though - we are dealing with English messages! Therefore, we can use this information to do a *lot* better than a general-purpose compression scheme. We can exploit the fact that we expect English words to be more likely than gibberish words like "qrrbrbirlbel". This is information a general-purpose data compression algorithm just wouldn't know.
You should look up something called "Huffman coding" (https://en.wikipedia.org/wiki/Huffman_coding). This is a data-compression technique that allows you to assign bit-codes to various messages, depending on how likely those messages are. Somebody could pre-load a Huffman table with all the words in the English language, weighted by frequency, plus letters and punctuation. All Memo.cash users would have a copy of this table, so messages sent in "English mode" could gain the compression benefits from this shared understanding.
Putting the English words into a Huffman table is basically the same idea as this blog post; it's just a more efficient way to encode those words than using Unicode characters. The problem with using Unicode characters is that they don't all cost the same. English letters generally take 1 byte each, but Chinese characters take 3 bytes each, and many Emoji take 4 bytes each. Huffman encoding is basically just a fancy way to balance out the cost/reward ratio. More common words get shorter codes.
$1.00
   6mo ago
25.0¢ 50.0¢ 25.0¢
  earned 0.0¢
Maybe this is what hieroglyphs are? Characters representing words or sets of words. There is talk that those pyramids were not actually build by Egyptians (I can see valid reasoning that this might just be the case) and if so, it would indicate much more advanced civilization living on Earth before us (or along us). Idea is definitely interesting and worth looking into, I wonder if Bitcoin Cash devs haven't thought of this already, they did mention Graphene (I think that was the name) compression tech being looked at and developed already, which they said could reduce by factor of 10x.
0.0¢
   6mo ago
  earned 0.0¢
There are ways to do this already. Here's some code:
Smaz does exactly what you propose, but algorithmically and thus more efficiently - representing more frequent words or parts with less bits. You train it on the kind of text you will represent later, so you can teach it also some common parts of urls.
Actually, computer scientists have been thinking about this a lot and there are many improvements (Huffman encoding is one) to a "naive" idea that we could figure out - I don't mean to downplay your input, actually it's nice to see how you think even though you did not study computer science. I recommend googling a college textbook on compression, it will give you more ideas.
And of course what matters is to actually implement it.
0.0¢
   6mo ago