Our cloud-covered world could be in for a storm. Beijing-based tech company Terark has developed algorithms that allow databases to run up to 200x faster by compressing their data further. On top of that, their algorithm also allows reading the data without having to decompress it. This means one server running TerarkDB can do the job of five servers running industry standard database engines. The cost savings for companies will be huge, plus it slots straight into existing database ecosystems “like changing a battery,” allowing them to easily offer free trials.
Terark itself has secured a $1 million contract with Alibaba Cloud and is already profitable despite only being established in November 2015. They’re not looking for further investment, but they are now heading to Europe and the US to try to explain the revolutionary concept to potential clients (keep reading for our attempts), though they’re also making the code free to small users running just one server.
The company is already in talks with undisclosed large clients around the world.
“It’s not an incremental improvement,” said VP Remy Trichard, “You can get up to 200 times faster on random reads. [Big companies like IBM, Google] spend a lot of time and money trying to optimize their servers, adding more memory to get only incremental improvements. In terms of cost savings, it’s fives times. One server can do the work of five, in some scenarios, ten.”
So far their biggest client is Alibaba who have done a $1 million deal with Terark to integrate their technology into Alibaba Cloud (Aliyun 阿里云), the world’s third largest cloud company according to Sean Fu, CEO of the company. This angel client will give its cloud service users the choice to switch to TerarkDB in a few weeks’ time.
Pricing structures are not yet clear, though the team divulged that Alibaba Cloud will save money by customers switching to the system.
How does it work?
The team uses various analogies (scroll down) to explain how the solution works and the inventor of the algorithms, CTO Lei Peng, even drew diagrams to explain the difference. “Our whole logic system is different,” said Lei as he got his whiteboard pen.
Databases store their data in blocks with a corresponding index. When data is needed, a search of the index is made and the relevant block is retrieved. Currently, those blocks are compressed and need to be decompressed. The blocks are managed by a file system cache and have to be dropped into a block cache to be decompressed and read, which puts a huge demand on servers.
TerarkDB compresses the data further, but its indexing system is where the real difference is. “Traditional system can only index 1% but we can index 100% using the Nested Succinct Trie [pronounced “try”], said Lei. That the index holds way more information about what is in the data, blocks don’t have to be retrieved and decompressed—they can be read in situ. The compressed index is more comprehensive which means the data doesn’t have to be compressed as blocks, but as a “global compression,” allowing for far greater query speeds.
“We can search directly into the data without decompressing it so we don’t need a big block cache. Traditional databases need to find the relevant block, decompress it, check if it’s the right data, if not then put it back and pick another,” said Trichard.
Lei came up with the algorithm when devising a way for Chinese characters to be suggested more quickly when typing pinyin into a keyboard. “It was quite a gradual step by step process in itself, but the breakthrough was applying something very specific to something very general—databases,” said Lei.
Analogy #1 The Zip File
One explanation of how it works is to think of it as the blocks being like a Zip file of vacation photos. You can’t see individual photos within the file and either have to decompress to view then recompress, or leave them decompressed and taking up more space. But for Terark you can access them within the file, still zipped.
Analogy #2 The Library
Trichard prefers the library scenario. Think of blocks as sections of books in a library, such as architecture, history. Each book has a table of contents at the front, then the library has an index of all the books. So if you want a book on architecture, the librarian/index can direct you to the architecture section/block, but to you have to look at each book’s contents page to decide if that’s the book you need. Terark lets you put all the tables of contents into the overall library index
“It’s like putting all your library on Google – you just type the keyword for what you want,” said Trichard.
Plug ‘n’ play—much faster
Will all that speed make your smartphone melt? “The users of everyday apps and websites may notice a faster experience, but it would really be for the company itself. They would be able to reduce their number of servers and reduce the speed of querying data from the servers,” says CEO Sean Fu.
“We’ve developed a new engine, not a car,” said Fu. The solution can be slotted straight into existing databases meaning companies can keep running ecosystems such as MongoDB and MySQL, the most commonly used worldwide.
“Everything stays the same, the interface stays the same – the only difference is they get better speed, better storage, better efficiency,” said Fu.
There are only ten of them and they don’t see the point of scaling the team or opening offices elsewhere. We met the team at their small office within a Tencent-run startup space (you have to use WeChat to get in a meeting room) on the edge of Beijing. “Technology does not have the boundaries of countries – if it’s good, people can use it anywhere. We can do almost everything online, though may need sales engineers in some places,” said Fu.
The Nested Succinct Trie is only the beginning. Terark has six patents for its various innovations, but the team is quite resigned to the fact that the key algorithm for the indexing compression is nearing maturity. “There will be an evolution, but then there will have to different indexes,” said Lei. The team is looking into creating indices suited to handling different types of data sets as they are approached by more interested parties. They may end up developing a range of products targeted at different client types such as genetics companies. “Different indexes will be more efficient for different data,” said Lei.