Topic:
Corporate“Data is the new oil” has been a popular technology motto of the last decade. And yet, this claim does not hold under scrutiny. I can fill terabytes of disk space by taking millions of high-resolution pictures of the same empty wall. No one will believe that I have produced value by this pointless exercise. Unlike oil, data is very heterogeneous in value. While in this example the value (or lack thereof) might be easy to recognize, in general it is not.
A second reason for skepticism. Rather than oil, data is closer to bitumen: a mixture of oil and sand, as data is a mixture of useful information and noise. I can store and communicate petabytes of autonomous vehicles sensor recordings at submicron scale, even if the sensor accuracy is at millimeter scale. Most of the digits I store will be noise. Sand.
A third reason for skepticism is that data can be riskier to communicate and store than oil. Privacy and security are constantly at risk with data, and this poses risks that are much harder to detect and alleviate than the risks involved in transporting oil.
Part of the "data is oil" confusion might arise from the fact that we use bits to measure very different quantities: amount of information and data size. Bits of information are the really valuable quantity in data: they are the quantity we want to store, communicate, use to make decisions, train models and so on. Bits of data are the quantity that our computer infrastructure is built around. We are billed for every bit we store, communicate, process, acquire. We are billed for storing redundant bits at the same rate at which we are for storing information bits.
What convinced me to join Granica is the opportunity (and the challenge) to bridge this gap. The long term vision is to redesign cloud computing around bits of information rather than bits of data. I think this vision is equally as compelling from a science and technology point of view as it is from an economic point of view.
Of course, it is a very ambitious objective, beginning with very fundamental questions. How do we define and measure information? How do we encode it? What is noise? What information is useful for learning AI models, predicting trends and so on? These questions are century old, and classical formalizations were given within Information Theory. However, our times require us to rethink many of these questions from scratch. In the past century we thought of data as a sequence of symbols to be produced and consumed by humans, for instance text communicated from one person to another, or voice recordings. Mathematical models and methods that were developed back then are largely tailored to this setting.
Modern applications are dramatically different. We are not interested in efficiently representing one or a few images, each taking a few hundred kilobytes of disk space. We must deal with billions of such images, adding up to petabytes. These images will never be looked at by any human, but rather be used to train AI systems. They have not been acquired by humans, but by autonomous systems. This poses new challenges and opens the door to new opportunities.
While ambitious, this vision is extremely concrete. The research problems it poses are difficult and of fundamental nature, but they have broad impacts and hard metrics. All of these reasons convinced me to join this endeavor. The past year has been intellectually stimulating and helped me develop a fresh perspective of many research problems.
To give a concrete example, we have been developing a palette of new data compression algorithms that are tailored to data formats that are of interest to many data-centric industries. Before embarking on this work, I was under the impression that data-compression was –at a fundamental level– a “well understood” problem. I had learnt as a student that the fundamental limits of data compression were characterized by the Shannon-McMillan-Breiman Theorem and could be achieved in a universal fashion using a Lempel-Ziv (LZ) style algorithm. These algorithms also enjoy strong competitive guarantees against finite-state machines. Many efficient, open-source implementations of these approaches have been developed over the years.
However, this classical “solution” relies on two crucial assumptions on the data: ergodicity and stationarity. Facing this problem from a practical viewpoint made me appreciate that these assumptions are rarely a good approximation to real data and, as a consequence, LZ-type algorithms can be substantially suboptimal. For instance, in a recent paper we show empirically that the industry standard for LZ-style compression (Zstandard) can be beaten by a significant percentage on tabular data. We develop a mathematical model for these data within which the gain over classical theories can be rigorously characterized.
While this is only one among several research directions that we are pursuing, I believe it is extremely rich in itself, with numerous connections with machine learning, statistics and optimization. My team and I look forward to exploring these questions in the near future.
Want to discuss? Share your comments/questions below:
June 08, 2023