Research

Nearly every analytic and ML pipeline spends more energy hauling entropy than learning from it. At Granica we frame the research question this way: If you can compress data to within a breath of the Shannon limit, can the compression step itself teach the system enough semantics that storage becomes a reasoning organ? Our answer is E∑L. In the ∑ step, incoming exabytes are squeezed, but also augmented with learned signal. Outcomes being that cost is now bounded by residual uncertainty, not by the number of bytes you can brute-force through a cluster.

That substrate unlocks a family of frontier problems: loss-bounded compression that preserves analytic fidelity; sub-millisecond subselection that skips 99.9% of blocks; generative augmentation for rare-event inference; retrieval and indexing that exploit grid-aware attention; and probabilistic execution plans with deterministic fallbacks, all under continual learning from live traffic.

If turning entropy into intelligence at exabyte scale sounds like research you want to stretch,reach us at hello@granica.ai

Featured Research

Scaling laws for learning with real and surrogate data

SYNTHETIC GENERATION

Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. We introduce a weighted empirical risk minimization (ERM) approach for integrating augmented or 'surrogate' data into training.

Read paper

NeurIPS 2024

Towards a statistical theory of data selection under weak supervision

INFORMATION DISTILLATION

Given a sample of size N, it is often useful to select a subsample of smaller size n<N to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning.

Read paper

ICLR 2024

Compressing Tabular Data via Latent Variable Estimation

TABULAR COMPRESSION

Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data.

Read paper

ICML 2023

Sampling, Diffusion, and Stochastic Localization

ALGORITHMS

Diffusions are a successful technique to sample from high-dimensional distributions that are not given explicitly but rather learnt from a collection of samples. We generalize the construction of stochastic localization processes.

Read paper

Scaling Training Data with Lossy Image Compression

LOSSY COMPRESSION

To capture the trade-off between model performance and the optimal storage of training data, we propose a 'storage scaling law' that describes the joint evolution of test error with sample size and number of bits per image.

Read paper

KDD 2024

Inline Data Detection in Large Data Streams

LOSSLESS DATA REDUCTION

We present a novel approach to data processing and reduction method that involves receiving an input data stream and computing a set of features that are representative of or unique to the stream.

Read paper

Efficient Data Deduplication through Sketch Computation and Similarity Metrics

LOSSLESS DATA REDUCTION

The methods provide a more efficient and effective way of handling large data streams, which can be particularly beneficial in applications that require real-time data processing and reduction.

Read paper

Research

Featured Research

Scaling laws for learning with real and surrogate data

Towards a statistical theory of data selection under weak supervision

Compressing Tabular Data via Latent Variable Estimation

Sampling, Diffusion, and Stochastic Localization

Scaling Training Data with Lossy Image Compression

Inline Data Detection in Large Data Streams

Efficient Data Deduplication through Sketch Computation and Similarity Metrics

RESEARCH

COMPANY

RESOURCES

INFO