Edit: Also check out the story by the Washington Post and on cancer.gov.

Shirley is a collaborator of mine who works on using gene expression data to get a better understanding of ovarian cancer. She has a remarkable personal story that is featured in a podcast about our work together. I laughed, I cried, I can’t recommend it enough. It can be found on itunes and on soundcloud (link below).

As a physicist, I’m drawn towards simple principles that can explain phenomena that look complex. In biology, on the other hand, explanations tend to be messy and complicated. My recent work has really revolved around trying to use information theory to cut through messy data to discover the strongest signals. My work with Shirley applies this idea to gene expression data for patients with ovarian cancer. Thanks to Shirley’s amazing work, we were able to find a ton of interesting biological signals that could potentially have a real impact on treating this deadly disease. You can see a preprint of our work here.

I want to share one quick result. People often judge clusters discovered in gene expression data based on how well they recover known biological signals. The plot below shows how well our method (CorEx) does compared to a standard method (k-means) and a very popular method in the literature (hierarchical clustering). We are doing a much better job of finding biologically meaningful clusters (at least according to gene ontology databases), and this is very useful for connecting our discovery of hidden factors that affect long-term survival to new drugs that might be useful for treating ovarian cancer.

TCGA clusters



Here’s one way to solve a problem. (1) Visualize what a good solution would look like. (2) Quantify what makes that solution “good”. (3) Search over all potentials solutions for one that optimizes the goodness.

I like working on this whole pipeline, but I have come to the realization that I have been spending too much time on (3). What if there were a easy, general, powerful framework for doing (3) that would work pretty well most of the time? That’s really what tensorflow is. In most cases, I could spend some time engineering a task-specific optimizer that will be better, but this is really premature optimization of my optimization and, as Knuth famously said: “About 97% of the time, premature optimization is the root of all evil”.The docker whale

Abstract starfish

This one is just for fun. There’s no deeper meaning, just a failed experiment that resulted in some cool looking pictures.

Abstract bear


You have just eaten the most delicious soup of your life. You beg the cook for a recipe, but soup makers are notoriously secretive and soup recipes are traditionally only passed on to the eldest heir. Surreptitiously and with extreme caution, you pour some soup into a hidden soup compartment in your pocket.

The Information Sieve

When you get back to your mad laboratory, you begin reverse engineering the soup using an elaborate set of sieves. You pour the soup through the first sieve which has very large holes. “Eureka! The first ingredient is an entire steak.” Pleased with yourself, you continue by pouring the soup through the next sieve with slightly smaller holes. “Mushrooms, of course!” You continue to an even smaller sieve, “Peppers, I knew it!”. Since it is not a just a laboratory, but a mad laboratory, you even have a set of molecular sieves that can separate the liquid ingredients so that you are able to tell exactly how much salt and water are in the soup. You publish the soup recipe on your blog and the tight-lipped chef is ruined and his family’s legacy is destroyed. “This is for the greater good,” you say to yourself, somberly, “Information wants to be free.”

This story is the allegorical view of my latest paper, “The Information Sieve“, which I’ll present at ICML this summer (and the code is here). Like soup, most data is a mix of different things and we’d really like to identify the main ingredients. The sieve tries to pull out the main ingredient first. In this case, the main ingredient is the factor that explains most of the relationships in the data. After we’ve removed this ingredient, we run it through the sieve again, identifying successively more subtle ingredients. At the end, we’ve explained all the relationships in the data in terms of a (hopefully) small number of ingredients. The surprising things are the following:

  1. We can actually reconstruct the “most informative factor”!
  2. After we have identified it, we can say what it means to “take it out”, leaving the “remainder information” intact.
  3. The third surprise is negative: for discrete data, this process is not particularly practical (because of the difficulty of constructing remainder information). However, an exciting sequel will appear soon showing that this is actually very practical and useful for continuous data.

Update: The continuous version is finally out and is much more practical and useful. A longer post on that will follow.

The Bandwagon

Shannon’s birthday has passed, but I thought I would jump on the bandwagon late, as usual. Shannon himself recognized that information theory was so compelling that it encouraged over-use. He wrote an article saying as much way back in the 50’s.

It will be all too easy for our somewhat artificial prosperity to collapse overnight when it is realized that the use of a few exciting words like information, entropy, redundancy do not solve all of our problems”


As a researcher who tries to use information theory far beyond the domains for which it was intended, I take this note of caution seriously. Information as defined in Shannon’s theory has quite a narrow (but powerful!) focus on communication between two parties, A and B. When we try apply information theory to gene expression, neuroimaging, or language, we have many, many variables and there is not an obvious or unique sense of what A and B should be. We don’t really have a complete theory of information for many variable systems, but I think that is where this bandwagon is headed.


As a child, I was visited by an alien. I remember the sensation of not being able to move or speak and seeing this other-worldly face. Some time later, when I saw a documentary about people who had been visited by aliens, I felt a chill of recognition. Their experiences matched my own.

People all over the world describe the appearance of aliens in a similar way, and this is often cited as proof that they are among us. I think that there is a different and more plausible explanation for the universality of this phenomena.

The alien image above was generated automatically from a collection of human faces. How was that done? Basically, I train some “neurons” to capture as much information about human faces as possible. These neurons try to split up the work in an optimal way, and robustly we see a neuron that represents the alien face. When you add this together with a few other informative facial features, you can flexibly recognize many types of faces. Seeing this pattern causes a strong sense of recognition because it activates core features in our facial recognition circuitry. However, having this neuron fire by itself is unnatural – no human face would cause just this one pattern to fire without any accompanying ones. Therefore it also strikes us as alien.

Dreams and drugs cause random firing patterns that sometimes activate unusual combinations of our facial recognition circuitry. The universality of people’s perceptions of alien faces only reflects the universal principles underlying the circuits in our head. Now we just have to reconstruct these principles: the truth is out there (in here?).

IMG_0041Once upon a time, a boy from a farm in Iowa got an exciting opportunity to move to the west coast. While many new experiences awaited him there, he found himself imprisoned in a cage made of cars. After many years, he had never managed to escape the prison of cars to do simple things like learning to surf or exploring the natural beauty nearby. Later he realized that the prison was in his mind and driving to Joshua Tree is really not that big a deal. And it’s totally worth it.

In academic work, page restrictions in publications often mean that there is not enough space to explore interesting but tangential relationships between ideas or to give more than bare-bones proofs of mathematical ideas. I am hoping to remedy this, at least initially, with a webcast series of three talks at ISI. I’ll post the links here. These are on the technical side. The eHarmony talk is a little more general of an introduction.

Learning Succinct and Informative Representations: Background and Big Picture

This talk contains some background about information theory and a few famous ideas on how to use it for learning.

  • InfoMax   This famous principle says something very intuitive. A good representation should have maximal mutual information with the data. I briefly discuss why this is wrong. (The short version: maximizing mutual information is like memorizing, and memorizing is not the way to build powerful representations with layers of abstraction.)
  • Information decomposition   This is one of my favorite topics. I talk about some classic Venn diagrams in information theory, Partial Information Decomposition from Beer/Williams, and a classic result from Watanabe about how multivariate mutual information can be decomposed. This decomposition motivates an interpretation of CorEx as hierarchical information decomposition.
  • Information bottleneck  The principle behind the bottleneck is to lossily compress data in a way that minimizes some distortion measure. In this case, they focus on supervised learning and take maximizing relevance about labels as a distortion measure. CorEx can be viewed as a compression with an unsupervised distortion measure where we try to retain the most redundant information in the data.
  • Independent component analysis  This one was at the end and got short shrift. ICA also has a compression interpretation. CorEx finds successively less dependent components at each layer (same with the information sieve, which we’ve used for discrete ICA).
  • Generative models  A popular way to do learning is to assume some generative model and then fit parameters to maximize the likelihood of the data. This requires a lot of up front assumptions. The perspective we take is the opposite: you say what type of computational structure you can support (i.e., calculating some probabilistic functions in parallel), and then optimize an informational objective with those resources. This doesn’t require model assumptions and, depending on the objective, has an operational meaning even if your model is mis-specified.

Part 2 will have a quick recap, filling in some things I missed in part 1. Then we’ll get into some in-depth derivations and implementation details about using CorEx to learn representations. Part 3 will get into the new directions (information sieve, temporal representations, …).

CorEx special case

This shows the Venn diagram for a special case where Y explains correlations between two variables, X1, X2. In that case, the objective reduces to the triple information and is maximized if X1 and X2 are conditionally independent given Y.

I just put up a new paper with the (hopefully) intriguing title “The Information Sieve“. The motivation is that when we humans look at the world, we tend to identify a new pattern or learn a new trick, and then we move on to the next thing. There are two amazing things about this:

1. We don’t learn everything at once.

2. We don’t re-learn the same thing over and over again (usually).

These may seem inconsequential, but it turns out to be very difficult to get machines to learn in this way.

The information sieve introduces a new way of learning things piece by piece. There is some amount of information in whatever data we are looking at, but we don’t know how much (and it’s usually impossible to exactly find out because of limited data/computation). We pass the data through the first layer of the sieve to extract the “most informative” pattern in the data, the data is transformed and the remaining information trickles down to the next layer of the sieve. This “remainder information” contains all the information from the original data except for what was already learned. This allows incremental learning that is guaranteed to improve at each step, and to never duplicate effort by re-learning what is already known.

The bonus eigen-faces below are not in the paper, but they show what the sieve extracts at various layers (when looking at a classic dataset called the Olivetti faces). Any face can “activate” any of these 10 learned factors. The blue/red shows how different pixels in the image contribute to whether that factor is activated. One seems to correspond to faces looking left or right (bottom, second from right). Others seem to focus on different parts of the face reflecting facial expressions. Anyway, there is more to be done to make this method practical on larger datasets, but this seems to be a promising first step. (The paper also shows how this method applies to lossy and lossless compression and independent component analysis, in case that is your bailiwick.)

Some eigen-faces learned with the information sieve.

Some eigen-faces learned with the information sieve.