I want to maintain my amazing once-per-year blogging streak, but I spent all my holiday break time on an a new but still secret project that won’t be unveiled until next year 😦

There were lots of exciting developments in 2023, most of which I’m sure I’ll miss.

  • Information theoretic diffusion: this approach was introduced in ICLR 2023 and we explore some applications in an ICLR 2024 submission. I feel this perspective still has unique insights to offer (compared to the SDE/ODE/Score matching/VAE/Nonequilibrium thermodynamics perspectives on diffusion). The ICLR 2024 submission was partially inspired by a fun workshop with the small, but fantastic “information decomposition” community, DeMICS.
  • It’s been exciting working on more causality stuff again with the smart and prolific Myrl Marmarelis.
  • I’ve also spent more time on my role as an Amazon visiting academic. There have been two exciting research developments there that I can’t share until there is a public paper out.
  • A few random (but not that personal) developments. (1) I made a cold tub to up the game on my Wim Hof practice (2) doubled my consecutive push-up number (3) I really like the new NYT Connections game.
Random AI image to liven things up

UC in 2023

30Dec22

I am happy to announce that I am starting 2023 as an associate professor at the University of California, Riverside, in the computer science department. While I will miss USC, I hope to maintain close connections with the many wonderful colleagues and students I have had the privilege of working with there (I will continue to have a title as an adjunct research associate professor at USC). I will also continue my role as a part-time visiting academic at Amazon Alexa AI. 

I am looking forward to using the freedom of tenure to explore several exciting new research directions, see here for a preview of one new direction. I was fortunate to participate in many exciting projects in 2022, here is a sampling of some major developments.

  • My PhD student Rob Brekelmans defended and is now a postdoc at Vector Institute. His ICLR 2022 paper presented exciting new results on mutual information estimation (talk).
  • A second PhD student in my group defended. Sami Abu-al-Haija is now at Google Research. He contributed a number of significant and influential ideas in graph representation learning.
  • Umang Gupta will defend in 2023. His work at ACL 2022 on reducing bias in text generation seems especially relevant given the recent interest in ChatGPT. 
  • Hrayr Harutyunyan will also defend in 2023. He has made some major contributions on the deep problem of understanding generalization in neural networks, building on his already influential NeurIPS 2021 paper with recent results in ITW in 2022.
  • Material science! This is my first paper in material science, led by the multi-talented physicist Marcin Abram. A paper appeared in npj Computational Materials, showing an intriguing way to use self-supervised machine learning to discover changes in material microstructure.
  • Neuroimaging. Several papers build on our ongoing work on harmonizing MRIs across sites, and on federated learning. An older line of work of mine, CorEx, saw some new applications in neuroimaging with my first cover article for a journal, an issue of Entropy (shown below).
Cover article in Entropy
Cool picture from the npj Computational Materials article.
For fun: one of my Lensa avatars, astronaut mode.

There’s lots of exciting work recently that I haven’t had time to describe. Sami wrote a nice blog post about his NeurIPS 2021 paper on dramatically speeding up graph representation learning with an implicit form of SVD.

I will continue to be too busy to blog much, especially because of the new class I’m teaching on dynamics of representation learning.


A few posts back, I talked about how fairness could be related to information theory. By removing any information that could be used to identify a group, you make it impossible to give that group preferential treatment. A talented student in our group, Umang Gupta, has taken that line of reasoning further and shown how information theory can give guarantees about trade-offs between fairness and accuracy for some task. Umang made this cool 1 minute explainer video for his paper which will appear in AAAI that sums it up better than I can.


I’m excited to share a student paper that was just accepted to ICML. Neural networks are capable of memorizing training labels, but if this happens they will generalize poorly when applied to test data. Where is that information about memorized labels stored? Well, it has to be stored in the neural network weights somewhere. You can write down an information measure that captures the amount of memorization. Unfortunately, it’s very difficult to estimate or control this term, because it involves high-dimensional quantities. Hrayr found an interesting way to control this information, by using a separate neural network that estimates gradients without relying too much on label information.

The top line (y) are the actual labels in the Clothing-1M dataset provided by human annotators. As you can see, many of these labels are wrong or confusing. Our approach can be used to identify these noisy labels, and in some cases correct them (second line, y hat).

One of the most fun parts of this project is that we can then check which labels in a dataset seem to require the most memorization. In the figure, you can see that the labels provided by humans (y) for this dataset are often wrong or confusing. Our network that tries to learn a classifier without memorizing labels often does better (y hat).


ICML and MixHop

01Jun19

ICML 2019 is coming up soon, and I plan to be there (except I’m missing Tuesday). I want to briefly tout the excellent work of a fantastic student who joined our lab, Sami Abu-El-Haija.  If you’ve kept up with develops on learning with graphs, you may be aware of graph convolutional networks, which combine the best of neural networks and spectral learning on graphs to produce some top-notch results on graph datasets. There is one drawback of the original graph convolution approach. Unlike visual convolutions, it doesn’t allow for complex weighting of pixels at different locations within the kernel. Basically, the equivalent relative weighting in graph convolutions are constant, and this makes it difficult to distinguish how neighbors in a graph might systematically differ in their effect from neighbors of neighbors. Sami’s paper MixHop rectifies this issue and shows that this gives some nice performance boosts. Sami will be presenting the work at ICML and code for the approach is also available on github.

MixHop architecture


binary confederate rip

Southern information theorists after the civil war realized that although they could no longer exclude former slaves from the polls, they could exclude people based on other criteria like, say, education, and that these criteria happen to be highly correlated with formerly being a slave who was not allowed education. Republican information theorists continue to exploit this observation in ugly ways to exclude voters, like disenfranchising voters without street addresses (many Native Americans).

In the era of big data, these deplorable types of discrimination have become more insidious. Algorithms determine things like what you see on your Facebook feed or whether you are approved for a home loan. Although a loan approval can’t be explicitly based on race, it might depend on your zip code which may be highly correlated for historical reasons.

This leads to an interesting mathematical question: can we design algorithms that are good at predicting things like whether you are a promising applicant for a home loan without being discriminatory? This type of question is the heart of the emerging field of “fair representation learning”.

This is effectively an information theory question. We want to know if our data contains information about the thing that we would like to predict, but that is not informative about some protected variable. The contribution of a great PhD student in my group, Daniel Moyer, to the growing field of fair representation learning was to come up with an explicit and direct information-theoretic characterization of this problem. His results will appear as a paper at NIPS this year.

He showed that an information-theoretic approach could be more effective with less effort than previous approaches which rely on an adversary to test whether any protected information has leaked through. He also showed that you can use this approach in other fun ways. For instance, you can imagine counter-factual answers like what the data would look like if we changed just the protected variable and nothing else. As a concrete visual example, you can imagine that our “protected variable” is the digit in a handwritten picture. Now our neural net learns to represent the image, but without knowing which specific digit was written. Then we can run the neural net in reverse to reconstruct an image that looks similar to the original stylistically but with any value of the digit that we choose.

vae_mnist

Fig. 3 from our paper.

Going back to the original motivation about fairness, even though we can define it in this information-theoretic way it’s not clear that this fits a human conception of fairness in all scenarios. Formulating fairness in a way that meets societal goals for different situations and is quantifiable is an ongoing problem in this interesting new field.


It’s officially been a year since my last blog. There have been so many exciting new things going on that it’s been hard to take time out for some nice big picture blog posts. Here are a few areas that I have the best of intentions for getting to.


Consider a little science experiment we’ve all done, to find out if a switch controls a light. How many data points does it usually take to convince you? Not many! Even if you didn’t do a randomized trial yourself, and observed somebody else manipulating the switch you’d figure it out pretty quickly. This type of science is easy!

One thing that makes this easy is that you already know the right level of abstraction for the problem: what a switch is, and what a bulb is. You also have some prior knowledge, e.g. that switches typically have two states, and that it often controls things like lights. What if the data you had was actually a million variables, representing the state of every atom in the switch, or in the room?

Even though, technically, this data includes everything about the state of the switch, it’s overkill and not directly useful. For it to be useful, it would be better if you could boil it back down to a “macro” description consisting of just a switch with two states. Unfortunately, it’s not very easy to go from the micro description to the macro one. One reason for this is the “curse of dimensionality”: a few samples of a million dimensional space is considered very under-sampled, and directly applying machine learning methods using this type of data typically leads to unreliable results.

As an example of another thing that could go wrong, imagine that we detect, with p<0.000001, that atom 173 is a perfect predictor of the light being on or off. Headlines immediately proclaim the important role of atom 173 in production of light. A complicated apparatus to manipulate atom 173 is devised only to reveal… nothing. The role of this atom is meaningless in isolation from the rest of the switch. And this hints at the meaning of “macro-causality” – to identify (simple) causal effects, we first have describe our system at the right level of abstraction. Then we can say that flipping the switch causes the light to go on. While there exists a causal story involving all the atoms in the switch, electrons, etc., but this is not very useful.

Social science’s micro-macro problem

Social science has a similar micro-macro problem. If we get “micro” data about every decision an individual makes, is it possible to recover the macro state of the individual? You could ask the same where the micro-variables are individuals and you want to know the state of an organization like a company.

Currently, we use expert intuition to come up with macro-states. For individuals, this might be a theory of personality or mood and include states like extroversion, or a test of depression, etc. After dreaming up a good idea for a macro-state, the expert makes up some questions that they think reflect that factor. Finally, they ask an individual to answer these questions. There are many places where things can go wrong in this process. Do experts really know all the macro states for individuals, or organizations? Do the questions they come up with accurately gauge these states? Are the answers that individuals provide a reliable measure?

Most of social science is about answering the last two questions. We assume that we know what the right macro-states are (mood, personality, etc.) and we just need better ways to measure them. What if we are wrong? There may be hidden states underlying human behavior that remain unknown. This brings us back to the light switch example. If we can identify the right description of our system (a switch with two states), experimenting with the effects of the switch is easy.

Macro approaches and limitations

The mapping from micro to macro is sometimes called “coarse-graining” by physicists. Unfortunately coarse-graining in physics usually relies on reasoning based on the physical laws of the universe, allowing us, for instance, to analytically derive an expression allowing us to go from describing a box of atoms with many degrees of freedom to a simple description involving just three macro-variables: volume, pressure, and temperature.

The analytic approach isn’t going to work for social science. If we ask, “why didn’t Jane go to the party?”, an answer involving the firing of neurons in her brain is not very useful, even if it is technically correct. We want a macro-state description that gives us a a more abstract causal explanation, even if the connection with micro-states is not as clean as the ideal gas law.

There are some more data-driven approaches to coarse-graining. One of the things I work on, CorEx, says that “a good macro-variable description should explain most of the relationships among the micro-variables.” We have gotten some mileage from this idea, finding useful structure in gene expression data, social science data, and (in ongoing work) brain imaging, but it’s far from enough to solve this problem.  Currently these approaches  address course-graining without handling the causal aspect. One promising direction for future work is to jointly model abstractions and causality.

A long line of research sometimes called computational mechanics and developed by Shalizi and Crutchfield, among others, says that our macro-state description should be a minimal sufficient statistic for optimally predicting the future from the past. Here’s an older high-level summary and long list of other publications. One of the main problems with this approach for social science is that the sufficient statistics may not give us much insight for systems with many internal degrees of freedom. We would like to be able to structure our macro-states in a meaningful way. A similar approach also looks for compressed state space representations but focuses more on the complexity and efficiency of simulating a system using its macrostate description.

Eberhardt and Chalupka introduced me to the term “macro-causal” which they have formalized in the “causal coarsening theorem”. A review of their work using causal coarsening includes a nice example where they infer macro-level climate effects (El Nino) from micro-level wind and temperature data.

Finally, an idea that I really like but is still in its infancy is to find macro-states that maximize interventional efficiency. A good macro-state description is one that, if we manipulated it, would produce strong causal effects. Scott Aaronson gives a humorous critique of this work, “Higher level causation exists (but I wish it didn’t)“. I don’t think the current formulation of interventional efficiency is correct either, but I think the idea has potential.

It’s hard to be both practical and rigorous when it comes to complex systems involving human behavior. The most successful example of discovering a macro-causal effect, I think, comes from the El Nino example but this is still relatively easy compared to the problems in social science. In the climate case, they were able to restrict the focus to homogeneous array of sensors in an area where we expect to find only a small number of relevant macro-variables. For humans, we get a jumble of missing and heterogeneous data and (intuitively) feel that there are many hidden factors at play even in the simplest questions. These challenges guarantee lots of room for improvement.

Cross-posted to Approximately Correct.


The Grue language doesn’t have words for “blue” or “green”. Instead Grue speakers have the following concepts:

grue: green during the day and blue at night

bleen: blue during the day and green at night

(This example is adapted from the original grue thought experiment.) To us, these concepts seem needlessly complicated. However, to a Grue speaker, it is our language that is unnecessarily complicated. For him, green has the cumbersome definition of “grue during the day and bleen at night”.

How can we wipe the smug smile off this Grue speaker’s face, and convince him of the obvious superiority of our own concepts of blue and green? What we do is sneak into his house at night and blindfold and drug the Grue speaker. We take him to a cave deep underground and leave him there for a few days. When he wakes up, he has no idea whether it is day or night. We remove his blindfold and present him with a simple choice: press the grue button and we let him go, but press the bleen button… Now he’s forced to admit the shortcomings of “grue” as a concept. By withholding irrelevant extra information (the time of day), grue does not provide any information about visual appearance. Obviously, if we told him to press the green button, he’d be much better off.

We say that grue-ness and time of day exhibit “informational synergy” with respect to predicting the visual appearance of an object. Synergy means the “whole is more than the sum of the parts” and in this case, knowing either the time of day or the grue-ness of an object does not help you predict its appearance, but knowing both together gives you perfect information.

Grues in deep learning

This whimsical story is a very close analogy for what happens in the field of “representation learning”. Neural nets and the like learn representations of some data consisting of “neurons” that we can think of as concepts or words in a language, like “grue”. There’s no reason for generic deep learners to prefer a representation involving grue/bleen to one with blue/green because either will have the same ability to make good predictions. And so most learned representations are synergistic and when we look at individual neurons in these representations they have no apparent meaning.

The importance of interpretable models is becoming acutely apparent in biomedical fields where blackbox predictions can be actively dangerous. We would like to quantify and minimize synergies in representation learning to encourage more interpretable and robust representations. Early attempts to do this are described in this paper about synergy and another paper demonstrates some benefits of a less synergistic factor model.

Revenge of the Grue

Now, after making this case, I want to expose our linguo-centrism and provide the Grue apologist’s argument, adapted from a conversation with Jimmy Foulds. It turns out the Grue speakers live on an island that has two species of jellyfish: a bleen-colored one that is deadly poisonous and a grue-colored one which is delicious. Since the Grue people encounter these jellyfish on a daily basis and their very lives are at stake, they find it very convenient to speak of “grue” jellyfish, since in the time it takes them to warn about a “blue during the day but green at night jellyfish”, someone could already be dead. This story doesn’t contradict the previous one but highlights an important point. Synergy only makes sense with respect to a certain set of predicted variables. If we minimize synergies in our mental model of the world, then our most common observations and tasks will determine what constitutes a parsimonious representation of our reality.

Acknowledgments

I want to thank some of the PhD students who have been integral to this work. Rob Brekelmans did many nice experiments for the synergy paper. He has provided code for the character disentangling benchmark task in the paper. Dave Kale suggested key aspects of this setup. Finally Hrayr Harutyunyan has been doing some amazing work in understanding and improving on different aspects of these models. The code for the disentangled linear factor models is here, I hope to do some in depth posts about different aspects of that model (like blessings of dimensionality!).