Demystifying Information-Theoretic Clustering

Quick summary: Finding clusters in data is a fundamental problem in machine learning. You’d like to be able to do so without making any assumptions about your data. Information theory provides a good way to do this, but the first few attempts to do so have been fundamentally flawed. We fix the flaws and show that it leads to much better clusters. Almost every clustering paper shows a picture like the one on the left below. What they don’t show is that they have carefully picked the size of the circles so that their clustering method works. Our method (CVR) works for arbitrarily sized clusters.

Using information theory to find clusters is a nice idea, but previous attempts were wrong, and would only work for carefully hand-picked examples. Our method works for arbitrarily sized clusters.

Using information theory to find clusters is a nice idea, but previous attempts (NIC,MI) were totally wrong, and would only work for carefully hand-picked examples. Our method (CVR) works for arbitrarily sized clusters.

MAX 2-SAT with up to 108 qubits

Quick summary: If you read Shtetl-optimized, you’re aware of the ongoing drama surrounding the D-Wave quantum chip. There was a widely publicized paper showing some speed-ups in solving certain problems with the D-Wave chip. Those speed-ups evaporated when compared to state-of-the-art classical solvers. In the debate at Shtetl, Scott says:

In the comments, many people tried repeatedly to change the subject from [whether a speed-up has been demonstrated] to various subsidiary questions.  For example: isn’t it possible that D-Wave’s current device will be found to provide a speedup on some other distribution of instances, besides the one that was tested?

That is basically what we were trying to explore in our paper. We have some knobs to tune the distribution of problems to see which ones are easier or harder. Can we reveal a difference in hardness for the classical and quantum solvers? The answer is a tentative yes, but there is more to be done here.

Other things

I went to the WIN workshop again this year. I talked about some recent ideas about detecting contagion. Although “latent homophily” has gotten a lot of discussion as a methodological problem, we showed that it can be overcome, while “external dynamics” provide a more fundamental (and frequently ignored) obstacle to detecting contagion.

Shuyang Gao gave an outstanding talk about “stylistic accommodation”. Have you ever noticed that when one person yawns, it makes other people yawn too? The question in this line of research is whether one person’s use of a particular style of speech affects the style of speech of people responding. Previous research had purported to find strong effects of this kind, but Shuyang showed that almost all of these correlations are due to confounding factors.

Finally, I’ll be in India in December, and I’m planning to visit the Tata Institute of Fundamental Research (in Mumbai) while I’m there.


Non-Parametric Entropy Estimation Toolbox (NPEET)

I finally got around to packaging together bits of Python code that are useful for estimating information-theoretic quantities for both continuous and discrete variables. This is in preparation for a tutorial I’m giving next week at ICWSM

The documentation[pdf link] contains detailed descriptions along with discussion of the technical challenges.


My AISTATS talk has been posted online.


If you’ve been reading science news, you’ve seen over the years that obesity, happiness, loneliness, and divorce are all contagious. Just a few weeks ago, I read that grades are also contagious. Certainly, it’s not surprising to find out that friends’ behaviors are correlated on all these fronts, but is that enough to say for sure that your friends cause you to become more obese/happy/lonely etc.?

Generally, the answer is no. Shalizi explains on his blog  and in a paper with Andrew Thomas why this is the case. The basic idea is expressed in the picture below. It could be that Alice influences Bob directly, but it could also be (for example) that Alice has joined a Pie-Eating club and Bob has joined the same club. The Pie-Eating club explains why they both have a tendency to become obese and it explains how they became friends. The club, in this case, is a hidden variable that gives an alternate explanation for correlations in obesity.

influence or homophily

The problem is that it is easy to come up with more and more elaborate hidden variables that might explain correlations in, e.g., obesity, and we could never hope to account for them all. Is hope lost? Not quite.

It turns out that a similar problem arises in quantum physics. Einstein saw the “spooky action at a distance” implied by quantum physics and declared that it could not be: ultimately, there must be some hitherto unmeasured hidden variables that explain the correlations between distantly separated particles. Amazingly, Einstein was wrong and John Bell demonstrated a simple test that would be satisfied by any hidden variable theory but is violated by quantum physics. What we do is extend this reasoning for correlations in social networks. We don’t want to account for every possible source of correlations (like Pie-Eating Club, yum), we want a test that tells us that no hidden variables explain the correlations and therefore there must be some influence between friends.

Bell inequality for networks

Everything else is just mathematical details. We show how to construct these types of tests in a general way. In the end, we were able to show that hidden variable theories do not explain the correlations in obesity and, therefore, some other causal effect is needed to explain what is going on. Of course, my language here is purposely cagey, as all causality researchers are:

You will have to see the paper for all the caveats, but I can make the bold statement that according to the best causal tests, obesity is contagious, probably.

I’m presenting this work at AISTATS, here is the paper which builds on work I previously discussed.


Thanks to the magic of Mathematica 9, you can now make pretty pictures of your social networks with a single command. I was impressed by how well it automatically captured the true structure of my network, which I labeled and included below. I’m not sure what you can conclude about me as a person, except that I’m happy to route all socializing through my more extroverted wife. It’s probably also pretty clear which parts of my life are geographically separated.
Greg Ver Steeg's Facebook network


Black holes

21Mar13

In grad school I collaborated with Chris Adami on some papers about black holes. We suggested that some of the mysteries surrounding black holes may evaporate if you correctly include the effects of stimulated emission. Chris is presenting some of these results now at the APS meeting, and on his new blog. (paper)


Can we measure the effect of one person’s words on another person? I want to describe the main idea behind my recent paper, Information-Theoretic Measures of Influence Based on Content Dynamics, which I’m presenting at WSDM 2013.

Ostensibly, information theory is a dry, academic subject about how to send 0’s and 1’s through a noisy channel (like browsing the web using your cellular signal). The goal is to encode the information so that all the bits can be recovered after they are sent through a noisy channel. Shannon’s theory tells us that the amount of information we can send through the noisy channel is related to a quantity called “mutual information”.

noisy channel

What does this have to do with influence, human speech, or social media? This abstract framework is remarkably flexible. What if the input is some statement made by Alice. Then the “noisy channel” consists of (e.g.) sound waves, the ear drum, and the brain of Bob. Now Bob “outputs” some other statement. In the example below, Bob has said something very relevant to Alice’s statement: WSDM is in Rome, so Alice should definitely have some coffee while she’s there. Bob’s statement gives us some information about what Alice’s original statement was.

cappucino

If, on the other hand, Bob had proclaimed his love of borscht, it’s not obvious that this has anything to do with Alice’s statement. Clearly, Bob lives in his own universe, day-dreaming of borscht. His statements carry no information about Alice’s statements.

borscht

How do we measure this notion of whether Bob’s statements carry information about Alice’s? We just use the standard information-theoretic notion of mutual information.

probability

Unfortunately, this quantity depends on us being able to determine the probability of every pair of statements we might see Alice and Bob utter. Unfortunately, at best we might have observed a few hundred statements from Alice and Bob (on Twitter, for example). We have to do two things:

1. Simplify our representation of these statements (or “reduce the dimensionality of the signal”). That means Alice’s statement is reduced to some keywords that it might be about. (Some details: we use a popular technique called topic modeling to achieve this.)

2. Estimate mutual information using these simplified signals. (Some details: we actually use a more nuanced measure called conditional mutual information, or transfer entropy. We use non-parametric entropy estimators that avoid the step of estimating high dimensional probability distributions.)

Surprisingly, we were able to carry this procedure out for some users on Twitter and detect signals in human speech! One way to understand this result is to say that we could find user pairs where Bob’s statements could be predicted better if we knew Alice’s recent statements. It will come as no surprise that human speech is nuanced and complex, so we were only able to detect very predictable relationships. For instance, one triad of strongly connected users were three tri-lingual sisters. If one of the sisters tweeted to the others in a specific language, the others would always respond in the same language. Other strong relationships included news dissemination and web site promotion. Can we do better and detect more complex forms of influence? Sure, we need either more data or better representation of content or better entropy estimators. We’re working on all three!


New preprint: http://arxiv.org/abs/1208.4475. The title is changing to: “Information-Theoretic Measures of Influence Based on Content Dynamics”. I’ll give a detailed, readable summary in a week or two. I’ll be presenting about this at WIN workshop, so please come!

For now, imagine the following problem. There are hundreds of people in a large room talking to each other. You can hear what everybody is saying, but you don’t know who’s talking to whom. How could you figure it out, based on the what they are saying? If one person says something, and another responds, we can usually tell that it fits with the original statement somehow. We quantify this intuition using information theory. If we interpret both people’s utterances as arbitrary signals, then we can use very general information theoretic tools to tell us if one signal is predictable from the other. I.e., one person’s statements are more predictable if we know what the other person is saying. 


Sadly, I was not born with a sensical one-word last name, leading to various problems throughout life. For googling purposes, the all-too-common misspellings:
GV Steeg
Greg Steeg
Greg VerSteeg


A busy month

14Mar12

Next month on April 9 I’ll give a talk for UC Irvine’s AI/ML seminar. The next week I’ll fly to Lyon, France for WWW (the World Wide Web conference). I’ll be giving a talk at a workshop before WWW called “Making sense of micro posts”. Then I’ll present my and Aram Galstyan‘s paper “Information Transfer in Social Media” at WWW.

In the meantime, I hope to finish a paper for UAI while continuing to teach “Physics and Computation”. After all that, perhaps a visit this summer to a research group in Singapore? More details on that later.