The Information Sieve
You have just eaten the most delicious soup of your life. You beg the cook for a recipe, but soup makers are notoriously secretive and soup recipes are traditionally only passed on to the eldest heir. Surreptitiously and with extreme caution, you pour some soup into a hidden soup compartment in your pocket.
When you get back to your mad laboratory, you begin reverse engineering the soup using an elaborate set of sieves. You pour the soup through the first sieve which has very large holes. “Eureka! The first ingredient is an entire steak.” Pleased with yourself, you continue by pouring the soup through the next sieve with slightly smaller holes. “Mushrooms, of course!” You continue to an even smaller sieve, “Peppers, I knew it!”. Since it is not a just a laboratory, but a mad laboratory, you even have a set of molecular sieves that can separate the liquid ingredients so that you are able to tell exactly how much salt and water are in the soup. You publish the soup recipe on your blog and the tight-lipped chef is ruined and his family’s legacy is destroyed. “This is for the greater good,” you say to yourself, somberly, “Information wants to be free.”
This story is the allegorical view of my latest paper, “The Information Sieve“, which I’ll present at ICML this summer (and the code is here). Like soup, most data is a mix of different things and we’d really like to identify the main ingredients. The sieve tries to pull out the main ingredient first. In this case, the main ingredient is the factor that explains most of the relationships in the data. After we’ve removed this ingredient, we run it through the sieve again, identifying successively more subtle ingredients. At the end, we’ve explained all the relationships in the data in terms of a (hopefully) small number of ingredients. The surprising things are the following:
- We can actually reconstruct the “most informative factor”!
- After we have identified it, we can say what it means to “take it out”, leaving the “remainder information” intact.
- The third surprise is negative: for discrete data, this process is not particularly practical (because of the difficulty of constructing remainder information). However, an exciting sequel will appear soon showing that this is actually very practical and useful for continuous data.
Update: The continuous version is finally out and is much more practical and useful. A longer post on that will follow.
Filed under: Posted by Greg Ver Steeg | Leave a Comment