||exponentially surprised|| [2H theoryblog]

{about} {projects} {notes} {contact}

First ArXiv Paper and a New Title!

Hey everyone! Check out my first arXiv paper Mapping Between Natural Movie fMRI Responses and Word-Sequence Representations, to be presented at NIPS 2016 Workshop on Representation Learning in Artificial and Biological Neural Networks . This work is the culmination of my senior thesis work and summer research at Princeton CS and Neuroscience. I will be making some updates soon (more results and experiments) so stay tuned.

Secondly, I’ve finally removed my placeholder title for this website. I really like the phrase “exponentially suprised” - it’s a reference to an intuitive way for thinking about the information theoretic quantity of entropy, “surprise”. The more entropy a system has, the less sure you are about what it will output next; thus, the more surprised you are. Uniform distributions are very surprising, while having all the probability mass at a point is very unsurprising. You know exactly what will happen.

Now what is “exponentially surprised”? Well, it’s being perplexed! I find myself constantly perplexed by strange things (like the concentration of maximum eigenvalues of a random matrix is ridiculously tight), so it seems like a good fit. It’s also a common measure for the performance of language models in natural language understanding. If a model is constantly perplexed (in terms of its probability model) at the occurrences of words in language, it’s clearly not a good model. The more the language model understands about the conditional distributions of words in language and context, the less perplexed it will be, and thus the better the model will be.

This setup is not quite true for the case where you want to build a generative language model though, since you don’t want to generate incredibly predictable things all the time (i.e. let’s NOT output the maximum likelihood at every time step, since the models we train are too weak for the maximum likelihood to actually be the right thing). For instance, in dialogue generation models (see the amusing examples in this paper ), we want to avoid getting stuck in boring optima where two chatbots tell each other that the other does not know what they are talking about. You in fact want to be a little perplexed by what someone says to you; you hope for new information in any exchange of information. So there is some tradeoff here which is not completely clear.

The original title of this website was “Representing Things”, not just language. Perplexity can be considered a measure for a generative unsupervised model of any sort - the goal is in some sense to represent a summarization which contains all necessary information to either reproduce (autoencoders) or generate from something close to the true distribution (generative setting) the thing we were trying to represent in the first place. I suppose you could think of the probabilistic generative model setting as “autoencoders for probability distributions”. Surprise and perplexity therefore measure the analogue to a minimum description length (compare with coding theory) or low rank (compare with linear algebraic approaches) for this setting.

Anyways, that’s the explanation for the new title as well as some ideas I’ve been thinking about recently in terms of creating a framework for unsupervised learning (some potentially interesting theory in terms of “linear autoencoders” is in this NIPS 2016 paper by Elad Hazan and Tengyu Ma at Princeton). It would be particularly interesting to extend some kind of framework to the generative setting: log linear models (as in the word vector paper by Sanjeev Arora et al at Princeton), energy based models, generative adversarial networks all have common notions in an autoencoder-like reconstruction of a true probability density function with a self-conflicting objective in the case where generation is also desirable. It would be great to formalize this notion in a contained framework which handles both representation and generation: Minimize surprise, yet be consistently perplexed.

Presentations at ICML 2016

Update: Check out the presentation I gave in the projects section of my website!

Well it’s been a year since my first post, so let’s do another post today in the hopes that I’ll actually start writing things this summer!

Over the past few days I’ve been attending ICML 2016, and it’s been pretty interesting so far. I’m presenting a poster and a talk for the Multi-View Representation Learning Workshop on Thursday, so I’ve pretty much been preparing for that (poster, presentation and revisions) and attending sessions. I’ve kind of neglected the poster sessions, so I’ll probably spend more time wandering around the posters and engaging people tomorrow.

There were some really crystal clear presentations in the Learning Theory and Continuous Optimization sessions today which I quite enjoyed. On the other side of things, I felt most of the Deep Learning sessions fell kind of flat, with a few exceptions. Perhaps one of the issues with Deep Learning papers is that they’re not really suitable for presentation format unless there’s something truly novel going on in the architecture. Too often the punchline of a deep learning paper seems to be that some slightly modified technique was tried and it improved upon the baseline of some typical dataset/ task by a small portion. This fact isn’t necessarily a bad thing, but it does make for difficult talks since there’s no real concrete punchline you can get at without getting dirty with the details - and talks are not typically the place to get dirty with the details. Of course, there are some exceptions to this rule in my mind. For instance, sequence-to-sequence learning felt like a genuinely new innovation in comparison to previous work in the area.

Anyways, it’s been pretty interesting attending as this conference is my first machine learning conference. I also really liked Fei-Fei Li and Dan Spielman’s talks today. Both were particularly good presenters in terms of clarity and command. I guess one advantage of attending most of the talks is you start to see how you should go about improving your own delivery of talks, which is an important skill to have pretty much anywhere, though particularly in academia. One thing my advisor has reminded me of is the importance of avoiding too many words in presentations/posters, which is something I’m trying to work on. Coming up with the right figures that clearly deliver a point is pretty challenging at first!

So, now to talk about the future. I’ve been sitting on several ideas for blog posts for the past several months (probably more like 9 months). I plan to get to them sometime soon, after organizing all my computer/internet clutter so that I can be more productive. Here’s a list of potential topics:

  • What is meant by ``Machine Learning’’? An overview of all the relevant fields I think which overlap it.

  • Machine Learning courses in colleges.

  • A list of reading I’m doing this summer.

  • Machine Learning as Science++: Various positions on what constitutes truth.

  • Discussion of the current state of NLP/NLU: What I believe and what I’m skeptical of, and what I plan to test and implement with my friend Holden Lee

  • Discussion of my current research and perhaps some fun visualizations.

  • Some technical exercise? I need to practice technical writing skills… This is supposed to be my theory blog after all.

MathJax Tests

First post, woo hoo! This blog will be roughly about my thoughts about research and maybe other things, I haven’t really decided yet. Probably some notes and projects.

In the meantime, here are some tests to ensure MathJax is working.

\begin{equation} x_{t+1} = \prod_{\mathcal{K}} \left(x_{t} - \eta \nabla_t\right) \tag{1} \label{eq:OGD} \end{equation}

Equation \eqref{eq:OGD} is the online gradient descent update. See \eqref{eq:vandermonde} for a matrix. Consider \(x, y \in \mathbb{R}\): Then, \(x + y \in \mathbb{R}\) as well (this math is inline).

Here’s the Vandermonde matrix:

\begin{pmatrix} 1 & a_1 & {a_1}^2 & \cdots & {a_1}^n \\ 1 & a_2 & {a_2}^2 & \cdots & {a_2}^n \\ \tag{2.1} \label{eq:vandermonde} \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & a_m & {a_m}^2 & \cdots & {a_m}^n \\ \end{pmatrix}

Here’s some matrix multiplication:

\begin{align} \begin{pmatrix} 1 & 0 \\ 0 & e^{i\pi} \\ \tag{2.2} \end{pmatrix} \begin{pmatrix} u \\ v \end{pmatrix} &= \begin{pmatrix} u \\ -v \end{pmatrix} \\ \large\equiv \\
\begin{bmatrix} 1 & 0 \\ 0 & -1 \\ \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} &= \begin{bmatrix} x \\ -y \end{bmatrix} \end{align}

And finally, we have the Cauchy-Schwarz inequality:

Here’s a useful link to some MathJax tricks.