Notes from MIT Sublinear Workshop (June 10-13)

Panel (moderated by Madhu Sudan)
Interactively Visualizing Large Datasets using Sketches (Parikshit Gopalan)
Recent Advances in NP-Hard Variants of Low Rank Approximation (David Woodruff)
Submodular Maximization via Approximate Data Structures (Huy Nguyen)
Jelani Nelson on \(\ell_2\)-norm sketching.

I’m trying something new out today and taking notes directly in Markdown and posting directly to my website instead of compiling a pdf with LaTeX. Let’s see how it goes! Something I’m already missing is my macros in LaTeX… need to see how to add that to Markdown. Also things like writing definitions etc. I’m sure there are ways to add TeX style files to Markdown/MathJaX.

Panel (moderated by Madhu Sudan)

Madhu Sudan: A few things struck me — we do have a mix of disciplines sitting around here. I wanted to quiz our organizers: What led to this particular choice? In terms of what you’d like to learn, and in terms of what we have learned already from interacting with each other, maybe some technical results. I would also love to hear how are we different in terms of philosophies. We all define things very differently and that’s one thing we can learn from each other. We can also gripe a bit and see how our communities’ results are badly misunderstood by everyone else.

I’ll start with Phillipe: Why do you think we are sitting together as a group?

Phillipe Rigollet: As a disclaimer, I have an answer to all your questions except it might be yes or no. I will start by saying I’m a staticistian, I’m an interloper in TCS. I think there are truly interesting ideas come out of TCS. I’ve had several of you ask me how to get statisticians to understand that what I’m doing is new and interesting. I think I have more acknowledgements in the Annals of Statistics than actual papers. There are some philosophy differences. My impression is that computer scientists are excellent problem solvers — you’ll solve the hell out of it using every assumption on the paper, if you did not provide assumptions, that’s the problem you’ll be solving. Statisiticians are different. What are the big open problems in stats? There are none! Statisticians solve the problem before they state it – they have intuition something will work, they reverse-engineer the model and make it work. For instance, SBM – modeling robustness etc. is hard. On the other hand you have Gaussian assumptions, the moment you use the 68th moment, a statistician is going to be concerned.

David: I had an identity crisis: I would go to conferences in different fields: CS, math, operations research (my degree), never feel at home at any one of them. I feel like a constant misfit to any of the fields, as long as the fields are well-defined. Maybe that’s a strength not a weakness! I think the most interesting things I came up with in my judgement are not just when I was trying to think up one problem very hard, but when I learned about something in a completely different field, and I realized there’s a connection. I think this workshop (we should give big credit to Ronitt) — this workshop is exemplary of that. Only when you feel like a misfit, you feel pushed to learn something. We have people from CS (majority here, which is great), statisticians, mathematicians, we had a wonderful bootcamp talk by Yufei in combinatorics, and I think that’s great. I personally enjoy coming to workshops where there is a variety, where I will learn something, where I feel like a misfit.

Ronitt: I want to say sorry about the poster session, it was really going strong when I wanted to stop it! Now I think we actually do share a lot in techniques. Maybe results and models are not quite the same, we have so many it’s not clear, but techniques we do share a lot! People cared about other people’s posters, because they were similar enough! I think the combination of what should be in this workshop started gradually. There was a whole panel of organizers, they suggested a lot of names and directions, it was a joint effort all the way. Back in 2000, the idea of having sublinear algorithms and combining sublinear space and sublinear time seemed natural. Already there were mathematical and coding theory questions, graph problems were considered, sketching wasn’t quite then, this community stayed together, but crazy things in the techniques! Goldbach and Ran wanted to distinguish graphs which are expanders or not. We looked at this and wanted to tell about mixing properties. Then started testing uniformity, arbitrary distribution, etc. These were statistics questions, we went around asking the statisticians, but we didn’t know Phillipe back then! No one we talked to thought it was interesting! We got a lot of negative feedback from the statistics community and we did not find the right people at that time. There was a lot of relevant techniques later on (Costis etc.) and more and more it became clear that there was a snowball effect. Now it’s getting larger.

Devavrat: The zeroth order answer is we are here because of Piotr leading MIFODS. Personally I did my ugrad and Ph.D. in CS, my advisor in EE was doing probability, I did postdoc in statistics. Like David I feel a constant identity crisis and this is exciting. We are here because there are a few classes of questions which have brought us to different communities — coding (complexity), communication (different reasons), applied math, techniques mixed, graphical models, etc. I grew up with quetsions where we were looking at belief propagation algorithms and why it’s robust. Those algorithms are fundamentally local, even though they were not designed with that as an interest. In 2009 a student of mine was working on questions for graphical models, but had relations to local property testing. I think these questions are coming again: matrix estimation, graphons, community detection: Every one is coming from different angles, there are commonalities of differences, we are in a great melting pot.

Rachel: I also share the sentiment of feeling like a misfit: Undergrad was math, Ph.D. was applied math (advisor Ingrid Daubechies), I liked the way she found problems in diverse areas and tried to solve interesting problems using mathematics. My background is applied harmonic analysis, I dabbled in various areas like statistics, ml, optimization, and now I’m in a math department. I really don’t know why I’m here actually! There are great people here I feel comfortable around sort of not sitting anywhere in particular. I think looking at the history of statistics on its own, (now there’s a lot of robust high-dimensional stats, maybe highlights stats was overly focused on low-dimensional stats for a long time, probably since a community wants to fine-tune things that work and solve problems that are already going in that direction), it’s interesting to see what happens when we have to start over. You can say the same with theoretical computer science — very good at solving problems, lots of open problems which is interesting. The whole idea of being an applied mathematician is to formalize — if they’ve already been stated, then it’s a pure math problem!

Madhu: I completely subscribe to that last problem! P = NP is a pure math problem, completely formalized. Anything further to say?

Rachel: A lot of problems are similar, using different notation, assumptions, focuses, if there were more talking, that would be realized more. Like real talking, one-on-one nitty gritty details to see if there are more similarities than we know.

Yihong: I am in a statistics and data science department — property testing, sublinear algorithms is an area where statistics and computer science interact quite well and fruitfully. Sublinearity not well understood in stats, especially in high-dimensional problems. If you have \(d\) parameters, you need \(d\) samples — this is the rule of thumb. You can do things if you have structure (sparsity), and also without sparsity when you only care about certain properties of things — if you only want to know the length of a high-dimensional vector, you can do \(\sqrt{d}\) (\(d\) if you want to know the whole vector), this shows up in distributed testing all the time. This shows up in stats all the time, but we don’t know when sublinear is possible in general or not. In TCS formulations, people know much more. I think this workshop is a good opportunity for people to exchange ideas and learn techniques, and learn in neighboring directions.

Piotr: I’m at the end of the line, all the good ideas were already taken. In the registration, you were asked what is your field. We have a very skewed distribution towards computer science. About 90 people who mentioned CS, 40 who mentioned statistics, and 8 mentioned mathematics. On the other hand, we are in the math department so it compensates a bit. We are trying to get people from all the fields, with special emphasis on computer science and statistics. More generally, I think that research is easier to do these days than twenty years ago. Trying to understand different fields was more like an adventure. These days we have the internet, conferences like NIPS and ICML which are huge mixes of people, and there’s interests in similar problems. So there’s motivation and logistical easiness for interacting and exchanging ideas. This creates opportunities; compressed sensing is a classic example – attracted attention from everyone: math, stats, ML, cs. Best way I know to get new ideas is to get people together and make things happen.

Madhu: Now let’s turn it to the audience.

Q1: I’d like to know from pure and applied side what the main challenges of sublinear algorithms are for us to be talking about. In stats, high-dimensional statistics are at the forefront, what about beyond that?

Phillipe Rigollet: First thing is, when I think of property testing, it’s testing whether a distribution is like this or this. Usually you separate along some functional of your distribution, you analyze it, is it bounded away from zero, etc. I asked Kulchinsky — people are moving into learning functionals. Yihong is an expert. Also ideas from Nemirovski. I was asking Kulchinsky — why? He said people are tired of high-dimensional slow rates, they want to learn fast things! Even for functional estimation we don’t have the right answer. Polynomial approxmation of functionals is dual to moment matching of lower bounds. But why polynomials? Why that basis? Are things tailored to polynomials without us knowing? (e.g., cosine). There are much higher level questions which are more remote from ‘’ what is the rate of estimation?’’. That’s at a higher level of what the questions are. We don’t understand random forests, why deep nets generalize. There are big questions. I could come up with some, but I don’t think I could convince you they are interesting in small amount of time alloted to me.

David: (talking about local algorithms in random structures) ‘’ We take care of the issue of measurability in lecture 3 and then move on! But this paper was talking a lot about measurability, it was the main thing!’’ Interesting problems in sublinear algorithms means many things in many fields — it’s a cross-disciplinary concept anyways.

Piotr: I’ll wear my applied hat for a second. I find this fascinating: sketching streaming, compressed sensing, etc. Whenever we prove theorems, we use random objects. Random gaussians, random hash functions, etc. We saw multiple talks about this today. Random measurements, hash functions, etc. are wonderful, work well in practice. Over last few years, if you know your data, you can do better than random. You can learn — design hash function, projection, the sample you take, and especially now, especially there’s lots of machine learning tools, I think figuring out how to use these methods to improve over random is a fascinating challenge. Another challenge is how to say anything meaningful about it. I guess if you’re using deep learning to design measurement matrix, can’t say anything until you solve deep learning, so that might be a hopeless question. But I think this is generally a fascinating question.

Ronitt: I’d tell you my favorite open problem, but if I tell you, I’d have to shoot you.

Piotr: No, we wouldn’t want this.

Ronitt: Sometimes you stumble across something. If you’re into property testing of graphs, what if you’re not in sparse, not in dense case? Different sets of degrees? Is there any other interesting problem that’s not constant time testable or graph isomorphism or needs linear query? Yes there is, but it’s not a natural problem (Oded, etc. ). Also with distributions: Are there other distribution properties we care about beyond entropy, number of distinct elements, etc. What else do we care? What properties would actually help people if we could find these things in the data? I would give talks and people would ask did you implement? Who cares about this? And I would say no, and I don’t know. But this is changing; peopel are paying more attention. Where in real life could we find these questions. Maybe there are ideas for other properties to test, other things to look at?

Devavrat: These are distributions you’re testing, samples independently — what if distribution samples are coming in structured manner? Or there’s a graphical model? What variations would it introduce in your understanding?

Costis: When you’re sampling graphical model or?

Devavrat: Some form of hypothesis testing: Distribution as a whole is given, you’re sampling independently…

Costis: If not i.i.d., but weakly correlated, you can extend a lot of techniques – de Brujin’s condition, weak dependence conditions can extend current results. If your distribution of interest is a graphical model (Devavrat: But you see partial observations). I haven’t that — high dimensional distribution, and you get one sample from it — testing from one sample is a super interesting problem.

Madhu: I think there was a result by Servedio and O’Donnell from a while back — you can improve upon learning if you get a large number of variable sconstant and only vary a few, you can study these few much better.

Moitra: Don’t know better than trivial algorithm for DNF, but once it’s a random walk on hypercube, you can do poly.

Madhu: Good, that’s exactly the result I was referring to. Hardness on the average is not related to this workshop persay, but it underlies what we think about constantly. Phillipe used hardness of planted clique to infer hardness of othe rlearning prolbems. These techniques we really need to boost on. Can you find a clique of size larger than \(\log n\) in a random graph? We have no clues how to answer it — we need to understand it a lot better, if we can’t, just build cryptosystems based on this!

Moses Charikar: What are your favorite tools and techniques that you wish would be better exported?

Yihong: Something like the notion of duality, which can be made very operational. Writing as maximization problem and taking dual, somehow becomes an achievable algorithm. If you want to do worst case analysis in statistics, and work with the least favorable prior. Primal = strategy, dual = worst case distributions. Philosophically interesting and operationally implementable. This is pretty cool.

Rachel: That’s what I was going to say! Let’s come back to me.

Devavrat: In context of workshop, looking at some local structure of random objects like random graphs. It’s super useful to establish things you may not expect (for example, a random graph, you’re at a node, broadcast, get noisy copy, if you go too far, you forget what you started. But if too close, you’re in a very local neighborhood. How far do you go to balance? — \(\sqrt{n}\) has a small window which is remarkable).

Rachel: I would say looking at continuous limits of discrete problems. That could be emphasized more in theoretical computer science. Sometimes certain scaling issues matter. Rates for total variation minimization in compressed sensing. Problem became much easier when we looked at continuous limit of doing total variation on L2 functions. There were tools. Gradient and Stochastic gradient (ODE, momentum, step size better understanding).

Ronitt: What about you Moses?

Moses: I thought we get to ask the questions!

David: I won’t zoom in — if I had to, I’d say method of moments…

Phillipe: That was mine!

David: Or probabilistic method. It’s incredible how well it works, how method of moments is built in under different names like union bounds. Orthogonal to that there’s a community not here but statistical physics. It has featured in many presentations. A lesson I learned from them — before you solve the problem, try to understand it first. Let’s not try to find the best algorithm right away but instead understand the problem from different perspectives. I like that structural philosophy. Before we rush to proposal of different methods thatt work well, understand it.

Phillipe: Ok, we don’t have the exact same definition of method of moments. I’m thinking of the statistics definition, where the idea is to compute more parameters, and you’re trying to solve for the parameters. You need to go order 3 moments at least, and it requires a lot of careful analysis and technique. For instance, the tensor modeling stuff uses this a lot – latent variable modeling, where you don’t know the assignments. The go to algorithm method is always the method of moments even though in statistics in practice everyone will go for maximum likelihood. The behavior of maximum likelihood is better for finite samples, the variational formulation is more robust, yet every time we want computational efficiency we have to go to method of moments! Everyone is interested in nonconvex optimization right now, but we’re using the wrong tools. We need better tools… now it’s like the five best grad students in the country who can do this stuff, but we are definitely using the wrong tools, we don’t understand this stuff. It’s like if the Greeks were trying to solve modern day problems… hey! I’m not saying anything about Costis! For example, Costis has a paper on EM that uses these techniques that’s going in the right direction.. But I’m interested in whether we can get sharp non-asymptotics.

Ronitt: Ok I thought of something. The regularity lemma we saw about graphs was very interesting: We can say all large graphs can be described by constant size graphs. The general area that’s interesting is distributed local computation – you talk to a constant number of neighbors in a round — that’s really local algs. But they have developed a whole machinery. There are lots of beautiful ideas that should be mentioned.

Madhu: I want to thank the organizers.

Interactively Visualizing Large Datasets using Sketches (Parikshit Gopalan)

Introduction

I will be talking about something which involves some theory, and a lot of coding. I think they may have been expecting ‘‘coding theory’‘….

This talk is about taking some work which has been done and finding a nice application. The problem we’ll discuss today is that of visualizing a really large dataset. If you have a small dataset, it fits on your computer — then, this is a solved problem: There are lots of good solutions for visualizing the data. The problem is more challenging when your data is spread across a bunch of different machines. The complexity goes up several notches. MapReduce and Spark have come into existence, but you need higher level of expertise to write a Spark job versus looking at a bar chart in Excel. Then these are distributed job, and they take a lot of time. They introduce a lot of latency, making the business of interactivity a lot more challenging. Also, maybe your data is growing larger and larger and larger. Presumably your screen isn’t growing larger.

Demo

So let’s discuss the problem of visualizing a large, tabular dataset. This is data which has a billions of row tables — you can’t use this in Excel. We are trying to lower the bar to using such a thing. It should be as simple as Excel, just point and click. Hillview is a big data spreadsheet. I’ll do something unconventional for a theory talk, and start with a demo. The thing about demos is that when they work well, they’re very entertaining, and when they fail, it’s hilarious. We will look at a dataset of all flights within the US (129 million rows) — our software is running on 20 machines, each with 25 GB RAM, 4 cores. Let’s look at the demo. I’ve loaded the data already. First, I’ll show you nothing — this is an important principle — we’ll show you nothing you asked to see [screen doesn’t display computer]. Oh! You truly are seeing nothing. One second. [shows Hillview, shows a histogram on its side for each of airplane carriers]. You can then expand out by origin city, for instance. It recomputes what it ought to show. You can do something like frequent elements. You can do a bunch of things in a tabular view. Let’s look at flight date, and look at a histogram. This plots a histogram and its CDF. You can also very easily filter the data: It will replot a histogram for those particular months you select.

There’s a headlong rush to do machine learning, but before you do learning with the dataset, you might want to do some understanding of the dataset — you want to know what features are correlated, etc. Let’s look at color-coding the histogram over time by the fraction of flights which are late. You can also normalize the heights of the histogram to compare how delays occur over time. This helps you look for trends in the data. Another thing which is easy is to look at outliers in the data. You can look at flights which last 1000 minutes or more — return to table view, look where they’re going to and from, who the carrier is, etc. Then we can do something and find the heavy hitters on this — aha, it’s United! Everything is mundane and easy, except this is running on a 100 million row dataset. Even if you had 10 billion rows, the delay would only be 10-15 seconds. It’s easy to do this very easily. It requires much less effort than Spark etc.

Now, why are we talking about this at a sublinear algorithms workshop?

Let me show you what is under the hood of Hillview. First, we show exactly what the user asks you for and no more. This is a standard principle in visualization and graphics. We take this one level down to the underlying system. We do lazy computations. We try to do the minimum amount of computation needed — what you see is what you get. We do nothing more than what you ask us to do. Approximation is a big part of what we need. You’re going to have to quantize histogram plots anyways – if there is already going to be an error, there is no point to computing answers of larger accuracy than that. This sounds like a minor technicality, but it makes a huge difference computation wise. Also, parallelism. We are never going to move data around. The computation is going to go to where the data resides, we don’t move the data around in our system. Everything we do is a sketch in Hillview: It’s basically a distributed sketching system. Any rendering you want to display admits a sketch which has both a low computation and a low communication cost.

What is a rendering? Nothing but a function from tuples of rows to some object which can be displayed on your screen. The screen is finite, so what you’re trying to compute is a constant. So here’s our model of computation. We distribute this input among a bunch of workers. Workers sketch their input using sampling/streaming. We’d like the size of the sketch to be proportional to what we’re trying to display. We also want them to be efficient though, so we sketch through sampling/streaming. We then merge at the end.

How to implement new functionality in Hillview? First, implement the sketch function — then, write the second function which tells you how to put two different sketches from different machines together. Finally, you need to write code to produce a graphical display out of it.

You can also display results incremently. When your cluster gets large, there’s a notion of stragglers. With our model, the messages get aggregated as workers finish their computation. So you’ll see something right from the get go.

Limitations

Now, we’re not doing a full SQL database. You can use something like Tableau. If you need to do joins, do it somewhere else and then visualize the results using Hillview. There’s no real hope of doing joins in real time. This is also less powerful than Spark and MapReduce. They do not really fit the streaming model, they move things around. In our case, we have said we do not want to do this. Our hypothesis is that the subset of database functionality needed for interactive visualization is just sketching and streaming.

Background

Now I want to point out there’s a lot of work on related topics. Sampling to estimate the size of database queries in fact started out the whole sketching/sampling/streaming trends. When you’re trying to build a visualization, you should tune sampling weights so that the result is indistinguishable from having done the work. Back when the area started though, 40000 data points was a lot and screens were 200 pixels. So it is worth revisiting today when things are much more feasible.

This also fits into a long line of systems work at Microsoft SVC. There also was work by Budiu, others, and Andoni called sketch spreadsheet — but this was lost when Microsoft closed that lab down. So we decided to get together and rebuild it, and this time we would make it open source. So we put it on GitHub. In unrelated news, Microsoft then bought GitHub.

Scrolling

Consider the problem: There are \(h\) vertical pixels, but a billion rows. You don’t have room to scroll to any row. How do we convey the information? What does a pixel mean? Each pixel represents a single quantile. Our goal when we scroll is to show you consecutive records from the sample quantile. We are not picky about where in the quantile where we started. We do this in two steps: First, an approximate quantile query based on where you scroll to on the page: You can do this by sampling \(\mathcal{O}(h^2)\). E.g., if you drag your mouse .39 down the screen, the space should show you things from that quantile roughly. So this gives you a single row. Given the top row you just calculated, you need to fill up the rest of the screen. This is also easy to do in a single pass. Each machine will compute the twenty next rows, which you see locally and then merge. This only gets you some subset of the table. When you’re scrolling you’re just trying to get a sense of the data.

Histograms

How about histograms? We need to figure out what the right bucket boundaries are. For numeric data, just figure out the min and max, and then take equally sized buckets. If it’s ordinal data (e.g., strings like names of city), there’s an ordering, but no metric. Equal sizes don’t make sense, though min and max make sense. If I wanted to artifically impose a metric, you might group things together by first letter. But this is a pretty bad solution. In general it’s not good to impose a metric where there isn’t one. A more reasonable thing to do is look at how many keys you have: Each bucket will contain an equal number of keys. How do you compute bucket boundaries? They are really quantiles again, but it’s the set of distinct keys that appear in the column. We only care about the number of distinct values — this will tell you how to make buckets which are almost equal. This is equivalent to \(\ell_0\)-sampling. Then we can sample from the set of keys. Then the empirical quantiles are close to the true quantiles. We also do things like heatmaps. You’re computing things on a log scale.

How about visualizing time series? First, there’s more data than pixels, so we need to break things into buckets and average. Then you need to smooth: averaging over a sliding window. The more sketch efficient way of doing it is to take a geometric average over the past. The question becomes the width of the window you’re looking at — what should this be taken to be to get a good rendering? What is the right level of smoothing? There’s a paper from VLDB last year called ASAP. You’re trying to smooth out local perturbations but keep the global behavior of the time series. You ought to try and smooth in a way that minimizes variance, but does not decrease the kurtosis, related to the hypercontractivity of a random variable. There’s some interesting work to do here.

Final Comments: Sampling vs. Streaming

There’s a simple sampling approach, and you can get better results if you’re willing to make a full pass. First, this is an in-memory system like Spark: The entire dataset is loaded into RAM. When you have more and more data, you increase the number of machines. This ensures there is no latency. You want the response time to be independent of the size of the data. In such a model, how do sampling and streaming compare. In streaming, you’re doing one pass and doing things in parallel — the speed is the same regardless of the number of machines. In sampling, it depends on the screen resolution. So you’re asking each machine to do less and less work as you increase the size of the dataset. So you have a paradoxical view that sampling gets faster as you increase the number of servers. Sounds like sampling is great! When we can do sampling, we will. Streaming has advantages: You can get away with smaller sketches, and there are deterministic guarantees for heavy hitters, etc. I didn’t talk about detecting outliers. You have to do these kinds of things with streaming. But we’ve also been doing filtering — it’s not obvious how you sample from these datasets. That’s where you use streaming: You do one pass over the filtered thing. You set up by doing a pass over it, and then you hope you sample from it lots of times to justify the cost of the first pass. This is essentially conditional sampling. We also did a comparison to commercial databases.

To summarize, the thesis of this talk is that sketching and streaming is the right model for interactive visualization of large tables, where you don’t want to move data around but just do things quickly interactively. We believe Hillview does pretty well, it’s open source (on Github, for now at least ;)). We are climbing up the machine learning stack slowly.

Recent Advances in NP-Hard Variants of Low Rank Approximation (David Woodruff)

Outline and Problem Description

I’ll focus on some low-rank matrix approximation problems, but in non-standard error measures where the problem becomes NP-hard. I’ll focus mostly on weighted low-rank matrix approximation. I’ll mention \(\ell_1\)-low-rank matrix approximation. The standard case is the Frobenius norm case. We get a matrix \(A\) and a parameter \(k\) which makes \(|A - UV|_F^2\) small where \(U, V\) are rank \(k\). We want a \((1 + \epsilon)\) factor compared to the best approximation. You can solve this with SVD and using fast matrix multiplication, but this is not so great in other situations (for instance, non-negative factorizations). We also might want our error measure to be robust to noise. If you corrupt entries of \(A\), you might be paying more attention to these corruptions because we are looking at sums of squares of differences. You might also have missing entries in your matrix; what does it mean in that setting?

Let’s define the weighted low-rank approximation problem. We are again given \(A, \epsilon\), but now we are also given \(W \in \mathbb{R}^{n \times n}\). Instead, we try to minimize a weighted square Frobenius norm. Each entry will be attached a weight: Look at \(|W \odot \left(\hat{A} - A\right)|_F^2\). We still want a \((1 + \epsilon)\) guarantee.

Motivation

Why do we care about doing this? Well, we might want to renormalize the different categories of movie ratings (in this application) by normalizing by the variances of the entries in each category. So you have a normalization factor associated with each category. This is the weight matrix, and can be thought of as having a small number of distinct columns (e.g., movie types). We are assuming that movies are similar somehow. We can also assume that users (the rows) behave similarly too. So the weight matrix can capture a lot of settings, we’re dealing with similar rows and columns.

Another problem we can look at is matrix completion. It’s a hard problem, even to approximate. That is a special case of low-rank matrix approximation, so that suggests this problem is hard.

Results

But we will get algorithms for special cases of the weight matrix. Case 1: small number of distinct rows and columns. More generally Case 2: the weight matrix has a small number of distinct columns, but any rows. More generally, Case 3: the weight matrix has rank at most \(r\).

Our results guarantee \(|W \odot (A - \hat{A})|_F^2 \leq (1 + \epsilon)OPT\) with high probability, in runtime depending on the sparsity of \(A, W\) and \(n^{\gamma + 1}\cdot 2^{\mathcal{O}(poly(k, r))}\). In the last case, we get an algorithm in time \(n^{\mathcal{O}(k^2r/\epsilon^2)}\).

Hardness

Let me mention some hardness results to justify the large runtimes. We look to the random 4-SAT hypothesis of Feige, which assumes this allocation problem is hard. We look at the Maximum Edge Biclique problem (MEB). With the random 4-SAT hypothesis, any algorithm that is able to distinguish instances from two families of the MEB problem (1) there is a clique of a certain size, and 2) all bipartite cliques are of size less than a constant factor times the first size) (this refers to the number of edges in the clique), then it requires \(2^{\Omega(n)}\) time.

Here’s a reduction from MEB to weighted low-rank approximation, which will show that even for \(k = 1\) that weighted low-rank approximation is hard to approximate. The reduction places a \(1\) in the weight matrix if there’s an edge between the the two cliques, and \(n^6\) otherwise. Suppose the weight matrix has \(r\) distinct columns. You can get a \(2^{\Omega(r)}\) hardness — just take MEB and apply the reduction to the \(r \times r\) matrix planted within the \(n \times n\) matrix.

Techniques

We use a polynomial system verifier — it’s like linear programming except polynomial constraints instead of linear constraints. In ellipsoid algorithm, you cut down an ellipsoid and need to know when to stop — you need a lower bound on the objective function. We will need an analogous notion for polynomial systems. Then I will describe a main technique here — we’ll use sketching or data dimensionality reduction to reduce the number of variables in a polynomial system. Typically sketching is used to reduce a polytime algorithm to a linear algorighm or faster. Instead we will reduce large number of variables to small number of variables. Here we go from exponential time algorithm to polynomial time algorithm. This is a somewhat surprising use of sketching. We call this ‘‘guessing a sketch’’.

A polynomial system verifier (Renegar ‘92) is as follows: you’re given a real polynomial system \(P(x)\). You have \(v\) variables, \(m\) polynomial constraints, and polynomials have degree at most \(d\). We also say that the bits of the polynomial coefficients can be specified with at most \(H\) bits. It takes \((md)^{\mathcal{O}(v)}\)poly(\(H\)) time to verify if there is a solution to the system \(P\).

Now we want a lower bound on the cost. For linear programming you can get a lower bound exponential in the number of variables. What about for a general polynomial system? You can’t get any lower bound. We can give an example of polynomial \((xy - 1)^2 + y^2\) which can never equal zero, but can get arbitrarily close to zero. So there is no lower bound on the cost that you can give here. So we need to intersect our semialgebraic set with a ball, requiring the variables that we output have big complexity at most \(H\). This gives us a lower bound of either 0 or \(\geq (2^H + m)^{-d^v}\).

Now we do a multiple regression sketch. We’re given \(m\) matrices \(A_1, \cdots, A_m\) (tall and thing) and vectors \(b_1, \cdots, b_m\). We are just trying to solve least squares regression. We have vector \(b_j\) and \(A_j\) and are just trying to minimize \(|A_jx - b_j|\). Now choose \(S\) to be a random Gaussian matrix, and apply \(S\) to both \(Ax\) and \(b\). This will yield a much smaller regression problem. We have the guarantee that for approximation factor \(\epsilon\), if we set the rows to be \(k/\epsilon\) (note that this is independent of number of regression problems, as well as the dimension in the regression problems), then if we take the sum of squares of regression costs, this is at most a \((1 + \epsilon)\) factor more than optimal costs. The sketch size only depends on \(k/\epsilon\) — no union bound going on, just a statement about expectations.

Algorithm for WLRA

We could create \(2nk\) variables for \(U, V\). But the polynomial verifier runs in exponential in number of variables. So we’ll use a multiple-regression sketch with few rows, and use the fact that the weight matrix has rank at most \(r\) to evade this issue.

Guess a Sketch

Consider the \(j^{th}\) column of \(W\), \(W_j\). We take the objective and re-write it as a multiple regression problem, multiplying by the diagonal version of \(W_j\) (as a matrix) — we scale all the entries according to \(W_j\). We do this in order. This weighted low-rank approximation problem is the same as this multiple regression problem. Now we can choose Gaussian matrix \(S\) and replace the multiple regression problem with the sketch of a multiple regression problem — multiply by \(S\) on both sides. What if we create variables for the tiny matrix \(SD_{W_j}U\)? This is what we get after multiplying by \(S\) and \(D_{W_j}\). Then just create variables for all these entries. Now, you can solve for \(V_j\) in terms of all these variables! Recall we are looking at \(\sum_{j = 1}^n \|SD_{W_j}UV_j - SD_{W_j}A_j\|_2^2\). We have a bunch of small matrices — only need to create variables for \(r\) of these, and express everything else in terms of the variable swe chose. So we take the weighted low-rank objective, write as multiple regression, use sketching to say we can solve the smaller problem, and then create variables for \(SD_{W_j}U\) for a linearly independent subset of these variables. This is a way of reducing the number of variables. Once we have a small number of variables, we can solve for \(V\) in a closed form expression. The solution is a rational function, but can clear the denominator with tricks. This makes the difference between getting something with exponential time and something with polynomial time, because the time depends heavily on the total number of variables.

Open Problems

We don’t know if this problem is fixed-parameter tractable since the upper bound is \(n^{\mathcal{O}(k^2r/\epsilon)}\) while the lower bound is \(2^{\Omega(r)}\).

Submodular Maximization via Approximate Data Structures (Huy Nguyen)

Background and Motivation

One setting that has not been looked at as much in sublinear algorithms are the data structures used to support optimization problems. There’s been a lot of success in linear programming and other things – I’ll give some examples and hopefully will inspire people to look into such things.

Let’s look what happened in submodular functions: First, a submodular function is a function \(f: 2^V \to \mathbb{R}\) such that \(f(A \cup {x}) - f(A) \geq f(B \cup {x}) - f(B)\) for \(x \not\in B\) — this is a diminishing return notion. It captures a lot of interesting models.

For instance, coverage functions (the number of elements in the union) are submodular. A graph cut is also a submodular function. This talk we will focus on maximizing submodular functions subject to constraints: the partition matroid and the graphic matroid. This is a fairly interesting setup which models problems you might care about.

Here’s an example of a partition matroid problem: Submodular welfare maximization. You have \(m\) items and \(k\) players. The goal is to maximize total welfare: You want to maximize the sum of the utility of all people involved. How do we model this problem using the partition matroid? Let’s assume the utility function is submodular. The objective is just the sum of utility functions. The constraint is that each item can only be assigned to one person: That’s the partition matroid constraint. You can only pick one of (item 1 to player 1, item 1 to player 2, …), as well as (item 2 to player 1, item 2 to player 2, …) for all items. Together, they form a single partition matroid constraint. For the purpose of this talk we will talk about the partition matroid with constraints of this type.

Often, people use greedy algorithms to solve this problem The way the algorithm works is as follows: You look at the marginal gain of adding an item to the current solution, starting from an empty solution set. You look at the marginal gains, and pick the one with the maximal gain. Once you have something in your solution, because of diminishing return, the marginal gain doesn’t stay the same. In a classical work by Nemhauser, Wolsey, and Fisher ‘78, for monotone functions, greedy achieves a constant factor approximation. If you have a cardinality constraint, you can get \(\sim 63\%\). With a matroid constraint, you can get 1/2 approximation.

Some questions greedy doesn’t seem to address: How do we do Optimal approximation for monotone maximization with a matroid constraint? Non-monotone maximization? We’ll focus on the first one. Greedy algorithm is not optimal for more complex constraints. For instance, suppose you have 3 items. The constraints are you have at most one of the first two, and at most one of the third. You can come up with an example of a function that’s bad: \(f(1) = 1 + \epsilon, f(2) = f(3) = 1, f({2, 3}) = 2, f({1, 3}) = 1\). This case shows that greedy can be myopic with no way back.

People have come up with ideas to get around this problem: Don’t ever commit to something you might regret later! Come up with small decisions and only look at those: This is the canonical relaxation via continuous optimization, and round the resulting relaxation. You don’t want to make big decisions with no way to fix it!

We can look at the multilinear relaxation ([CCPV’07]). \(F\) is the multilinear extension of \(f\), and \(P\) is a convex relaxation of \(\mathcal{F}\). Then, \(F(x) = \mathbb{E}[f(\hat{x})]\), and then round \(x_i\) to \(0/1\) with probability \(x_i\). This is hard to optimize, it’s not convex. If we are interested in approximation, you can do it though. You can solve it approximately in a ‘‘continuous greedy’’ way. You find the maximum direction of potential gain (look at the maximum dot product with \(\nabla F\) for points in the feasible set). For very specific cases, you can do provably good rounding, but with several matroid constraints, it’s tricky. We don’t know the best constant. For a single matroid, you can round it without any loss, but that is very special.

This multilinear extension and continuous greedy method however is quite a bit slower than the regular greedy approach. There’s no general way to evaluate this thing other than ‘’ take some sample from this distribution and see what the value may be’’ — this will converge to the right distribution, but it takes time: A single evaluation takes \(\mathcal{O}(n)\) function evaluations. Even if continuous optimization is efficient, you need \(\mathcal{O}(n^2)\) evaluations. Rounding is also significant to the tune of \(\mathcal{O}(k^3)\) time.

So, greedy is fast and can be made to run in near-linear time, and continuous greedy is slow, but has a better quality guarantee. So can we get the best of both worlds? We’re going to be looking at partition matroids (picking items), we can also consider the graphic matroid (you can pick any forest of the graph).

Key Idea

We are going to combine the discrete and continuous algorithms. If the total marginal gain of elements is large, you can analyze some part of greedy and show it will work just fine. When the total marginal gains become small, then greedy might make a bad choice which you can’t get back from. The continuous version runs much faster here, since the total marginal gain corresponds to the variance of the estimate of the function value \(F\) (we took \(\mathbb{E} F\)). If the variance is small, we can do the continuous thing quickly. If the variance is big, you can just do greedy, and it’s fast and not have to worry about the approximation.

Implementation

We have a current solution \(S\). We look at the value of \(f(S + i) - f(S)\). Find a solution \(B\) of maximum total value. Then, add a random element in \(B\) to your current solution. So this is a variation of the greedy discussed before — there we took maximum gain. Here, we pick the say, ten, elements, whose marginal gains are the biggest, and then pick one of these randomly. This is like hedging your bet — an adversary could plant a bad element, we’re trying to avoid it. In general, this algorithm doesn’t work, but it works when the value of the base solution \(B\) is large. Then we can use this and get a lot of gain. When \(B\) is small, we switch to the continuous thing and use that.

For this algorithm to run fast, we need to maintain the maximum of the base \(B\) here. As soon as an element is added to \(S\), the marginal gains change for everyone, and you need to get the new base \(B\), and you have to come up with it quickly! You have to do this \(n\) times.

Data Structure Problem

You have to approximately maintain the maximum weight base \(B\) through a sequence of weight updates (decrements). For the graphic matroid, you can use a dynamic MST data structure. If you add a partition matroid, there’s some ad-hoc solution like bucketing or linked lists. However, when you add an item to the solution, all marginal gains could change, and we cannot afford to update all the weights! So if you want to design an algorithm here, you have to put it in the context of these iterative algorithms. The solution to this is we use a classical idea from sublinear algorithms: We look at spot-checking. We have a current maximum weight base, and then the rest of the items. Now what do we do? We add a random element from the maximum weight base to the solution. After we add an element to the current solution, the marginal gains changed. We then have to update the weights, and find a new base. We are going to just randomly check the weights: Sample a random element, and see the marginal gain of the element. We drop it to zero, and call the dynamic MST to update the weight of this thing. Then we randomly check another element. It’s possible it drops a little bit, but not by much. So you keep doing this: Sample random element, did the weight drop by a lot, then update, otherwise, the current estimate is good. The point is that you know the weight cannot drop too many times, since each time it has to drop sufficiently. After sufficiently many drops, it drops to zero and it’s gone. The key thing is that it can’t drop too many times, and that’s what we piggy back on. The number of tests without update - number of tests with update is \(> \log n\). Then you can be confident a high number of weight updates are correct with high probability. This gives a fast implementation using spot checking.

Takeaways

You can get faster algorithms by combining continuous and discrete approaches. Often when iterative algorithms are approached, you can use approximate data structures which are fast runtime. This is an invitation for sublinear algorithms researchers to look into iterative methods.

Questions

Madhu Sudan: Is there a nice description of the problem that can be cleanly separated from the application? Answer: Don’t have a good answer — it’s mainly to do with the sequence of iterates. Costis Daskalakis: Is there a relation to Follow-the-Perturbed-Leader? Answer: Not sure. There are relations to Franke-Wolfe.

Jelani Nelson on \(\ell_2\)-norm sketching.

I missed most of the presentation, but it seemed interesting — the punchline was using the generic chaining to get a constant factor approximation. I have to look at the paper to see exactly what happened.