# Theory of Computing Blog Aggregator

### Difference Sets with Local Properties

I recently attended a wonderful workshop about Algebraic methods in combinatorics, which took place in Harvard’s CMSA. There were many interesting participants from a variety of combinatorial fields, and a very friendly/productive atmosphere. My talk focused on a recent work with Cosmin Pohoata, and I also mentioned some distinct distances result that we derived. During […]

by Adam Sheffer at December 12, 2017 04:50 PM UTC

### Interesting Probability on a VERY OLD TV show

I have posted about things I see in TV or Movies that are math or CS related:

Do TV shows overestimate how much a genius can help solve crimes or make really good crystal meth which seems to be blue. YES, see here

Do TV shows get math wrong. YES, see here and about 90% of the episodes of Numb3rs

Closer to home- do TV shows say stupid things about P vs NP. Elementary (one of the two Modern Day Sherlock Holmes shows did) does  see  here

Did Kirk and Spock really defeat a computer by a trick that wouldn't work now. Yes, see Lance's post on this here

Do TV shows use the word Quantum incorrectly? They do but they are not alone as such, see here

Do people writing Futrama get their math right! Yes- see here

Do people writing 24 get their math wrong! Yes- see here

Does the Big Bang Theory mostly get things right? Yes! - see here

There are more (Seinfeld things comedians should learn proofs! Really- see here) but I can make my point just with the ones above.

ALL of the TV shows except Star Trek were from after 2000 (or so).  So, with the exception of Science Fiction, math-refs and sci-refs in TV shows are relatively recent- I had thought.

Which is why I was surprised and delighted to see, an episode of the old western (anti-western? satire of a western?) Maverick, from 1958 (before I was born!), called Rope of Cards a CORRECT and INTERESTING  math reference.Maverick bets that a random 25 cards from a deck of cards can be arranged into four  5-card pat hands (I had to look that up-- hands where you don't want to discard any cards, so  flush, a straight, a full house would qualify. 4 of a kind would be pat if there were no wild cards).  The sucker takes the bet and loses. Maverick later says the odds are high and called the game Maverick Solitaire.And that is now the name of the puzzle- see here. The prob is around 0.98.

I call this a mention of math since it has to do with probability- which may be a stretch. And I doubt the scene would encourage people to go into math. But it might encourage one to learn probability either to sucker others or to not be suckered.

So the question now is- are there other non-science-fiction, refs to math in older TV shows?
I suspect yes - similar to the one above which is gambling and probability. What is the earliest mention of math on a TV show? The oldest that did not involve science fiction or gambling?

by GASARCH (noreply@blogger.com) at December 12, 2017 02:33 PM UTC

### Special Topics in Complexity Theory, Lecture 18

from Emanuele Viola

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

### 1 Lecture 18, Scribe: Giorgos Zirdelis

In this lecture we study lower bounds on data structures. First, we define the setting. We have $n$ bits of data, stored in $s$ bits of memory (the data structure) and want to answer $m$ queries about the data. Each query is answered with $d$ probes. There are two types of probes:

• bit-probe which return one bit from the memory, and
• cell-probe in which the memory is divided into cells of $\log n$ bits, and each probe returns one cell.

The queries can be adaptive or non-adaptive. In the adaptive case, the data structure probes locations which may depend on the answer to previous probes. For bit-probes it means that we answer a query with depth-$d$ decision trees.

Finally, there are two types of data structure problems:

• The static case, in which we map the data to the memory arbitrarily and afterwards the memory remains unchanged.
• The dynamic case, in which we have update queries that change the memory and also run in bounded time.

In this lecture we focus on the non-adaptive, bit-probe, and static setting. Some trivial extremes for this setting are the following. Any problem (i.e., collection of queries) admits data structures with the following parameters:

• $s=m$ and $d=1$, i.e. you write down all the answers, and
• $s=n$ and $d=n$, i.e. you can always answer a query about the data if you read the entire data.

Next, we review the best current lower bound, a bound proved in the 80’s by Siegel [Sie04] and rediscovered later. We state and prove the lower bound in a different way. The lower bound is for the problem of $k$-wise independence.

Problem 1. The data is a seed of size $n=k \log m$ for a $k$-wise independent distribution over $\{0,1\}^m$. A query $i$ is defined to be the $i$-th bit of the sample.

The question is: if we allow a little more space than seed length, can we compute such distributions fast?

Theorem 2. For the above problem with $k=m^{1/3}$ it holds that

\begin{aligned} d \geq \Omega \left ( \frac {\lg m}{\lg (s/n)} \right ). \end{aligned}

It follows, that if $s=O(n)$ then $d$ is $\Omega (\lg m)$. But if $s=n^{1+\Omega (1)}$ then nothing is known.

Proof. Let $p=1/m^{1/4d}$. We have the memory of $s$ bits and we are going to subsample it. Specifically, we will select a bit of $s$ with probability $p$, independently.

The intuition is that we will shrink the memory but still answer a lot of queries, and derive a contradiction because of the seed length required to sample $k$-wise independence.

For the “shrinking” part we have the following. We expect to keep $p\cdot s$ memory bits. By a Chernoff bound, it follows that we keep $O(p\cdot s)$ bits except with probability $2^{-\Omega (p \cdot s)}$.

For the “answer a lot of queries” part, recall that each query probes $d$ bits from the memory. We keep one of the $m$ queries if it so happens that we keep all the $d$ bits that it probed in the memory. For a fixed query, the probability that we keep all its $d$ probes is $p^d = 1/m^{1/4}$.

We claim that with probability at least $1/m^{O(1)}$, we keep $\sqrt {m}$ queries. This follows by Markov’s inequality. We expect to not keep $m - m^{3/4}$ queries on average. We now apply Markov’s inequality to get that the probability that we don’t keep at least $m - \sqrt {m}$ queries is at most $(m - m^{3/4})/(m-\sqrt {m})$.

Thus, if $2^{-\Omega (p\cdot s)} \leq 1/m^{O(1)}$, then there exists a fixed choice of memory bits that we keep, to achieve both the “shrinking” part and the “answer a lot of queries” part as above. This inequality is true because $s \geq n > m^{1/3}$ and so $p \cdot s \ge m^{-1/4 + 1/3} = m^{\Omega (1)}$. But now we have $O(p \cdot s)$ bits of memory while still answering as many as $\sqrt {m}$ queries.

The minimum seed length to answer that many queries while maintaining $k$-wise independence is $k \log \sqrt {m} = \Omega (k \lg m) = \Omega (n)$. Therefore the memory has to be at least as big as the seed. This yields

\begin{aligned} O(ps) \ge \Omega (n) \end{aligned}

from which the result follows. $\square$

This lower bound holds even if the $s$ memory bits are filled arbitrarily (rather than having entropy at most $n$). It can also be extended to adaptive cell probes.

We will now show a conceptually simple data structure which nearly matches the lower bound. Pick a random bipartite graph with $s$ nodes on the left and $m$ nodes on the right. Every node on the right side has degree $d$. We answer each probe with an XOR of its neighbor bits. By the Vazirani XOR lemma, it suffices to show that any subset $S \subseteq [m]$ of at most $k$ memory bits has an XOR which is unbiased. Hence it suffices that every subset $S \subseteq [m]$ with $|S| \leq k$ has a unique neighbor. For that, in turn, it suffices that $S$ has a neighborhood of size greater than $\frac {d |S|}{2}$ (because if every element in the neighborhood of $S$ has two neighbors in $S$ then $S$ has a neighborhood of size $< d|S|/2$). We pick the graph at random and show by standard calculations that it has this property with non-zero probability.

\begin{aligned} & \Pr \left [ \exists S \subseteq [m], |S| \leq k, \textrm { s.t. } |\mathsf {neighborhood}(S)| \leq \frac {d |S|}{2} \right ] \\ & = \Pr \left [ \exists S \subseteq [m], |S| \leq k, \textrm { and } \exists T \subseteq [s], |T| \leq \frac {d|S|}{2} \textrm { s.t. all neighbors of S land in T} \right ] \\ & \leq \sum _{i=1}^k \binom {m}{i} \cdot \binom {s}{d \cdot i/2} \cdot \left (\frac {d \cdot i}{s}\right )^{d \cdot i} \\ & \leq \sum _{i=1}^k \left (\frac {e \cdot m}{i}\right )^i \cdot \left (\frac {e \cdot s} {d \cdot i/2}\right )^{d\cdot i/2} \cdot \left (\frac {d \cdot i}{s}\right )^{d \cdot i} \\ & = \sum _{i=1}^k \left (\frac {e \cdot m}{i}\right )^i \cdot \left (\frac {e \cdot d \cdot i/2}{s}\right )^{d \cdot i/2} \\ & = \sum _{i=1}^k \left [ \underbrace { \frac {e \cdot m}{i} \cdot \left (\frac {e \cdot d \cdot i/2}{s}\right )^{d/2} }_{C} \right ]^{i}. \end{aligned}

It suffices to have $C \leq 1/2$, so that the probability is strictly less than 1, because $\sum _{i=1}^{k} 1/2^i = 1-2^{-k}$. We can match the lower bound in two settings:

• if $s=m^{\epsilon }$ for some constant $\epsilon$, then $d=O(1)$ suffices,
• $s=O(k \cdot \log m)$ and $d=O(\lg m)$ suffices.

Remark 3. It is enough if the memory is $(d\cdot k)$-wise independent as opposed to completely uniform, so one can have $n = d \cdot k \cdot \log s$. An open question is if you can improve the seed length to optimal.

As remarked earlier the lower bound does not give anything when $s$ is much larger than $n$. In particular it is not clear if it rules out $d=2$. Next we show a lower bound which applies to this case.

Problem 4. Take $n$ bits to be a seed for $1/100$-biased distribution over $\{0,1\}^m$. The queries, like before, are the bits of that distribution. Recall that $n=O(\lg m)$.

Theorem 5. You need $s = \Omega (m)$.

Proof. Every query is answered by looking at $d=2$ bits. But $t = \Omega (m)$ queries are answered by the same 2-bit function $f$ of probes (because there is a constant number of functions on 2-bits). There are two cases for $f$:

1. $f$ is linear (or affine). Suppose for the sake of contradiction that $t>s$. Then you have a linear dependence, because the space of linear functions on $s$ bits is $s$. This implies that if you XOR those bits, you always get 0. This in turn contradicts the assumption that the distributions has small bias.
2. $f$ is AND (up to negating the input variables or the output). In this case, we keep collecting queries as long as they probe at least one new memory bit. If $t > s$ when we stop we have a query left such that both their probes query bits that have already been queried. This means that there exist two queries $q_1$ and $q_2$ whose probes cover the probes of a third query $q_3$. This in turn implies that the queries are not close to uniform. That is because there exist answers to $q_1$ and $q_2$ that fix bits probed by them, and so also fix the bits probed by $q_3$. But this contradicts the small bias of the distribution.

$\square$

### References

[Sie04]   Alan Siegel. On universal classes of extremely random constant-time hash functions. SIAM J. on Computing, 33(3):505–543, 2004.

by Emanuele at December 12, 2017 02:01 PM UTC

from Theory Matters

1. One great tradition at the ITCS conference is the “graduating bits” session, where graduating PhD students and postdocs give brief overviews of their research in advance of going out on the job market.  See here for instructions on how to apply (the deadline is January 8, 2018, and the session is four days later).  Hopefully these talks will be recorded and archived as well.
2. Avi Wigderson has a new book coming out, Mathematics and Computation, with a draft freely available on the Web.  In addition to being a great overview of many of the “greatest hits” of theoretical computer science, the last chapter (Chapter 20) lays out how TCS has influenced and connected to the life sciences, the social sciences, and technology.

by timroughgarden at December 12, 2017 08:34 AM UTC

### First third of my ICM2018 paper – Three Puzzles on Mathematics, Computation and Games. Corrections and comments welcome

from Gil Kalai

I have a very strict December 20 deadline (self-imposed, I missed the official one)  for my ICM2018 paper. I plan to talk about three puzzles on mathematics, computation and games, and here is a draft of the first third. Corrections and comments are very welcome! This part is around the simplex algorithm for linear programming.

by Gil Kalai at December 11, 2017 05:28 PM UTC

### Clustering and Sum of Squares Proofs, Part 1

I am excited to (temporarily) join the Windows on Theory family as a guest blogger!

This is the first post in a series which will appear on Windows on Theory in the coming weeks. The aim is to give a self-contained tutorial on using the Sum of Squares algorithm for unsupervised learning problems, and in particular in Gaussian mixture models. This will take several posts: let’s get started.

In an unsupervised learning problem, the goal is generally to recover some parameters $\theta \in \mathbb{R}^d$ given some data $X_1,\ldots,X_n \sim P(X \, | \, \theta)$, where $P$ is a probability distribution on, say, $\mathbb{R}^p$ which is parameterized by $\theta$. The goal is to estimate $\theta$ by some estimator $\hat{\theta}(X_1,\ldots,X_n)$ which (a) does not require too many samples and (b) can be computed from those samples in polynomial time. This basic setup can be instantiated (sometimes with minor adaptations) to capture numerous important problems in statistics and machine learning: a few examples are

•  component analysis and its many variants (PCA, ICA, Sparse PCA, etc.)
• Netflix problem / matrix completion / tensor completion
• dictionary learning / blind source separation
• community detection and recovery / stochastic block models
• many clustering problems

Excellent resources on any of these topics are just a Google search away, and our purpose here is not to survey the vast literature on unsupervised learning, or even on provable “TCS-style” algorithms for these problems. Instead, we will try to give a simple exposition of one technique which has now been applied successfully to many unsupervised learning problems: the Sum of Squares method for turning identifiability proofs into algorithms.

Identifiability is a concept from statistics. If one hopes for an algorithm which recovers parameters $\hat{\theta}(X_1,\ldots,X_n) \approx \theta$, it must at least be true that $X_1,\ldots,X_n$ uniquely (or almost uniquely) determine $\theta$, with high probability: when this occurs we say $\theta$ is identifiable from the samples $X_1,\ldots,X_n$.

Classically, identifiability is often proved by analysis of a (typically) inefficient brute-force algorithm. First, one invents some property $Q(X_1,\ldots,X_n)$ of $\theta$. For example, in a maximum-likelihood argument, one shows that $Pr(X_1,\ldots,X_n \, | \, \theta) > p$ for some $p$. Then, often via a concentration-of-measure argument, one shows that no $\theta'$ far from $\theta$ satisfies property $Q$. In the maximum-likelihood example, this would entail showing that if $\theta'$ is far from $\theta$ then $Pr(X_1,\ldots,X_n \, | \, \theta') < p$.

An identifiability argument like this typically has no implications for computationally-efficient algorithms, since finding $\theta$ which satisfies $Q$ often appears to require brute-force search. Thus, often when the investigation turns to efficient algorithms, the identifiability argument is abandoned and more explicitly algorithmic techniques are brought to bear: convex relaxations, spectral methods, and even heuristic methods.

The method we present here for designing computationally-efficient algorithms begins with a return to identifiability proofs. The main insight is that if both

• the property $Q$ and, more importantly,
• the proof that any $\theta'$ satisfying $Q$ must be close to $\theta$

are sufficiently simple, then identifiability of $\theta$ from $X_1,\ldots,X_n$ does imply the existence of an efficient algorithm to (approximately) recover $\theta$ from $X_1,\ldots,X_n$!

For us, simple has a formal meaning: the proof should be captured in the low-degree Sum of Squares proof system.

The algorithms which result in the end follow a familiar recipe: solve some convex relaxation whose constraints depend on $X_1,\ldots,X_n$ and round it to obtain $\hat{\theta}(X_1,\ldots,X_n)$. But the design and analysis of this relaxation are heavily informed by the simple identifiability proof described above. In particular, the convex programs which result will not be familiar relaxations of maximum likelihood problems!

In this series of blog posts, we are going to carry out this strategy from start to finish for a classic unsupervised learning problem: clustering mixtures of Gaussians. So that we can get down to business as quickly as possible, we defer a short survey of the literature on this “proofs-to-algorithms” method to a later post.

## Mixtures of Gaussians

Gaussian mixtures are classic objects in statistics, dating at least to Pearson in 1894. The basic idea is: suppose you have a data set $X_1,\ldots,X_n \in \mathbb{R}^d$ which was generated in a heterogeneous fashion: some points were sampled from probability distribution $\mathcal{D}_1$, some from $\mathcal{D}_2$, and so on up to $\mathcal{D}_k$, but you do not know which points came from which distributions. Under what settings can you cluster the points by which distribution they came from, and perhaps also recover some properties of those distributions, such as their means $\mu_i = \mathbb{E}_{X \sim \mathcal{D}_i} X$?

Pearson, in 1894, was faced with a collection of body measurements of crabs. The distribution of one such attribute in the data — the ratio of forehead length to body width — curiously deviated from a Gaussian distribution. Pearson concluded that in fact two distinct species of crabs were present in his data set, and that the data should therefore be modeled as a mixture of two Gaussians. He invented the method of moments to discover the means of both the Gaussians in question.1 In the years since, Gaussian mixtures have become a fundamental statistical modeling tool: algorithms to fit Gaussian mixtures to data sets are included in basic machine learning packages like sklearn.

A mixture of $2$ Gaussians in $\mathbb{R}^2$.2

Here is our mixture of Gaussians model, formally.

Mixtures of separated spherical Gaussians:
Let

• $\Delta > 0$.
• $\mu_1,\ldots,\mu_k \in \mathbb{R}^d$ be such that $\|\mu_i - \mu_j\| \geq \Delta$ for every $i \neq j$.
• $\mathcal{N}_1(\mu_1,I),\ldots,\mathcal{N}_k(\mu_k,I)$ be $d$-dimensional spherical Gaussians, centered at the means $\mu_i$.
• $X_1,\ldots,X_n \in \mathbb{R}^d$ be iid samples, each drawn by selecting $j \sim [k]$ uniformly, then drawing $X \sim \mathcal{N}(\mu_j, I)$.
• $S_1,\ldots,S_k$ be the induced partition of $[n]$ into $k$ parts, where $i \in S_j$ if samples $X_i$ was drawn from $\mathcal{N}(\mu_j, I)$

The problem is: given $X_1,\ldots,X_n$, output a partition $T_1,\ldots,T_k$ of $[n]$ into $k$ parts, each of size $n/k$, such that (up to renaming the clusters $1,\ldots,k$),

$|S_i \cap T_i| \geq (1 - \delta) \cdot \frac nk$

for each $i \in [k]$ and some small number $\delta > 0$.

Another mixture of 2 Gaussians. The means have Euclidean distance about 3.5.

To avoid some minor but notationally annoying matters, we are going to work with a small variant of the model, where we assume that exactly $n/k$ samples $X_i$ came from each Gaussian $\mathcal{N}(\mu_j, I)$. We call a mixture like this “equidistributed”.

We will work up to a proof of this theorem, which was proved recently (as of Fall 2017) in 3 independent works.

Main Theorem (Hopkins-Li, Kothari-Steinhardt, Diakonikolas-Kane-Stewart):
For arbitrarily-large $t \in \mathbb{N}$ there is an algorithm requiring $n = d^{O(t)} k^{O(1)}$ samples from the equidistributed mixtures of Gaussians model and running in time $n^{O(t)}$ which outputs a partition $T_1,\ldots,T_k$ of $[n]$ into parts of size $N = n/k$ such that with high probability,

$\displaystyle \frac{|S_i \cap T_i|}{N} \geq 1 - k^{10} \cdot \left ( \frac{C \sqrt t}{\Delta} \right )^{t}$

for some universal constant $C$.3

In particular:

• If $\Delta = k^\epsilon$ for some $\epsilon > 0$, then by choosing $t = 100/\epsilon$ the algorithm recovers the correct clustering up to $1/poly(k)$ errors in $poly(k,d)$ time with $poly(k,d)$-many samples.
• If $\Delta = C \sqrt{\log k}$ for a large-enough universal constant $C$, then choosing $t = O(\log k)$ gives a quasipolynomial-time algorithm (using quasipolynomially-many samples) to recover clusters up to $1/poly(k)$ error.4

One (rather weak) consequence of the main theorem is that, for $n = d^{O(t)} k^{O(1)}$ samples, there is enough information in the samples to determine the underlying clustering, up to error $\delta(t,\Delta) = \tfrac{2^{O(t)} \cdot t^{t/2} \cdot k^{10}}{\Delta^t}$. Our strategy to prove the main theorem will start with proving the latter statement independently — as we have discussed, such an argument is often called a proof of identifiability because we say that the clusters are identifiable from the samples (up to $\delta(t,\Delta)$ errors).

While identifiability itself does not carry immediate algorithmic consequences, our proof of identifiability will be somewhat special: it will be simple in a formal sense, namely, that it will be captured by a formal proof system of restricted power. This simplicity of the proof of identifiability will almost immediately imply the main theorem: the construction of a computationally-efficient algorithm from a simple proof of identifiability is the heart of the proofs-to-algorithms method.

## Identifiability proof: 1 dimension

Our eventual goal is to work up to a proof in the low-degree Sum of Squares proof system that clusters $S_1,\ldots,S_k$ are identifiable from samples $X_1,\ldots,X_n$ from a mixture of Gaussians. Since we have not yet defined low-degree Sum of Squares proofs, for now we will focus on constructing an identifiability proof which avoids mathematical facts which we deem “too complicated”. In particular, we will avoid any Chernoff/union bound style arguments.

To get to the heart of the matter it helps to simplify the setting. Our first simplification is to restrict attention to the $d = 1$ case, so that distributions $\mathcal{N}(\mu_i,1)$ are univariate Gaussians with unit variance.

Before stating our first lemma, let’s discuss the key property of a collection $Y_1,\ldots,Y_m$ of samples from a Gaussian $\mathcal{N}(0,1)$ which we will use. Recall that if $Y \sim \mathcal{N}(0,1)$ is a standard Gaussian, then for every $t \in \mathbb{N}$,

$\mathbb{E} |Y|^t \leq t^{t/2}.$

If $Y_1,\ldots,Y_m$ are samples from $Y$, then for $m = m(t)$ large enough, the empirical distribution of $Y_1,\ldots,Y_m$ inherits this property, up to some small fluctuations. Namely, with high probability we would have

$\mathbb{E}_{i \sim [m]} |Y_i|^t \leq 1.1 \cdot t^{t/2}.$

(We could have replaced $1.1$ by any small constant greater than $1$.) Here, the notation $i \sim [m]$ means that an index $i$ is chosen uniformly among $\{1,\ldots,m\}$.

If $Y \sim \mathcal{N}(\mu,1)$ for some $\mu \in \mathbb{R}$, then the same discussion applies immediately to $\mathbb{E}|Y - \mu|^t$ and $\mathbb{E}_{i \sim [m]} |Y_i - \mu|^t$. But even more is true: if $\overline{\mu}$ is the empirical mean of $Y_1,\ldots,Y_m$ (that is, $\overline{\mu} = \mathbb{E}_{i \sim [m]} Y_i$), then with high probability the $t$-th centered empirical moment also inherits the same bound:

$\mathbb{E}_{i \sim [m]} |Y_i - \overline{\mu}|^t \leq 1.1 \cdot t^{t/2}.$

In the Gaussian mixture setting, so long as enough samples are drawn from each Gaussian $\mathcal{N}(\mu_i, 1)$, each cluster will have $t$-th empirical moments satisfying the above bound (with high probability).

In our identifiability proof, we choose to forget that the samples $X_1,\ldots,X_n$ were drawn from Gaussians, and we remember only that they break up into underlying clusters, each of which satisfies that empirical moment bound. We do not even remember the “true” means of each Gaussian; instead we talk only about the empirical means. We will show that no clustering far from that underlying ground-truth clustering results in clusters which satisfy the empirical moment bound.

Lemma 1. Let $X_1,\ldots,X_n \in \mathbb{R}$. Let $S_1,\ldots,S_k$ be a partition of $[n]$ into $k$ pieces of size $N = n/k$ such that for each $S_i$, the collection of numbers $\{X_j\}_{j \in S_i}$ obeys the following moment bound:

$\mathbb{E}_{j \sim S_i} |X_j - \mu_i|^t \leq 2 \cdot t^{t/2}$

where $\mu_i$ is the average $\mathbb{E}_{j \sim S_i} X_j$ and $t$ is some number in $\mathbb{N}$. Let $\Delta > 0$ be such that $|\mu_i - \mu_j| \geq \Delta$ for every $i \neq j$. Suppose $t$ is large enough that $10 \sqrt t k^{1/t} \leq \Delta$.

Let $S \subseteq [n]$ have size $|S| = N = n/k$ and be such that $\{X_i\}_{i \in S}$ obey the same moment-boundedness property:

$\mathbb{E}_{j \sim S} |X_j - \mu_S|^t < 2 \cdot t^{t/2}$

for the same $t \in \mathbb{N}$, where $\mu_S$ is the mean $\mu_S = \mathbb{E}_{j \sim S} X_j$. Then there exists an $S_i$ such that

$\displaystyle \frac{|S \cap S_i|}{N} \geq 1 - k \cdot \left ( \frac{C \sqrt{t}}{\Delta} \right ) ^t.$

for some universal constant $C$.

How do we interpret Lemma 1 as a statement of cluster identifiability? The lemma implies that the clusters are, up to $\delta(t,\Delta)$ errors, the only subsets of $[n]$ of size $n/k$ which satisfy the $t$-th moment bound. This is our property $Q$, like we discussed earlier in this post. The true clustering $S_1,\ldots,S_k$ satisfies $Q$ (i.e. if you group $X_1,\ldots,X_n$ by this ground-truth clustering, the resulting clusters will have bounded empirical $t$-th moments), and every clustering which satisfies this bounded $t$-th moment property must be close to the true clustering. Thus, the correct clusters could be identified by searching for subsets of $[n]$ which satisfy the $t$-th moment bound (never mind that such a search would naively require about $2^n$ time).

We said that to use the sum of squares method to turn our identifiability proof into an algorithm, both the property $Q$ and the proof of identifiability need to be simple.

This $t$-th moment boundedness property will turn out to be simple enough. What about the proof of Lemma 1? By the end of this post we will prove Lemma 1 in a way which uses only Holder’s inequality for the $t$ vs $\tfrac t {t-1}$ norms and the triangle inequality for the $t$-norm. Later, we will show that these inequalities are simple in the correct formal sense: they are captured by a proof system of restricted power.

Our proof of Lemma 1 relies on the following key fact.

Fact 1. Let $S,S' \subseteq \mathbb{R}$ have $|S| = |S'| = N$. Let $X$ denote a uniform sample from $S$ and similarly for $X'$. Let $\mu = \mathbb{E} X$ and $\mu' = \mathbb{E} X'$. Suppose $X,X'$ satisfy the $t$-th moment bound

$\mathbb{E} |X - \mu|^t \leq 2 \cdot t^{t/2} \text{ and } \mathbb{E}|X' - \mu'|^t \leq 2 \cdot t^{t/2}.$

Then

$|\mu - \mu'| \leq 4 \sqrt t \cdot \left ( \frac{|S \cap S'|}{N} \right )^{-1/t}.$

A slightly more general interpretation of this fact is that a pair of random variables $X,X'$ on $\mathbb{R}$ which have bounded $t$-th moments and whose total variation distance $TV(X,X') \leq 1-\alpha$ cannot have means which are too far apart: $|\mathbb{E} X - \mathbb{E} X'| \leq 4 \sqrt t / \alpha^{1/t}$.

Proof of Fact 1.
Let $\alpha = |S \cap S'|/N$. Observe that there is a coupling of the random variables $X,X'$ so that $Pr(X = X') = \alpha$. The coupling chooses with probability $\alpha$ to select a uniform sample $Y \sim S \cap S'$, then lets $X = X' = Y$. With probability $1-\alpha$, the random variable $X$ is a uniform sample from $S \setminus S'$ and similarly for $X'$.

Let $(X,X')$ be a coupled pair of samples. We expand a quantity related to the one we want to bound, and then apply Holder’s inequality with the $t$ and $\tfrac t {t-1}$ norms. (The underlying inner product space assigns functions $f, g : (X,X') \mapsto \mathbb{R}$ the inner product $\mathbb{E}_{(X,X')} f(X,X') \cdot g(X,X')$.)

$\alpha \cdot |\mu - \mu'| = \mathbb{E}_{(X,X')} \left [ \mathbf{1}_{X = X'} \cdot |(\mu - X) - (\mu' - X')| \right ]$

$\leq \left ( \mathbb{E} (\mathbf{1}_{X = X'})^{t/(t-1)} \right )^{\tfrac {t-1} t} \cdot \left ( \mathbb{E} |(\mu - X) - (\mu' - X')|^t \right )^{1/t}$

$= \alpha^{1-1/t} \cdot \left ( \mathbb{E} |(\mu - X) - (\mu' - X')|^t \right)^{1/t}.$

Now we can apply the triangle inequality for the $t$-norm to the last term, followed by our $t$-th moment assumptions.

$\left ( \mathbb{E} |(\mu - X) - (\mu' - X')|^t \right )^{1/t} \leq \left ( \mathbb{E} |\mu - X|^t \right )^{1/t} + \left ( \mathbb{E} |\mu' - X'|^t \right )^{1/t} \leq 4 \sqrt t$

Putting everything together, we get $|\mu - \mu'| \leq \tfrac {4 \sqrt t}{\alpha^{1/t}}$. QED.

Keeping in mind our eventual goal of constructing a low-degree Sum of Squares proof, we record the observation that the only inequalities we required to prove Fact 1 were the $t$ vs. $\tfrac t {t-1}$ Holder’s inequality and the triangle inequality for the $t$-norm.

Armed with Fact 1, we can prove Lemma 1.The main idea in the proof is that if $S_1$ and $S_2$ are the two clusters with greatest intersection with $S$, then $\mu_S$ can only be close to one of $\mu_1, \mu_2$.

Proof of Lemma 1.
Let $S \subseteq [n]$ have size $|S| = n/k = N$, with mean $\mu_S = \mathbb{E}_{i \sim S} X_i$ and $t$-th moment bound $\mathbb{E}_{i \sim S} |X_i - \mu_S|^t \leq 2t^{t/2}$. Without loss of generality, order the clusters so that $S_1$ has largest intersection with $S$, then $S_2$, and so on: that is $|S \cap S_1| \geq |S \cap S_2| \geq \ldots \geq |S \cap S_k|$. If $|S \cap S_1| = (1 -\delta)N$, then $|S \cap S_2| \geq \tfrac \delta k N$, just by counting.

Since $|\mu_1 - \mu_2| \geq \Delta$, either $|\mu_1 - \mu_S| \geq \Delta/2$ or $|\mu_2 - \mu_S| \geq \Delta/2$. We claim it must be the second. Using Fact 1,

$|\mu_1 - \mu_S| \leq \frac{4 \sqrt t}{(1 - \delta)^{1/t}} \leq 4 \sqrt t \cdot k^{1/t} < \Delta/2$

since certainly $(1-\delta) \geq \tfrac 1k$, and we assumed $10 \sqrt t k^{1/t} \leq \Delta$. We conclude that $|\mu_2 - \mu_S| \geq \Delta/2$.

We have obtained $\tfrac{|S \cap S_2|}{N} \geq \tfrac \delta k$ and $|\mu_2 - \mu_S| \geq \Delta/2$. Putting these together with Fact 1, we find

$\frac \Delta 2 \leq |\mu_2 - \mu_S| \leq 4 \sqrt t \cdot \left ( \frac k \delta \right )^{1/t}.$

Rearranging, this reads $\delta \leq \frac{2^{O(t)} t^{t/2} k}{\Delta^t}.$ QED.

This concludes our one-dimensional identifiability proof, which will form the core of our proof of Theorem 1. In our next post we will lift this proof to a Sum of Squares proof (for which we will need to define Sum of Squares proofs). After that, with a Sum of Squares proof in hand, we will finish designing our mixture of Gaussians algorithm for the one-dimensional case. Then we will show that the same ideas, nearly unchanged, imply that the algorithm works in higher dimensions.

### Thanks

Many thanks to Boaz Barak, David Steurer, and especially Daniel Freund for their helpful remarks on early drafts of this post.

#### Footnotes

1. Moritz Hardt on Gaussian mixtures and Pearson’s approach

2. Image credit: Mathematica Stack Exchange

3. To see how to apply the ideas in this tutorial to a much broader class of clustering problems, see my joint paper with Jerry Li and the recent paper of Pravesh Kothari and Jacob Steinhardt.

4. Before these recent works, the best polynomial-time algorithms for the clustering mixtures of Gaussians could not tolerate any $\Delta < k^{1/4}$ (when $\Delta \geq k^{1/4}$ a simple greedy algorithm can be shown to solve the clustering problem to high accuracy). On the other hand, known lower bounds show that when $\Delta = C \sqrt{\log k}$, clustering is impossible (even using exponential time) with $k^{O(1)}$ samples, so one cannot hope to improve the guarantees of this theorem too much further [Regev-Vijayaraghavan]. (That said, reducing the sample complexity and running time to $poly(d,k)$ when $\Delta = C \sqrt{\log k}$ is a fascinating open problem.)

Variants of this theorem, which may be found in all three of the sources listed, offer algorithms which additionally output estimates of the means $\mu_1,\ldots,\mu_k$, work for many distributions besides Gaussians (without even knowing the underlying distribution!), and tolerate some amount of advesarial corruption among the samples $X_1,\ldots,X_n$. We note also that the theorems in those works handle the usual mixtures of Gaussians problem, rather than the equidistributed version, and can tolerate non-uniform mixtures; i.e. those where some clusters are much smaller than others.

by samhop at December 11, 2017 03:58 AM UTC

### Graph-based time-space trade-offs for approximate near neighbors

Authors: Thijs Laarhoven
Abstract: We take a first step towards a rigorous asymptotic analysis of graph-based approaches for finding (approximate) nearest neighbors in high-dimensional spaces, by analyzing the complexity of (randomized) greedy walks on the approximate near neighbor graph. For random data sets of size $n = 2^{o(d)}$ on the $d$-dimensional Euclidean unit sphere, using near neighbor graphs we can provably solve the approximate nearest neighbor problem with approximation factor $c > 1$ in query time $n^{\rho_q + o(1)}$ and space $n^{1 + \rho_s + o(1)}$, for arbitrary $\rho_q, \rho_s \geq 0$ satisfying \begin{align} (2c^2 - 1) \rho_q + 2 c^2 (c^2 - 1) \sqrt{\rho_s (1 - \rho_s)} \geq c^4. \end{align} Graph-based near neighbor searching is especially competitive with hash-based methods for small $c$ and near-linear memory, and in this regime the asymptotic scaling of a greedy graph-based search matches the recent optimal hash-based trade-offs of Andoni-Laarhoven-Razenshteyn-Waingarten [SODA'17]. We further study how the trade-offs scale when the data set is of size $n = 2^{\Theta(d)}$, and analyze asymptotic complexities when applying these results to lattice sieving.

### Periortree: An Extention of R-Tree for Periodic Boundary Conditions

Authors: Toru Niina
Abstract: Searching spatial data is an important operation for scientific simulations which are performed mostly with periodic boundary conditions. An R-Tree is a well known tree data structure used to contain spatial objects and it is capable of answering to spatial searching queries in an efficient way. In this paper, a novel method to construct an R-Tree considering periodic boundary conditions is presented. Unlike existing methods, the proposed method works without any kind of extra objects or queries. Moreover, because this method reduces the volume of bounding box for each node under the periodic boundary conditions, it is expected to increase the overall efficiency. While the extension of an R-Tree is presented in this work, this method is not only applicable to an R-Tree but also to other data structures that use axis-aligned bounding boxes with periodic boundary conditions. The implementation is available on GitHub.

### Deciding the existence of a cherry-picking sequence is hard on two trees

Authors: Janosch Döcker, Leo van Iersel, Steven Kelk, Simone Linz
Abstract: Here we show that deciding whether two rooted binary phylogenetic trees on the same set of taxa permit a cherry-picking sequence, a special type of elimination order on the taxa, is NP-complete. This improves on an earlier result which proved hardness for eight or more trees. Via a known equivalence between cherry-picking sequences and temporal phylogenetic networks, our result proves that it is NP-complete to determine the existence of a temporal phylogenetic network that contains topological embeddings of both trees. The hardness result also greatly strengthens previous inapproximability results for the minimum temporal-hybridization number problem. This is the optimization version of the problem where we wish to construct a temporal phylogenetic network that topologically embeds two given rooted binary phylogenetic trees and that has a minimum number of indegree-2 nodes, which represent events such as hybridization and horizontal gene transfer. We end on a positive note, pointing out that fixed parameter tractability results in this area are likely to ensure the continued relevance of the temporal phylogenetic network model.

### How to Net a Convex Shape

Authors: Sariel Har-Peled, Mitchell Jones

(A) Let $\Net$ be a sample of size $\tldO(d^2 \eps^{-1})$ points from the original point set (which is no longer needed), where $\tldO$ hides polylogarithmic terms. Given a convex body $\body$, via a separation oracle, the algorithm performs a small sequence of (oracle) stabbing queries (computed from $\Net$) -- if none of the query points hits $\body$, then $\body$ contains less than an $\eps$-fraction of the input points. The number of stabbing queries performed is $O( d^2 \log \eps^{-1})$, and the time to compute them is $\tldO(d^9 \eps^{-1})$. To the best of our knowledge, this is the first weak $\eps$-net related construction where all constants/bounds are polynomial in the dimension.

(B) If one is allowed to expand the convex range before checking if it intersects the sample, then a sample of size $\Bigl.\tldO(\eps^{-(d+1)/2})$, from the original point set, is sufficient to form a net.

(C) We show a construction of weak $\eps$-nets which have the following additional property: For a heavy body, there is a net point that stabs the body, and it is also a good centerpoint for the points contained inside the body.

(D) We present a variant of a known algorithm for approximating a centerpoint, improving the running time from $\tldO(d^9)$ to $\tldO(d^7)$. Our analysis of this algorithm is arguably cleaner than the previous version.

### On Topology Optimization and Canonical Duality Method

Authors: David Yang Gao
Abstract: The general problem in topology optimization is correctly formulated as a double-$\min$ mixed integer nonlinear programming (MINLP) problem based on the minimum total potential energy principle. It is proved that for linear elastic structures, the alternative iteration leads to a Knapsack problem, which is considered to be NP-hard in computer science. However, by using canonical duality theory (CDT) developed recently by the author, this challenging 0-1 integer programming problem can be solved analytically to obtain global optimal solution at each design iteration. The novel CDT method for general topology optimization is refined and tested mainly by both 2-D benchmark problems in topology optimization. Numerical results show that without using filter, the CDT method can obtain exactly 0-1 optimal density distribution with almost no checkerboard pattern. Its performance and novelty are compared with the popular SIMP and BESO approaches. A brief review on the canonical duality theory for solving a unified problem in multi-scale systems is given in Appendix.

### Decomposing arrangements of hyperplanes: VC-dimension, combinatorial dimension, and point location

Authors: Esther Ezra, Sariel Har-Peled, Haim Kaplan, Micha Sharir
Abstract: $\renewcommand{\Re}{\mathbb{R}}$ We re-examine parameters for the two main space decomposition techniques---bottom-vertex triangulation, and vertical decomposition, including their explicit dependence on the dimension $d$, and discover several unexpected phenomena, which show that, in both techniques, there are large gaps between the VC-dimension (and primal shatter dimension), and the combinatorial dimension.

For vertical decomposition, the combinatorial dimension is only $2d$, the primal shatter dimension is at most $d(d+1)$, and the VC-dimension is at least $1 + d(d+1)/2$ and at most $O(d^3)$. For bottom-vertex triangulation, both the primal shatter dimension and the combinatorial dimension are $\Theta(d^2)$, but there seems to be a significant gap between them, as the combinatorial dimension is $\frac12d(d+3)$, whereas the primal shatter dimension is at most $d(d+1)$, and the VC-dimension is between $d(d+1)$ and $5d^2 \log{d}$ (for $d\ge 9$).

Our main application is to point location in an arrangement of $n$ hyperplanes is $\Re^d$, in which we show that the query cost in Meiser's algorithm can be improved if one uses vertical decomposition instead of bottom-vertex triangulation, at the cost of some increase in the preprocessing cost and storage. The best query time that we can obtain is $O(d^3\log n)$, instead of $O(d^4\log d\log n)$ in Meiser's algorithm. For these bounds to hold, the preprocessing and storage are rather large (super-exponential in $d$). We discuss the tradeoff between query cost and storage (in both approaches, the one using bottom-vertex trinagulation and the one using vertical decomposition).

### The Complexity of Maximum $k$-Order Bounded Component Set Proble

Authors: Sounaka Mishra, Shijin Rajakrishnan
Abstract: Given a graph $G=(V, E)$ and a positive integer $k$, in \textsc{Maximum $k$-Order Bounded Component Set} problem (\maxkcomp{k}), it is required to find a vertex set $S \subseteq V$ of maximum size such that each component in the induced graph $G[S]$ has at most $k$ vertices. We prove that for constant $k$, \maxkcomp{k} is hard to approximate within a factor of $n^{1 -\epsilon}$, for any $\epsilon > 0$, unless $\mathsf{P} = \mathsf{NP}$. We provide lower bounds on the approximability when $k$ is not a constant as well. \maxkcomp{k} can be seen as a generalization of \textsc{Maximum Independent Set} problem (\textsc{Max-IS}). We generalize Tur\'{a}n's greedy algorithm for \textsc{Max-IS} and prove that it approximates \maxkcomp{k} within a factor of $(2k - 1)\overline{d} + k$, where $\overline{d}$ is the average degree of the input graph $G$. This approximation factor is a generalization of Tur\'{a}n's approximation factor for \textsc{Max-IS}.

### Core Discovery in Hidden Graphs

Abstract: Massive network exploration is an important research direction with many applications. In such a setting, the network is, usually, modeled as a graph $G$, whereas any structural information of interest is extracted by inspecting the way nodes are connected together. In the case where the adjacency matrix or the adjacency list of $G$ is available, one can directly apply graph mining algorithms to extract useful knowledge. However, there are cases where this is not possible because the graph is \textit{hidden} or \textit{implicit}, meaning that the edges are not recorded explicitly in the form of an adjacency representation. In such a case, the only alternative is to pose a sequence of \textit{edge probing queries} asking for the existence or not of a particular graph edge. However, checking all possible node pairs is costly (quadratic on the number of nodes). Thus, our objective is to pose as few edge probing queries as possible, since each such query is expected to be costly. In this work, we center our focus on the \textit{core decomposition} of a hidden graph. In particular, we provide an efficient algorithm to detect the maximal subgraph of $S_k$ of $G$ where the induced degree of every node $u \in S_k$ is at least $k$. Performance evaluation results demonstrate that significant performance improvements are achieved in comparison to baseline approaches.

## What Makes a Programming Language?

There is an alphabet, words, grammar, statements, semantics, and various ways to organize the previous in order to create a computer program in a programming language. Flex helps developers create a tool called a lexical analyzer which identifies the words of a program during the compilation/interpretation process.

A compiler takes a text file and parses it by character trying to match patterns at each of the aforementioned levels. The initial parse is the lexical analysis, this pass ensures that there are no lexical errors, which are characters that don’t contribute to meaningful words.

Meaningful words for a programming language are described by a regular language. An implementation for describing a regular language is regular expressions. An implementation for parsing text while looking for matches to regular expressions is a flex lexical analyzer.

Essentially, programming a flex lexer means defining various regular expressions which detail all of the possible words and characters that are meaningful in a correct program for your language.

## BooleanLogicLanguage Example Language

To illustrate flex, BooleanLogicLanguage is an example language which a flex-based lexer will lexically analyze. For no reason in particular, the purpose of this language is to evaluate Boolean logic expressions.

An example of a program made in BooleanLogicLanguage

This diagram is an example of what a correct program would look like in this BooleanLogicLanguage. The lexical analyzer should be able to parse this without errors. Note that in this language, ‘INTEGER(‘ and ‘INTEGER)’ are the ‘words’ used to separate pieces of code — as opposed to ‘(‘, ‘)’, ‘{‘, or ‘}’. When creating your own language, you are free to do whatever you want, but following convention, to some degree, just ensures that the tool you are creating is easy to use.

NOTE: A programming language is a tool for using a computer. A GUI is a tool for using a computer. Siri or Cortana are tools for using a computer, etc.

Regular expressions describing the lexical structure of BooleanLogicLanguage

This diagram is a sketch of the regular expressions which will be used in the flex program in order to describe meaningful words. The phrases on the left hand side are token names. Token names are not necessarily important during lexical analysis, but they are of the utmost importance when performing the syntactic analysis (which comes after a lexical analysis during compilation/interpretation — not covered in this post).

The most difficult part of this process is defining tokens and figuring out what sort of regular expression should describe them. For this example, I decided to make a language that would evaluate Boolean logic. Then I started writing potential programs in this language, and once I wrote enough that looked interesting and useful, I defined tokens and regular expressions to ensure those particular programs would be correct. I made code up that I liked and then fit tokens and regex to make them work.

A Quick Note on Regular Expressions (Regex)

[…] denotes a single thing, whose identity is dictated by what’s inside the brackets.

A single character is a single thing.

* denotes zero or more of the single thing before it.

? denotes one or zero of the single thing before it.

(…) is just a grouping, usually used with * or ?.

The difference between […] and (…) is that the square brackets represents one of what is inside of it and the parentheses are a grouping. For example [abc] and (abc): ‘a’, ‘b’, ‘c’ all match the former, and ‘abc’ matches the latter.

– denotes a range and has specific applications that are very useful: A-Z, a-z, 0-9.

## Flex

Flex is like a C program, except it has a  further defined layout.

%{

C code declarations

%}

definitions of regular expressions

%%

regular expression1   code to execute on match (such as a macro)

regular expression2  other code to execute on match (such as a function)

%%

C code functions

main function

The power of Flex comes from being able to define regular expressions (between ‘%}’ and ‘%%’) and then attach them to code (between ‘%%’ and ‘%%’). The additional areas for C code are both handy and what gives the lexer its true functionality (doing something when a regex is matched).

NOTE: flex essentially turns this flex file (extension ‘.l’) into a C program which is then compiled like you would compile any C program. The result is an object-file/program which you execute on/with a text file containing programming code. And in this case, the output of the program is to a text file (Tokens.txt) and also to stdout (terminal).

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 %{ int numErrors = 0; char* arg; typedef struct node { char* lexeme; char* token; struct node* next; } node_t; node_t head; node_t* current = &head; int yywrap(void); void store(char* lexeme); void error(void); void printStored(void); %} whitespace (\t|" "|\r) newline \n real -?[0-9]+\.[0-9]+ integer -?[0-9]+ string \".*\" boolean_op_binary (" AND "|" OR "|" XOR ") boolean_op_unary "NOT " boolean_literal ("True"|"False") identifier [A-Z]+ separator_open ({real}|{integer})$separator_close$({real}|{integer}) assignment = IO "print " terminal ; %% {whitespace} {ECHO;} {newline} {ECHO;} {real} {ECHO;} {integer} {ECHO;} {string} {ECHO;} {boolean_op_binary} {ECHO; arg = "BOOL_OP_BINARY"; store(yytext);} {boolean_op_unary} {ECHO; arg = "BOOL_OP_UNARY"; store(yytext);} {boolean_literal} {ECHO; arg = "BOOL_LITERAL"; store(yytext);} {identifier} {ECHO; arg = "IDENTIFIER"; store(yytext);} {separator_open} {ECHO; arg = "OPEN"; store(yytext);} {separator_close} {ECHO; arg = "CLOSE"; store(yytext);} {assignment} {ECHO; arg = "ASSIGN"; store(yytext);} {IO} {ECHO; arg = "IO"; store(yytext);} {terminal} {ECHO; arg = "TERMINAL"; store(yytext);} . {ECHO; numErrors++; error();} %% int yywrap(void) { return 1; } void store(char* lexeme) { current->lexeme = malloc(sizeof(strlen(lexeme)+1)); strcpy(current->lexeme,lexeme); current->token = malloc(sizeof(strlen(arg)+1)); strcpy(current->token,arg); node_t* temp; temp = malloc(sizeof(node_t)); current->next = temp; current = current->next; } void error(void) { printf("[e]"); } void printStored(void) { node_t* c = &head; FILE* f = fopen("Tokens.txt","w"); while (c->next) { fprintf(f,"%s\t%s\n",c->lexeme,c->token); c = c->next; } fclose(f); printf("Tokens.txt written.\n"); } int main(int argc, char *argv[]) { // ensures number of command line arguments if (argc != 2) { printf("Please enter one filename as an argument.\n"); return -1; } // opens the file with name of second argument yyin = fopen(argv[1],"r"); yylex(); // close file fclose(yyin); printf("\nLexicalErrors %d\n",numErrors); printStored(); return 0; }

The above code is a flex file which parses the BooleanLogicLanguage.

 1 2 3 4 5 6 7 8 9 CC=gcc CFLAGS= LexerFile=lexer lexer: lex.yy.c $(CC)$(CCFLAGS) -o lexer lex.yy.c lex.yy.c: $(LexerFile).l flex$(LexerFile).l

The above code is a makefile, which when run in the same directory as the flex file, will create the ‘lexer’ program. This was tested on an Ubuntu 16.04 operating system. The GNU C Compiler (gcc) is required in addition to flex.

 1 2 3 4 5 P = True; R = False; Q = 1(NOT P)1 XOR 2(P AND 3(NOT R)3)2; print Q; 

The above text should be parsed without errors by our lexer. And the lexer should output a Tokens.txt matching each lexeme to the token describing its regex.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 P IDENTIFIER = ASSIGN True BOOL_LITERAL ; TERMINAL R IDENTIFIER = ASSIGN False BOOL_LITERAL ; TERMINAL Q IDENTIFIER = ASSIGN 1( OPEN NOT BOOL_OP_UNARY P IDENTIFIER )1 CLOSE XOR BOOL_OP_BINARY 2( OPEN P IDENTIFIER AND BOOL_OP_BINARY 3( OPEN NOT BOOL_OP_UNARY R IDENTIFIER )3 CLOSE )2 CLOSE ; TERMINAL print IO Q IDENTIFIER ; TERMINAL 

Here is the Token.txt for that initial errorless program.

Lexical errors are designated with a ‘[e]’ for this lexer, and it comes jsut after the erroneous character.

The above is a screenshot of an example of parsing a program with lexical errors.

What is Next?

Classically, once you can identify words, including what type of word (token), you can then create a grammar. You could think of a token as ‘noun’ or ‘verb’ or ‘adjective’, if comparing a programming language to a natural language. The step after lexical analysis (checking for correctness of words) is syntactic analysis (checking for correctness of grammar). Making a comparison to natural languages again, an English grammar could be PHRASE: article noun verb (The dog ran, A bird flies, etc). In BooleanLogicLanguage there could be STATEMENT: IDENTIFIER ASSIGN BOOL_LITERAL (Q = True, A = False, etc). But each of those are an example of just one type of phrase or statement, you can have multiple definitions for a production in the grammar for your language.

Non-Classically? I don’t know. Natural Language Processing (NLP) is a very active area, but I’m not aware of anyone using those techniques for parsing textual programming languages.

The post Flex, Regular Expressions, and Lexical Analysis appeared first on XRDS.

by Alexander DeForge at December 10, 2017 11:44 PM UTC

### Women in Theory 2018

from CS Theory Events

June 19-22, 2018 Harvard University, Cambridge, MA https://womenintheory.wordpress.com/ Registration deadline: January 16, 2018 This is a workshop for female graduate students and exceptional undergraduates (fourth year) in theoretical computer science. The workshop will have first-rate technical content and will be a great opportunity for students to meet their peers from around the world. We will … Continue reading Women in Theory 2018

by shacharlovett at December 10, 2017 08:03 PM UTC

### Preview: The solution by Keller and Lifshitz to several open problems in extremal combinatorics

from Gil Kalai

Peter Frankl (right) and Zoltan Furedi

## The news

A new paper by Nathan Keller and Noam Lifshitz settles several open problems in extremal combinatorics for wide range of parameters. Those include the three problems we mention next.

### Three central open problems in extremal combinatorics are

The 1975 Erdős-Sós forbidding one intersection problem, asks for the maximal
size of a k-uniform hypergraph that does not contain two edges whose intersection
is of size exactly t−1;

The 1987 Frankl-Füredi special simplex problem asks for the maximal
size of a k-uniform hypergraph that does not contain the following forbidden configuration: d+1 edges $E_1,\dots ,E_{d+1}$ such that there exists a set $S= \{v_1,\dots , v_{d+1}\}$ for which $E_i\cap S = S \backslash \{v_i\}$ for any i and the sets {Ei \ S} are pairwise disjoint.

The 1974 Erdős-Chvátal’s simplex conjecture proposes an answer for the maximal
size of a k-uniform hypergraph that does not contain a d-simplex. Here, a d-simplex is a family of d+1 sets that have empty intersection, such that the intersection
of any d of them is nonempty.

All these questions are related to the Erdős-Ko-Rado theorem (see this post and many others). For $t=1$, two edges whose intersection is of size exactly t−1 are just two disjoint edges and so is a 1-simplex and a special 1-simplex.

### The papers by Keller and Lifshitz and by Ellis, Keller, and Lifshitz

The paper by Keller and Lifshitz settles all these problems for a wide range of parameters! A subsequent paper by David Ellis, Nathan Keller, and Noam Lifshitz extends (among various other results) the range of parameters for the  Erdős-Sós problem even further.

## Plans for posts

Michel Deza

I have an ambitious plan to devote two or three posts to these developments (but not before January). In the first post I will give some general background on Turan’s problem for hypergraphs and the new new exciting results, Then (perhaps in a second post) I will give little background on two major methods, the Delta-system method initiated by Deza, Erdos and Frankl and used in many earlier papers mainly by Frankl and Furedi, and the Junta method initiated by Friedgut which is used (among other ingredients) in the new paper.  Then I will write about the new results in the last post.

Paul Erdos, Thomas Luczak, Ehud Friedgut, and Svante Janson

by Gil Kalai at December 10, 2017 05:26 PM UTC

### On the (Im)possiblity of intelligence explosion

(In this post I am following the venerable tradition of bloggers opining about matters on which they don’t really know much about. I hope I learn something from the feedback –Boaz).

Nothing is impossible,
Child, nothing is impossible.
Every bridge is crossable.
Every tooth is flossable.
Every win is lossable.
Every worker’s bossable.
Every yak’s a lhasa bull.
Nothing is impossible,
Child, nothing is impossible.

Okay, teacher, can you name something that ISN’T possible?

No, Child. Nothing is impossible.

So, there IS something that’s impossible. Naming something that’s impossible is impossible.

(From “The Teacher and The Child”by Chris Harris)

In this world where reasoned arguments and respect for facts seem increasigly rare, some people are actually worried about the opposite problem of “intelligence explosion”.
Recently, through a blog post of Scott Aaronson, I came across this essay by François Chollet. Given Scott’s less than raving review, I fully expected not to like it, but actually found myself agreeing with some parts of this essay (though not all, and in particular not with the title).

The basic fear of “intelligence explosion” is that:

1. Once we develop sufficiently advanced artificial intelligence, it will go on to use this intelligence to build better and better AI, quickly leaving us behind.
2. The AI will develop some form of consciousness and rather than using their intelligence to make our lives better, will be about as kind to us as we were to the Neanderthals.

Consciousness is a hard concept to define. Humans used to attribute consciousness to the weather, praying to the sun and sea gods. After all it made perfect sense, the weather is unpredictable and dangerous, and seemed to vary in capricious ways. As we have grown to understand weather better, we no longer think that it is conscious. Yes, there is still an element of unpredictability and randomness to it, but we do understand the basic mechanisms at play, and can place some probabilities or bounds on its behavior. So these days the weather is stochastic rather than adversarial.
Arthur C. Clarke famously said that “Any sufficiently advanced technology is indistinguishable from magic.”. Similarly, one can say that any system that is sufficiently adversarial is indistinguishable from being conscious.

If we follow this definition, then we have already created conscious machines, since we can definitely already create technologies that we understand so poorly that we cannot model it in any way other than adversarial. In some sense this is the underlying lesson of results such as the Halting problem and NP completeness. Moreover, ever since the Atom bomb, mankind’s potential to damage the planet and ourselves has gone far beyond what nature can do (and of course, as witnessed by global warming, nuclear energy is not the only technology that can affect the whole planet). Also, as anyone getting “on the air updates” for their gadgets can attest to, we already have systems that continuously improve over time, often with a feedback loop. With more and more of the world’s economy becoming dependent on the transfer of information as opposed to goods (which of course is somewhat of an artificial distinction), the speed of progress has become much faster. So, if the question is whether we should worry about us developing and deploying technology whose behavior we can’t completely predict, and one that could result in very bad consequences, then I am with the alarmists.

What I find less appealing about the “AI risk” position is the focus on notions such as “intelligence” and “conciousness”. There is already an algorithm outperforming most humans on IQ tests and surely soon there will be an algorithm with an IQ of 300, or whatever the maximum possible value is. However, as Chollet points out, while some individual humans have had profound influence on the course of humanity, it is typically not intelligence alone that helped them (see e.g. the rather sad story of William James Sidis). That said, one can’t dispute that in the 200K years we’ve been around, Homo Sapiens have managed to make some significant progress, and so if you could simulate a population of even average humans, but do it with more people and larger speed, you’d speed up scientific discovery. Indeed (and this is where I part ways with Chollet) scientific progress has been accelerating, precisely because we use scientific progress to make more progress, whether it’s computer simulations helping in doing physics or quantum physics helping us build new computers, and so on and so forth. We are likely to continue to progress at an accelerating pace, not by trying to simulate a population of average or genius humans, but rather by continuously applying our tools and understanding to build better tools and improve our understanding.

But all of the above does not mean that modelling computation and communication systems as a new species and anthropomorphizing them is helpful. Research can and should be done on trying to verify that the behavior of computing systems does not deviate from certain parameters, whether it is Windows 10 or an AI algorithm.
With the progress of time computers are likely to continue to do more and more tasks currently associated with human intelligence, and yes, we do have a nontrivial chance of creating a technology that may eventually destroy us. But I don’t see why thinking of algorithms in anthropomorphic terms is helpful any more than thinking of the weather in terms of deities. If anything, understanding human “intelligence” and “consciousness” in more algorithmic ways  seems like a better path forward.

by Boaz Barak at December 09, 2017 08:22 PM UTC

### Research Staff Member at IBM Almaden Research theory group (apply by January 20, 2018)

from CCI: jobs

The Theory Group anticipates one or more available positions as Research Staff Members, pursuing research in theoretical computer science. The positions would start in 2018. Please see the webpage for more information.

Website: http://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=4491
Email: klclarks@us.ibm.com

by shacharlovett at December 08, 2017 11:55 PM UTC

### Is there any book on the philosophical implications of Theoretical Computer Science?

I find some books about computers, but all of them are about technology. I want something more linked to theory.

by Raphael Augusto at December 08, 2017 10:46 PM UTC

### faculty at Northeastern University, Boston (apply by January 1, 2018)

from CCI: jobs

Northeastern University, Boston, continues to hire at all levels and in all areas, including theory. In addition to a number of positions in the College of Computer Science, there is also a position joint with the Mathematics department that includes a focus on Data Science, Machine Learning, Discrete and Computational Mathematics, Probability and Statistics, or
Topological Data Analysis.

by shacharlovett at December 08, 2017 10:40 PM UTC

### P=NP: Perhaps I Change My Mind

from Richard Lipton

An old result put a new way (in a now-fixed-up post)

Albert Meyer knows circuit lower bounds. He co-authored a paper with the late Larry Stockmeyer that proves that small instances of the decision problem of a certain weak second-order logical theory require Boolean circuits with more gates than there are atoms in the observable universe. The instances almost fit into two Tweets using just the Roman typewriter keys.

Today Ken and I discuss a simple but perhaps underlooked connection between P=NP and circuit lower bounds.

Albert recently co-authored, with Eric Lehman of Google and his MIT colleague Tom Leighton, the textbook Mathematics for Computer Science. It looks familiar to us because it uses the same MIT Press fonts and layout package as our quantum computing book. They say the following in their Introduction:

Simply put, a proof is a method of establishing truth. Like beauty, “truth” sometimes depends on the eye of the beholder, and it should not be surprising that what constitutes a proof differs among fields.

We would say that the game is not only about making truth evident from a proof but also from the way a theorem statement is expressed. This post uses an old result of Albert’s as an example.

## New Statement of Old Result

Well, any criticism of Albert for how his theorem was stated is really criticism of myself, because Dick Karp and I were the first to state it in a paper. Here is exactly how we wrote it in our STOC 1980 version:

There was no paper by Albert to cite. In our 1982 final version we included a proof of his theorem:

As you can see, our proof—Albert’s proof?—used completeness for ${\mathsf{EXP}}$ and some constructions earlier in the paper. In a preliminary section we wrote that our proofs about classes ${\mathcal{L}}$ such as ${\mathsf{EXP}}$ involved showing inclusions

$\displaystyle K \in \mathcal{V}/f \implies K \in \mathcal{S'},$

“where the set of strings ${K}$ is complete in ${\mathcal{L}}$ with respect to an appropriate reducibility.”

But in this case the proof does not need completeness for ${\mathsf{EXP}}$. I came up with this realization on Wednesday and Ken found essentially the same proof in these lecture notes by Kristoffer Hansen:

This proof uses ${\mathsf{EXP} \subseteq \mathsf{P/poly}}$ not only for ${L(M) \in \mathsf{P/poly}}$ but also for ${L_M \in \mathsf{P/poly}}$. For the latter it suffices to have ${L_M}$ polynomial-time reduce to ${L(M)}$. This follows so long as ${L(M)}$ is complete for some class to which ${L_M}$ also belongs.

Suppose ${M}$ runs in deterministic time ${t(n)}$. Then both ${L(M)}$ and ${L_M}$ belong to ${\mathsf{DTIME}[t(n)]}$, the latter because on input ${\langle x,i,t,z \rangle}$ we can run ${M}$ for ${t \leq t(n)}$ steps. With minimal assumptions on the function ${t}$, ${\mathsf{DTIME}[t(n)]}$ has complete sets ${L(M)}$, and then ${L_M \in \mathsf{P/poly}}$ follows from ${L(M) \in \mathsf{P/poly}}$. So we can state the theorem more generally:

Theorem 1 (toujours Meyer) For any time function ${t(n) \leq 2^{n^{O(1)}}}$, ${\mathsf{DTIME}[t(n)] \subset \mathsf{P/poly} \implies \mathsf{DTIME}[t(n)] \subseteq \Sigma_2^p}$.

In fact we would get ${\mathsf{DTIME}[t(n)]}$ into ${\Sigma_2^p \cap \Pi_2^p}$ and even smaller classes but that’s going beyond our simple point.

## The Point

Our point comes through if we think of a concrete case like deterministic time ${2^{(\log n)^{O(1)}}}$, called ${\mathsf{QP}}$ for quasipolynomial time. So we have:

Corollary 2 If ${\mathsf{QP} \subset \mathsf{P/poly}}$ then ${\mathsf{QP} \subseteq \Sigma_2^p}$.

Now mentally substitute ${\mathsf{QP}}$ for ${\mathsf{EXP}}$ (and ${\subseteq}$‘ for ${=}$‘) in the way Karp and I summarized the final implication in our paper:

What you get after contraposing and using the hierarchy theorem for ${\mathsf{P \neq QP}}$ is:

Corollary 3 If ${\mathsf{P = NP}}$ then ${\mathsf{QP} \not\subseteq \mathsf{P/poly}}$.

The point is that we can also do this for time ${t(n) = n^{\log^* n}}$ and even smaller proper super-classes of ${\mathsf{P}}$. What follows is:

Any attempt to prove ${\mathsf{P=NP}}$ entails proving strong nonuniform circuit lower bounds on languages that are arbitrarily close to being in ${\mathsf{P}}$.

## Reflecting…

Again in the case of ${\mathsf{EXP}}$ this implication too has been variously noted. Scott Aaronson mentions it in one sentence of his great recent 121-page survey on the ${\mathsf{P}}$ versus ${\mathsf{NP}}$ question (p65):

“[I]f someone proved ${\mathsf{P} = \mathsf{NP}}$, that wouldn’t be a total disaster for lower bounds re-search: at least it would immediately imply ${\mathsf{EXP} \not\subseteq \mathsf{P/poly}}$ (via ${\mathsf{EXP = EXP^{NP^{NP}}}}$).”

Maybe I (Dick) considered this in terms of ${\mathsf{EXP}}$ in weighing my thoughts about ${\mathsf{P} = \mathsf{NP}}$. But that it applies to ${\mathsf{QP}}$ in place of ${\mathsf{EXP}}$ gives me pause. This greatly amplifies idle thoughts about the irony of how proving ${\mathsf{P} = \mathsf{NP}}$ yields the same type of lower bounds against ${\mathsf{P/poly}}$ that are involved in the “Natural Proofsbarrier against proving ${\mathsf{P} \neq \mathsf{NP}}$. Ryan Williams had to combine many ideas just to separate ${\mathsf{NEXP}}$ from nonuniform ${\mathsf{ACC}}$—not even getting ${\mathsf{EXP}}$ on the left nor ${\mathsf{P/poly}}$ on the right. (Incidentally, we note this nice recent MIT profile of Ryan.) So having such lower bounds for ${\mathsf{QP}}$ just drop from the sky upon ${\mathsf{P} = \mathsf{NP}}$ seems jarring.

So I’m rethinking my angle on ${\mathsf{P} = \mathsf{NP}}$. I’ve always propounded that good lower bounds flow like ripples from new upper bounds, but the wake of ${\mathsf{P} = \mathsf{NP}}$ seems a tsunami. We wonder if Bill Gasarch will do a 3rd edition of his famous poll about ${\mathsf{P}}$ versus ${\mathsf{NP}}$. Ken and I offset each other with our votes last time, but maybe not this time.

We also wonder whether Theorem 1 can be given even stronger statements in ways that are useful. In the original version of this post we overlooked a point noted first by Ryan Williams here and thought we had ${\mathsf{EXP} \cap \mathsf{P/poly} \subseteq \Sigma_2^p}$. To patch it, call a language ${L}$ in ${\mathsf{EXP}}$ “reflective” if there is a TM ${M}$ running in exponential time such that ${L(M) = L}$ and ${L_M}$ (namely, the “tableau” language defined above) polynomial-time reduces to ${L}$. The complete sets ${L}$ mentioned above for classes within ${\mathsf{EXP}}$ are reflective. If we let ${\mathsf{EXP}_R}$ denote the subclass of reflective languages, then we can say:

$\displaystyle \mathsf{EXP}_R \cap \mathsf{P/poly} \subseteq \Sigma_2^p.$

Note that per Lance Fortnow’s comment here, sparse languages ${L}$ are candidates for being non-reflective: the tableau language ${L_M}$ which we would wish to polynomial-time Turing reduce to ${L}$ is generally dense.

## Open Problems

Is this realization about ${\mathsf{P} = \mathsf{NP}}$ and strong circuit lower bounds arbitrarily close to ${\mathsf{P}}$ really new? Can our readers point us to other discussions of it?

Is the notion of “reflective” known? useful?

[fixed error in original Theorem 1 and surrounding text; added paragraph about it before “Open Problems”; moved query about “cosmological” formulas to a comment.]

by RJLipton+KWRegan at December 08, 2017 07:55 PM UTC

### Generalization Theory and Deep Nets, An introduction

Deep learning holds many mysteries for theory, as we have discussed on this blog. Lately many ML theorists have become interested in the generalization mystery: why do trained deep nets perform well on previously unseen data, even though they have way more free parameters than the number of datapoints (the classic “overfitting” regime)? Zhang et al.’s paper Understanding Deep Learning requires Rethinking Generalization played some role in bringing attention to this challenge. Their main experimental finding is that if you take a classic convnet architecture, say Alexnet, and train it on images with random labels, then you can still achieve very high accuracy on the training data. (Furthermore, usual regularization strategies, which are believed to promote better generalization, do not help much.) Needless to say, the trained net is subsequently unable to predict the (random) labels of still-unseen images, which means it doesn’t generalize. The paper notes that the ability to fit a classifier to data with random labels is also a traditional measure in machine learning called Rademacher complexity (which we will discuss shortly) and thus Rademacher complexity gives no meaningful bounds on sample complexity. I found this paper entertainingly written and recommend reading it, despite having given away the punchline. Congratulations to the authors for winning best paper at ICLR 2017.

But I would be remiss if I didn’t report that at the Simons Institute Semester on theoretical ML in spring 2017 generalization theory experts expressed unhappiness about this paper, and especially its title. They felt that similar issues had been extensively studied in context of simpler models such as kernel SVMs (which, to be fair, is clearly mentioned in the paper). It is trivial to design SVM architectures with high Rademacher complexity which nevertheless train and generalize well on real-life data. Furthermore, theory was developed to explain this generalization behavior (and also for related models like boosting). On a related note, several earlier papers of Behnam Neyshabur and coauthors (see this paper and for a full account, Behnam’s thesis) had made points fairly similar to Zhang et al. pertaining to deep nets.

But regardless of such complaints, we should be happy about the attention brought by Zhang et al.’s paper to a core theory challenge. Indeed, the passionate discussants at the Simons semester themselves banded up in subgroups to address this challenge: these resulted in papers by Dzigaite and Roy, then Bartlett, Foster, and Telgarsky and finally Neyshabur, Bhojapalli, MacAallester, Srebro. (The latter two were presented at NIPS’17 this week.)

Before surveying these results let me start by suggesting that some of the controversy over the title of Zhang et al.’s paper stems from some basic confusion about whether or not current generalization theory is prescriptive or merely descriptive. These confusions arise from the standard treatment of generalization theory in courses and textbooks, as I discovered while teaching the recent developments in my graduate seminar.

### Prescriptive versus descriptive theory

To illustrate the difference, consider a patient who says to his doctor: “Doctor, I wake up often at night and am tired all day.”

Doctor 1 (without any physical examination): “Oh, you have sleep disorder.”

I call such a diagnosis descriptive, since it only attaches a label to the patient’s problem, without giving any insight into how to solve the problem. Contrast with:

Doctor 2 (after careful physical examination): “A growth in your sinus is causing sleep apnea. Removing it will resolve your problems.”

Such a diagnosis is prescriptive.

## Generalization theory: descriptive or prescriptive?

Generalization theory notions such as VC dimension, Rademacher complexity, and PAC-Bayes bound, consist of attaching a descriptive label to the basic phenomenon of lack of generalization. They are hard to compute for today’s complicated ML models, let alone to use as a guide in designing learning systems.

Recall what it means for a hypothesis/classifier $h$ to not generalize. Assume the training data consists of a sample $S = {(x_1, y_1), (x_2, y_2),\ldots, (x_m, y_m)}$ of $m$ examples from some distribution ${\mathcal D}$. A loss function $\ell$ describes how well hypothesis $h$ classifies a datapoint: the loss $\ell(h, (x, y))$ is high if the hypothesis didn’t come close to producing the label $y$ on $x$ and low if it came close. (To give an example, the regression loss is $(h(x) -y)^2$.) Now let us denote by $\Delta_S(h)$ the average loss on samplepoints in $S$, and by $\Delta_{\mathcal D}(h)$ the expected loss on samples from distribution ${\mathcal D}$. Training generalizes if the hypothesis $h$ that minimises $\Delta_S(h)$ for a random sample $S$ also achieves very similarly low loss $\Delta_{\mathcal D}(h)$ on the full distribution. When this fails to happen, we have:

Lack of generalization: $\Delta_S(h) \ll \Delta_{\mathcal D}(h) \qquad (1).$

In practice, lack of generalization is detected by taking a second sample (“held out set”) $S_2$ of size $m$ from ${\mathcal D}$. By concentration bounds expected loss of $h$ on this second sample closely approximates $\Delta_{\mathcal D}(h)$, allowing us to conclude

### Generalization Theory: Descriptive Parts

Let’s discuss Rademacher complexity, which I will simplify a bit for this discussion. (See also scribe notes of my lecture.) For convenience assume in this discussion that labels and loss are $0,1$, and assume that the badly generalizing $h$ predicts perfectly on the training sample $S$ and is completely wrong on the heldout set $S_2$, meaning

$\Delta_S(h) - \Delta_{S_2}(h) \approx - 1 \qquad (3)$

Rademacher complexity concerns the following thought experiment. Take a single sample of size $2m$ from $\mathcal{D}$, split it into two and call the first half $S$ and the second $S_2$. Flip the labels of points in $S_2$. Now try to find a classifier $C$ that best describes this new sample, meaning one that minimizes $\Delta_S(h) + 1- \Delta_{S_2}(h)$. This expression follows since flipping the label of a point turns good classification into bad and vice versa, and thus the loss function for $S_2$ is $1$ minus the old loss. We say the class of classifiers has high Rademacher complexity if with high probability this quantity is small, say close to $0$.

But a glance at (3) shows that it implies high Rademacher complexity: $S, S_2$ were random samples of size $m$ from $\mathcal{D}$, so their combined size is $2m$, and when generalization failed we succeeded in finding a hypothesis $h$ for which $\Delta_S(h) + 1- \Delta_{S_2}(h)$ is very small.

In other words, returning to our medical analogy, the doctor only had to hear “Generalization didn’t happen” to pipe up with: “Rademacher complexity is high.” This is why I call this result descriptive.

The VC dimension bound is similarly descriptive. VC dimension is defined to be at least $k +1$ if there exists a set of size $k$ such that the following is true. If we look at all possible classifiers in the class, and the sequence of labels each gives to the $k$ datapoints in the sample, then we can find all possible $2^{k}$ sequences of $0$’s and $1$’s.

If generalization does not happen as in (2) or (3) then this turns out to imply that VC dimension is at least around $\epsilon m$ for some $\epsilon >0$. The reason is that the $2m$ data points were split randomly into $S, S_2$, and there are $2^{2m}$ such splittings. When the generalization error is $\Omega(1)$ this can be shown to imply that we can achieve $2^{\Omega(m)}$ labelings of the $2m$ datapoints using all possible classifiers. Now the classic Sauer’s lemma (see any lecture notes on this topic, such as Schapire’s) can be used to show that VC dimension is at least $\epsilon m/\log m$ for some constant $\epsilon>0$.

Thus again, the doctor only has to hear “Generalization didn’t happen with sample size $m$” to pipe up with: “VC dimension is higher than $\Omega(m/log m)$.”

One can similarly show that PAC-Bayes bounds are also descriptive, as you can see in scribe notes from my lecture.

Why do students get confused and think that such tools of generalization theory gives some powerful technique to guide design of machine learning algorithms?

Answer: Probably because standard presentation in lecture notes and textbooks seems to pretend that we are computationally-omnipotent beings who can compute VC dimension and Rademacher complexity and thus arrive at meaningful bounds on sample sizes needed for training to generalize. While this may have been possible in the old days with simple classifiers, today we have complicated classifiers with millions of variables, which furthermore are products of nonconvex optimization techniques like backpropagation. The only way to actually lowerbound Rademacher complexity of such complicated learning architectures is to try training a classifier, and detect lack of generalization via a held-out set. Every practitioner in the world already does this (without realizing it), and kudos to Zhang et al. for highlighting that theory currently offers nothing better.

## Toward a prescriptive generalization theory: the new papers

In our medical analogy we saw that the doctor needs to at least do a physical examination to have a prescriptive diagnosis. The authors of the new papers intuitively grasp this point, and try to identify properties of real-life deep nets that may lead to better generalization. Such an analysis (related to “margin”) was done for simple 2-layer networks couple decades ago, and the challenge is to find analogs for multilayer networks. Both Bartlett et al. and Neyshabur et al. hone in on stable rank of the weight matrices of the layers of the deep net. These can be seen as an instance of a “flat minimum” which has been discussed in neural nets literature for many years. I will present my take on these results as well as some improvements in a future post. Note that these methods do not as yet give any nontrivial bounds on the number of datapoints needed for training the nets in question.

Dziugaite and Roy take a slightly different tack. They start with McAllester’s 1999 PAC-Bayes bound, which says that if the algorithm’s prior distribution on the hypotheses is $P$ then for every posterior distributions $Q$ (which could depend on the data) on the hypotheses the generalization error of the average classifier picked according to $Q$ is upper bounded as follows where $D()$ denotes KL divergence:

This allows upperbounds on generalization error (specifically, upperbounds on number of samples that guarantee such an upperbound) by proceeding as in Langford and Caruana’s old paper where $P$ is a uniform gaussian, and $Q$ is a noised version of the trained deep net (whose generalization we are trying to explain). Specifically, if $w_{ij}$ is the weight of edge ${i, j}$ in the trained net, then $Q$ consists of adding a gaussian noise $\eta_{ij}$ to weight $w_{ij}$. Thus a random classifier according to $Q$ is nothing but a noised version of the trained net. Now we arrive at the crucial idea: Use nonconvex optimization to find a choice for the variance of $\eta_{ij}$ that balances two competing criteria: (a) the average classifier drawn from $Q$ has training error not much more than the original trained net (again, this is a quantification of the “flatness” of the minimum found by the optimization) and (b) the right hand side of the above expression is as small as possible. Assuming (a) and (b) can be suitably bounded, it follows that the average classifier from Q works reasonably well on unseen data. (Note that this method only proves generalization of a noised version of the trained classifier.)

Applying this method on simple fully-connected neural nets trained on MNIST dataset, they can prove that the method achieves error $17$ percent error on MNIST (whereas the actual error is much lower at 2-3 percent). Hence the title of their paper, which promises nonvacuous generalization bounds. What I find most interesting about this result is that it uses the power of nonconvex optimization (harnessed above to find a suitable noised distribution $Q$) to cast light on one of the metaquestions about nonconvex optimization, namely, why does deep learning not overfit!

### Postdoc at UT Austin (apply by January 3, 2018)

from CCI: jobs

The Computer Science Department at UT Austin invites applications for a Postdoctoral Fellow in theoretical computer science for the 2018-19 academic year. The Fellow will work with Dana Moshkovitz and David Zuckerman on pseudorandomness and computational complexity. Applications will be accepted until the position is filled.

Website: https://utdirect.utexas.edu/apps/hr/jobs/nlogon/171128010712
Email: diz@cs.utexas.edu

by shacharlovett at December 08, 2017 04:50 PM UTC

### Cologne-Twente Workshop on Graphs and Combinatorial Optimization

from CS Theory Events

June 18-20, 2018 Paris, France http://ctw18.lipn.univ-paris13.fr/ Submission deadline: February 2, 2018 Registration deadline: May 2, 2018 The Cologne-Twente Workshop on Graphs and Combinatorial Optimization 2018 welcomes contributions on theory and applications of discrete algorithms, graphs and combinatorial optimization in the wide sense (4-page “long abstract” submissions invited).Filed under: workshop

by shacharlovett at December 08, 2017 11:26 AM UTC

### Googatory

from Scott Aaronson

When I awoke with glowing, translucent hands, and hundreds of five-pointed yellow stars lined up along the left of my visual field, my first thought was that a dream must have made itself self-defeatingly obvious. I was a 63-year-old computer science professor. I might’ve been dying of brain cancer, but my mind was lucid enough that I’d refused hospice care, lived at home, still even met sometimes with my students, and most importantly: still answered my email, more or less. I could still easily distinguish dreams from waking reality. Couldn’t I?

I stared at the digital clock beside my bed: 6:47am. After half a minute it changed to 6:48. No leaping around haphazardly. I picked up the two-column conference paper by my nightstand. “Hash-and-Reduce: A New Approach to Distributed Proximity Queries in the Cloud.” I scanned the abstract and first few paragraphs. It wasn’t nonsense—at least, no more so than the other papers that I still sometimes reviewed. The external world still ticked with clockwork regularity. This was no dream.

Nervously, I got up. I saw that my whole body was glowing and translucent. My pajamas, too. A second instance of my body, inert and not translucent, remained in the bed. I looked into the mirror: I had no reflection. The mirror showed a bedroom unoccupied but for the corpse on the bed.

OK, so I was a ghost.

Just then I heard my nurse enter through the front door. “Bob, how you feeling this morning?” I met her in the foyer. “Linda, look what happened! I’m a ghost now, but interestingly enough, I can still..”

Linda walked right through me and into the bedroom. She let out a small gasp when she saw the corpse, then started making phone calls.

Over the following days, I accompanied my body to the morgue. I attended my little memorial session at the university, made note of which of my former colleagues didn’t bother to show up. I went to my funeral. At the wake, I stood with my estranged wife and grown children, who mostly remained none the wiser—except when they talked about how eerie it was, how it felt like I was still there with them. Or maybe I’d say something, and get no response from my family, but then five minutes later their conversation would mysteriously veer toward the topic I’d broached. It seemed that I still had full input from the world of the living, but that my only output channel was doing spooky haunted things that still maintained plausible deniability about my existence.

Questions flooded my mind: were there other ghosts? Why was I in this purgatory … or whatever it was? Would I be here forever? And: what was that column of yellow stars in the left of my visual field, the stars that followed me everywhere?

Once it seemed clear that I was here to stay, for some definition of “here,” I figured I might as well do the same stuff that filled my waking hours when I was alive. I pulled up a chair and sat at my laptop. I hit up The Washington Post, The Onion, xkcd, SMBC Comics, Slate Star Codex. They all worked fine.

Then I switched to the Gmail tab. Hundreds of new messages. Former students asking for recommendation letters, prospective students wanting to work with me, grant managers howling about overdue progress reports, none of them bothering to check if I was dead.

I replied to one randomly-chosen email:

Dear Ashish,
Thanks for your interest in joining our group. Alas, I’m currently dead and walking the earth as a translucent wraith. For that reason, I’m unable to take on new PhD students at this time.
Best of luck!
–Bob

I clicked “Send” and—part of me was expecting this—got an error. Message not sent. Email couldn’t cross the barrier from the dead to the living: too obvious.

Next I opened my “Starred” folder. I was greeted by 779 starred messages: each one a pressing matter that I’d promised myself I’d get to while alive but didn’t.

Dear Bob,
Hope you’re well. I think I’ve found another error in your 2002 paper ‘Cache-Oblivious Approximation Algorithms for Sparse Linear Algebra on Big Data.’ Specifically, in the proof of Lemma 4.2, you assume a spectral bound [har har, spectral], even though your earlier definition of the matrix A_i seems to allow arbitrary norm…

I chuckled. Well, I did spend most of my life on this stuff, didn’t I? Shouldn’t I sort this out, just for the sake of my intellectual conscience?

I opened up my old paper in Ghostview (what else?) and found the offending lemma. Then I took out pen and paper—they worked, luckily, although presumably my scribblings remained invisible to the living—and set to work. After an hour, I’d satisfied myself that the alleged error was nothing too serious, just a gap requiring a few sentences of clarification. I sadly had no direct way to tell my years-ago correspondent that, assuming the correspondent was still even alive and research-active and at the same email address. But still: good for my peace of mind, right?

Then something happened: the first intimation of what my life, or rather undeath, was to consist of from then on. Faintly but unmistakably, one of the tiny yellow stars in the left of my visual field became a blue-gray outline. It was no longer filled with yellow.

Excitedly, I clicked through more starred emails. Some I saw no easy way to deal with. But every time I could satisfy myself that an email was no longer relevant—whether it was an invitation to a long-ago workshop, a grant that I never applied for, a proposed research collaboration rendered moot by subsequent work—one of those yellow stars in my visual field lost its yellow filling. Before long there were ten blue-gray outline stars, then twenty.

One day, while I invisibly attended an old haunt (har har)—the weekly faculty lunch in my former department—I encountered a fellow ghost: a former senior colleague of mine, who’d died twenty years prior. He and I got to talking.

For the most part, my fellow specter confirmed what I’d already guessed. Yes, in some long-ago past, purgatory no doubt had a different character. Yes, it’s no doubt different for others, who lived different lives and faced different psychic burdens. For us, though, for the faculty, purgatory is neither more nor less than the place where you must reply to every last email that was still starred “important” when you died.

In the afterlife, it turns out, it doesn’t matter how “virtuous” you were, unless altruism happens to have been your obsession while alive. What matters is just that you free yourself from whatever burdened you every night when you went to sleep, that you finish what you started. Those unable to do so remain ghosts forever.

“So,” I asked the other polter-guest at the faculty lunch, “how long does it take a professor to finish answering a lifetime’s worth of emails?”

“Depends. I’ve been doing it for twenty years.  Hoping to finish in twenty more.”

“I see. And when you’ve dealt with the last email, what then?”

“You pass to another place. None of us know exactly where. But”—and here his voice dropped to a whisper, as if anyone else present could hear ghosts—“it’s said to be a place of breathtaking tranquility. Where researchers like us wear flowing robes, and sit under olive trees, and contemplate truth and beauty with Plato and Euclid, and then go out for lunch buffet. Where there’s no email, no deadlines, no journals, no grant applications, no responsibilities but one: to explore whatever has captured your curiosity in the present moment. Some call it the Paradise of Productivity.”

“Does everyone have to pass through purgatory first, before they go there?”

“It’s said that, among all the computer scientists who’ve lived, only Alan Turing went straight to Paradise. And he died before email was even invented. When his time comes, Donald Knuth might also escape purgatory, since he forswore email in 1990. But Knuth, alas, might spend tens of thousands of years in a different purgatory, finishing Volume 4 of The Art of Computer Programming.

“As for the rest of us, we all spend more or less time here with our wretched emails—for most of us, more. For one computer scientist—an Umesh Vazi-something, I believe, from Berkeley—it’s rumored that when he enters this place, even a trillion years won’t suffice to leave it. It’s said that the Sun will swallow the Earth, the night sky will go dark, and yet there Umesh will be, still clearing his inbox.”

After a few years, I’d knocked off all the easy stuff in my Starred folder. Then, alas, I was left with missives like this:

Hey, earth to Bob!
The rest of us have done our part in writing up the paper. We’re all waiting on you to integrate the TeX files, and to craft an introduction explaining why anyone cared about the problem in the first place. Also, would you mind making a detailed pass through Sections 4.3 and 5.2?

Ugh. There were so many slightly different TeX files. Which were the most recent? This could take a while.

Nevertheless, after weeks of … ghosting on the project, I got to work revising the paper. There was, of course, the practical difficulty that I couldn’t directly communicate my edits back to the world of the living. Fortunately, I could still do haunted stuff. One day, for example, one of my former coauthors opened her old TeX file, and “discovered” that I’d actually done way more work on the paper while I was alive than anyone remembered I had. The mysteries of when exactly I did that work, and why no one knew about it at the time, were never satisfactorily resolved.

Finally, after fourteen years, I’d succeeded in putting to rest 731 of my 779 starred emails. In the corner of my visual field was a vast array of blue-gray stars—but still, ominously, 48 yellow stars scattered among them.

“God in Heaven!” I cried. “Whoever you are! I can’t handle any of the remaining starred emails, and thereby pass to the Paradise of Productivity, without sending replies back into the world of the living. Please, I beg you: let me breach this metaphysical wall.”

A booming voice came down from on high. “YEA, BOB, WHAT THOU REQUESTETH IS POSSIBLE.  THOU WOULDST NOT EVEN BE THE FIRST GHOUL FOR WHOM I WOULDST GRANTETH THIS REQUEST: FAR FROM IT.  BUT I MUST WARN THEE: BREACHING THE WALL BETWEEN LIVING AND DEAD WILL BRINGETH FRUITS THAT THOU MAYST NOT LIKE.”

“I think I’ll take my chances with those fruits.”

“VERY WELL,” said God.

And that’s how it is that, half a century after my death, I remain in purgatory still, my days now filled with missives like the following:

Dear Bob,
Thanks for the reply! I’m sorry to hear that you’re now a ghost condemned to answer emails before he can pass to the next world. My sympathies. Having said that, I have to confess that I still don’t understand Section 4.2 of your paper. When you get a chance, could you please clarify? I’ve cc’ed my coauthors, who might have additional followup questions.

Note: To anyone who emailed me lately, I apologize for the delay in replying. I was writing this story. –SA

### Women in Theory 2018 – Call for Application

The wonderful Women in Theory (WIT) biennial series of workshops started in 2008 and the 6th meeting will take place at Harvard University, Jun 19 – 22, 2018. Please see below the call for application.

WIT is one of my favorite (if not the favorite) program in the theory community.  I was lucky enough to participate in the outskirts of two WIT meetings, but naturally and rightfully never took part of its inner sanctum. So how can I vouch for an event I never fully attended?  Experiencing the passion of the organizers, and most notably of Tal Rabin, and the enthusiastic reactions of its participants, there can be no doubt. I am far from being the only one feeling this way, and this year both Harvard and Stanford competed to host the workshop. So if you fit the workshop’s qualifications – please do yourself a favor and apply!

We will have a “Women In Theory” workshop for female graduate students and exceptional undergraduates (fourth year) in theoretical computer science at Harvard University, Cambridge, MA  from Tuesday, June 19 to Friday, June 22 , 2018https://womenintheory.wordpress.com/ .

We are writing to draw your attention to the workshop and ask you to inform female students in your department about this workshop. The workshop will have first-rate technical content and will be a great opportunity for students to meet their peers from around the world.  We have received very enthusiastic feedback from participants in previous years and we think that this could be an exciting event for your student as well.  Please forward this email to the female students in your department.

We will supply dorm rooms, breakfast and lunch, and will probably also be able to cover at least part if not all of their travel expenses. It would be great if you would be able to cover the remaining expenses.

For any questions, please email womenintheory2018@gmail.com

The deadline to apply is January 16, 2018.  Each student applicant needs to finish the application form on
https://womenintheory.wordpress.com/apply/ , and her advisor also needs to supply a short letter of recommendation.

Best,

The organizing committee:
Tal Rabin
Shubhangi Saraf
Lisa Zhang

by Omer Reingold at December 07, 2017 05:23 PM UTC

### Razor's Edge

Informally the sensitivity conjecture asks whether every hard Boolean function has a razor's edge input, where flipping a random bit has a reasonable chance of flipping the output.

Let's be more precise. We consider functions f mapping {0,1}n to {0,1}. For every input x, the decision tree complexity at x is the least number of bits of  x you would need to query to decide whether the function outputs 0 or 1. The decision tree complexity of a function is the maximum decision tree complexity over all possible x. Most interesting functions have high decision tree complexity, even the lowly OR function requires querying every bit on the input of all zeroes. The decision tree complexity is polynomially-equivalent to randomized-complexity, quantum complexity, certificate complexity, and the degree of a polynomial that computes the function exactly or approximately. The recent paper by Aaronson, Ben-David and Kothari gives a nice chart showing the relationship between these measures and references to the various papers giving the bounds.

The sensitivity of f on an input x is the number of bit locations i such that f(x)≠f(x⊕i) where x⊕i is x with the ith bit flipped. The sensitivity of f is the maximum sensitivity over all inputs. The sensitivity conjecture states that there is some ε>0 such that the sensitivity of f is at least mε if the decision tree complexity is at least m. If the conjecture were true then for any function with maximal decision tree complexity n (querying every input bit) there must be some razor's edge input x such that flipping a random bit of x has probability at least n of flipping the output.

I find it surprising that we have no proof or counterexample to this purely combinatorial question. There is a generalization of sensitivity known as block sensitivity which is the largest set of disjoint blocks where flipping the bits in any block flips the output bit. Block sensitivity is known to be polynomially related to decision tree complexity.

In a future post I'll talk about some approaches towards resolving this conjecture.

by Lance Fortnow (noreply@blogger.com) at December 07, 2017 03:21 PM UTC

### Hausdorff School on Combinatorial Optimization

from CS Theory Events

August 20-24, 2018 Bonn, Germany http://www.hcm.uni-bonn.de/combinatorial-optimization-2018/ Submission deadline: April 30, 2018 In this summer school, leading experts present recent progress on classical combinatorial optimization problems, utilizing a variety of new techniques. Each of the five invited speakers teaches a mini-course spanning three lectures.Filed under: school

by shacharlovett at December 07, 2017 01:48 PM UTC

### International research school on Graph Limits

from CS Theory Events

January 22-26, 2018 Lyon, France https://graphlimitslyon.sciencesconf.org The International research school on Graph Limits is part of the thematic semester “Graphs, groups and Dynamics” held in Lyon between September 2017 and February 2018.Filed under: school

by shacharlovett at December 07, 2017 01:47 PM UTC

### News for November 2017

A quiet month, with only two papers. Perhaps the calm before the storm? Please let us know in the comments if something slipped under our radar.

Agreement tests on graphs and hypergraphs, by Irit Dinur, Yuval Filmus, and Prahladh Harsha (ECCC). This work looks at agreement tests and agreement theorems, which argue that if one checks if a number of local functions agree, then there exists a global function which agrees with most of them. This work extends previous work on direct product testing to local functions of higher degree, which corresponds to agreement tests on graphs and hypergraphs.

Testing Conditional Independence of Discrete Distributions, by Clément L. Canonne, Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart (arXiv). This paper focuses on testing whether a bivariate discrete distribution has independent marginals, conditioned on the value of a tertiary discrete random variable. More precisely, given realizations of $(X, Y, Z)$, test if $X \perp Y \mid Z$. Unconditional independence testing (corresponding to the case when $Z$ is constant) has been extensively studied by the community, with tight upper and lower bounds showing that the sample complexity has two regimes, depending on the tradeoff between the support size and the accuracy desired. This paper shows gives upper and lower bounds for this more general problem, showing a rich landscape depending on the relative value of the parameters.

by Gautam "G" Kamath at December 07, 2017 12:51 AM UTC

### Which groups are amenable to proving exponent two for matrix multiplication?

Authors: Jonah Blasiak, Thomas Church, Henry Cohn, Joshua A. Grochow, Chris Umans
Abstract: The Cohn-Umans group-theoretic approach to matrix multiplication suggests embedding matrix multiplication into group algebra multiplication, and bounding $\omega$ in terms of the representation theory of the host group. This framework is general enough to capture the best known upper bounds on $\omega$ and is conjectured to be powerful enough to prove $\omega = 2$, although finding a suitable group and constructing such an embedding has remained elusive. Recently it was shown, by a generalization of the proof of the Cap Set Conjecture, that abelian groups of bounded exponent cannot prove $\omega = 2$ in this framework, which ruled out a family of potential constructions in the literature.

In this paper we study nonabelian groups as potential hosts for an embedding. We prove two main results:

(1) We show that a large class of nonabelian groups---nilpotent groups of bounded exponent satisfying a mild additional condition---cannot prove $\omega = 2$ in this framework. We do this by showing that the shrinkage rate of powers of the augmentation ideal is similar to the shrinkage rate of the number of functions over $(\mathbb{Z}/p\mathbb{Z})^n$ that are degree $d$ polynomials; our proof technique can be seen as a generalization of the polynomial method used to resolve the Cap Set Conjecture.

(2) We show that symmetric groups $S_n$ cannot prove nontrivial bounds on $\omega$ when the embedding is via three Young subgroups---subgroups of the form $S_{k_1} \times S_{k_2} \times \dotsb \times S_{k_\ell}$---which is a natural strategy that includes all known constructions in $S_n$.

By developing techniques for negative results in this paper, we hope to catalyze a fruitful interplay between the search for constructions proving bounds on $\omega$ and methods for ruling them out.

### Lifting Linear Extension Complexity Bounds to the Mixed-Integer Setting

Authors: Alfonso Cevallos, Stefan Weltge, Rico Zenklusen
Abstract: Mixed-integer mathematical programs are among the most commonly used models for a wide set of problems in Operations Research and related fields. However, there is still very little known about what can be expressed by small mixed-integer programs. In particular, prior to this work, it was open whether some classical problems, like the minimum odd-cut problem, can be expressed by a compact mixed-integer program with few (even constantly many) integer variables. This is in stark contrast to linear formulations, where recent breakthroughs in the field of extended formulations have shown that many polytopes associated to classical combinatorial optimization problems do not even admit approximate extended formulations of sub-exponential size.

We provide a general framework for lifting inapproximability results of extended formulations to the setting of mixed-integer extended formulations, and obtain almost tight lower bounds on the number of integer variables needed to describe a variety of classical combinatorial optimization problems. Among the implications we obtain, we show that any mixed-integer extended formulation of sub-exponential size for the matching polytope, cut polytope, traveling salesman polytope or dominant of the odd-cut polytope, needs $\Omega(n/\log n)$ many integer variables, where $n$ is the number of vertices of the underlying graph. Conversely, the above-mentioned polyhedra admit polynomial-size mixed-integer formulations with only $O(n)$ or $O(n \log n)$ (for the traveling salesman polytope) many integer variables.

Our results build upon a new decomposition technique that, for any convex set $C$, allows for approximating any mixed-integer description of $C$ by the intersection of $C$ with the union of a small number of affine subspaces.

### Arrangements of Pseudocircles: On Circularizability

Authors: Stefan Felsner, Manfred Scheucher
Abstract: An arrangement of pseudocircles is a collection of simple closed curves on the sphere or in the plane such that every pair is either disjoint or intersects in exactly two crossing points. We call an arrangement intersecting if every pair of pseudocircles intersects twice. An arrangement is circularizable if there is a combinatorially equivalent arrangement of circles.

Kang and M\"uller showed that every arrangement of at most 4 pseudocircles is circularizable. Linhart and Ortner found an arrangement of 5 pseudocircles which is not circularizable.

In this paper we show that there are exactly four non-circularizable arrangements of 5 pseudocircles, exactly one of them is intersecting. For $n=6$, we show that there are exactly three non-circularizable digon-free intersecting arrangements. We also have some additional examples of non-circularizable arrangements of 6 pseudocircles.

The proofs of non-circularizability use various techniques, most depend on incidence theorems, others use arguments involving metric properties of arrangements of planes, or angles in planar figures. The claims that we have all non-circularizable arrangements with the given properties are based on a program that generated all connected arrangements of $n\leq 6$ pseudocircles and all intersecting arrangements of $n\leq 7$ pseudocircles. Given the complete lists of arrangements, we used heuristics to find circle representations. Examples where the heuristics failed had to be examined by hand.

One of the digon-free intersecting arrangements of pseudocircles with $n=6$ is particularly interesting. It only has $8$ triangles and occurs as a subarrangement of every digon-free intersecting arrangement with less than $2n-4$ triangles for $n=7,8,9$. Hence it may be true that every digon-free intersecting arrangement of circles contains at least $2n-4$ triangles.

### Exact Algorithms With Worst-case Guarantee For Scheduling: From Theory to Practice

Authors: Lei Shang
Abstract: This PhD thesis summarizes research works on the design of exact algorithms that provide a worst-case (time or space) guarantee for NP-hard scheduling problems. Both theoretical and practical aspects are considered with three main results reported. The first one is about a Dynamic Programming algorithm which solves the F3Cmax problem in O*(3^n) time and space. The algorithm is easily generalized to other flowshop problems and single machine scheduling problems. The second contribution is about a search tree method called Branch & Merge which solves the 1||SumTi problem with the time complexity converging to O*(2^n) and in polynomial space. Our third contribution aims to improve the practical efficiency of exact search tree algorithms solving scheduling problems. First we realized that a better way to implement the idea of Branch & Merge is to use a technique called Memorization. By the finding of a new algorithmic paradox and the implementation of a memory cleaning strategy, the method succeeded to solve instances with 300 more jobs with respect to the state-of-the-art algorithm for the 1||SumTi problem. Then the treatment is extended to another three problems 1|ri|SumCi, 1|dtilde|SumwiCi and F2||SumCi. The results of the four problems all together show the power of the Memorization paradigm when applied on sequencing problems. We name it Branch & Memorize to promote a systematic consideration of Memorization as an essential building block in branching algorithms like Branch and Bound. The method can surely also be used to solve other problems, which are not necessarily scheduling problems.

### Oblivious Routing via Random Walks

Authors: Michael Schapira, Gal Shahaf
Abstract: We present novel oblivious routing algorithms for both splittable and unsplittable multicommodity flow. Our algorithm for minimizing congestion for \emph{unsplittable} multicommodity flow is the first oblivious routing algorithm for this setting. As an intermediate step towards this algorithm, we present a novel generalization of Valiant's classical load balancing scheme for packet-switched networks to arbitrary graphs, which is of independent interest. Our algorithm for minimizing congestion for \emph{splittable} multicommodity flow improves upon the state-of-the-art, in terms of both running time and performance, for graphs that exhibit good expansion guarantees. Our algorithms rely on diffusing traffic via iterative applications of the random walk operator. Consequently, the performance guarantees of our algorithms are derived from the convergence of the random walk operator to the stationary distribution and are expressed in terms of the spectral gap of the graph (which dominates the mixing time).

### Deterministic Heavy Hitters with Sublinear Query Time

Authors: Yi Li, Vasileios Nakos
Abstract: This paper studies the classic problem of finding heavy hitters in the turnstile streaming model. We give the first deterministic linear sketch that has $O(\epsilon^{-2} \log n \cdot \log^*(\epsilon^{-1}))$ rows and answers queries in sublinear time. The number of rows is only a factor of $\log^*(\epsilon^{-1})$ more than that used by the state-of-the-art algorithm prior to our paper due to Nelson, Nguyen and Woodruff (RANDOM'12). Their algorithm runs in time at least linear in the universe size $n$, which is highly undesirable in streaming applications. Our approach is based on an iterative procedure, where most unrecovered heavy hitters are identified in each iteration. Although this technique has been extensively employed in the related problem of sparse recovery, this is the first time, to the best of our knowledge, that it has been used in the context of $\ell_1$ heavy hitters.

Along the way, we also give sublinear time algorithms for the closely related problems of combinatorial group testing and $\ell_1/\ell_1$ compressed sensing, matching the space usage of previous (super-)linear time algorithms.

### Optimal Quasi-Gray Codes: Does the Alphabet Matter?

Authors: Diptarka Chakraborty, Debarati Das, Michal Koucký, Nitin Saurabh
Abstract: A quasi-Gray code of dimension $n$ and length $\ell$ over an alphabet $\Sigma$ is a sequence of distinct words $w_1,w_2,\dots,w_\ell$ from $\Sigma^n$ such that any two consecutive words differ in at most $c$ coordinates, for some fixed constant $c>0$. In this paper we are interested in the read and write complexity of quasi-Gray codes in the bit-probe model, where we measure the number of symbols read and written in order to transform any word $w_i$ into its successor $w_{i+1}$.

We present construction of quasi-Gray codes of dimension $n$ and length $3^n$ over the ternary alphabet $\{0,1,2\}$ with worst-case read complexity $O(\log n)$ and write complexity $2$. This generalizes to arbitrary odd-size alphabets. For the binary alphabet, we present quasi-Gray codes of dimension $n$ and length at least $2^n - 20n$ with worst-case read complexity $6+\log n$ and write complexity $2$.

Our results significantly improve on previously known constructions and for the odd-size alphabets we break the $\Omega(n)$ worst-case barrier for space-optimal (non-redundant) quasi-Gray codes with constant number of writes. We obtain our results via a novel application of algebraic tools together with the principles of catalytic computation [Buhrman et al. '14, Ben-Or and Cleve '92, Barrington '89, Coppersmith and Grossman '75]. We also establish certain limits of our technique in the binary case. Although our techniques cannot give space-optimal quasi-Gray codes with small read complexity over the binary alphabet, our results strongly indicate that such codes do exist.

### Is there a P-complete problem on diophantine equations?

In general deciding whether a diophantine equation has any integer solutions is equivalent to the halting problem. I believe that deciding if a quadratic diophantine equation has any solution is NP-complete. Does there exist a further restriction on the equations involved that yields a P-complete problem?

by Jacob Edelman at December 06, 2017 08:02 PM UTC

### TCS+ talk: Wednesday, December 13th — Sebastien Bubeck, MSR

The last TCS+ talk of the Fall will take place this coming Wednesday, December 13th at 1:00 PM Eastern Time (10:00 AM Pacific Time, 19:00 Central European Time, 18:00 UTC). Sebastien Bubeck from MSR will speak about his recent breakthrough with Michael B. Cohen, James R. Lee, Yin Tat Lee, and Aleksander Madry on “k-server via multiscale entropic regularization”  (abstract below).

Please make sure you reserve a spot for your group to join us live by signing up on the online form. As usual, for more information about the TCS+ online seminar series and the upcoming talks, or to suggest a possible topic or speaker, please see the website.

Abstract: I will start by describing how mirror descent is a natural strategy for online decision making, specifically in online learning and metrical task systems. To motivate the k-server problem I will also briefly recall what we know and what we don’t know for structured state/action spaces in these models. Using the basic mirror descent calculations I will show how to easily obtain a log(k)-competitive algorithm for k-paging. I will then introduce our new parametrization of fractional k-server on a tree, and explain how to analyze the movement cost of entropy-regularized mirror descent on this parametrization. This leads to a depth*log(k)-competitive (fractional) algorithm for general trees, and log^2(k) for HSTs. I will also briefly mention dynamic embeddings to go beyond the standard log(n) loss in the reduction from general metrics to HSTs.

Joint work with Michael B. Cohen, James R. Lee, Yin Tat Lee, and Aleksander Madry.

by plustcs at December 06, 2017 06:25 PM UTC