*Unsupervised learning*, as the name suggests, is the science of learning from unlabeled data. A look at the wikipedia page shows that this term has many interpretations:

**(Task A)** *Learning a distribution from samples.* (Examples: gaussian mixtures, topic models, variational autoencoders,..)

**(Task B)** *Understanding latent structure in the data.* This is not the same as (a); for example principal component analysis, clustering, manifold learning etc. identify latent structure but don’t learn a distribution per se.

**(Task C)** *Feature Learning.* Learn a mapping from *datapoint* $\rightarrow$ *feature vector* such that classification tasks are easier to carry out on feature vectors rather than datapoints. For example, unsupervised feature learning could help lower the amount of *labeled* samples needed for learning a classifier, or be useful for *domain adaptation*.

Task B is often a subcase of Task C, as the intended user of “structure found in data” are humans (scientists) who pour over the representation of data to gain some intuition about its properties, and these “properties” can be often phrased as a classification task.

This post explains the relationship between Tasks A and C, and why they get mixed up in students’ mind. We hope there is also some food for thought here for experts, namely, our discussion about the fragility of the usual “perplexity” definition of unsupervised learning. It explains why Task A doesn’t in practice lead to good enough solution for Task C. For example, it has been believed for many years that unsupervised pretraining should help us improve training of deep nets, but this has been hard to show in practice.

If $x$ is a datapoint, each of these methods seeks to map it to a new “high level” representation $h$ that captures its “essence.” This is why it helps to have access to $h$ when performing machine learning tasks on $x$ (e.g. classification). The difficulty of course is that “high-level representation” is not uniquely defined. For example, $x$ may be an image, and $h$ may contain the information that it contains a person and a dog. But another $h$ may say that it shows a poodle and a person wearing pyjamas standing on the beach. This nonuniqueness seems inherent.

Unsupervised learning tries to learn high-level representation using unlabeled data. Each method make an implicit assumption about how the hidden $h$ relates to the visible $x$. For example, in k-means clustering the hidden $h$ consists of labeling the datapoint with the index of the cluster it belongs to. Clearly, such a simple clustering-based representation has rather limited expressive power since it groups datapoints into disjoint classes: this limits its application for complicated settings. For example, if one clusters images according to the labels “human”, “animal” “plant” etc., then which cluster should contain an image showing a man and a dog standing in front of a tree?

The search for a descriptive language for talking about the possible relationships of representations and data leads us naturally to Bayesian models. (Note that these are viewed with some skepticism in machine learning theory – compared to assumptionless models like PAC learning, online learning, etc. – but we do not know of another suitable vocabulary in this setting.)

Bayesian approaches capture the relationship between the “high level” representation $h$ and the datapoint $x$ by postulating a *joint distribution* $p_{\theta}(x, h)$ of the data $x$ and representation $h$, such that $p_{\theta}(h)$ and the posterior $p_{\theta}(x \mid h)$ have a simple form as a function of the parameters $\theta$. These are also called *latent variable* probabilistic models, since $h$ is a latent (hidden) variable.

The standard goal in distribution learning is to find the $\theta$ that “best explains” the data (what we called Task (A)) above). This is formalized using maximum-likelihood estimation going back to Fisher (~1910-1920): find the $\theta$ that maximizes the *log probability* of the training data. Mathematically, indexing the samples with $t$, we can write this as

where

(Note that $\sum_{t} \log p_{\theta}(x_t)$ is also the empirical estimate of the *cross-entropy*
$E_{x}[\log p_{\theta}(x)]$ of the distribution $p_{\theta}$, where $x$ is distributed according to $p^*$, the true distribution of the data. Thus the above method looks for the distribution with best cross-entropy on the empirical data, which is also log of the *perplexity* of $p_{\theta}$.)

In the limit of $t \to ∞$, this estimator is *consistent* (converges in probability to the ground-truth value) and *efficient* (has lowest asymptotic mean-square-error among all consistent estimators). See the Wikipedia page. (Aside: maximum likelihood estimation is often NP-hard, which is one of the reasons for the renaissance of the method-of-moments and tensor decomposition algorithms in learning latent variable models, which Rong wrote about some time ago.)

Simply learning the distribution $p_{\theta}(x, h)$ does not yield a representation *per se.* To get a distribution of $x$, we need access to the posterior $p_{\theta}(h \mid x)$: then a sample from this posterior can be used as a “representation” of a data-point $x$. (Aside: Sometimes, in settings when $p_{\theta}(h \mid x)$ has a simple description, this description can be viewed as the representation of $x$.)

Thus solving Task C requires learning distribution parameters $\theta$ *and* figuring out how to efficiently sample from the posterior distribution.

Note that the sampling problems for the posterior can be #-P hard for very simple families. The reason is that by Bayes’ law, $p_{\theta}(h \mid x) = \frac{p_{\theta}(h) p_{\theta}(x \mid h)}{p_{\theta}(x)}$. Even if the numerator is easy to calculate, as is the case for simple families, the $p_{\theta}(x)$ involves a big summation (or integral) and is often hard to calculate.

Note that the max-likelihood parameter estimation (Task A) and approximating the posterior distributions $p(h \mid x)$ (Task C) can have radically different complexities: Sometimes A is easy but C is NP-hard (example: topic modeling with “nice” topic-word matrices, but short documents, see also Bresler 2015); or vice versa (example: topic modeling with long documents, but worst-case chosen topic matrices Arora et al. 2011)

Of course, one may hope (as usual) that computational complexity is a worst-case notion and may not apply in practice. But there is a bigger issue with this setup, having to do with accuracy.

The above description assumes that the parametric model $p_{\theta}(x, h)$ for the data was *exact* whereas one imagines it is only *approximate* (i.e., suffers from modeling error). Furthermore, computational difficulties may restrict us to use approximately correct inference even if the model were exact. So in practice, we may only have an *approximation* $q(h|x)$ to
the posterior distribution $p_{\theta}(h \mid x)$. (Below we describe a popular methods to compute such approximations.)

How good of an approximationto the true posterior do we need?

Recall, we are trying to answer this question through the lens of Task C, solving some classification task. We take the following point of view:

For $t=1, 2,\ldots,$ nature picked some $(h_t, x_t)$ from the joint distribution and presented us $x_t$. The true label $y_t$ of $x_t$ is $\mathcal{C}(h_t)$ where $\mathcal{C}$ is an unknown classifier. Our goal is classify according to these labels.

To simplify notation, assume the output of $\mathcal{C}$ is binary. If we wish to use $q(h \mid x)$ as a surrogate for the true posterior $p_{\theta}(h \mid x)$, we need to have

How close must $q(h \mid x)$ and $p(h \mid x)$ be to let us conclude this? We will use KL divergence as “distance” between the distributions, for reasons that will become apparent in the following section. We claim the following:

CLAIM: The probability of obtaining different answers on classification tasks done using the ground truth $h$ versus the representations obtained using $q(h_t \mid x_t)$ is less than $\epsilon$ if $KL(q(h_t \mid x_t) \parallel p(h_t \mid x_t)) \leq 2\epsilon^2.$

Here’s a proof sketch. The natural distance these two distributions $q(h \mid x)$ and $p(h \mid x)$ with respect to accuracy of classification tasks is *total variation (TV)* distance. Indeed, if the TV distance between $q(h\mid x)$ and $p(h \mid x)$ is bounded by $\epsilon$, this implies that for any event $\Omega$,
The CLAIM now follows by instantiating this with the event $\Omega = $ “Classifier $\mathcal{C}$ output something different than $y_t$ given representation $h_t$ for input $x_t$”, and then relating TV distance to KL divergence using Pinsker’s inequality, which gives $\mbox{TV}(q(h_t \mid x_t),p(h_t \mid x_t)) \leq \sqrt{\frac{1}{2} KL(q(h_t \mid x_t) \parallel p(h_t \mid x_t))}$. *QED*

This observation explains why solving Task A in practice does not automatically lead to very useful representations for classification tasks (Task C): the posterior distribution has to be learnt extremely accurately, which probably didn’t happen (either due to model mismatch or computational complexity).

As noted, distribution learning (Task A) goes via cross-entropy/maximum-likelihood fitting which seems like an information coding task. Representation learning (Task C) via sampling the posterior seems fairly distinct. Why do students often conflate the two? Because in practice the most frequent way to solve Task A does implicitly compute posteriors and thus also seems to solve Task C. (Although as noted above, the accuracy may not insufficient.)

The generic way to learn latent variable models involves variational methods, which can be viewed as a generalization of the famous EM algorithm (Dempster et al. 1977).

Variational methods maintain at all times a *proposed distribution* $q(h | x)$ (called *variational distribution*). The methods rely on the observation that for every such $q(h \mid x)$ the following lower bound holds
\begin{equation} \log p(x) \geq E_{q(\mid x)} \log p(x,h) + H(q(h\mid x)) \qquad (2). \end{equation}
where $H$ denotes Shannon entropy (or differential entropy, depending on whether $x$ is discrete or continuous). The RHS above is often called the *ELBO bound* (ELBO = evidence-based lower bound). This inequality follows from a bit of algebra using non-negativity of KL divergence, applied to distributions $q(h \mid x)$ and $p(h\mid x)$. More concretely, the chain of inequalities is as follows,
Furthermore, *equality* is achieved if $q(h\mid x) = p(h\mid x)$. (This can be viewed as some kind of “duality” theorem for distributions, and dates all the way back to Gibbs. )

Algorithmically observation (2) is used by foregoing solving the maximum-likelihood optimization (1), and solving instead
Since the variables are naturally divided into two blocks: the model parameters $\theta$, and the variational distributions $q(h_t\mid x_t)$, a natural way to optimize the above is to *alternate* optimizing over each group, while keeping the other fixed. (This meta-algorithm is often called variational EM for obvious reasons.)

Of course, optimizing over all possible distributions $q$ is an ill-defined problem, so typically one constrains $q$ to lie in some parametric family (e.g., “ standard Gaussian transformed by depth $4$ neural nets of certain size and architecture”) such that the maximizing the ELBO for $q$ is a tractable problem in practice. Clearly if the parametric family of distributions is expressive enough, and the (non-convex) optimization problem doesn’t get stuck in bad local minima, then variational EM algorithm will give us not only values of the parameters $\theta$ which are close to the ground-truth ones, but also variational distributions $q(h\mid x)$ which accurately track $p(h\mid x)$. But as we saw above, this accuracy would need to be very high to get meaningful representations.

In the next post, we will describe our recent work further clarifying this issue of representation learning via a Bayesian viewpoint.

**Authors: **Marius Zimand **Download:** PDF**Abstract: **Distributed compression is the task of compressing correlated data by several
parties, each one possessing one piece of data and acting separately. The
classical Slepian-Wolf theorem (D. Slepian, J. K. Wolf, IEEE Transactions on
Inf. Theory, 1973) shows that if data is generated by independent draws from a
joint distribution, that is by a memoryless stochastic process, then
distributed compression can achieve the same compression rates as centralized
compression when the parties act together. Recently, the author (M. Zimand,
STOC 2017) has obtained an analogue version of the Slepian-Wolf theorem in the
framework of Algorithmic Information Theory (also known as Kolmogorov
complexity). The advantage over the classical theorem, is that the AIT version
works for individual strings, without any assumption regarding the generative
process. The only requirement is that the parties know the complexity profile
of the input strings, which is a simple quantitative measure of the data
correlation. The goal of this paper is to present in an accessible form that
omits some technical details the main ideas from the reference (M. Zimand, STOC
2017).

**Authors: **Tobias Friedrich, Anton Krohmer, Ralf Rothenberger, Thomas Sauerwald, Andrew M. Sutton **Download:** PDF**Abstract: **Propositional satisfiability (SAT) is one of the most fundamental problems in
computer science. The worst-case hardness of SAT lies at the core of
computational complexity theory. The average-case analysis of SAT has triggered
the development of sophisticated rigorous and non-rigorous techniques for
analyzing random structures.

Despite a long line of research and substantial progress, nearly all theoretical work on random SAT assumes a uniform distribution on the variables. In contrast, real-world instances often exhibit large fluctuations in variable occurrence. This can be modeled by a scale-free distribution of the variables, which results in distributions closer to industrial SAT instances.

We study random k-SAT on n variables, $m=\Theta(n)$ clauses, and a power law distribution on the variable occurrences with exponent $\beta$. We observe a satisfiability threshold at $\beta=(2k-1)/(k-1)$. This threshold is tight in the sense that instances with $\beta\le(2k-1)/(k-1)-\varepsilon$ for any constant $\varepsilon>0$ are unsatisfiable with high probability (w.h.p.). For $\beta\geq(2k-1)/(k-1)+\varepsilon$, the picture is reminiscent of the uniform case: instances are satisfiable w.h.p. for sufficiently small constant clause-variable ratios $m/n$; they are unsatisfiable above a ratio $m/n$ that depends on $\beta$.

**Authors: **Marc Roth **Download:** PDF**Abstract: **We present a framework for the complexity classification of parameterized
counting problems that can be formulated as the summation over the numbers of
homomorphisms from small pattern graphs H_1,...,H_l to a big host graph G with
the restriction that the coefficients correspond to evaluations of the M\"obius
function over the lattice of a graphic matroid. This generalizes the idea of
Curticapean, Dell and Marx [STOC 17] who used a result of Lov\'asz stating that
the number of subgraph embeddings from a graph H to a graph G can be expressed
as such a sum over the lattice of partitions of H. In the first step we
introduce what we call graphically restricted homomorphisms that, inter alia,
generalize subgraph embeddings as well as locally injective homomorphisms. We
provide a complete parameterized complexity dichotomy for counting such
homomorphisms, that is, we identify classes of patterns for which the problem
is fixed-parameter tractable (FPT), including an algorithm, and prove that all
other pattern classes lead to #W[1]-hard problems. The main ingredients of the
proof are the complexity classification of linear combinations of homomorphisms
due to Curticapean, Dell and Marx [STOC 17] as well as a corollary of Rota's
NBC Theorem which states that the sign of the M\"obius function over a
geometric lattice only depends on the rank of its arguments. We use the general
theorem to classify the complexity of counting locally injective homomorphisms
as well as homomorphisms that are injective in the r-neighborhood for constant
r. Furthermore, we show that the former has "real" FPT cases by considering the
subgraph counting problem restricted to trees on both sides. Finally we show
that the dichotomy for counting graphically restricted homomorphisms readily
extends to so-called linear combinations.

**Authors: **Tommi Junttila, Matti Karppa, Petteri Kaski, Jukka Kohonen Aalto University, Department of Computer Science) **Download:** PDF**Abstract: **This paper presents a technique for symmetry reduction that adaptively
assigns a prefix of variables in a system of constraints so that the generated
prefix-assignments are pairwise nonisomorphic under the action of the symmetry
group of the system. The technique is based on McKay's canonical extension
framework [J. Algorithms 26 (1998), no. 2, 306-324]. Among key features of the
technique are (i) adaptability - the prefix sequence can be user-prescribed and
truncated for compatibility with the group of symmetries; (ii)
parallelisability - prefix-assignments can be processed in parallel
independently of each other; (iii) versatility - the method is applicable
whenever the group of symmetries can be concisely represented as the
automorphism group of a vertex-colored graph; and (iv) implementability - the
method can be implemented relying on a canonical labeling map for
vertex-colored graphs as the only nontrivial subroutine. To demonstrate the
tentative practical applicability of our technique we have prepared a
preliminary implementation and report on a limited set of experiments that
demonstrate ability to reduce symmetry on hard instances.

**Authors: **Arnold Filtser **Download:** PDF**Abstract: **In the Steiner point removal (SPR) problem, we are given a weighted graph
$G=(V,E)$ and a set of terminals $K\subset V$ of size $k$. The objective is to
find a minor $M$ of $G$ with only the terminals as its vertex set, such that
the distance between the terminals will be preserved up to a small
multiplicative distortion. Kamma, Krauthgamer and Nguyen [KKN15] used a
ball-growing algorithm with exponential distributions to show that the
distortion is at most $O(\log^5 k)$. Cheung [Che17] improved the analysis of
the same algorithm, bounding the distortion by $O(\log^2 k)$. We improve the
analysis of this ball-growing algorithm even further, bounding the distortion
by $O(\log k)$.

**Authors: **Ahmad Biniaz, Anil Maheshwari, Michiel Smid **Download:** PDF**Abstract: **Counting the number of interior disjoint empty convex polygons in a point set
is a typical Erd\H{o}s-Szekeres-type problem. We study this problem for 4-gons.
Let $P$ be a set of $n$ points in the plane and in general position. A subset
$Q$ of $P$ with four points is called a $4$-hole in $P$ if the convex hull of
$Q$ is a quadrilateral and does not contain any point of $P$ in its interior.
Two 4-holes in $P$ are compatible if their interiors are disjoint. We show that
$P$ contains at least $\lfloor 5n/11\rfloor {-} 1$ pairwise compatible 4-holes.
This improves the lower bound of $2\lfloor(n-2)/5\rfloor$ which is implied by a
result of Sakai and Urrutia (2007).

**Authors: **Nina Chiarelli, Tatiana R. Hartinger, Matthew Johnson, Martin Milanič, Daniël Paulusma **Download:** PDF**Abstract: **We perform a systematic study in the computational complexity of the
connected variant of three related transversal problems: Vertex Cover, Feedback
Vertex Set, and Odd Cycle Transversal. Just like their original counterparts,
these variants are NP-complete for general graphs. However, apart from the fact
that Connected Vertex Cover is NP-complete for line graphs (and thus for
claw-free graphs) not much is known when the input is restricted to $H$-free
graphs. We show that the three connected variants remain NP-complete if $H$
contains a cycle or claw. In the remaining case $H$ is a linear forest. We show
that Connected Vertex Cover, Connected Feedback Vertex Set, and Connected Odd
Cycle Transversal are polynomial-time solvable for $sP_2$-free graphs for every
constant $s\geq 1$. For proving these results we use known results on the price
of connectivity for vertex cover, feedback vertex set, and odd cycle
transversal. This is the first application of the price of connectivity that
results in polynomial-time algorithms.

**Authors: **Heng Zhou, Zhiqiang Xu **Download:** PDF**Abstract: **The $p$-set, which is in a simple analytic form, is well distributed in unit
cubes. The well-known Weil's exponential sum theorem presents an upper bound of
the exponential sum over the $p$-set. Based on the result, one shows that the
$p$-set performs well in numerical integration, in compressed sensing as well
as in UQ. However, $p$-set is somewhat rigid since the cardinality of the
$p$-set is a prime $p$ and the set only depends on the prime number $p$. The
purpose of this paper is to present generalizations of $p$-sets, say
$\mathcal{P}_{d,p}^{{\mathbf a},\epsilon}$, which is more flexible.
Particularly, when a prime number $p$ is given, we have many different choices
of the new $p$-sets. Under the assumption that Goldbach conjecture holds, for
any even number $m$, we present a point set, say ${\mathcal L}_{p,q}$, with
cardinality $m-1$ by combining two different new $p$-sets, which overcomes a
major bottleneck of the $p$-set. We also present the upper bounds of the
exponential sums over $\mathcal{P}_{d,p}^{{\mathbf a},\epsilon}$ and ${\mathcal
L}_{p,q}$, which imply these sets have many potential applications.

**Authors: **Prosenjit Bose, Jean-Lou De Carufel, Alina Shaikhet, Michiel Smid **Download:** PDF**Abstract: **Art Gallery Localization (AGL) is the problem of placing a set $T$ of
broadcast towers in a simple polygon $P$ in order for a point to locate itself
in the interior. For any point $p \in P$: for each tower $t \in T \cap V(p)$
(where $V(p)$ denotes the visibility polygon of $p$) the point $p$ receives the
coordinates of $t$ and the Euclidean distance between $t$ and $p$. From this
information $p$ can determine its coordinates. We study the computational
complexity of AGL problem. We show that the problem of determining the minimum
number of broadcast towers that can localize a point anywhere in a simple
polygon $P$ is NP-hard. We show a reduction from Boolean Three Satisfiability
problem to our problem and give a proof that the reduction takes polynomial
time.

**Authors: **Zhaocheng Yang, Rodrigo C. de Lamare, Weijian Liu **Download:** PDF**Abstract: **We present a novel sparsity-based space-time adaptive processing (STAP)
technique based on the alternating direction method to overcome the severe
performance degradation caused by array gain/phase (GP) errors. The proposed
algorithm reformulates the STAP problem as a joint optimization problem of the
spatio-Doppler profile and GP errors in both single and multiple snapshots, and
introduces a target detector using the reconstructed spatio-Doppler profiles.
Simulations are conducted to illustrate the benefits of the proposed algorithm.

**Authors: **Erik D. Demaine, Mikhail Rudoy **Download:** PDF**Abstract: **In this paper, we introduce a new problem called Tree-Residue Vertex-Breaking
(TRVB): given a multigraph $G$ some of whose vertices are marked "breakable,"
is it possible to convert $G$ into a tree via a sequence of "vertex-breaking"
operations (disconnecting the edges at a degree-$k$ breakable vertex by
replacing that vertex with $k$ degree-$1$ vertices)? We consider the special
cases of TRVB with any combination of the following additional constraints: $G$
must be planar, $G$ must be a simple graph, the degree of every breakable
vertex must belong to an allowed list $B$, and the degree of every unbreakable
vertex must belong to an allowed list $U$. We fully characterize these variants
of TRVB as polynomially solvable or NP-complete. The two results which we
expect to be most generally applicable are that (1) TRVB is polynomially
solvable when breakable vertices are restricted to have degree at most $3$; and
(2) for any $k \ge 4$, TRVB is NP-complete when the given multigraph is
restricted to be planar and to consist entirely of degree-$k$ breakable
vertices. To demonstrate the use of TRVB, we give a simple proof of the known
result that Hamiltonicity in max-degree-$3$ square grid graphs is NP-hard.

**Authors: **Andrew Drucker **Download:** PDF**Abstract: **We describe a communication game, and a conjecture about this game, whose
proof would imply the well-known Sensitivity Conjecture asserting a polynomial
relation between sensitivity and block sensitivity for Boolean functions. The
author defined this game and observed the connection in Dec. 2013 - Jan. 2014.
The game and connection were independently discovered by Gilmer, Kouck\'y, and
Saks, who also established further results about the game (not proved by us)
and published their results in ITCS '15 [GKS15].

This note records our independent work, including some observations that did not appear in [GKS15]. Namely, the main conjecture about this communication game would imply not only the Sensitivity Conjecture, but also a stronger hypothesis raised by Chung, F\"uredi, Graham, and Seymour [CFGS88]; and, another related conjecture we pose about a "query-bounded" variant of our communication game would suffice to answer a question of Aaronson, Ambainis, Balodis, and Bavarian [AABB14] about the query complexity of the "Weak Parity" problem---a question whose resolution was previously shown by [AABB14] to follow from a proof of the Chung et al. hypothesis.

**Authors: **Kyle Kloster, Philipp Kuinke, Michael P. O'Brien, Felix Reidl, Fernando Sánchez Villaamil, Blair D. Sullivan, Andrew van der Poel **Download:** PDF**Abstract: **The Flow Decomposition problem, which asks for the smallest set of weighted
paths that "covers" a flow on a DAG, has recently been used as an important
computational step in genetic assembly problems. We prove the problem is in FPT
when parameterized by the number of paths, and we give a practical linear fpt
algorithm. Combining this approach with algorithm engineering, we implement a
Flow Decomposition solver and demonstrate its competitiveness with a
state-of-the-art heuristic on RNA sequencing data. We contextualize our design
choices with two hardness results related to preprocessing and weight recovery.
First, the problem does not admit polynomial kernels under standard complexity
assumptions. Second, the related problem of assigning weights to a given set of
paths is NP-hard even when the weights are known.

**Authors: **Radi Muhammad Reza, Mohammed Eunus Ali, Muhammad Aamir Cheema **Download:** PDF**Abstract: **Recently, with the advancement of the GPS-enabled cellular technologies, the
location-based services (LBS) have gained in popularity. Nowadays, an
increasingly larger number of map-based applications enable users to ask a
wider variety of queries. Researchers have studied the ride-sharing, the
carpooling, the vehicle routing, and the collective travel planning problems
extensively in recent years. Collective traveling has the benefit of being
environment-friendly by reducing the global travel cost, the greenhouse gas
emission, and the energy consumption. In this paper, we introduce several
optimization problems to recommend a suitable route and stops of a vehicle, in
a road network, for a group of users intending to travel collectively. The goal
of each problem is to minimize the aggregate cost of the individual travelers'
paths and the shared route under various constraints. First, we formulate the
problem of determining the optimal pair of end-stops, given a set of queries
that originate and terminate near the two prospective end regions. We outline a
baseline polynomial-time algorithm and propose a new faster solution - both
calculating an exact answer. In our approach, we utilize the path-coherence
property of road networks to develop an efficient algorithm. Second, we define
the problem of calculating the optimal route and intermediate stops of a
vehicle that picks up and drops off passengers en-route, given its start and
end stoppages, and a set of path queries from users. We outline an exact
solution of both time and space complexities exponential in the number of
queries. Then, we propose a novel polynomial-time-and-space heuristic algorithm
that performs reasonably well in practice. We also analyze several variants of
this problem under different constraints. Last, we perform extensive
experiments that demonstrate the efficiency and accuracy of our algorithms.

**Authors: **Ervin Győri, Tamás Róbert Mezei **Download:** PDF**Abstract: **We prove that every simply connected orthogonal polygon of $n$ vertices can
be partitioned into $\left\lfloor\frac{3 n +4}{16}\right\rfloor$ (simply
connected) orthogonal polygons of at most 8 vertices. It yields a new and
shorter proof of the theorem of A. Aggarwal that $\left\lfloor\frac{3 n
+4}{16}\right\rfloor$ mobile guards are sufficient to control the interior of
an $n$-vertex orthogonal polygon. Moreover, we strengthen this result by
requiring combinatorial guards (visibility is only required at the endpoints of
patrols) and prohibiting intersecting patrols. This yields positive answers to
two questions of O'Rourke. Our result is also a further example of the
"metatheorem" that (orthogonal) art gallery theorems are based on partition
theorems.

The 2017 STOC is over, and I thought it went very well. The new format ran with what seemed to me to be minimal to non-existent glitches, and overall it sounded like people enjoyed it. The local arrangements were terrific -- much much thanks to Hamed Hatami and Pierre McKenzie who made the whole thing look easy. (It's not.) I'd have liked a few dozen more people, but I'm hoping we'll see some positive momentum going into next year.

A heads-up to mark you calendars now that STOC 2018 will be held June 23-27, in Los Angeles.

I'm putting this post up to see if anyone wants to make general comments about the TheoryFest/STOC 2017 experience. Feedback is always useful, and if there's any constructive criticism and/or wild enthusiasm for any parts of the 2017 STOC, we'll keep that in mind as we go forward next year. Please, however, be respectful to those who did the work of putting everything together.

And for those who went commenting here doesn't absolve you from filling out the survey that will be sent out, though!

A heads-up to mark you calendars now that STOC 2018 will be held June 23-27, in Los Angeles.

I'm putting this post up to see if anyone wants to make general comments about the TheoryFest/STOC 2017 experience. Feedback is always useful, and if there's any constructive criticism and/or wild enthusiasm for any parts of the 2017 STOC, we'll keep that in mind as we go forward next year. Please, however, be respectful to those who did the work of putting everything together.

And for those who went commenting here doesn't absolve you from filling out the survey that will be sent out, though!

The Panel on TCS: The Next Decade |

By the numbers: 370 attendees, 46% students. 103 accepted papers out of 421 submitted. These numbers are moderate increases over recent years.

The Panel on TCS: The Next Decade talked about everything but the next decade. A few of my favorite quotes: "Hard instances are everywhere except where people care" (Russell Impagliazzo, who walked back a little from it later in the discussion). "I never know when I proved my last theorem" (Dan Spielman on why he keeps trying). Generally the panel gave great advice on how to do research and talk with other disciplines.

Avi Wigderson argued that theory of computing has become "an independent academic discipline" which has strong ties to many others, of which computer science is just one example. He didn't quite go as far as suggesting a separate department but he outlined a TCS major and argued that our concepts should be taught as early as elementary school.

Oded Goldreich received the Knuth Prize and said that researchers should focus on their research and not on their careers. The SIGACT Distinguished Service Award went to Alistair Sinclair for his work at the Simons Institute.

Oded apologized for lying about why he was attending STOC this year. TheoryFest will be a true success when you need reasons to not attend STOC. All happens again next year in Los Angeles (June 23-27) for the 50th STOC. Do be there.

**Authors: **Arya Mazumdar, Barna Saha **Download:** PDF**Abstract: **Suppose, we are given a set of $n$ elements to be clustered into $k$
(unknown) clusters, and an oracle/expert labeler that can interactively answer
pair-wise queries of the form, "do two elements $u$ and $v$ belong to the same
cluster?". The goal is to recover the optimum clustering by asking the minimum
number of queries. In this paper, we initiate a rigorous theoretical study of
this basic problem of query complexity of interactive clustering, and provide
strong information theoretic lower bounds, as well as nearly matching upper
bounds. Most clustering problems come with a similarity matrix, which is used
by an automated process to cluster similar points together. Our main
contribution in this paper is to show the dramatic power of side information
aka similarity matrix on reducing the query complexity of clustering. A
similarity matrix represents noisy pair-wise relationships such as one computed
by some function on attributes of the elements. A natural noisy model is where
similarity values are drawn independently from some arbitrary probability
distribution $f_+$ when the underlying pair of elements belong to the same
cluster, and from some $f_-$ otherwise. We show that given such a similarity
matrix, the query complexity reduces drastically from $\Theta(nk)$ (no
similarity matrix) to $O(\frac{k^2\log{n}}{\cH^2(f_+\|f_-)})$ where $\cH^2$
denotes the squared Hellinger divergence. Moreover, this is also
information-theoretic optimal within an $O(\log{n})$ factor. Our algorithms are
all efficient, and parameter free, i.e., they work without any knowledge of $k,
f_+$ and $f_-$, and only depend logarithmically with $n$. Along the way, our
work also reveals intriguing connection to popular community detection models
such as the {\em stochastic block model}, significantly generalizes them, and
opens up many venues for interesting future research.

**Authors: **Steve Hanneke, Liu Yang **Download:** PDF**Abstract: **This work explores the query complexity of property testing for general
piecewise functions on the real line, in the active and passive property
testing settings. The results are proven under an abstract zero-measure
crossings condition, which has as special cases piecewise constant functions
and piecewise polynomial functions. We find that, in the active testing
setting, the query complexity of testing general piecewise functions is
independent of the number of pieces. We also identify the optimal dependence on
the number of pieces in the query complexity of passive testing in the special
case of piecewise constant functions.

**Authors: **René Sitters, Liya Yang **Download:** PDF**Abstract: **We give a $(2 + \epsilon)$-approximation algorithm for minimizing total
weighted completion time on a single machine under release time and precedence
constraints. This settles a recent conjecture made in [18]

**Authors: **Arya Mazumdar, Barna Saha **Download:** PDF**Abstract: **In this paper, we initiate a rigorous theoretical study of clustering with
noisy queries (or a faulty oracle). Given a set of $n$ elements, our goal is to
recover the true clustering by asking minimum number of pairwise queries to an
oracle. Oracle can answer queries of the form : "do elements $u$ and $v$ belong
to the same cluster?" -- the queries can be asked interactively (adaptive
queries), or non-adaptively up-front, but its answer can be erroneous with
probability $p$. In this paper, we provide the first information theoretic
lower bound on the number of queries for clustering with noisy oracle in both
situations. We design novel algorithms that closely match this query complexity
lower bound, even when the number of clusters is unknown. Moreover, we design
computationally efficient algorithms both for the adaptive and non-adaptive
settings. The problem captures/generalizes multiple application scenarios. It
is directly motivated by the growing body of work that use crowdsourcing for
{\em entity resolution}, a fundamental and challenging data mining task aimed
to identify all records in a database referring to the same entity. Here crowd
represents the noisy oracle, and the number of queries directly relates to the
cost of crowdsourcing. Another application comes from the problem of {\em sign
edge prediction} in social network, where social interactions can be both
positive and negative, and one must identify the sign of all pair-wise
interactions by querying a few pairs. Furthermore, clustering with noisy oracle
is intimately connected to correlation clustering, leading to improvement
therein. Finally, it introduces a new direction of study in the popular {\em
stochastic block model} where one has an incomplete stochastic block model
matrix to recover the clusters.

**Authors: **Arne Leitert, Feodor F. Dragan **Download:** PDF**Abstract: **We develop efficient parameterized, with additive error, approximation
algorithms for the (Connected) $r$-Domination problem and the (Connected)
$p$-Center problem for unweighted and undirected graphs. Given a graph $G$, we
show how to construct a (connected) $\big(r + \mathcal{O}(\mu)
\big)$-dominating set $D$ with $|D| \leq |D^*|$ efficiently. Here, $D^*$ is a
minimum (connected) $r$-dominating set of $G$ and $\mu$ is our graph parameter,
which is the tree-breadth or the cluster diameter in a layering partition of
$G$. Additionally, we show that a $+ \mathcal{O}(\mu)$-approximation for the
(Connected) $p$-Center problem on $G$ can be computed in polynomial time. Our
interest in these parameters stems from the fact that in many real-world
networks, including Internet application networks, web networks, collaboration
networks, social networks, biological networks, and others, and in many
structured classes of graphs these parameters are small constants.

**Authors: **Felipe Cucker, Peter Bürgisser, Pierre Lairez **Download:** PDF**Abstract: **We describe and analyze an algorithm for computing the homology (Betti
numbers and torsion coefficients) of basic semialgebraic sets which works in
weak exponential time. That is, out of a set of exponentially small measure in
the space of data the cost of the algorithm is exponential in the size of the
data. All algorithms previously proposed for this problem have a complexity
which is doubly exponential (and this is so for almost all data).

I have been physically handicapped the past several months, but managed to make it with some difficulty and pain to the 50th celebration of ACM Turing Award.

I liked the deep learning panel pitting the thinkers -- Stuart Russell, Michael Jordan and others -- against/with the doers like Ilya Sutskever. The panel had some zingers like Stuart's reference to the Graduate Student Descent method. The panel more or less agreed on the strengths of deep learning --- power of learning circuits as concepts, throwing lot of computing power etc -- and their challenges -- thinking, reasoning. Judea kept pushing the panel to look at limitations of deep learning, "if I added another layer, I still can not do.... what?"

Joan Feigenbaum gave a powerful introduction to privacy vs security conundrum and had a superb panel that debated the issues in a nuanced way Crypto and Security people tend to do when confronted with the intersection of their work with social, policy and human issues. The panel was chock full of examples of "security holes" that are mainly human failings.

These meetings give me a chance to see people of course, but also see several lines of research and how far they have stretched, they need to stretch.

PS: It was great to have Moni and Amos get the Kanellakis Prize and the Diff Privacy folks get the Godel Prize.

I liked the deep learning panel pitting the thinkers -- Stuart Russell, Michael Jordan and others -- against/with the doers like Ilya Sutskever. The panel had some zingers like Stuart's reference to the Graduate Student Descent method. The panel more or less agreed on the strengths of deep learning --- power of learning circuits as concepts, throwing lot of computing power etc -- and their challenges -- thinking, reasoning. Judea kept pushing the panel to look at limitations of deep learning, "if I added another layer, I still can not do.... what?"

Joan Feigenbaum gave a powerful introduction to privacy vs security conundrum and had a superb panel that debated the issues in a nuanced way Crypto and Security people tend to do when confronted with the intersection of their work with social, policy and human issues. The panel was chock full of examples of "security holes" that are mainly human failings.

These meetings give me a chance to see people of course, but also see several lines of research and how far they have stretched, they need to stretch.

PS: It was great to have Moni and Amos get the Kanellakis Prize and the Diff Privacy folks get the Godel Prize.

I'm in San Francisco for the ACM conference celebrating 50 years of the Turing Award. I'll post on STOC and the Turing award celebration next week. Today though we remember another member of Bletchley Park, Joan Clarke, born one hundred years ago today, five years and a day after Turing.

Clarke became one of the leading cryptoanalysts at Bletchley Park during the second World War. She mastered the technique of Banburismus developed by Alan Turing, the only woman to do so, to help break German codes. Bletchley Park promoted her to linguist, even though she didn't know any languages, to partially compensate for a lower pay scale for woman at the time. Keira Knightly played Joan Clarke in The Imitation Game.

Joan Clarke had a close friendship with Turing and a brief engagement. In this video Joan Clarke talks about that time in her life.

Clarke became one of the leading cryptoanalysts at Bletchley Park during the second World War. She mastered the technique of Banburismus developed by Alan Turing, the only woman to do so, to help break German codes. Bletchley Park promoted her to linguist, even though she didn't know any languages, to partially compensate for a lower pay scale for woman at the time. Keira Knightly played Joan Clarke in The Imitation Game.

Joan Clarke had a close friendship with Turing and a brief engagement. In this video Joan Clarke talks about that time in her life.

I was the chair of the plenary session on Wednesday, so was too focused on keeping track of time and such to pay full attention to the talks. Having said that, all the speakers we've had so far have done a bang-up job of keeping within their time window without much prompting at all.

So I can only give my very brief thoughts on the talks. For more information, go here.

Atri Rudra was up first with a neat way to generalize joins, inference in probabilistic models and even matrix multiplication all within a generic semi-ring framework, which allowed the authors to provide faster algorithms for join estimation and inference. In fact, these are being used right now to get SOTA join implementations that beat what Oracle et al have to offer. Neat!

Vasilis Syrgkakis asked a very natural question: when players are playing a game and learning, what happens if we treat*all* players as learning agents, rather than analyzing each player's behavior with respect to an adversary? It turns out that you can show better bounds on convergence to equilibrium as well as approximations to optimal welfare (i.e the price of anarchy). There's more work to do here with more general learning frameworks (beyond bandits, for example).

Chris Umans talked about how the resolution of the cap set conjecture implies bad news for all current attempts to prove that $\omega = 2$ for matrix multiplication. He also provided the "book proof" for the cap set conjecture that came out of the recent flurry of work by Croot, Lev, Pach, Ellenberg, Gijswijt and Tao (and cited Tao's blog post as well as the papers, which I thought was neat).

I hope the slides will be up soon. If not for anything else, for Atri's explanation of graphical models in terms of "why is my child crying", which so many of us can relate to.

So I can only give my very brief thoughts on the talks. For more information, go here.

Atri Rudra was up first with a neat way to generalize joins, inference in probabilistic models and even matrix multiplication all within a generic semi-ring framework, which allowed the authors to provide faster algorithms for join estimation and inference. In fact, these are being used right now to get SOTA join implementations that beat what Oracle et al have to offer. Neat!

Vasilis Syrgkakis asked a very natural question: when players are playing a game and learning, what happens if we treat

Chris Umans talked about how the resolution of the cap set conjecture implies bad news for all current attempts to prove that $\omega = 2$ for matrix multiplication. He also provided the "book proof" for the cap set conjecture that came out of the recent flurry of work by Croot, Lev, Pach, Ellenberg, Gijswijt and Tao (and cited Tao's blog post as well as the papers, which I thought was neat).

I hope the slides will be up soon. If not for anything else, for Atri's explanation of graphical models in terms of "why is my child crying", which so many of us can relate to.

Minority report (the movie) is 15 years old. Who knew!

Well I certainly didn't, till I was called by a reporter from CNN who wanted to talk about the legacy of the movie. Here's the link to the story.

It was a depressing conversation. We went over some of the main themes from the movie, and I realized to my horrow how many of them are now part of our reality.

**Precogs are now PredPol**. Algorithms that claim to know where crime will happen. The companies building predictive policing software will often take umbrage at references to Minority Report because they say they're not targeting people. But all I say is "….yet".

**Predictions have errors**. The **very title** of the movie telegraphs the idea of errors in the prediction system. And much of the movie is about a coverup of such a 'minority report'. And yet today we treat our algorithms (precogs) as infallible, and their predictions as truth.

**VERY personalized advertising**. The main character is targeted by personalized advertising and a good section of the plot involves him trying to get a replacement eyeball so retina scans don't detect him. And then we have this.

**Feedback loops**. The debate between Agatha (the minority precog) and Anderton about free will leads him to a decision to change his future, which then changes the prediction system. In other words, feedback loops! But feedback loops work both ways. Firstly, predictions are not set in stone: they can be changed by our actions. Secondly, if we don't realize that predictions can be affected by feedback from earlier decisions, our decision-making apparatus can spiral out of control, **provably so **(consider this a teaser: I'll have more to say in a few days).

What's a little sad for me is because I wasn't sufficiently 'woke' when I first saw the movie, I thought that the coolest part of it was the ingenious visual interfaces on display. We're actually not too far from such systems with VR and AR. But that now seems like such a minor and insignificant part of the future the movie describes.

Well I certainly didn't, till I was called by a reporter from CNN who wanted to talk about the legacy of the movie. Here's the link to the story.

It was a depressing conversation. We went over some of the main themes from the movie, and I realized to my horrow how many of them are now part of our reality.

What's a little sad for me is because I wasn't sufficiently 'woke' when I first saw the movie, I thought that the coolest part of it was the ingenious visual interfaces on display. We're actually not too far from such systems with VR and AR. But that now seems like such a minor and insignificant part of the future the movie describes.

There's a weird phenomenon in the world of streaming norm estimation: For $\ell_0, \ell_1, \ell_2$ norm estimation, there are polylog (or less)-space streaming approximation algorithms. But once you get to $\ell_p, p \ge 3$, the required space suddenly jumps to polynomial in $n$. What's worse is that if you change norms you need a new algorithm and have to prove all your results all over again.

This paper gives a universal algorithm for estimating a class of norms called "symmetric' (which basically means that the norm is invariant under coordinate permutation and sign flips - you can think of this as being invariant under a very special class of rotations and reflections if you like). This class includes the $\ell_p$ norms as a special case, so this result generalizes (upto polylog factors) all the prior work.

The result works in a very neat way. The key idea is to define a notion of concentration relative to a Euclidean ball. Specifically, Fix the unit $\ell_2$ ball in $n$ dimensions, and look at the maximum value of your norm $\ell$ over this call: call it $b_\ell$. Now look at the median value of your norm (with respect to a uniform distribution over the sphere): call it $m(\ell)$. Define the*modulus of concentration* as

$$ mc(\ell) = \frac{b_\ell}{m_\ell} $$

Notice that this is 1 for $\ell_2$. For $\ell_1$, the maximum value is larger: it's $\sqrt{d}$. The median value as it turns out is also $\Theta(\sqrt{d})$, and so the modulus of concentration is constant. Interestingly, for $p > 2$, the modulus of concentration for $\ell_p$ is $d^{1/2(1 - 2/p)}$ which looks an awful lot like the bound of $d^{1-2/p}$ for sketching $\ell_p$.

As it turns out, this is precisely the point. The authors show that the streaming complexity of any norm $\ell$ can be expressed in terms of the square of $mc(\ell)$. There are some technical details - this is not exactly the theorem statement - but you can read the paper for more information.

**Update: **Symmetric norms show up in a later paper as well. Andoni, Nikolov, Razenshteyn and Waingarten show how to do approximate near neighbors with $\log \log n$ approximation in spaces endowed with a symmetric norm. It doesn't appear that they use the same ideas from this paper though.

This paper gives a universal algorithm for estimating a class of norms called "symmetric' (which basically means that the norm is invariant under coordinate permutation and sign flips - you can think of this as being invariant under a very special class of rotations and reflections if you like). This class includes the $\ell_p$ norms as a special case, so this result generalizes (upto polylog factors) all the prior work.

The result works in a very neat way. The key idea is to define a notion of concentration relative to a Euclidean ball. Specifically, Fix the unit $\ell_2$ ball in $n$ dimensions, and look at the maximum value of your norm $\ell$ over this call: call it $b_\ell$. Now look at the median value of your norm (with respect to a uniform distribution over the sphere): call it $m(\ell)$. Define the

$$ mc(\ell) = \frac{b_\ell}{m_\ell} $$

Notice that this is 1 for $\ell_2$. For $\ell_1$, the maximum value is larger: it's $\sqrt{d}$. The median value as it turns out is also $\Theta(\sqrt{d})$, and so the modulus of concentration is constant. Interestingly, for $p > 2$, the modulus of concentration for $\ell_p$ is $d^{1/2(1 - 2/p)}$ which looks an awful lot like the bound of $d^{1-2/p}$ for sketching $\ell_p$.

As it turns out, this is precisely the point. The authors show that the streaming complexity of any norm $\ell$ can be expressed in terms of the square of $mc(\ell)$. There are some technical details - this is not exactly the theorem statement - but you can read the paper for more information.

January 7-10, 2018 New Orleans https://simplicityalgorithms.wixsite.com/sosa Submission deadline: August 24, 2017 The Symposium on Simplicity in Algorithms is a new conference in theoretical computer science dedicated to advancing simplicity and elegance in the design and analysis of algorithms. The 1st SOSA will be co-located with SODA 2018 in New Orleans. Ideal submissions will present simpler … Continue reading Symposium on Simplicity in Algorithms 2018

Postdoc in algorithms and complexity at Oxford with Leslie Ann Goldberg. Applicant will join the project: http://www.cs.ox.ac.uk/people/leslieann.goldberg/mcc.html Webpage: http://www.cs.ox.ac.uk/news/1329-full.html Email: leslie.goldberg@cs.ox.ac.uk

Postdoc in algorithms and complexity at Oxford with Leslie Ann Goldberg.

Applicant will join the project: http://www.cs.ox.ac.uk/people/leslieann.goldberg/mcc.html