by Adam Sheffer at December 12, 2017 04:50 PM UTC
Theory of Computing Blog Aggregator
Do TV shows overestimate how much a genius can help solve crimes or make really good crystal meth which seems to be blue. YES, see here
Do TV shows get math wrong. YES, see here and about 90% of the episodes of Numb3rs
Closer to home do TV shows say stupid things about P vs NP. Elementary (one of the two Modern Day Sherlock Holmes shows did) does see here
Did Kirk and Spock really defeat a computer by a trick that wouldn't work now. Yes, see Lance's post on this here
Do TV shows use the word Quantum incorrectly? They do but they are not alone as such, see here
Do people writing Futrama get their math right! Yes see here
Do people writing 24 get their math wrong! Yes see here
Does the Big Bang Theory mostly get things right? Yes!  see here
There are more (Seinfeld things comedians should learn proofs! Really see here) but I can make my point just with the ones above.
ALL of the TV shows except Star Trek were from after 2000 (or so). So, with the exception of Science Fiction, mathrefs and scirefs in TV shows are relatively recent I had thought.
Which is why I was surprised and delighted to see, an episode of the old western (antiwestern? satire of a western?) Maverick, from 1958 (before I was born!), called Rope of Cards a CORRECT and INTERESTING math reference.Maverick bets that a random 25 cards from a deck of cards can be arranged into four 5card pat hands (I had to look that up hands where you don't want to discard any cards, so flush, a straight, a full house would qualify. 4 of a kind would be pat if there were no wild cards). The sucker takes the bet and loses. Maverick later says the odds are high and called the game Maverick Solitaire.And that is now the name of the puzzle see here. The prob is around 0.98.
I call this a mention of math since it has to do with probability which may be a stretch. And I doubt the scene would encourage people to go into math. But it might encourage one to learn probability either to sucker others or to not be suckered.
So the question now is are there other nonsciencefiction, refs to math in older TV shows?
I suspect yes  similar to the one above which is gambling and probability. What is the earliest mention of math on a TV show? The oldest that did not involve science fiction or gambling?
by GASARCH (noreply@blogger.com) at December 12, 2017 02:33 PM UTC
Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola
1 Lecture 18, Scribe: Giorgos Zirdelis
In this lecture we study lower bounds on data structures. First, we define the setting. We have bits of data, stored in bits of memory (the data structure) and want to answer queries about the data. Each query is answered with probes. There are two types of probes:
 bitprobe which return one bit from the memory, and
 cellprobe in which the memory is divided into cells of bits, and each probe returns one cell.
The queries can be adaptive or nonadaptive. In the adaptive case, the data structure probes locations which may depend on the answer to previous probes. For bitprobes it means that we answer a query with depth decision trees.
Finally, there are two types of data structure problems:
 The static case, in which we map the data to the memory arbitrarily and afterwards the memory remains unchanged.
 The dynamic case, in which we have update queries that change the memory and also run in bounded time.
In this lecture we focus on the nonadaptive, bitprobe, and static setting. Some trivial extremes for this setting are the following. Any problem (i.e., collection of queries) admits data structures with the following parameters:
 and , i.e. you write down all the answers, and
 and , i.e. you can always answer a query about the data if you read the entire data.
Next, we review the best current lower bound, a bound proved in the 80’s by Siegel [Sie04] and rediscovered later. We state and prove the lower bound in a different way. The lower bound is for the problem of wise independence.
Problem 1. The data is a seed of size for a wise independent distribution over . A query is defined to be the th bit of the sample.
The question is: if we allow a little more space than seed length, can we compute such distributions fast?
Theorem 2. For the above problem with it holds that
It follows, that if then is . But if then nothing is known.
Proof. Let . We have the memory of bits and we are going to subsample it. Specifically, we will select a bit of with probability , independently.
The intuition is that we will shrink the memory but still answer a lot of queries, and derive a contradiction because of the seed length required to sample wise independence.
For the “shrinking” part we have the following. We expect to keep memory bits. By a Chernoff bound, it follows that we keep bits except with probability .
For the “answer a lot of queries” part, recall that each query probes bits from the memory. We keep one of the queries if it so happens that we keep all the bits that it probed in the memory. For a fixed query, the probability that we keep all its probes is .
We claim that with probability at least , we keep queries. This follows by Markov’s inequality. We expect to not keep queries on average. We now apply Markov’s inequality to get that the probability that we don’t keep at least queries is at most .
Thus, if , then there exists a fixed choice of memory bits that we keep, to achieve both the “shrinking” part and the “answer a lot of queries” part as above. This inequality is true because and so . But now we have bits of memory while still answering as many as queries.
The minimum seed length to answer that many queries while maintaining wise independence is . Therefore the memory has to be at least as big as the seed. This yields
from which the result follows.
This lower bound holds even if the memory bits are filled arbitrarily (rather than having entropy at most ). It can also be extended to adaptive cell probes.
We will now show a conceptually simple data structure which nearly matches the lower bound. Pick a random bipartite graph with nodes on the left and nodes on the right. Every node on the right side has degree . We answer each probe with an XOR of its neighbor bits. By the Vazirani XOR lemma, it suffices to show that any subset of at most memory bits has an XOR which is unbiased. Hence it suffices that every subset with has a unique neighbor. For that, in turn, it suffices that has a neighborhood of size greater than (because if every element in the neighborhood of has two neighbors in then has a neighborhood of size ). We pick the graph at random and show by standard calculations that it has this property with nonzero probability.
It suffices to have , so that the probability is strictly less than 1, because . We can match the lower bound in two settings:
 if for some constant , then suffices,
 and suffices.
Remark 3. It is enough if the memory is wise independent as opposed to completely uniform, so one can have . An open question is if you can improve the seed length to optimal.
As remarked earlier the lower bound does not give anything when is much larger than . In particular it is not clear if it rules out . Next we show a lower bound which applies to this case.
Problem 4. Take bits to be a seed for biased distribution over . The queries, like before, are the bits of that distribution. Recall that .
Theorem 5. You need .
Proof. Every query is answered by looking at bits. But queries are answered by the same 2bit function of probes (because there is a constant number of functions on 2bits). There are two cases for :
 is linear (or affine). Suppose for the sake of contradiction that . Then you have a linear dependence, because the space of linear functions on bits is . This implies that if you XOR those bits, you always get 0. This in turn contradicts the assumption that the distributions has small bias.
 is AND (up to negating the input variables or the output). In this case, we keep collecting queries as long as they probe at least one new memory bit. If when we stop we have a query left such that both their probes query bits that have already been queried. This means that there exist two queries and whose probes cover the probes of a third query . This in turn implies that the queries are not close to uniform. That is because there exist answers to and that fix bits probed by them, and so also fix the bits probed by . But this contradicts the small bias of the distribution.
References
by Emanuele at December 12, 2017 02:01 PM UTC
 One great tradition at the ITCS conference is the “graduating bits” session, where graduating PhD students and postdocs give brief overviews of their research in advance of going out on the job market. See here for instructions on how to apply (the deadline is January 8, 2018, and the session is four days later). Hopefully these talks will be recorded and archived as well.
 Avi Wigderson has a new book coming out, Mathematics and Computation, with a draft freely available on the Web. In addition to being a great overview of many of the “greatest hits” of theoretical computer science, the last chapter (Chapter 20) lays out how TCS has influenced and connected to the life sciences, the social sciences, and technology.
by timroughgarden at December 12, 2017 08:34 AM UTC
I have a very strict December 20 deadline (selfimposed, I missed the official one) for my ICM2018 paper. I plan to talk about three puzzles on mathematics, computation and games, and here is a draft of the first third. Corrections and comments are very welcome! This part is around the simplex algorithm for linear programming.
by Gil Kalai at December 11, 2017 05:28 PM UTC
I am excited to (temporarily) join the Windows on Theory family as a guest blogger!
This is the first post in a series which will appear on Windows on Theory in the coming weeks. The aim is to give a selfcontained tutorial on using the Sum of Squares algorithm for unsupervised learning problems, and in particular in Gaussian mixture models. This will take several posts: let’s get started.
In an unsupervised learning problem, the goal is generally to recover some parameters given some data , where is a probability distribution on, say, which is parameterized by . The goal is to estimate by some estimator which (a) does not require too many samples and (b) can be computed from those samples in polynomial time. This basic setup can be instantiated (sometimes with minor adaptations) to capture numerous important problems in statistics and machine learning: a few examples are
 component analysis and its many variants (PCA, ICA, Sparse PCA, etc.)
 Netflix problem / matrix completion / tensor completion
 dictionary learning / blind source separation
 community detection and recovery / stochastic block models
 many clustering problems
Excellent resources on any of these topics are just a Google search away, and our purpose here is not to survey the vast literature on unsupervised learning, or even on provable “TCSstyle” algorithms for these problems. Instead, we will try to give a simple exposition of one technique which has now been applied successfully to many unsupervised learning problems: the Sum of Squares method for turning identifiability proofs into algorithms.
Identifiability is a concept from statistics. If one hopes for an algorithm which recovers parameters , it must at least be true that uniquely (or almost uniquely) determine , with high probability: when this occurs we say is identifiable from the samples .
Classically, identifiability is often proved by analysis of a (typically) inefficient bruteforce algorithm. First, one invents some property of . For example, in a maximumlikelihood argument, one shows that for some . Then, often via a concentrationofmeasure argument, one shows that no far from satisfies property . In the maximumlikelihood example, this would entail showing that if is far from then .
An identifiability argument like this typically has no implications for computationallyefficient algorithms, since finding which satisfies often appears to require bruteforce search. Thus, often when the investigation turns to efficient algorithms, the identifiability argument is abandoned and more explicitly algorithmic techniques are brought to bear: convex relaxations, spectral methods, and even heuristic methods.
The method we present here for designing computationallyefficient algorithms begins with a return to identifiability proofs. The main insight is that if both
 the property and, more importantly,
 the proof that any satisfying must be close to
are sufficiently simple, then identifiability of from does imply the existence of an efficient algorithm to (approximately) recover from !
For us, simple has a formal meaning: the proof should be captured in the lowdegree Sum of Squares proof system.
The algorithms which result in the end follow a familiar recipe: solve some convex relaxation whose constraints depend on and round it to obtain . But the design and analysis of this relaxation are heavily informed by the simple identifiability proof described above. In particular, the convex programs which result will not be familiar relaxations of maximum likelihood problems!
In this series of blog posts, we are going to carry out this strategy from start to finish for a classic unsupervised learning problem: clustering mixtures of Gaussians. So that we can get down to business as quickly as possible, we defer a short survey of the literature on this “proofstoalgorithms” method to a later post.
Mixtures of Gaussians
Gaussian mixtures are classic objects in statistics, dating at least to Pearson in 1894. The basic idea is: suppose you have a data set which was generated in a heterogeneous fashion: some points were sampled from probability distribution , some from , and so on up to , but you do not know which points came from which distributions. Under what settings can you cluster the points by which distribution they came from, and perhaps also recover some properties of those distributions, such as their means ?
Pearson, in 1894, was faced with a collection of body measurements of crabs. The distribution of one such attribute in the data — the ratio of forehead length to body width — curiously deviated from a Gaussian distribution. Pearson concluded that in fact two distinct species of crabs were present in his data set, and that the data should therefore be modeled as a mixture of two Gaussians. He invented the method of moments to discover the means of both the Gaussians in question.^{1} In the years since, Gaussian mixtures have become a fundamental statistical modeling tool: algorithms to fit Gaussian mixtures to data sets are included in basic machine learning packages like sklearn.
Here is our mixture of Gaussians model, formally.
Mixtures of separated spherical Gaussians:
Let
 .
 be such that for every .
 be dimensional spherical Gaussians, centered at the means .
 be iid samples, each drawn by selecting uniformly, then drawing .
 be the induced partition of into parts, where if samples was drawn from
The problem is: given , output a partition of into parts, each of size , such that (up to renaming the clusters ),
for each and some small number .
To avoid some minor but notationally annoying matters, we are going to work with a small variant of the model, where we assume that exactly samples came from each Gaussian . We call a mixture like this “equidistributed”.
We will work up to a proof of this theorem, which was proved recently (as of Fall 2017) in 3 independent works.
Main Theorem (HopkinsLi, KothariSteinhardt, DiakonikolasKaneStewart):
For arbitrarilylarge there is an algorithm requiring samples from the equidistributed mixtures of Gaussians model and running in time which outputs a partition of into parts of size such that with high probability,for some universal constant .^{3}
In particular:
 If for some , then by choosing the algorithm recovers the correct clustering up to errors in time with many samples.
 If for a largeenough universal constant , then choosing gives a quasipolynomialtime algorithm (using quasipolynomiallymany samples) to recover clusters up to error.^{4}
One (rather weak) consequence of the main theorem is that, for samples, there is enough information in the samples to determine the underlying clustering, up to error . Our strategy to prove the main theorem will start with proving the latter statement independently — as we have discussed, such an argument is often called a proof of identifiability because we say that the clusters are identifiable from the samples (up to errors).
While identifiability itself does not carry immediate algorithmic consequences, our proof of identifiability will be somewhat special: it will be simple in a formal sense, namely, that it will be captured by a formal proof system of restricted power. This simplicity of the proof of identifiability will almost immediately imply the main theorem: the construction of a computationallyefficient algorithm from a simple proof of identifiability is the heart of the proofstoalgorithms method.
Identifiability proof: 1 dimension
Our eventual goal is to work up to a proof in the lowdegree Sum of Squares proof system that clusters are identifiable from samples from a mixture of Gaussians. Since we have not yet defined lowdegree Sum of Squares proofs, for now we will focus on constructing an identifiability proof which avoids mathematical facts which we deem “too complicated”. In particular, we will avoid any Chernoff/union bound style arguments.
To get to the heart of the matter it helps to simplify the setting. Our first simplification is to restrict attention to the case, so that distributions are univariate Gaussians with unit variance.
Before stating our first lemma, let’s discuss the key property of a collection of samples from a Gaussian which we will use. Recall that if is a standard Gaussian, then for every ,
If are samples from , then for large enough, the empirical distribution of inherits this property, up to some small fluctuations. Namely, with high probability we would have
(We could have replaced by any small constant greater than .) Here, the notation means that an index is chosen uniformly among .
If for some , then the same discussion applies immediately to and . But even more is true: if is the empirical mean of (that is, ), then with high probability the th centered empirical moment also inherits the same bound:
In the Gaussian mixture setting, so long as enough samples are drawn from each Gaussian , each cluster will have th empirical moments satisfying the above bound (with high probability).
In our identifiability proof, we choose to forget that the samples were drawn from Gaussians, and we remember only that they break up into underlying clusters, each of which satisfies that empirical moment bound. We do not even remember the “true” means of each Gaussian; instead we talk only about the empirical means. We will show that no clustering far from that underlying groundtruth clustering results in clusters which satisfy the empirical moment bound.
Lemma 1. Let . Let be a partition of into pieces of size such that for each , the collection of numbers obeys the following moment bound:
where is the average and is some number in . Let be such that for every . Suppose is large enough that .
Let have size and be such that obey the same momentboundedness property:
for the same , where is the mean . Then there exists an such that
for some universal constant .
How do we interpret Lemma 1 as a statement of cluster identifiability? The lemma implies that the clusters are, up to errors, the only subsets of of size which satisfy the th moment bound. This is our property , like we discussed earlier in this post. The true clustering satisfies (i.e. if you group by this groundtruth clustering, the resulting clusters will have bounded empirical th moments), and every clustering which satisfies this bounded th moment property must be close to the true clustering. Thus, the correct clusters could be identified by searching for subsets of which satisfy the th moment bound (never mind that such a search would naively require about time).
We said that to use the sum of squares method to turn our identifiability proof into an algorithm, both the property and the proof of identifiability need to be simple.
This th moment boundedness property will turn out to be simple enough. What about the proof of Lemma 1? By the end of this post we will prove Lemma 1 in a way which uses only Holder’s inequality for the vs norms and the triangle inequality for the norm. Later, we will show that these inequalities are simple in the correct formal sense: they are captured by a proof system of restricted power.
Our proof of Lemma 1 relies on the following key fact.
Fact 1. Let have . Let denote a uniform sample from and similarly for . Let and . Suppose satisfy the th moment bound
Then
A slightly more general interpretation of this fact is that a pair of random variables on which have bounded th moments and whose total variation distance cannot have means which are too far apart: .
Proof of Fact 1.
Let . Observe that there is a coupling of the random variables so that . The coupling chooses with probability to select a uniform sample , then lets . With probability , the random variable is a uniform sample from and similarly for .
Let be a coupled pair of samples. We expand a quantity related to the one we want to bound, and then apply Holder’s inequality with the and norms. (The underlying inner product space assigns functions the inner product .)
Now we can apply the triangle inequality for the norm to the last term, followed by our th moment assumptions.
Putting everything together, we get . QED.
Keeping in mind our eventual goal of constructing a lowdegree Sum of Squares proof, we record the observation that the only inequalities we required to prove Fact 1 were the vs. Holder’s inequality and the triangle inequality for the norm.
Armed with Fact 1, we can prove Lemma 1.The main idea in the proof is that if and are the two clusters with greatest intersection with , then can only be close to one of .
Proof of Lemma 1.
Let have size , with mean and th moment bound . Without loss of generality, order the clusters so that has largest intersection with , then , and so on: that is . If , then , just by counting.
Since , either or . We claim it must be the second. Using Fact 1,
since certainly , and we assumed . We conclude that .
We have obtained and . Putting these together with Fact 1, we find
Rearranging, this reads QED.
Looking ahead
This concludes our onedimensional identifiability proof, which will form the core of our proof of Theorem 1. In our next post we will lift this proof to a Sum of Squares proof (for which we will need to define Sum of Squares proofs). After that, with a Sum of Squares proof in hand, we will finish designing our mixture of Gaussians algorithm for the onedimensional case. Then we will show that the same ideas, nearly unchanged, imply that the algorithm works in higher dimensions.
Thanks
Many thanks to Boaz Barak, David Steurer, and especially Daniel Freund for their helpful remarks on early drafts of this post.
Footnotes
 Moritz Hardt on Gaussian mixtures and Pearson’s approach

To see how to apply the ideas in this tutorial to a much broader class of clustering problems, see my joint paper with Jerry Li and the recent paper of Pravesh Kothari and Jacob Steinhardt.

Before these recent works, the best polynomialtime algorithms for the clustering mixtures of Gaussians could not tolerate any (when a simple greedy algorithm can be shown to solve the clustering problem to high accuracy). On the other hand, known lower bounds show that when , clustering is impossible (even using exponential time) with samples, so one cannot hope to improve the guarantees of this theorem too much further [RegevVijayaraghavan]. (That said, reducing the sample complexity and running time to when is a fascinating open problem.)
Variants of this theorem, which may be found in all three of the sources listed, offer algorithms which additionally output estimates of the means , work for many distributions besides Gaussians (without even knowing the underlying distribution!), and tolerate some amount of advesarial corruption among the samples . We note also that the theorems in those works handle the usual mixtures of Gaussians problem, rather than the equidistributed version, and can tolerate nonuniform mixtures; i.e. those where some clusters are much smaller than others.
by samhop at December 11, 2017 03:58 AM UTC
Authors: Thijs Laarhoven
Download: PDF
Abstract: We take a first step towards a rigorous asymptotic analysis of graphbased
approaches for finding (approximate) nearest neighbors in highdimensional
spaces, by analyzing the complexity of (randomized) greedy walks on the
approximate near neighbor graph. For random data sets of size $n = 2^{o(d)}$ on
the $d$dimensional Euclidean unit sphere, using near neighbor graphs we can
provably solve the approximate nearest neighbor problem with approximation
factor $c > 1$ in query time $n^{\rho_q + o(1)}$ and space $n^{1 + \rho_s +
o(1)}$, for arbitrary $\rho_q, \rho_s \geq 0$ satisfying \begin{align} (2c^2 
1) \rho_q + 2 c^2 (c^2  1) \sqrt{\rho_s (1  \rho_s)} \geq c^4. \end{align}
Graphbased near neighbor searching is especially competitive with hashbased
methods for small $c$ and nearlinear memory, and in this regime the asymptotic
scaling of a greedy graphbased search matches the recent optimal hashbased
tradeoffs of AndoniLaarhovenRazenshteynWaingarten [SODA'17]. We further
study how the tradeoffs scale when the data set is of size $n =
2^{\Theta(d)}$, and analyze asymptotic complexities when applying these results
to lattice sieving.
at December 11, 2017 12:00 AM UTC
Authors: Toru Niina
Download: PDF
Abstract: Searching spatial data is an important operation for scientific simulations
which are performed mostly with periodic boundary conditions. An RTree is a
well known tree data structure used to contain spatial objects and it is
capable of answering to spatial searching queries in an efficient way. In this
paper, a novel method to construct an RTree considering periodic boundary
conditions is presented. Unlike existing methods, the proposed method works
without any kind of extra objects or queries. Moreover, because this method
reduces the volume of bounding box for each node under the periodic boundary
conditions, it is expected to increase the overall efficiency. While the
extension of an RTree is presented in this work, this method is not only
applicable to an RTree but also to other data structures that use axisaligned
bounding boxes with periodic boundary conditions. The implementation is
available on GitHub.
at December 11, 2017 01:42 AM UTC
Authors: Janosch Döcker, Leo van Iersel, Steven Kelk, Simone Linz
Download: PDF
Abstract: Here we show that deciding whether two rooted binary phylogenetic trees on
the same set of taxa permit a cherrypicking sequence, a special type of
elimination order on the taxa, is NPcomplete. This improves on an earlier
result which proved hardness for eight or more trees. Via a known equivalence
between cherrypicking sequences and temporal phylogenetic networks, our result
proves that it is NPcomplete to determine the existence of a temporal
phylogenetic network that contains topological embeddings of both trees. The
hardness result also greatly strengthens previous inapproximability results for
the minimum temporalhybridization number problem. This is the optimization
version of the problem where we wish to construct a temporal phylogenetic
network that topologically embeds two given rooted binary phylogenetic trees
and that has a minimum number of indegree2 nodes, which represent events such
as hybridization and horizontal gene transfer. We end on a positive note,
pointing out that fixed parameter tractability results in this area are likely
to ensure the continued relevance of the temporal phylogenetic network model.
at December 11, 2017 01:41 AM UTC
Authors: Sariel HarPeled, Mitchell Jones
Download: PDF
Abstract: $\renewcommand{\Re}{\mathbb{R}} \newcommand{\eps}{\varepsilon}
\newcommand{\Net}{S}
\newcommand{\tldO}{{\widetilde{O}}}
\newcommand{\body}{C}
$ We revisit the problem of building weak $\eps$nets for convex ranges over a point set in $\Re^d$. Unfortunately, the known constructions of weak $\eps$nets yields sets that are of size $\Omega(\eps^{d} e^{c d^2} )\Bigr.$, where $c$ is some constant. We offer two alternative schemes that yield significantly smaller sets, and two related results, as follows:
(A) Let $\Net$ be a sample of size $\tldO(d^2 \eps^{1})$ points from the original point set (which is no longer needed), where $\tldO$ hides polylogarithmic terms. Given a convex body $\body$, via a separation oracle, the algorithm performs a small sequence of (oracle) stabbing queries (computed from $\Net$)  if none of the query points hits $\body$, then $\body$ contains less than an $\eps$fraction of the input points. The number of stabbing queries performed is $O( d^2 \log \eps^{1})$, and the time to compute them is $\tldO(d^9 \eps^{1})$. To the best of our knowledge, this is the first weak $\eps$net related construction where all constants/bounds are polynomial in the dimension.
(B) If one is allowed to expand the convex range before checking if it intersects the sample, then a sample of size $\Bigl.\tldO(\eps^{(d+1)/2})$, from the original point set, is sufficient to form a net.
(C) We show a construction of weak $\eps$nets which have the following additional property: For a heavy body, there is a net point that stabs the body, and it is also a good centerpoint for the points contained inside the body.
(D) We present a variant of a known algorithm for approximating a centerpoint, improving the running time from $\tldO(d^9)$ to $\tldO(d^7)$. Our analysis of this algorithm is arguably cleaner than the previous version.
at December 11, 2017 12:00 AM UTC
Authors: David Yang Gao
Download: PDF
Abstract: The general problem in topology optimization is correctly formulated as a
double$\min$ mixed integer nonlinear programming (MINLP) problem based on the
minimum total potential energy principle. It is proved that for linear elastic
structures, the alternative iteration leads to a Knapsack problem, which is
considered to be NPhard in computer science. However, by using canonical
duality theory (CDT) developed recently by the author, this challenging 01
integer programming problem can be solved analytically to obtain global optimal
solution at each design iteration. The novel CDT method for general topology
optimization is refined and tested mainly by both 2D benchmark problems in
topology optimization. Numerical results show that without using filter, the
CDT method can obtain exactly 01 optimal density distribution with almost no
checkerboard pattern. Its performance and novelty are compared with the popular
SIMP and BESO approaches. A brief review on the canonical duality theory for
solving a unified problem in multiscale systems is given in Appendix.
at December 11, 2017 12:00 AM UTC
Authors: Esther Ezra, Sariel HarPeled, Haim Kaplan, Micha Sharir
Download: PDF
Abstract: $\renewcommand{\Re}{\mathbb{R}}$ We reexamine parameters for the two main
space decomposition techniquesbottomvertex triangulation, and vertical
decomposition, including their explicit dependence on the dimension $d$, and
discover several unexpected phenomena, which show that, in both techniques,
there are large gaps between the VCdimension (and primal shatter dimension),
and the combinatorial dimension.
For vertical decomposition, the combinatorial dimension is only $2d$, the primal shatter dimension is at most $d(d+1)$, and the VCdimension is at least $1 + d(d+1)/2$ and at most $O(d^3)$. For bottomvertex triangulation, both the primal shatter dimension and the combinatorial dimension are $\Theta(d^2)$, but there seems to be a significant gap between them, as the combinatorial dimension is $\frac12d(d+3)$, whereas the primal shatter dimension is at most $d(d+1)$, and the VCdimension is between $d(d+1)$ and $5d^2 \log{d}$ (for $d\ge 9$).
Our main application is to point location in an arrangement of $n$ hyperplanes is $\Re^d$, in which we show that the query cost in Meiser's algorithm can be improved if one uses vertical decomposition instead of bottomvertex triangulation, at the cost of some increase in the preprocessing cost and storage. The best query time that we can obtain is $O(d^3\log n)$, instead of $O(d^4\log d\log n)$ in Meiser's algorithm. For these bounds to hold, the preprocessing and storage are rather large (superexponential in $d$). We discuss the tradeoff between query cost and storage (in both approaches, the one using bottomvertex trinagulation and the one using vertical decomposition).
at December 11, 2017 02:05 AM UTC
Authors: Sounaka Mishra, Shijin Rajakrishnan
Download: PDF
Abstract: Given a graph $G=(V, E)$ and a positive integer $k$, in \textsc{Maximum
$k$Order Bounded Component Set} problem (\maxkcomp{k}), it is required to find
a vertex set $S \subseteq V$ of maximum size such that each component in the
induced graph $G[S]$ has at most $k$ vertices. We prove that for constant $k$,
\maxkcomp{k} is hard to approximate within a factor of $n^{1 \epsilon}$, for
any $\epsilon > 0$, unless $\mathsf{P} = \mathsf{NP}$. We provide lower bounds
on the approximability when $k$ is not a constant as well. \maxkcomp{k} can be
seen as a generalization of \textsc{Maximum Independent Set} problem
(\textsc{MaxIS}). We generalize Tur\'{a}n's greedy algorithm for
\textsc{MaxIS} and prove that it approximates \maxkcomp{k} within a factor of
$(2k  1)\overline{d} + k$, where $\overline{d}$ is the average degree of the
input graph $G$. This approximation factor is a generalization of Tur\'{a}n's
approximation factor for \textsc{MaxIS}.
at December 11, 2017 01:42 AM UTC
Authors: Panagiotis Strouthopoulos, Apostolos Papadopoulos
Download: PDF
Abstract: Massive network exploration is an important research direction with many
applications. In such a setting, the network is, usually, modeled as a graph
$G$, whereas any structural information of interest is extracted by inspecting
the way nodes are connected together. In the case where the adjacency matrix or
the adjacency list of $G$ is available, one can directly apply graph mining
algorithms to extract useful knowledge. However, there are cases where this is
not possible because the graph is \textit{hidden} or \textit{implicit}, meaning
that the edges are not recorded explicitly in the form of an adjacency
representation. In such a case, the only alternative is to pose a sequence of
\textit{edge probing queries} asking for the existence or not of a particular
graph edge. However, checking all possible node pairs is costly (quadratic on
the number of nodes). Thus, our objective is to pose as few edge probing
queries as possible, since each such query is expected to be costly. In this
work, we center our focus on the \textit{core decomposition} of a hidden graph.
In particular, we provide an efficient algorithm to detect the maximal subgraph
of $S_k$ of $G$ where the induced degree of every node $u \in S_k$ is at least
$k$. Performance evaluation results demonstrate that significant performance
improvements are achieved in comparison to baseline approaches.
at December 11, 2017 01:40 AM UTC
What Makes a Programming Language?
There is an alphabet, words, grammar, statements, semantics, and various ways to organize the previous in order to create a computer program in a programming language. Flex helps developers create a tool called a lexical analyzer which identifies the words of a program during the compilation/interpretation process.
A compiler takes a text file and parses it by character trying to match patterns at each of the aforementioned levels. The initial parse is the lexical analysis, this pass ensures that there are no lexical errors, which are characters that don’t contribute to meaningful words.
Meaningful words for a programming language are described by a regular language. An implementation for describing a regular language is regular expressions. An implementation for parsing text while looking for matches to regular expressions is a flex lexical analyzer.
Essentially, programming a flex lexer means defining various regular expressions which detail all of the possible words and characters that are meaningful in a correct program for your language.
BooleanLogicLanguage Example Language
To illustrate flex, BooleanLogicLanguage is an example language which a flexbased lexer will lexically analyze. For no reason in particular, the purpose of this language is to evaluate Boolean logic expressions.
This diagram is an example of what a correct program would look like in this BooleanLogicLanguage. The lexical analyzer should be able to parse this without errors. Note that in this language, ‘INTEGER(‘ and ‘INTEGER)’ are the ‘words’ used to separate pieces of code — as opposed to ‘(‘, ‘)’, ‘{‘, or ‘}’. When creating your own language, you are free to do whatever you want, but following convention, to some degree, just ensures that the tool you are creating is easy to use.
NOTE: A programming language is a tool for using a computer. A GUI is a tool for using a computer. Siri or Cortana are tools for using a computer, etc.
This diagram is a sketch of the regular expressions which will be used in the flex program in order to describe meaningful words. The phrases on the left hand side are token names. Token names are not necessarily important during lexical analysis, but they are of the utmost importance when performing the syntactic analysis (which comes after a lexical analysis during compilation/interpretation — not covered in this post).
The most difficult part of this process is defining tokens and figuring out what sort of regular expression should describe them. For this example, I decided to make a language that would evaluate Boolean logic. Then I started writing potential programs in this language, and once I wrote enough that looked interesting and useful, I defined tokens and regular expressions to ensure those particular programs would be correct. I made code up that I liked and then fit tokens and regex to make them work.
A Quick Note on Regular Expressions (Regex)
[…] denotes a single thing, whose identity is dictated by what’s inside the brackets.
A single character is a single thing.
* denotes zero or more of the single thing before it.
? denotes one or zero of the single thing before it.
(…) is just a grouping, usually used with * or ?.
The difference between […] and (…) is that the square brackets represents one of what is inside of it and the parentheses are a grouping. For example [abc] and (abc): ‘a’, ‘b’, ‘c’ all match the former, and ‘abc’ matches the latter.
– denotes a range and has specific applications that are very useful: AZ, az, 09.
Flex
Flex is like a C program, except it has a further defined layout.
%{
C code declarations
%}
definitions of regular expressions
%%
regular expression1 code to execute on match (such as a macro)
regular expression2 other code to execute on match (such as a function)
%%
C code functions
main function
The power of Flex comes from being able to define regular expressions (between ‘%}’ and ‘%%’) and then attach them to code (between ‘%%’ and ‘%%’). The additional areas for C code are both handy and what gives the lexer its true functionality (doing something when a regex is matched).
NOTE: flex essentially turns this flex file (extension ‘.l’) into a C program which is then compiled like you would compile any C program. The result is an objectfile/program which you execute on/with a text file containing programming code. And in this case, the output of the program is to a text file (Tokens.txt) and also to stdout (terminal).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 
%{ int numErrors = 0; char* arg; typedef struct node { char* lexeme; char* token; struct node* next; } node_t; node_t head; node_t* current = &head; int yywrap(void); void store(char* lexeme); void error(void); void printStored(void); %} whitespace (\t" "\r) newline \n real ?[09]+\.[09]+ integer ?[09]+ string \".*\" boolean_op_binary (" AND "" OR "" XOR ") boolean_op_unary "NOT " boolean_literal ("True""False") identifier [AZ]+ separator_open ({real}{integer})\( separator_close \)({real}{integer}) assignment = IO "print " terminal ; %% {whitespace} {ECHO;} {newline} {ECHO;} {real} {ECHO;} {integer} {ECHO;} {string} {ECHO;} {boolean_op_binary} {ECHO; arg = "BOOL_OP_BINARY"; store(yytext);} {boolean_op_unary} {ECHO; arg = "BOOL_OP_UNARY"; store(yytext);} {boolean_literal} {ECHO; arg = "BOOL_LITERAL"; store(yytext);} {identifier} {ECHO; arg = "IDENTIFIER"; store(yytext);} {separator_open} {ECHO; arg = "OPEN"; store(yytext);} {separator_close} {ECHO; arg = "CLOSE"; store(yytext);} {assignment} {ECHO; arg = "ASSIGN"; store(yytext);} {IO} {ECHO; arg = "IO"; store(yytext);} {terminal} {ECHO; arg = "TERMINAL"; store(yytext);} . {ECHO; numErrors++; error();} %% int yywrap(void) { return 1; } void store(char* lexeme) { current>lexeme = malloc(sizeof(strlen(lexeme)+1)); strcpy(current>lexeme,lexeme); current>token = malloc(sizeof(strlen(arg)+1)); strcpy(current>token,arg); node_t* temp; temp = malloc(sizeof(node_t)); current>next = temp; current = current>next; } void error(void) { printf("[e]"); } void printStored(void) { node_t* c = &head; FILE* f = fopen("Tokens.txt","w"); while (c>next) { fprintf(f,"%s\t%s\n",c>lexeme,c>token); c = c>next; } fclose(f); printf("Tokens.txt written.\n"); } int main(int argc, char *argv[]) { // ensures number of command line arguments if (argc != 2) { printf("Please enter one filename as an argument.\n"); return 1; } // opens the file with name of second argument yyin = fopen(argv[1],"r"); yylex(); // close file fclose(yyin); printf("\nLexicalErrors %d\n",numErrors); printStored(); return 0; } 
The above code is a flex file which parses the BooleanLogicLanguage.
1 2 3 4 5 6 7 8 9 
CC=gcc CFLAGS= LexerFile=lexer lexer: lex.yy.c $(CC) $(CCFLAGS) o lexer lex.yy.c lex.yy.c: $(LexerFile).l flex $(LexerFile).l 
The above code is a makefile, which when run in the same directory as the flex file, will create the ‘lexer’ program. This was tested on an Ubuntu 16.04 operating system. The GNU C Compiler (gcc) is required in addition to flex.
1 2 3 4 5 
P = True; R = False; Q = 1(NOT P)1 XOR 2(P AND 3(NOT R)3)2; print Q; 
The above text should be parsed without errors by our lexer. And the lexer should output a Tokens.txt matching each lexeme to the token describing its regex.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 
P IDENTIFIER = ASSIGN True BOOL_LITERAL ; TERMINAL R IDENTIFIER = ASSIGN False BOOL_LITERAL ; TERMINAL Q IDENTIFIER = ASSIGN 1( OPEN NOT BOOL_OP_UNARY P IDENTIFIER )1 CLOSE XOR BOOL_OP_BINARY 2( OPEN P IDENTIFIER AND BOOL_OP_BINARY 3( OPEN NOT BOOL_OP_UNARY R IDENTIFIER )3 CLOSE )2 CLOSE ; TERMINAL print IO Q IDENTIFIER ; TERMINAL 
Here is the Token.txt for that initial errorless program.
The above is a screenshot of an example of parsing a program with lexical errors.What is Next?
Classically, once you can identify words, including what type of word (token), you can then create a grammar. You could think of a token as ‘noun’ or ‘verb’ or ‘adjective’, if comparing a programming language to a natural language. The step after lexical analysis (checking for correctness of words) is syntactic analysis (checking for correctness of grammar). Making a comparison to natural languages again, an English grammar could be PHRASE: article noun verb (The dog ran, A bird flies, etc). In BooleanLogicLanguage there could be STATEMENT: IDENTIFIER ASSIGN BOOL_LITERAL (Q = True, A = False, etc). But each of those are an example of just one type of phrase or statement, you can have multiple definitions for a production in the grammar for your language.
NonClassically? I don’t know. Natural Language Processing (NLP) is a very active area, but I’m not aware of anyone using those techniques for parsing textual programming languages.
The post Flex, Regular Expressions, and Lexical Analysis appeared first on XRDS.
by Alexander DeForge at December 10, 2017 11:44 PM UTC
by shacharlovett at December 10, 2017 08:03 PM UTC
Peter Frankl (right) and Zoltan Furedi
The news
A new paper by Nathan Keller and Noam Lifshitz settles several open problems in extremal combinatorics for wide range of parameters. Those include the three problems we mention next.
Three central open problems in extremal combinatorics are
The 1975 ErdősSós forbidding one intersection problem, asks for the maximal
size of a kuniform hypergraph that does not contain two edges whose intersection
is of size exactly t−1;
The 1987 FranklFüredi special simplex problem asks for the maximal
size of a kuniform hypergraph that does not contain the following forbidden configuration: d+1 edges such that there exists a set for which for any i and the sets {Ei \ S} are pairwise disjoint.
The 1974 ErdősChvátal’s simplex conjecture proposes an answer for the maximal
size of a kuniform hypergraph that does not contain a dsimplex. Here, a dsimplex is a family of d+1 sets that have empty intersection, such that the intersection
of any d of them is nonempty.
All these questions are related to the ErdősKoRado theorem (see this post and many others). For , two edges whose intersection is of size exactly t−1 are just two disjoint edges and so is a 1simplex and a special 1simplex.
The papers by Keller and Lifshitz and by Ellis, Keller, and Lifshitz
The paper by Keller and Lifshitz settles all these problems for a wide range of parameters! A subsequent paper by David Ellis, Nathan Keller, and Noam Lifshitz extends (among various other results) the range of parameters for the ErdősSós problem even further.
Plans for posts
Michel Deza
I have an ambitious plan to devote two or three posts to these developments (but not before January). In the first post I will give some general background on Turan’s problem for hypergraphs and the new new exciting results, Then (perhaps in a second post) I will give little background on two major methods, the Deltasystem method initiated by Deza, Erdos and Frankl and used in many earlier papers mainly by Frankl and Furedi, and the Junta method initiated by Friedgut which is used (among other ingredients) in the new paper. Then I will write about the new results in the last post.
Paul Erdos, Thomas Luczak, Ehud Friedgut, and Svante Janson
by Gil Kalai at December 10, 2017 05:26 PM UTC
(In this post I am following the venerable tradition of bloggers opining about matters on which they don’t really know much about. I hope I learn something from the feedback –Boaz).
Nothing is impossible,
Child, nothing is impossible.
Every bridge is crossable.
Every tooth is flossable.
Every win is lossable.
Every worker’s bossable.
Every cookie’s tossable.
Every yak’s a lhasa bull.
Nothing is impossible,
Child, nothing is impossible.Okay, teacher, can you name something that ISN’T possible?
No, Child. Nothing is impossible.
So, there IS something that’s impossible. Naming something that’s impossible is impossible.
(From “The Teacher and The Child”by Chris Harris)
In this world where reasoned arguments and respect for facts seem increasigly rare, some people are actually worried about the opposite problem of “intelligence explosion”.
Recently, through a blog post of Scott Aaronson, I came across this essay by François Chollet. Given Scott’s less than raving review, I fully expected not to like it, but actually found myself agreeing with some parts of this essay (though not all, and in particular not with the title).
The basic fear of “intelligence explosion” is that:
 Once we develop sufficiently advanced artificial intelligence, it will go on to use this intelligence to build better and better AI, quickly leaving us behind.
 The AI will develop some form of consciousness and rather than using their intelligence to make our lives better, will be about as kind to us as we were to the Neanderthals.
Consciousness is a hard concept to define. Humans used to attribute consciousness to the weather, praying to the sun and sea gods. After all it made perfect sense, the weather is unpredictable and dangerous, and seemed to vary in capricious ways. As we have grown to understand weather better, we no longer think that it is conscious. Yes, there is still an element of unpredictability and randomness to it, but we do understand the basic mechanisms at play, and can place some probabilities or bounds on its behavior. So these days the weather is stochastic rather than adversarial.
Arthur C. Clarke famously said that “Any sufficiently advanced technology is indistinguishable from magic.”. Similarly, one can say that any system that is sufficiently adversarial is indistinguishable from being conscious.
If we follow this definition, then we have already created conscious machines, since we can definitely already create technologies that we understand so poorly that we cannot model it in any way other than adversarial. In some sense this is the underlying lesson of results such as the Halting problem and NP completeness. Moreover, ever since the Atom bomb, mankind’s potential to damage the planet and ourselves has gone far beyond what nature can do (and of course, as witnessed by global warming, nuclear energy is not the only technology that can affect the whole planet). Also, as anyone getting “on the air updates” for their gadgets can attest to, we already have systems that continuously improve over time, often with a feedback loop. With more and more of the world’s economy becoming dependent on the transfer of information as opposed to goods (which of course is somewhat of an artificial distinction), the speed of progress has become much faster. So, if the question is whether we should worry about us developing and deploying technology whose behavior we can’t completely predict, and one that could result in very bad consequences, then I am with the alarmists.
What I find less appealing about the “AI risk” position is the focus on notions such as “intelligence” and “conciousness”. There is already an algorithm outperforming most humans on IQ tests and surely soon there will be an algorithm with an IQ of 300, or whatever the maximum possible value is. However, as Chollet points out, while some individual humans have had profound influence on the course of humanity, it is typically not intelligence alone that helped them (see e.g. the rather sad story of William James Sidis). That said, one can’t dispute that in the 200K years we’ve been around, Homo Sapiens have managed to make some significant progress, and so if you could simulate a population of even average humans, but do it with more people and larger speed, you’d speed up scientific discovery. Indeed (and this is where I part ways with Chollet) scientific progress has been accelerating, precisely because we use scientific progress to make more progress, whether it’s computer simulations helping in doing physics or quantum physics helping us build new computers, and so on and so forth. We are likely to continue to progress at an accelerating pace, not by trying to simulate a population of average or genius humans, but rather by continuously applying our tools and understanding to build better tools and improve our understanding.
But all of the above does not mean that modelling computation and communication systems as a new species and anthropomorphizing them is helpful. Research can and should be done on trying to verify that the behavior of computing systems does not deviate from certain parameters, whether it is Windows 10 or an AI algorithm.
With the progress of time computers are likely to continue to do more and more tasks currently associated with human intelligence, and yes, we do have a nontrivial chance of creating a technology that may eventually destroy us. But I don’t see why thinking of algorithms in anthropomorphic terms is helpful any more than thinking of the weather in terms of deities. If anything, understanding human “intelligence” and “consciousness” in more algorithmic ways seems like a better path forward.
by Boaz Barak at December 09, 2017 08:22 PM UTC
The Theory Group anticipates one or more available positions as Research Staff Members, pursuing research in theoretical computer science. The positions would start in 2018. Please see the webpage for more information.
Website: http://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=4491
Email: klclarks@us.ibm.com
by shacharlovett at December 08, 2017 11:55 PM UTC
I find some books about computers, but all of them are about technology. I want something more linked to theory.
by Raphael Augusto at December 08, 2017 10:46 PM UTC
Northeastern University, Boston, continues to hire at all levels and in all areas, including theory. In addition to a number of positions in the College of Computer Science, there is also a position joint with the Mathematics department that includes a focus on Data Science, Machine Learning, Discrete and Computational Mathematics, Probability and Statistics, or
Topological Data Analysis.
Website: https://www.ccis.northeastern.edu/openpositions/
Email: http://www.ccs.neu.edu/theory/index.html
by shacharlovett at December 08, 2017 10:40 PM UTC
An old result put a new way (in a nowfixedup post)
Albert Meyer knows circuit lower bounds. He coauthored a paper with the late Larry Stockmeyer that proves that small instances of the decision problem of a certain weak secondorder logical theory require Boolean circuits with more gates than there are atoms in the observable universe. The instances almost fit into two Tweets using just the Roman typewriter keys.
Today Ken and I discuss a simple but perhaps underlooked connection between P=NP and circuit lower bounds.
Albert recently coauthored, with Eric Lehman of Google and his MIT colleague Tom Leighton, the textbook Mathematics for Computer Science. It looks familiar to us because it uses the same MIT Press fonts and layout package as our quantum computing book. They say the following in their Introduction:
Simply put, a proof is a method of establishing truth. Like beauty, “truth” sometimes depends on the eye of the beholder, and it should not be surprising that what constitutes a proof differs among fields.
We would say that the game is not only about making truth evident from a proof but also from the way a theorem statement is expressed. This post uses an old result of Albert’s as an example.
New Statement of Old Result
Well, any criticism of Albert for how his theorem was stated is really criticism of myself, because Dick Karp and I were the first to state it in a paper. Here is exactly how we wrote it in our STOC 1980 version:
There was no paper by Albert to cite. In our 1982 final version we included a proof of his theorem:
As you can see, our proof—Albert’s proof?—used completeness for and some constructions earlier in the paper. In a preliminary section we wrote that our proofs about classes such as involved showing inclusions
“where the set of strings is complete in with respect to an appropriate reducibility.”
But in this case the proof does not need completeness for . I came up with this realization on Wednesday and Ken found essentially the same proof in these lecture notes by Kristoffer Hansen:
This proof uses not only for but also for . For the latter it suffices to have polynomialtime reduce to . This follows so long as is complete for some class to which also belongs.
Suppose runs in deterministic time . Then both and belong to , the latter because on input we can run for steps. With minimal assumptions on the function , has complete sets , and then follows from . So we can state the theorem more generally:
In fact we would get into and even smaller classes but that’s going beyond our simple point.
The Point
Our point comes through if we think of a concrete case like deterministic time , called for quasipolynomial time. So we have:
Corollary 2 If then .
Now mentally substitute for (and `‘ for `‘) in the way Karp and I summarized the final implication in our paper:
What you get after contraposing and using the hierarchy theorem for is:
Corollary 3 If then .
The point is that we can also do this for time and even smaller proper superclasses of . What follows is:
Any attempt to prove entails proving strong nonuniform circuit lower bounds on languages that are arbitrarily close to being in .
Reflecting…
Again in the case of this implication too has been variously noted. Scott Aaronson mentions it in one sentence of his great recent 121page survey on the versus question (p65):
“[I]f someone proved , that wouldn’t be a total disaster for lower bounds research: at least it would immediately imply (via ).”
Maybe I (Dick) considered this in terms of in weighing my thoughts about . But that it applies to in place of gives me pause. This greatly amplifies idle thoughts about the irony of how proving yields the same type of lower bounds against that are involved in the “Natural Proofs” barrier against proving . Ryan Williams had to combine many ideas just to separate from nonuniform —not even getting on the left nor on the right. (Incidentally, we note this nice recent MIT profile of Ryan.) So having such lower bounds for just drop from the sky upon seems jarring.
So I’m rethinking my angle on . I’ve always propounded that good lower bounds flow like ripples from new upper bounds, but the wake of seems a tsunami. We wonder if Bill Gasarch will do a 3rd edition of his famous poll about versus . Ken and I offset each other with our votes last time, but maybe not this time.
We also wonder whether Theorem 1 can be given even stronger statements in ways that are useful. In the original version of this post we overlooked a point noted first by Ryan Williams here and thought we had . To patch it, call a language in “reflective” if there is a TM running in exponential time such that and (namely, the “tableau” language defined above) polynomialtime reduces to . The complete sets mentioned above for classes within are reflective. If we let denote the subclass of reflective languages, then we can say:
Note that per Lance Fortnow’s comment here, sparse languages are candidates for being nonreflective: the tableau language which we would wish to polynomialtime Turing reduce to is generally dense.
Open Problems
Is this realization about and strong circuit lower bounds arbitrarily close to really new? Can our readers point us to other discussions of it?
Is the notion of “reflective” known? useful?
[fixed error in original Theorem 1 and surrounding text; added paragraph about it before “Open Problems”; moved query about “cosmological” formulas to a comment.]
by RJLipton+KWRegan at December 08, 2017 07:55 PM UTC
Deep learning holds many mysteries for theory, as we have discussed on this blog. Lately many ML theorists have become interested in the generalization mystery: why do trained deep nets perform well on previously unseen data, even though they have way more free parameters than the number of datapoints (the classic “overfitting” regime)? Zhang et al.’s paper Understanding Deep Learning requires Rethinking Generalization played some role in bringing attention to this challenge. Their main experimental finding is that if you take a classic convnet architecture, say Alexnet, and train it on images with random labels, then you can still achieve very high accuracy on the training data. (Furthermore, usual regularization strategies, which are believed to promote better generalization, do not help much.) Needless to say, the trained net is subsequently unable to predict the (random) labels of stillunseen images, which means it doesn’t generalize. The paper notes that the ability to fit a classifier to data with random labels is also a traditional measure in machine learning called Rademacher complexity (which we will discuss shortly) and thus Rademacher complexity gives no meaningful bounds on sample complexity. I found this paper entertainingly written and recommend reading it, despite having given away the punchline. Congratulations to the authors for winning best paper at ICLR 2017.
But I would be remiss if I didn’t report that at the Simons Institute Semester on theoretical ML in spring 2017 generalization theory experts expressed unhappiness about this paper, and especially its title. They felt that similar issues had been extensively studied in context of simpler models such as kernel SVMs (which, to be fair, is clearly mentioned in the paper). It is trivial to design SVM architectures with high Rademacher complexity which nevertheless train and generalize well on reallife data. Furthermore, theory was developed to explain this generalization behavior (and also for related models like boosting). On a related note, several earlier papers of Behnam Neyshabur and coauthors (see this paper and for a full account, Behnam’s thesis) had made points fairly similar to Zhang et al. pertaining to deep nets.
But regardless of such complaints, we should be happy about the attention brought by Zhang et al.’s paper to a core theory challenge. Indeed, the passionate discussants at the Simons semester themselves banded up in subgroups to address this challenge: these resulted in papers by Dzigaite and Roy, then Bartlett, Foster, and Telgarsky and finally Neyshabur, Bhojapalli, MacAallester, Srebro. (The latter two were presented at NIPS’17 this week.)
Before surveying these results let me start by suggesting that some of the controversy over the title of Zhang et al.’s paper stems from some basic confusion about whether or not current generalization theory is prescriptive or merely descriptive. These confusions arise from the standard treatment of generalization theory in courses and textbooks, as I discovered while teaching the recent developments in my graduate seminar.
Prescriptive versus descriptive theory
To illustrate the difference, consider a patient who says to his doctor: “Doctor, I wake up often at night and am tired all day.”
Doctor 1 (without any physical examination): “Oh, you have sleep disorder.”
I call such a diagnosis descriptive, since it only attaches a label to the patient’s problem, without giving any insight into how to solve the problem. Contrast with:
Doctor 2 (after careful physical examination): “A growth in your sinus is causing sleep apnea. Removing it will resolve your problems.”
Such a diagnosis is prescriptive.
Generalization theory: descriptive or prescriptive?
Generalization theory notions such as VC dimension, Rademacher complexity, and PACBayes bound, consist of attaching a descriptive label to the basic phenomenon of lack of generalization. They are hard to compute for today’s complicated ML models, let alone to use as a guide in designing learning systems.
Recall what it means for a hypothesis/classifier $h$ to not generalize. Assume the training data consists of a sample $S = {(x_1, y_1), (x_2, y_2),\ldots, (x_m, y_m)}$ of $m$ examples from some distribution ${\mathcal D}$. A loss function $\ell$ describes how well hypothesis $h$ classifies a datapoint: the loss $\ell(h, (x, y))$ is high if the hypothesis didn’t come close to producing the label $y$ on $x$ and low if it came close. (To give an example, the regression loss is $(h(x) y)^2$.) Now let us denote by $\Delta_S(h)$ the average loss on samplepoints in $S$, and by $\Delta_{\mathcal D}(h)$ the expected loss on samples from distribution ${\mathcal D}$. Training generalizes if the hypothesis $h$ that minimises $\Delta_S(h)$ for a random sample $S$ also achieves very similarly low loss $\Delta_{\mathcal D}(h)$ on the full distribution. When this fails to happen, we have:
Lack of generalization: $\Delta_S(h) \ll \Delta_{\mathcal D}(h) \qquad (1). $
In practice, lack of generalization is detected by taking a second sample (“held out set”) $S_2$ of size $m$ from ${\mathcal D}$. By concentration bounds expected loss of $h$ on this second sample closely approximates $\Delta_{\mathcal D}(h)$, allowing us to conclude
Generalization Theory: Descriptive Parts
Let’s discuss Rademacher complexity, which I will simplify a bit for this discussion. (See also scribe notes of my lecture.) For convenience assume in this discussion that labels and loss are $0,1$, and assume that the badly generalizing $h$ predicts perfectly on the training sample $S$ and is completely wrong on the heldout set $S_2$, meaning
$\Delta_S(h)  \Delta_{S_2}(h) \approx  1 \qquad (3)$
Rademacher complexity concerns the following thought experiment. Take a single sample of size $2m$ from $\mathcal{D}$, split it into two and call the first half $S$ and the second $S_2$. Flip the labels of points in $S_2$. Now try to find a classifier $C$ that best describes this new sample, meaning one that minimizes $\Delta_S(h) + 1 \Delta_{S_2}(h)$. This expression follows since flipping the label of a point turns good classification into bad and vice versa, and thus the loss function for $S_2$ is $1$ minus the old loss. We say the class of classifiers has high Rademacher complexity if with high probability this quantity is small, say close to $0$.
But a glance at (3) shows that it implies high Rademacher complexity: $S, S_2$ were random samples of size $m$ from $\mathcal{D}$, so their combined size is $2m$, and when generalization failed we succeeded in finding a hypothesis $h$ for which $\Delta_S(h) + 1 \Delta_{S_2}(h)$ is very small.
In other words, returning to our medical analogy, the doctor only had to hear “Generalization didn’t happen” to pipe up with: “Rademacher complexity is high.” This is why I call this result descriptive.
The VC dimension bound is similarly descriptive. VC dimension is defined to be at least $k +1$ if there exists a set of size $k$ such that the following is true. If we look at all possible classifiers in the class, and the sequence of labels each gives to the $k$ datapoints in the sample, then we can find all possible $2^{k}$ sequences of $0$’s and $1$’s.
If generalization does not happen as in (2) or (3) then this turns out to imply that VC dimension is at least around $\epsilon m$ for some $\epsilon >0$. The reason is that the $2m$ data points were split randomly into $S, S_2$, and there are $2^{2m}$ such splittings. When the generalization error is $\Omega(1)$ this can be shown to imply that we can achieve $2^{\Omega(m)}$ labelings of the $2m$ datapoints using all possible classifiers. Now the classic Sauer’s lemma (see any lecture notes on this topic, such as Schapire’s) can be used to show that VC dimension is at least $\epsilon m/\log m$ for some constant $\epsilon>0$.
Thus again, the doctor only has to hear “Generalization didn’t happen with sample size $m$” to pipe up with: “VC dimension is higher than $\Omega(m/log m)$.”
One can similarly show that PACBayes bounds are also descriptive, as you can see in scribe notes from my lecture.
Why do students get confused and think that such tools of generalization theory gives some powerful technique to guide design of machine learning algorithms?
Answer: Probably because standard presentation in lecture notes and textbooks seems to pretend that we are computationallyomnipotent beings who can compute VC dimension and Rademacher complexity and thus arrive at meaningful bounds on sample sizes needed for training to generalize. While this may have been possible in the old days with simple classifiers, today we have complicated classifiers with millions of variables, which furthermore are products of nonconvex optimization techniques like backpropagation. The only way to actually lowerbound Rademacher complexity of such complicated learning architectures is to try training a classifier, and detect lack of generalization via a heldout set. Every practitioner in the world already does this (without realizing it), and kudos to Zhang et al. for highlighting that theory currently offers nothing better.
Toward a prescriptive generalization theory: the new papers
In our medical analogy we saw that the doctor needs to at least do a physical examination to have a prescriptive diagnosis. The authors of the new papers intuitively grasp this point, and try to identify properties of reallife deep nets that may lead to better generalization. Such an analysis (related to “margin”) was done for simple 2layer networks couple decades ago, and the challenge is to find analogs for multilayer networks. Both Bartlett et al. and Neyshabur et al. hone in on stable rank of the weight matrices of the layers of the deep net. These can be seen as an instance of a “flat minimum” which has been discussed in neural nets literature for many years. I will present my take on these results as well as some improvements in a future post. Note that these methods do not as yet give any nontrivial bounds on the number of datapoints needed for training the nets in question.
Dziugaite and Roy take a slightly different tack. They start with McAllester’s 1999 PACBayes bound, which says that if the algorithm’s prior distribution on the hypotheses is $P$ then for every posterior distributions $Q$ (which could depend on the data) on the hypotheses the generalization error of the average classifier picked according to $Q$ is upper bounded as follows where $D()$ denotes KL divergence:
This allows upperbounds on generalization error (specifically, upperbounds on number of samples that guarantee such an upperbound) by proceeding as in Langford and Caruana’s old paper where $P$ is a uniform gaussian, and $Q$ is a noised version of the trained deep net (whose generalization we are trying to explain). Specifically, if $w_{ij}$ is the weight of edge ${i, j}$ in the trained net, then $Q$ consists of adding a gaussian noise $\eta_{ij}$ to weight $w_{ij}$. Thus a random classifier according to $Q$ is nothing but a noised version of the trained net. Now we arrive at the crucial idea: Use nonconvex optimization to find a choice for the variance of $\eta_{ij}$ that balances two competing criteria: (a) the average classifier drawn from $Q$ has training error not much more than the original trained net (again, this is a quantification of the “flatness” of the minimum found by the optimization) and (b) the right hand side of the above expression is as small as possible. Assuming (a) and (b) can be suitably bounded, it follows that the average classifier from Q works reasonably well on unseen data. (Note that this method only proves generalization of a noised version of the trained classifier.)
Applying this method on simple fullyconnected neural nets trained on MNIST dataset, they can prove that the method achieves error $17$ percent error on MNIST (whereas the actual error is much lower at 23 percent). Hence the title of their paper, which promises nonvacuous generalization bounds. What I find most interesting about this result is that it uses the power of nonconvex optimization (harnessed above to find a suitable noised distribution $Q$) to cast light on one of the metaquestions about nonconvex optimization, namely, why does deep learning not overfit!
at December 08, 2017 06:00 PM UTC
The Computer Science Department at UT Austin invites applications for a Postdoctoral Fellow in theoretical computer science for the 201819 academic year. The Fellow will work with Dana Moshkovitz and David Zuckerman on pseudorandomness and computational complexity. Applications will be accepted until the position is filled.
Website: https://utdirect.utexas.edu/apps/hr/jobs/nlogon/171128010712
Email: diz@cs.utexas.edu
by shacharlovett at December 08, 2017 04:50 PM UTC
by shacharlovett at December 08, 2017 11:26 AM UTC
When I awoke with glowing, translucent hands, and hundreds of fivepointed yellow stars lined up along the left of my visual field, my first thought was that a dream must have made itself selfdefeatingly obvious. I was a 63yearold computer science professor. I might’ve been dying of brain cancer, but my mind was lucid enough that I’d refused hospice care, lived at home, still even met sometimes with my students, and most importantly: still answered my email, more or less. I could still easily distinguish dreams from waking reality. Couldn’t I?
I stared at the digital clock beside my bed: 6:47am. After half a minute it changed to 6:48. No leaping around haphazardly. I picked up the twocolumn conference paper by my nightstand. “HashandReduce: A New Approach to Distributed Proximity Queries in the Cloud.” I scanned the abstract and first few paragraphs. It wasn’t nonsense—at least, no more so than the other papers that I still sometimes reviewed. The external world still ticked with clockwork regularity. This was no dream.
Nervously, I got up. I saw that my whole body was glowing and translucent. My pajamas, too. A second instance of my body, inert and not translucent, remained in the bed. I looked into the mirror: I had no reflection. The mirror showed a bedroom unoccupied but for the corpse on the bed.
OK, so I was a ghost.
Just then I heard my nurse enter through the front door. “Bob, how you feeling this morning?” I met her in the foyer. “Linda, look what happened! I’m a ghost now, but interestingly enough, I can still..”
Linda walked right through me and into the bedroom. She let out a small gasp when she saw the corpse, then started making phone calls.
Over the following days, I accompanied my body to the morgue. I attended my little memorial session at the university, made note of which of my former colleagues didn’t bother to show up. I went to my funeral. At the wake, I stood with my estranged wife and grown children, who mostly remained none the wiser—except when they talked about how eerie it was, how it felt like I was still there with them. Or maybe I’d say something, and get no response from my family, but then five minutes later their conversation would mysteriously veer toward the topic I’d broached. It seemed that I still had full input from the world of the living, but that my only output channel was doing spooky haunted things that still maintained plausible deniability about my existence.
Questions flooded my mind: were there other ghosts? Why was I in this purgatory … or whatever it was? Would I be here forever? And: what was that column of yellow stars in the left of my visual field, the stars that followed me everywhere?
Once it seemed clear that I was here to stay, for some definition of “here,” I figured I might as well do the same stuff that filled my waking hours when I was alive. I pulled up a chair and sat at my laptop. I hit up The Washington Post, The Onion, xkcd, SMBC Comics, Slate Star Codex. They all worked fine.
Then I switched to the Gmail tab. Hundreds of new messages. Former students asking for recommendation letters, prospective students wanting to work with me, grant managers howling about overdue progress reports, none of them bothering to check if I was dead.
I replied to one randomlychosen email:
Dear Ashish,
Thanks for your interest in joining our group. Alas, I’m currently dead and walking the earth as a translucent wraith. For that reason, I’m unable to take on new PhD students at this time.
Best of luck!
–Bob
I clicked “Send” and—part of me was expecting this—got an error. Message not sent. Email couldn’t cross the barrier from the dead to the living: too obvious.
Next I opened my “Starred” folder. I was greeted by 779 starred messages: each one a pressing matter that I’d promised myself I’d get to while alive but didn’t.
Dear Bob,
Hope you’re well. I think I’ve found another error in your 2002 paper ‘CacheOblivious Approximation Algorithms for Sparse Linear Algebra on Big Data.’ Specifically, in the proof of Lemma 4.2, you assume a spectral bound [har har, spectral], even though your earlier definition of the matrix A_i seems to allow arbitrary norm…
I chuckled. Well, I did spend most of my life on this stuff, didn’t I? Shouldn’t I sort this out, just for the sake of my intellectual conscience?
I opened up my old paper in Ghostview (what else?) and found the offending lemma. Then I took out pen and paper—they worked, luckily, although presumably my scribblings remained invisible to the living—and set to work. After an hour, I’d satisfied myself that the alleged error was nothing too serious, just a gap requiring a few sentences of clarification. I sadly had no direct way to tell my yearsago correspondent that, assuming the correspondent was still even alive and researchactive and at the same email address. But still: good for my peace of mind, right?
Then something happened: the first intimation of what my life, or rather undeath, was to consist of from then on. Faintly but unmistakably, one of the tiny yellow stars in the left of my visual field became a bluegray outline. It was no longer filled with yellow.
Excitedly, I clicked through more starred emails. Some I saw no easy way to deal with. But every time I could satisfy myself that an email was no longer relevant—whether it was an invitation to a longago workshop, a grant that I never applied for, a proposed research collaboration rendered moot by subsequent work—one of those yellow stars in my visual field lost its yellow filling. Before long there were ten bluegray outline stars, then twenty.
One day, while I invisibly attended an old haunt (har har)—the weekly faculty lunch in my former department—I encountered a fellow ghost: a former senior colleague of mine, who’d died twenty years prior. He and I got to talking.
For the most part, my fellow specter confirmed what I’d already guessed. Yes, in some longago past, purgatory no doubt had a different character. Yes, it’s no doubt different for others, who lived different lives and faced different psychic burdens. For us, though, for the faculty, purgatory is neither more nor less than the place where you must reply to every last email that was still starred “important” when you died.
In the afterlife, it turns out, it doesn’t matter how “virtuous” you were, unless altruism happens to have been your obsession while alive. What matters is just that you free yourself from whatever burdened you every night when you went to sleep, that you finish what you started. Those unable to do so remain ghosts forever.
“So,” I asked the other polterguest at the faculty lunch, “how long does it take a professor to finish answering a lifetime’s worth of emails?”
“Depends. I’ve been doing it for twenty years. Hoping to finish in twenty more.”
“I see. And when you’ve dealt with the last email, what then?”
“You pass to another place. None of us know exactly where. But”—and here his voice dropped to a whisper, as if anyone else present could hear ghosts—“it’s said to be a place of breathtaking tranquility. Where researchers like us wear flowing robes, and sit under olive trees, and contemplate truth and beauty with Plato and Euclid, and then go out for lunch buffet. Where there’s no email, no deadlines, no journals, no grant applications, no responsibilities but one: to explore whatever has captured your curiosity in the present moment. Some call it the Paradise of Productivity.”
“Does everyone have to pass through purgatory first, before they go there?”
“It’s said that, among all the computer scientists who’ve lived, only Alan Turing went straight to Paradise. And he died before email was even invented. When his time comes, Donald Knuth might also escape purgatory, since he forswore email in 1990. But Knuth, alas, might spend tens of thousands of years in a different purgatory, finishing Volume 4 of The Art of Computer Programming.
“As for the rest of us, we all spend more or less time here with our wretched emails—for most of us, more. For one computer scientist—an Umesh Vazisomething, I believe, from Berkeley—it’s rumored that when he enters this place, even a trillion years won’t suffice to leave it. It’s said that the Sun will swallow the Earth, the night sky will go dark, and yet there Umesh will be, still clearing his inbox.”
After a few years, I’d knocked off all the easy stuff in my Starred folder. Then, alas, I was left with missives like this:
Hey, earth to Bob!
The rest of us have done our part in writing up the paper. We’re all waiting on you to integrate the TeX files, and to craft an introduction explaining why anyone cared about the problem in the first place. Also, would you mind making a detailed pass through Sections 4.3 and 5.2?
Ugh. There were so many slightly different TeX files. Which were the most recent? This could take a while.
Nevertheless, after weeks of … ghosting on the project, I got to work revising the paper. There was, of course, the practical difficulty that I couldn’t directly communicate my edits back to the world of the living. Fortunately, I could still do haunted stuff. One day, for example, one of my former coauthors opened her old TeX file, and “discovered” that I’d actually done way more work on the paper while I was alive than anyone remembered I had. The mysteries of when exactly I did that work, and why no one knew about it at the time, were never satisfactorily resolved.
Finally, after fourteen years, I’d succeeded in putting to rest 731 of my 779 starred emails. In the corner of my visual field was a vast array of bluegray stars—but still, ominously, 48 yellow stars scattered among them.
“God in Heaven!” I cried. “Whoever you are! I can’t handle any of the remaining starred emails, and thereby pass to the Paradise of Productivity, without sending replies back into the world of the living. Please, I beg you: let me breach this metaphysical wall.”
A booming voice came down from on high. “YEA, BOB, WHAT THOU REQUESTETH IS POSSIBLE. THOU WOULDST NOT EVEN BE THE FIRST GHOUL FOR WHOM I WOULDST GRANTETH THIS REQUEST: FAR FROM IT. BUT I MUST WARN THEE: BREACHING THE WALL BETWEEN LIVING AND DEAD WILL BRINGETH FRUITS THAT THOU MAYST NOT LIKE.”
“I think I’ll take my chances with those fruits.”
“VERY WELL,” said God.
And that’s how it is that, half a century after my death, I remain in purgatory still, my days now filled with missives like the following:
Dear Bob,
Thanks for the reply! I’m sorry to hear that you’re now a ghost condemned to answer emails before he can pass to the next world. My sympathies. Having said that, I have to confess that I still don’t understand Section 4.2 of your paper. When you get a chance, could you please clarify? I’ve cc’ed my coauthors, who might have additional followup questions.
Note: To anyone who emailed me lately, I apologize for the delay in replying. I was writing this story. –SA
by Scott at December 08, 2017 02:32 AM UTC
The wonderful Women in Theory (WIT) biennial series of workshops started in 2008 and the 6th meeting will take place at Harvard University, Jun 19 – 22, 2018. Please see below the call for application.
WIT is one of my favorite (if not the favorite) program in the theory community. I was lucky enough to participate in the outskirts of two WIT meetings, but naturally and rightfully never took part of its inner sanctum. So how can I vouch for an event I never fully attended? Experiencing the passion of the organizers, and most notably of Tal Rabin, and the enthusiastic reactions of its participants, there can be no doubt. I am far from being the only one feeling this way, and this year both Harvard and Stanford competed to host the workshop. So if you fit the workshop’s qualifications – please do yourself a favor and apply!
We will have a “Women In Theory” workshop for female graduate students and exceptional undergraduates (fourth year) in theoretical computer science at Harvard University, Cambridge, MA from Tuesday, June 19 to Friday, June 22 , 2018, https://womenintheory.wordpress.com/ .
We are writing to draw your attention to the workshop and ask you to inform female students in your department about this workshop. The workshop will have firstrate technical content and will be a great opportunity for students to meet their peers from around the world. We have received very enthusiastic feedback from participants in previous years and we think that this could be an exciting event for your student as well. Please forward this email to the female students in your department.
We will supply dorm rooms, breakfast and lunch, and will probably also be able to cover at least part if not all of their travel expenses. It would be great if you would be able to cover the remaining expenses.
For any questions, please email womenintheory2018@gmail.com.
The deadline to apply is January 16, 2018. Each student applicant needs to finish the application form on https://womenintheory.wordpress.com/apply/ , and her advisor also needs to supply a short letter of recommendation.
Best,
The organizing committee:
Tal Rabin
Shubhangi Saraf
Lisa Zhang
by Omer Reingold at December 07, 2017 05:23 PM UTC
I find it surprising that we have no proof or counterexample to this purely combinatorial question. There is a generalization of sensitivity known as block sensitivity which is the largest set of disjoint blocks where flipping the bits in any block flips the output bit. Block sensitivity is known to be polynomially related to decision tree complexity.
In a future post I'll talk about some approaches towards resolving this conjecture.
by Lance Fortnow (noreply@blogger.com) at December 07, 2017 03:21 PM UTC
by shacharlovett at December 07, 2017 01:48 PM UTC
by shacharlovett at December 07, 2017 01:47 PM UTC
A quiet month, with only two papers. Perhaps the calm before the storm? Please let us know in the comments if something slipped under our radar.
Agreement tests on graphs and hypergraphs, by Irit Dinur, Yuval Filmus, and Prahladh Harsha (ECCC). This work looks at agreement tests and agreement theorems, which argue that if one checks if a number of local functions agree, then there exists a global function which agrees with most of them. This work extends previous work on direct product testing to local functions of higher degree, which corresponds to agreement tests on graphs and hypergraphs.
Testing Conditional Independence of Discrete Distributions, by Clément L. Canonne, Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart (arXiv). This paper focuses on testing whether a bivariate discrete distribution has independent marginals, conditioned on the value of a tertiary discrete random variable. More precisely, given realizations of \((X, Y, Z)\), test if \(X \perp Y \mid Z\). Unconditional independence testing (corresponding to the case when \(Z\) is constant) has been extensively studied by the community, with tight upper and lower bounds showing that the sample complexity has two regimes, depending on the tradeoff between the support size and the accuracy desired. This paper shows gives upper and lower bounds for this more general problem, showing a rich landscape depending on the relative value of the parameters.
by Gautam "G" Kamath at December 07, 2017 12:51 AM UTC
Authors: Jonah Blasiak, Thomas Church, Henry Cohn, Joshua A. Grochow, Chris Umans
Download: PDF
Abstract: The CohnUmans grouptheoretic approach to matrix multiplication suggests
embedding matrix multiplication into group algebra multiplication, and bounding
$\omega$ in terms of the representation theory of the host group. This
framework is general enough to capture the best known upper bounds on $\omega$
and is conjectured to be powerful enough to prove $\omega = 2$, although
finding a suitable group and constructing such an embedding has remained
elusive. Recently it was shown, by a generalization of the proof of the Cap Set
Conjecture, that abelian groups of bounded exponent cannot prove $\omega = 2$
in this framework, which ruled out a family of potential constructions in the
literature.
In this paper we study nonabelian groups as potential hosts for an embedding. We prove two main results:
(1) We show that a large class of nonabelian groupsnilpotent groups of bounded exponent satisfying a mild additional conditioncannot prove $\omega = 2$ in this framework. We do this by showing that the shrinkage rate of powers of the augmentation ideal is similar to the shrinkage rate of the number of functions over $(\mathbb{Z}/p\mathbb{Z})^n$ that are degree $d$ polynomials; our proof technique can be seen as a generalization of the polynomial method used to resolve the Cap Set Conjecture.
(2) We show that symmetric groups $S_n$ cannot prove nontrivial bounds on $\omega$ when the embedding is via three Young subgroupssubgroups of the form $S_{k_1} \times S_{k_2} \times \dotsb \times S_{k_\ell}$which is a natural strategy that includes all known constructions in $S_n$.
By developing techniques for negative results in this paper, we hope to catalyze a fruitful interplay between the search for constructions proving bounds on $\omega$ and methods for ruling them out.
at December 07, 2017 01:55 AM UTC
Authors: Alfonso Cevallos, Stefan Weltge, Rico Zenklusen
Download: PDF
Abstract: Mixedinteger mathematical programs are among the most commonly used models
for a wide set of problems in Operations Research and related fields. However,
there is still very little known about what can be expressed by small
mixedinteger programs. In particular, prior to this work, it was open whether
some classical problems, like the minimum oddcut problem, can be expressed by
a compact mixedinteger program with few (even constantly many) integer
variables. This is in stark contrast to linear formulations, where recent
breakthroughs in the field of extended formulations have shown that many
polytopes associated to classical combinatorial optimization problems do not
even admit approximate extended formulations of subexponential size.
We provide a general framework for lifting inapproximability results of extended formulations to the setting of mixedinteger extended formulations, and obtain almost tight lower bounds on the number of integer variables needed to describe a variety of classical combinatorial optimization problems. Among the implications we obtain, we show that any mixedinteger extended formulation of subexponential size for the matching polytope, cut polytope, traveling salesman polytope or dominant of the oddcut polytope, needs $ \Omega(n/\log n) $ many integer variables, where $ n $ is the number of vertices of the underlying graph. Conversely, the abovementioned polyhedra admit polynomialsize mixedinteger formulations with only $ O(n) $ or $ O(n \log n) $ (for the traveling salesman polytope) many integer variables.
Our results build upon a new decomposition technique that, for any convex set $ C $, allows for approximating any mixedinteger description of $ C $ by the intersection of $ C $ with the union of a small number of affine subspaces.
at December 07, 2017 01:55 AM UTC
Authors: Stefan Felsner, Manfred Scheucher
Download: PDF
Abstract: An arrangement of pseudocircles is a collection of simple closed curves on
the sphere or in the plane such that every pair is either disjoint or
intersects in exactly two crossing points. We call an arrangement intersecting
if every pair of pseudocircles intersects twice. An arrangement is
circularizable if there is a combinatorially equivalent arrangement of circles.
Kang and M\"uller showed that every arrangement of at most 4 pseudocircles is circularizable. Linhart and Ortner found an arrangement of 5 pseudocircles which is not circularizable.
In this paper we show that there are exactly four noncircularizable arrangements of 5 pseudocircles, exactly one of them is intersecting. For $n=6$, we show that there are exactly three noncircularizable digonfree intersecting arrangements. We also have some additional examples of noncircularizable arrangements of 6 pseudocircles.
The proofs of noncircularizability use various techniques, most depend on incidence theorems, others use arguments involving metric properties of arrangements of planes, or angles in planar figures. The claims that we have all noncircularizable arrangements with the given properties are based on a program that generated all connected arrangements of $n\leq 6$ pseudocircles and all intersecting arrangements of $n\leq 7$ pseudocircles. Given the complete lists of arrangements, we used heuristics to find circle representations. Examples where the heuristics failed had to be examined by hand.
One of the digonfree intersecting arrangements of pseudocircles with $n=6$ is particularly interesting. It only has $8$ triangles and occurs as a subarrangement of every digonfree intersecting arrangement with less than $2n4$ triangles for $n=7,8,9$. Hence it may be true that every digonfree intersecting arrangement of circles contains at least $2n4$ triangles.
at December 07, 2017 01:57 AM UTC
Authors: Lei Shang
Download: PDF
Abstract: This PhD thesis summarizes research works on the design of exact algorithms
that provide a worstcase (time or space) guarantee for NPhard scheduling
problems. Both theoretical and practical aspects are considered with three main
results reported. The first one is about a Dynamic Programming algorithm which
solves the F3Cmax problem in O*(3^n) time and space. The algorithm is easily
generalized to other flowshop problems and single machine scheduling problems.
The second contribution is about a search tree method called Branch & Merge
which solves the 1SumTi problem with the time complexity converging to
O*(2^n) and in polynomial space. Our third contribution aims to improve the
practical efficiency of exact search tree algorithms solving scheduling
problems. First we realized that a better way to implement the idea of Branch &
Merge is to use a technique called Memorization. By the finding of a new
algorithmic paradox and the implementation of a memory cleaning strategy, the
method succeeded to solve instances with 300 more jobs with respect to the
stateoftheart algorithm for the 1SumTi problem. Then the treatment is
extended to another three problems 1riSumCi, 1dtildeSumwiCi and F2SumCi.
The results of the four problems all together show the power of the
Memorization paradigm when applied on sequencing problems. We name it Branch &
Memorize to promote a systematic consideration of Memorization as an essential
building block in branching algorithms like Branch and Bound. The method can
surely also be used to solve other problems, which are not necessarily
scheduling problems.
at December 07, 2017 12:00 AM UTC
Authors: Michael Schapira, Gal Shahaf
Download: PDF
Abstract: We present novel oblivious routing algorithms for both splittable and
unsplittable multicommodity flow. Our algorithm for minimizing congestion for
\emph{unsplittable} multicommodity flow is the first oblivious routing
algorithm for this setting. As an intermediate step towards this algorithm, we
present a novel generalization of Valiant's classical load balancing scheme for
packetswitched networks to arbitrary graphs, which is of independent interest.
Our algorithm for minimizing congestion for \emph{splittable} multicommodity
flow improves upon the stateoftheart, in terms of both running time and
performance, for graphs that exhibit good expansion guarantees. Our algorithms
rely on diffusing traffic via iterative applications of the random walk
operator. Consequently, the performance guarantees of our algorithms are
derived from the convergence of the random walk operator to the stationary
distribution and are expressed in terms of the spectral gap of the graph (which
dominates the mixing time).
at December 07, 2017 01:54 AM UTC
Authors: Yi Li, Vasileios Nakos
Download: PDF
Abstract: This paper studies the classic problem of finding heavy hitters in the
turnstile streaming model. We give the first deterministic linear sketch that
has $O(\epsilon^{2} \log n \cdot \log^*(\epsilon^{1}))$ rows and answers
queries in sublinear time. The number of rows is only a factor of
$\log^*(\epsilon^{1})$ more than that used by the stateoftheart algorithm
prior to our paper due to Nelson, Nguyen and Woodruff (RANDOM'12). Their
algorithm runs in time at least linear in the universe size $n$, which is
highly undesirable in streaming applications. Our approach is based on an
iterative procedure, where most unrecovered heavy hitters are identified in
each iteration. Although this technique has been extensively employed in the
related problem of sparse recovery, this is the first time, to the best of our
knowledge, that it has been used in the context of $\ell_1$ heavy hitters.
Along the way, we also give sublinear time algorithms for the closely related problems of combinatorial group testing and $\ell_1/\ell_1$ compressed sensing, matching the space usage of previous (super)linear time algorithms.
at December 07, 2017 01:41 AM UTC
Authors: Diptarka Chakraborty, Debarati Das, Michal Koucký, Nitin Saurabh
Download: PDF
Abstract: A quasiGray code of dimension $n$ and length $\ell$ over an alphabet
$\Sigma$ is a sequence of distinct words $w_1,w_2,\dots,w_\ell$ from $\Sigma^n$
such that any two consecutive words differ in at most $c$ coordinates, for some
fixed constant $c>0$. In this paper we are interested in the read and write
complexity of quasiGray codes in the bitprobe model, where we measure the
number of symbols read and written in order to transform any word $w_i$ into
its successor $w_{i+1}$.
We present construction of quasiGray codes of dimension $n$ and length $3^n$ over the ternary alphabet $\{0,1,2\}$ with worstcase read complexity $O(\log n)$ and write complexity $2$. This generalizes to arbitrary oddsize alphabets. For the binary alphabet, we present quasiGray codes of dimension $n$ and length at least $2^n  20n$ with worstcase read complexity $6+\log n$ and write complexity $2$.
Our results significantly improve on previously known constructions and for the oddsize alphabets we break the $\Omega(n)$ worstcase barrier for spaceoptimal (nonredundant) quasiGray codes with constant number of writes. We obtain our results via a novel application of algebraic tools together with the principles of catalytic computation [Buhrman et al. '14, BenOr and Cleve '92, Barrington '89, Coppersmith and Grossman '75]. We also establish certain limits of our technique in the binary case. Although our techniques cannot give spaceoptimal quasiGray codes with small read complexity over the binary alphabet, our results strongly indicate that such codes do exist.
at December 07, 2017 12:00 AM UTC
In general deciding whether a diophantine equation has any integer solutions is equivalent to the halting problem. I believe that deciding if a quadratic diophantine equation has any solution is NPcomplete. Does there exist a further restriction on the equations involved that yields a Pcomplete problem?
by Jacob Edelman at December 06, 2017 08:02 PM UTC
The last TCS+ talk of the Fall will take place this coming Wednesday, December 13th at 1:00 PM Eastern Time (10:00 AM Pacific Time, 19:00 Central European Time, 18:00 UTC). Sebastien Bubeck from MSR will speak about his recent breakthrough with Michael B. Cohen, James R. Lee, Yin Tat Lee, and Aleksander Madry on “kserver via multiscale entropic regularization” (abstract below).
Please make sure you reserve a spot for your group to join us live by signing up on the online form. As usual, for more information about the TCS+ online seminar series and the upcoming talks, or to suggest a possible topic or speaker, please see the website.
Abstract: I will start by describing how mirror descent is a natural strategy for online decision making, specifically in online learning and metrical task systems. To motivate the kserver problem I will also briefly recall what we know and what we don’t know for structured state/action spaces in these models. Using the basic mirror descent calculations I will show how to easily obtain a log(k)competitive algorithm for kpaging. I will then introduce our new parametrization of fractional kserver on a tree, and explain how to analyze the movement cost of entropyregularized mirror descent on this parametrization. This leads to a depth*log(k)competitive (fractional) algorithm for general trees, and log^2(k) for HSTs. I will also briefly mention dynamic embeddings to go beyond the standard log(n) loss in the reduction from general metrics to HSTs.
Joint work with Michael B. Cohen, James R. Lee, Yin Tat Lee, and Aleksander Madry.
by plustcs at December 06, 2017 06:25 PM UTC