## NP-complete problems

In a previous post, I tried to explain the P vs NP problem. I gave a rough definition of the NP complexity class: NP is the class of the problems for which, for each “yes” instance, I can get convinced in polynomial time that the answer is yes by providing me with a proof that it is indeed the case. And, at the end of that post, I vaguely mentioned (without naming it) the notion of NP-complete problems.

The usual definition of a NP-complete problem is that it’s a problem which belongs to NP, and which is NP-hard. So now I need to explain what NP-hard means. Informally, it’s a problem that is at least as hard as all other problems of the NP class. Another informal way to say that is that it’s at least as hard as another NP-hard problem. The problem with that definition is the bootstrap problem: you’d need a first NP-hard problem to compare it to the others.

The archetypal NP-hard problem is SAT, which was the topic of my previous post. Since SAT is also in NP, SAT is also the archetypal NP-complete problem. It’s not a priori obvious that SAT is NP-hard. But one can prove that even 3-SAT is NP-hard, where 3-SAT represents the CNF instances of SAT where all clauses have exactly 3 literals. I’m not going to go through the proof, because that would imply that I’d explain a lot of things that I kind of don’t want to explain. For the people who are interested in the topic, it’s been proven in 1971, and it is known as the Cook-Levin theorem.

Now that we do have a NP-hard problem, we can find other ones by reduction. We reduce a NP-hard problem (for instance, 3-SAT), to another problem that we want to prove is NP-hard (call that problem A). To do that, we show that if we can solve A efficiently (that is to say, in polynomial time), then we can solve 3-SAT efficiently.

The idea of that kind of proof is to take an arbitrary instance of the NP-hard problem, and we transform it, in polynomial time, into an instance of A that allows to conclude on the original instance. Now suppose that we can solve problem A (that is to say, any instance of problem A) in polynomial time. Then we can take any instance of SAT, transform it into an instance of problem A, solve the instance of problem A, and get a conclusion. The transformation is done in polynomial time; the solving of A is in polynomial time; summing both yields again a polynomial, so we can solve any instance of SAT in polynomial time, so we can solve SAT in polynomial time. Easy.

NP-complete problems are, in a way, “complexity equivalent”: if we know how to solve one in polynomial time, then we can solve all of them in polynomial time as well, and we can solve all problems from NP in polynomial time (since NP-hard problems are at least as hard as any problem from NP). So, if we find a way to solve in polynomial time all the instances of any NP-complete problem… we proved that P = NP. And won a million dollars. And surprised a whole lot of people.

There is a large set of problems that are known to be NP-hard, or NP-complete if they are as well in NP. And there are people who look at… exotic problems, shall we say: let me give a few links for the people who are interested and may get a laugh out of it:

After this fun interlude, I’m going to give an example of reduction, so that you see that kind of proof works. I’m going to prove that the CLIQUE problem is NP-hard. It’s pretty much the basic example that everybody gives when they’re teaching that kind of things, but there must be a reason (I guess: it’s fairly easy to explain, and the reduction is also fairly easy compared to some others). What I’m explaining here is largely coming from the CLRS, that is to say the Cormen, Leiserson, Rivest and Stein, that is to say Introduction to Algorithms – that’s the reference book for A LOT of algorithms classes. (And I have a copy at home and a copy at work – you never know.)

The CLIQUE problem is described as follows: given a graph (a set of vertices and edges) $G$, does it contain a clique of size $k$? A clique of size $k$ is a set of $k$ vertices that are all connected to one another. For instance, this thing below is a clique of size 8. Also note that it does contain cliques of size 1 (single vertex), 2 (an edge), 3 (a triangle), 4, 5, 6, 7, since we can find sets of vertices of that size that are all connected to one another.

Let us reduce 3-SAT to that problem; since 3-SAT is NP-hard, CLIQUE will then be proven to be NP-hard as well. CLIQUE is in NP, because if I provide you with $k$ vertices, it’s possible to check that all these vertices are indeed connected to one another. So, if CLIQUE is NP-hard, CLIQUE is NP-complete.

To start with my reduction, I pick an arbitrary 3-SAT formula – more precisely, a 3-CNF-SAT formula (I explained the meaning of this in my post on SAT), which is a formula that has that look:

$(x_1 \vee x_2 \vee x_3) \wedge (\bar x_2 \vee x_4 \vee \bar x_5) \wedge (x_1 \vee \bar x_2 \vee \bar x_3) \wedge ...$

that is to say, a set of $m$ clauses connected by AND and composed of 3 literals connected by OR.

From there, we’re creating a graph that contains $3m$ vertices. Every vertex corresponds to a literal of the formula; vertices can be duplicated. For the above formula (truncated before the …), this yields the following vertices :

$x_1, x_2, x_3 ; \bar x_2, \bar x_4, \bar x_5 ; x_1, \bar x_2, \bar x_3$.

Then, we add edges almost everywhere. We don’t add edges in the “groups” that correspond to the clauses, and we also do not add edges between literals that are not compatible, that is to say inverse of each other. If I have two literals $x_1$ and $\bar x_1$, I’m not creating an edge between them. For the formula above, this is the graph in question:

And these are the edges that are NOT in the previous graph:

And now that we have that graph, we want to show that the SAT formula can be satisfied if and only if the graph (the first one) has a clique of size $m$, where $m$ is the number of clauses in the SAT formula. So I need to prove two things:

• if the formula can be satisfied, then the graph has a clique of size $m$,
• if the graph has a clique of size $m$, then the formula can be satisfied.

Let’s start with the first point. Suppose the formula can be satisfied. This means that we can assign a value to all the variables so that the formula is satisfied. This means that all clauses can be satisfied by this assignment. This also means that, for each clause, there’s a literal with value 1 (either a “positive” literal, for instance $x_1$ if the variable is assigned to 1, or a “negative” literal, for instance $\bar x_1$, if the variable is assigned to 0). Now remember that we created vertices grouped by “clause”, so for each clause, we can pick the vertex corresponding to that 1-valued literal (and if several literals have value 1 in a clause, we can pick an arbitrary one). Since we pick a vertex by clause, and we have $m$ clauses, we have $m$ vertices. We now want to prove that these $m$ vertices form a clique, which means that there is an edge between every pair of vertices of that set. We constructed the graph such that there is no edge between two vertices of a given clause, and we’re fine there, because we chose exactly one vertex per clause. Moreover, there is no edge between a literal and its negation – we’re also fine there, because we only picked literals that have value 1, and $x_i$ and $\bar x_i$ can’t both have value 1. These are the only conditions for which there is no edge between two vertices; which means that our $m$ vertices are all connected with each other, which yields a clique of size $m$, which is what we want.

Now, let’s look at the second point: if the graph has a clique of size $m$, then the formula can be satisfied. Suppose that the graph has a clique of size $m$. Since the vertices corresponding to the literals of a clause are not connected, this means that we have a vertex for each clause. We can give the value 1 to all the literals corresponding to these vertices. We cannot have a clique containing $x_i$ and $\bar x_i$, because there would be no edge between these two vertices, which goes against the definition of a clique, where all edges are present. So if a clique of size $m$ exists, this means that we found, for each clause, a literal whose value can be 1, and that all of these literals are compatible with each other. And so, we can satisfy the formula corresponding to the graph, which we also wanted to prove.

So, if we can solve CLIQUE in polynomial time, then we can solve 3-SAT in polynomial time; since 3-SAT is NP-hard, CLIQUE is NP-hard, so CLIQUE is NP-complete, which concludes the proof.

Showing that a problem is NP-complete is, in a way, an indicator that the problem in question is difficult, but this needs to be mitigated a lot. For one thing, it does not say anything about a specific instance of the problem. It does not say anything about a specific subset of instances either – let me explain that. If I say that CLIQUE is difficult, it doesn’t mean that, for example, deciding whether a triangle (a clique of size 3) is in a graph is difficult. I can take all sets of 3 vertices, look at whether they form a triangle, and conclude. There are approximately $n^3$ sets of 3 vertices in a graph with $n$ vertices (okay, there’s exactly $\displaystyle \frac{n(n-1)(n-2)}{6}$ – anyway, roughly $n^3$), so I can actually decide that in polynomial time (because I’m doing $n^3$ operations which are basically checking if three edges are in a graph). So I can decide 3-CLIQUE in polynomial time. Well, I’m not going to be a millionaire with that, because the CLIQUE problem is wider than just 3. I can also decide 1000-CLIQUE (clique of size 1000) in polynomial time with the same principle. Well, it’s a polynomial of degree 1000, but who cares 😛

But, in the general case, I cannot decide whether a graph over $n$ vertices contains a clique of $\displaystyle \frac{n}{2}$, or $\displaystyle \frac{n}{10000}$ vertices, or even $\log n$ vertices in polynomial time with this algorithm that looks at all groups of size $k$ (so, here, $k$ would be equal to $\displaystyle \frac{n}{2}$, $\displaystyle \frac{n}{10000}$, or $\log n$) and that looks at the edges of the group, because that would mean doing roughly $n^{n/2}$, $n^{n/10000}$ and $n^{\log n}$ operations, respectively, and that in any case it’s not polynomial. And, generally speaking, nobody knows if it’s possible to do that in polynomial time, since CLIQUE is NP-complete and we don’t know if P is equal to NP. Many people think that not, but nobody knows.

Even that does not say anything about the difficulty of a specific instance. If I’m given a graph that contains $\displaystyle \frac{n(n-1)}{2}$ edges for $n$ vertices, then I can decide pretty easily that it does contain a clique of size $\displaystyle \frac{n}{2}$, $\displaystyle \frac{n}{10000}$ and $\log n$. That’s because a graph that has $\displaystyle \frac{n(n-1)}{2}$ for $n$ vertices contains all the possible edges for the graph: it’s a clique on $n$ vertices… which means that every subgraph over $k$ vertices, for all $k \leq n$, is a clique of size $k$. Same thing if I’m given a graph over $n$ vertices with 0 edge, it’s not going to be hard to say that it doesn’t contain a clique of size $\displaystyle \frac n 2$.

SAT is also NP-hard, but can solve 2-CNF-SAT instances (where all clauses contain 2 literals) in polynomial, and even linear time. There are also SAT instances that are deemed “easy”, for instance the ones that have only few dependencies between the clauses: we know that we can to solve them, and we even know how (for the people who are interested in what I mean exactly by what I mean by “a few dependencies”, I’m thinking specifically about “the ones that are in the scope of the Lovász Local Lemma“, whose application to SAT is not trivial from the Wikipedia definition, but which might be clearer here).

Generally speaking, it may even be pretty difficult to create NP-hard instances of problems that we cannot solve in polynomial time. By the way, it’s not easy to show that we don’t know how to solve a given instance in polynomial time. We can find instances problem instances that are hard (i.e. require exponential time) for a given algorithm, and easy for another algorithm…

And that’s a bit of what happens in real life: we have instance of problems for which we know that the general case is NP-hard (or worse), and that we still solve without great difficulty. It can be that the problem is “small” (it’s not a huge problem to put an exponential algorithm on a problem of size 4), but it can also be that the problem belongs to an “easy” subset of instances. Considering that an instance of a problem is hard because the general problem is hard, and giving up on it (because we won’t know how to do it) may not be a good idea – it may be that we forgot to consider an easy restriction of the problem!

As usual, “it’s not that simple”. I’ve spent a fair amount of time with theory, and I tend to look at the problems themselves more than at given instances. But it’s sometimes nice to remember that stuff that’s “theoretically hard” can end up being “practically easy” – it’s a nice change from the things that are theoretically easy but that we don’t know how to do practically… 🙂

## The SAT problem

Note: this is a translation with some re-writing of a previously published post: Le problème SAT.

I’m going to explain what the SAT problem is, because I’m going to talk about it a bit more soon. It’s one of the fundamental blocks around the P vs NP question; I’m going to write a blog post about NP-hard problems, but writing a post about SAT specifically is worth it. Also: I did my master thesis on SAT-related questions, so it’s a bit of a favorite topic of mine 🙂

SAT is a common abbreviation for “boolean satisfiability problem”. Everything revolves around the concept of boolean formulas, and a boolean formula can look like this:

$(x \vee y \vee z) \wedge (\bar x \vee \bar y) \wedge (\bar x \vee y \vee \bar z)$

Let me unpack this thing. The first concept is variables: here, $x$, $y$ or $z$. They’re kind of like the “unknowns” of an equation: we want to find values for $x$, $y$ and $z$. Moreover, we’re in a somewhat weird universe, the boolean universe: the only values $x$, $y$ and $z$ can take are 0 or 1.

Then, we have some weird symbols. $\vee$ means “or”, and $\wedge$ means “and”. And the small bars above some of the letters indicate a negation. We combine these symbols with variables to get expressions, whose value can be computed as follow:

• If $x = 1$, then $\bar x = 0$, otherwise $\bar x = 1$ ($\bar x$ is the opposite value of $x$). We read this “not $x$“.
• If $x = 1$, then $x \vee y = 1$; if $y = 1$, then $x \vee y = 1$; if $x = 0$ and $y = 0$, then $x \vee y = 0$. In other words, $x \vee y = 1$ if $x = 1$ or $y = 1$. Note that in this context, when we say “or”, we do not use exactly the same meaning than in everyday language: we mean “if $x = 1$ or if $y = 1$ or if both are equal to 1″. We read this: “$x$ or $y$“.
• If $x = 1$ and $y = 1$, then $x \wedge y = 1$, otherwise $x \wedge y = 0$. We read this: “$x$ and $y$“.

I’m summarizing all this in this table, for every possible value of $x$ and $y$:

$\begin{array}{|c|c||c|c|c|c|} \hline x & y & \bar x &\bar y & x \vee y & x \wedge y \\ \hline 0 & 0 & 1 & 1 & 0 & 0 \\ \hline 0 & 1 & 1 & 0 & 1 & 0 \\ \hline 1 & 0 & 0 & 1 & 1 & 0 \\ \hline 1 & 1 & 0 & 0 & 1 & 1 \\ \hline \end{array}$

We can combine these expressions any way that we want to get boolean formulas. So the following examples are boolean formulas:

$x \wedge \bar y \wedge z \wedge t \wedge \bar u$
$(x \vee y) \wedge (\bar x \vee z) \wedge ((x \wedge z) \vee (z \wedge \bar t))$
$(x \vee y \vee z) \wedge (\bar x \vee \bar y) \wedge (\bar x \vee y \vee \bar z)$
$(x \wedge y \wedge z) \vee (\bar x \wedge \bar y) \vee (\bar x \wedge y \wedge \bar z)$

And then, by assigning values (0 or 1) to the individual variables, we can evaluate a formula for the assignment, that is to say, say if the value that the formula “computes” is 0 or 1. If the formula evaluates to 1 with an assignment, we say that this assignment satisfies the formula. And we say that a formula is satisfiable if there exists an assignment that makes it evaluate to 1 – hence the name of the problem, “boolean satisfiability”.

Boolean satisfiability in general is kind of “annoying” because it doesn’t have much structure to it: it’s hard to think about and it’s hard to say much about it, really. One thing that is always feasible, however, is to look at all the possibilities for all the variables, and see if there exists a combination of values that satisfies the formula. Picking one of the example above, and calling the formula $F$:

$F = (x \vee y \vee z) \wedge (\bar x \vee \bar y) \wedge (\bar x \vee y \vee z)$

And let’s enumerate all possibilities:

$\begin{array}{|c|c|c||c|c|c||c|} \hline x & y & z & x \vee y \vee z & \bar x \vee \bar y & \bar x \vee y \vee z & F \\ \hline 0 & 0 & 0 & 0 & 1 & 1 & 0 \\ \hline 0 & 0 & 1 & 1 & 1 & 1 & 1 \\ \hline 0 & 1 & 0 & 1 & 1 & 1 & 1\\ \hline 0 & 1 & 1 & 1 & 1 & 1 & 1\\ \hline 1 & 0 & 0 & 1 & 1 & 0 & 0 \\ \hline 1 & 0 & 1 & 1 & 1 & 1 & 1 \\ \hline 1 & 1 & 0 & 1 & 0 & 1 & 0\\ \hline 1 & 1 & 1 & 1 & 0 & 1 & 0\\ \hline \end{array}$

This table answers the question “is there values of $x$, $y$ and $z$ for which $F$ evaluates to 1 (the answer is yes) and it even goes further by giving said values (for instance, $x=1$, $y = 0$ and $z = 1$ are values that satisfy the formula).

The issue is that it’s not really manageable to write such a table as soon as we get a lot of variables. The reason for that is that, for every variable, I need to check what happens for its 0 value and for its 1 value. So I have two choices for the first variable, two choices for the second variable, two choices for the third variable, and so on. The choices multiply with each other: similarly to the table above, I need a line for every possible combination of values of variables. For 3 variables, $2\times 2 \times 2 = 2^8 = 8$ lines. For 5 variables, $2 \times 2 \times 2 \times 2 \times 2 = 2^5 = 32$ lines, that starts to be annoying to do by hand. For 20 variables, $2^{20} = 1,048,576$ lines, the result is not necessarily instant anymore on a computer. And it continues growing: for $n$ variables, I need $2^n$ lines, which it grows faster and faster: the joys of power functions.

For the people who followed the previous explanations, this is NOT a polynomial time algorithm: it’s an exponential time algorithm. (Not even considering the fact that I’m only enumerating the possibilities and not evaluating the formula yet).

Since the general SAT problem seems fairly complicated to tackle, one can wonder if there are “subgroups” that are easy to solve, or that are easier to think about. And, indeed, there are such subgroups.

Suppose that, instead of getting an arbitrary combination of variables and symbols, I restrict myself of formulas such as this one: $(x \wedge y \wedge z) \vee (y \wedge \bar z \wedge t) \vee (u \wedge \bar y \wedge \bar t)$. That’s a form where I have “groups” of variables linked by “$\wedge$“, themselves grouped by $\vee$. Then that’s an “easy” case: for the formula to be satisfiable, I only need one of the “groups” to evaluate to 1 (because they are linked by “or” operators, so the formula evaluates to 1 if at least one of the groups evaluates to 1). And a given group evaluates to 1, unless it contains both one variable and its negation, like $x \wedge y \wedge \bar y$. In that case, since within a group I need all the variables to be equal to 1 for the group to evaluate to 1, and since a variable and its negation cannot be equal both to 1, the group would evaluate to 0. So unless all the groups contain a variable and its negation, the formula is satisfiable – a fairly easy condition to verify.

If I consider the “opposite” of the previous setup, with formulas such as $(x \vee y \vee z) \wedge (\bar x \vee \bar y) \wedge (\bar x \vee y \vee z)$, we have “groups” of variables linked by $\vee$, themselves grouped by $\wedge$. This type of formulas are called CNF formulas (where CNF means “conjunctive normal form”). A CNF formula is defined as a set of clauses (the “groups”), all linked by $\wedge$ symbols (“and”). A clause is a set of one or several literals that are linked by $\vee$ symbols (“or”). A literal is either a variable (for example $x$) or its negation (for example $\bar x$).

The question asked by an instance of the SAT problem is “can I find values for all variables such that the whole formula evaluates to 1?”. If we restrict ourselves to CNF formulas (and call the restricted problem CNF-SAT), this means that we want that all the clauses evaluate to 1, because since they are linked by “and” in the formula, I need that all of its clauses evaluate to 1. And for each clause to have value 1, I need at least one of the literals to have value 1 in the clause (I want the first literal or the second literal or the third literal or… to be equal to 1). As previously mentioned: it can happen that all the literals are equal to 1 – the clause will still be equal to 1.

It turns out that we can again look at subgroups within CNF formulas. Let us consider CNF formulas that have exactly 2 literals in all clauses – they are called 2-CNF formulas. The associated problem is called 2-CNF-SAT, or more often 2-SAT. Well, 2-SAT can be solved in linear time. This means that there exists an algorithm that, given any 2-CNF formula, can tell in linear time whether the formula is satisfiable or not.

Conversely, for CNF formulas that have clauses that have more than 2 literals… in general, we don’t know. We can even restrict ourselves to CNF formulas that have exactly 3 literals in each clause and still not know. The associated problem is called 3-CNF-SAT, or (much) more often 3-SAT.

From a “complexity class” point of view, 3-SAT is in NP. If I’m given a satisfiable 3-CNF formula and values for all variables of the formula, I can check in polynomial time that it works: I can check that a formula is satisfiable if I have a proof of it. (This statement is also true for general SAT formulas.)

What we do not know, however, is whether 3-SAT is in P: we do not know if there exists a polynomial time algorithm that allows us to decide if a 3-CNF formula can be satisfied or not in the general case. The difficulty of the problem still does not presume of the difficulty of individual instances. In general, we do not know how to solve the 3-SAT problem; in practice, 3-SAT instances are solved every day.

We also do not know whether 3-SAT is outside of P: we do not know if we need an algorithm that is “more powerful” than a polynomial time algorithm to solve the problem in the general case. To answer that question (and to prove it) would actually allow to solve the P vs NP problem – but to explain that, I need to explain NP-hard problems, and that’s something for the next post, in which I’ll also explain why it’s pretty reasonable to restrict oneself to 3-SAT (instead of the general SAT problem) when it comes to studying the theory of satisfiability 🙂