LLVM源码解析- EarlyCSE

Abstract

Common sub-expression elimination (CSE) is an important optimization for compilers, which is similar to partial redundancies elimination optimization.
CSE is designed to eliminate those expressions with identical and semantically equivalent components, with consideration for some properties like commutativity, associativity of operators.
For LLVM, there is EarlyCSE pass as one of implementation for CSE. The "Early" in EarlyCSE means that simple, fast and can be applied in every stages it needs.

A Top-Down View

EarlyCSE iterates down all BasicBlocks in DFS order within dom-tree (only once), which guarantees that expressions in current expressions will be dominated after the expressions iterated before.

Besides, EarlyCSE tags every Node(or BasicBlock) with a generation number for memory instructions, since memory insts in LLVM doesn't fit into SSA, which we must hack in other ways. And every time we meet a branch (current BB has more than one predecessors), we have to increment generation by one.

If this block has a single predecessor, then the predecessor is the parent of the domtree node and all of the live out memory values are still current in this block. If this block has multiple predecessors, then they could have invalidated the live-out memory values of our parent value. For now, just be conservative and invalidate memory if this block has multiple predecessors.

Then, in processBlock function, we handle the most key case where "SimpleValue" can handle. We maintain a hash table called "AvailableValues". And when we encounter an instruction, we lookup this table for the hash value of the instruction. If no such hash in table, insert it. Otherwise, we compare whether those with the same hash is equivalent in instruction level. If equivalent, we replace the latter with the former higher in dom-tree.

In this way, we handle the most SSA. Memory operations are discussed later.

How is the available values maintained?

When DFS the dom-tree, EarlyCSE actually maintains a scoped map and a stack (emulating the function stack). When entering a new BB, push a Node to the stack and insert relevant hash in BB. When exiting BB, pop the Node and erase relevant hash in BB.

How is lookup implemented?

Let's take a look at getHashValueImpl of SimpleValue:

static unsigned getHashValueImpl(SimpleValue Val) {
  Instruction *Inst = Val.Inst;
  // Hash in all of the operands as pointers.
  if (BinaryOperator *BinOp = dyn_cast<BinaryOperator>(Inst)) {
    Value *LHS = BinOp->getOperand(0);
    Value *RHS = BinOp->getOperand(1);
    if (BinOp->isCommutative() && BinOp->getOperand(0) > BinOp->getOperand(1))
      std::swap(LHS, RHS);

    return hash_combine(BinOp->getOpcode(), LHS, RHS);
  }

  if (CmpInst *CI = dyn_cast<CmpInst>(Inst)) {
    // Compares can be commuted by swapping the comparands and
    // updating the predicate.  Choose the form that has the
    // comparands in sorted order, or in the case of a tie, the
    // one with the lower predicate.
    Value *LHS = CI->getOperand(0);
    Value *RHS = CI->getOperand(1);
    CmpInst::Predicate Pred = CI->getPredicate();
    CmpInst::Predicate SwappedPred = CI->getSwappedPredicate();
    if (std::tie(LHS, Pred) > std::tie(RHS, SwappedPred)) {
      std::swap(LHS, RHS);
      Pred = SwappedPred;
    }
    return hash_combine(Inst->getOpcode(), Pred, LHS, RHS);
  }
  ....
}

As we can see, $h a s h (b i n o p) = h a s h (o p c o d e, l h s, r h s)$ , where $l h s$ is the pointer of lhs, $r h s$ is that of rhs. It means that what we can eliminate once is those instruction with the same references/pointers of the same value.

For the DFS order in dom-tree, for the same two $o p (a, b)$ in BB1 and BB2, only when BB1 dominates BB2 or BB2 dominates BB1, can we eliminate them. However, GVN could solve it for its RPO iteration order (More expensive one).

Besides, IR flags like nsw, nuw having no effect on the what IR actually does are ignored.

With such easy implementation, EarlyCSE is cheap with $O (n)$ time, but less effective than GVN.

Ignore/Combine IR flag

When hashing instructions, we always ignore the flags like nsw, nuw. But for memory instructions, we will combine the flags like matching id, atomicity.

AvailableLoads.insert(MemInst.getPointerOperand(),
                              LoadValue(&Inst, CurrentGeneration,
                                        MemInst.getMatchingId(),
                                        MemInst.isAtomic(),
                                        MemInst.isLoad()));

Memory CSE

EarlyCSE eliminates memory operations mostly based on Memory SSA analysis. And it records the generation of BasicBlock. Currently, such generation is equivalent to the iteration order number (or DFS number) of BasicBlocks.

If generations of two memory operations differs, we can't state they are identical, since the live-out memory parental value could be invalidated by multiple predecessors.

In processNode function, EarlyCSE handles some trivial dead store elimination.

/// LastStore - Keep track of the last non-volatile store that we saw... for
/// as long as there in no instruction that reads memory.  If we see a store
/// to the same location, we delete the dead store.  This zaps trivial dead
/// stores which can occur in bitfield code among other things.
Instruction *LastStore = nullptr;

For non-trivial memory operations, EarlyCSE applies specific methods. Let's take a look at its implementation after lookup:

ParseMemoryInst MemInst(&Inst, TTI);
// If this is a non-volatile load, process it.
if (MemInst.isValid() && MemInst.isLoad()) {
  if (MemInst.isVolatile() || !MemInst.isUnordered()) {
    LastStore = nullptr;
    ++CurrentGeneration;
  }

Here we drop the last store, since volatile/ordered memory operation make the store unCSEable.

if (MemInst.isInvariantLoad()) {
  // If we pass an invariant load, we know that memory location is
  // indefinitely constant from the moment of first dereferenceability.
  // We conservatively treat the invariant_load as that moment.  If we
  // pass a invariant load after already establishing a scope, don't
  // restart it since we want to preserve the earliest point seen.
  auto MemLoc = MemoryLocation::get(&Inst);
  if (!AvailableInvariants.count(MemLoc))
    AvailableInvariants.insert(MemLoc, CurrentGeneration);
}

For invariant loop, its memory location, or pointer will keep invariant in later stages. So we keep the earliest load, to maximize its effect.

// If we have an available version of this load, and if it is the right
// generation or the load is known to be from an invariant location,
// replace this instruction.
//
// If either the dominating load or the current load are invariant, then
// we can assume the current load loads the same value as the dominating
// load.
LoadValue InVal = AvailableLoads.lookup(MemInst.getPointerOperand());
if (Value *Op = getMatchingValue(InVal, MemInst, CurrentGeneration)) {
  // Something related to debug information
  if (InVal.IsLoad)
    if (auto *I = dyn_cast<Instruction>(Op))
      combineMetadataForCSE(I, &Inst, false);
  if (!Inst.use_empty())
    Inst.replaceAllUsesWith(Op);
  // Something related to updating analysis and debug information
  Inst.eraseFromParent();
  Changed = true;
  ++NumCSELoad;
  continue;
}

Similar to SimpleValue case, besides getting matching value throught MemorySSA.

Difference between GVN and EarlyCSE

To be continued