Abstract
Common sub-expression elimination (CSE) is an important optimization for compilers, which is similar to partial redundancies elimination optimization.
CSE is designed to eliminate those expressions with identical and semantically equivalent components, with consideration for some properties like commutativity, associativity of operators.
For LLVM, there is EarlyCSE pass as one of implementation for CSE. The "Early" in EarlyCSE means that simple, fast and can be applied in every stages it needs.
A Top-Down View
EarlyCSE iterates down all BasicBlocks in DFS order within dom-tree (only once), which guarantees that expressions in current expressions will be dominated after the expressions iterated before.
Besides, EarlyCSE tags every Node(or BasicBlock) with a generation number for memory instructions, since memory insts in LLVM doesn't fit into SSA, which we must hack in other ways. And every time we meet a branch (current BB has more than one predecessors), we have to increment generation by one.
If this block has a single predecessor, then the predecessor is the parent of the domtree node and all of the live out memory values are still current in this block. If this block has multiple predecessors, then they could have invalidated the live-out memory values of our parent value. For now, just be conservative and invalidate memory if this block has multiple predecessors.
Then, in processBlock
function, we handle the most key case where "SimpleValue" can handle. We maintain a hash table called "AvailableValues". And when we encounter an instruction, we lookup this table for the hash value of the instruction. If no such hash in table, insert it. Otherwise, we compare whether those with the same hash is equivalent in instruction level. If equivalent, we replace the latter with the former higher in dom-tree.
In this way, we handle the most SSA. Memory operations are discussed later.
How is the available values maintained?
When DFS the dom-tree, EarlyCSE actually maintains a scoped map and a stack (emulating the function stack). When entering a new BB, push a Node to the stack and insert relevant hash in BB. When exiting BB, pop the Node and erase relevant hash in BB.
How is lookup implemented?
Let's take a look at getHashValueImpl
of SimpleValue:
1 | static unsigned getHashValueImpl(SimpleValue Val) { |
As we can see, \(hash(binop) = hash(opcode, lhs, rhs)\), where \(lhs\) is the pointer of lhs, \(rhs\) is that of rhs. It means that what we can eliminate once is those instruction with the same references/pointers of the same value.
For the DFS order in dom-tree, for the same two \(op(a,b)\) in BB1 and BB2, only when BB1 dominates BB2 or BB2 dominates BB1, can we eliminate them. However, GVN could solve it for its RPO iteration order (More expensive one).
Besides, IR flags like nsw, nuw
having no effect on the what IR actually does are ignored.
With such easy implementation, EarlyCSE is cheap with \(O(n)\) time, but less effective than GVN.
Ignore/Combine IR flag
When hashing instructions, we always ignore the flags like nsw, nuw
. But for memory instructions, we will combine the flags like matching id, atomicity.
1 | AvailableLoads.insert(MemInst.getPointerOperand(), |
Memory CSE
EarlyCSE eliminates memory operations mostly based on Memory SSA analysis. And it records the generation of BasicBlock. Currently, such generation is equivalent to the iteration order number (or DFS number) of BasicBlocks.
If generations of two memory operations differs, we can't state they are identical, since the live-out memory parental value could be invalidated by multiple predecessors.
In processNode
function, EarlyCSE handles some trivial dead store elimination.
1 | /// LastStore - Keep track of the last non-volatile store that we saw... for |
For non-trivial memory operations, EarlyCSE applies specific methods. Let's take a look at its implementation after lookup:
1 | ParseMemoryInst MemInst(&Inst, TTI); |
Here we drop the last store, since volatile/ordered memory operation make the store unCSEable.
1 |
|
For invariant loop, its memory location, or pointer will keep invariant in later stages. So we keep the earliest load, to maximize its effect.
1 | // If we have an available version of this load, and if it is the right |
Similar to SimpleValue case, besides getting matching value throught MemorySSA.
Difference between GVN and EarlyCSE
To be continued