sys-devel/llvm/files/cherry/bb7d3af1139c36270bc9948605e06f40e4c51541.patch - third_party/overlays/chromiumos-overlay - Git at Google

 commit bb7d3af1139c36270bc9948605e06f40e4c51541
 Author: Roman Lebedev <lebedev.ri@gmail.com>
 Date:   Mon Sep 7 23:54:06 2020 +0300

     Reland [SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipeline

     This was reverted in 503deec2183d466dad64b763bab4e15fd8804239
     because it caused gigantic increase (3x) in branch mispredictions
     in certain benchmarks on certain CPU's,
     see https://reviews.llvm.org/D84108#2227365.

     It has since been investigated and here are the results:
     https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20200907/827578.html
     > It's an amazingly severe regression, but it's also all due to branch
     > mispredicts (about 3x without this). The code layout looks ok so there's
     > probably something else to deal with. I'm not sure there's anything we can
     > reasonably do so we'll just have to take the hit for now and wait for
     > another code reorganization to make the branch predictor a bit more happy :)
     >
     > Thanks for giving us some time to investigate and feel free to recommit
     > whenever you'd like.
     >
     > -eric

     So let's just reland this.
     Original commit message:


     I've been looking at missed vectorizations in one codebase.
     One particular thing that stands out is that some of the loops
     reach vectorizer in a rather mangled form, with weird PHI's,
     and some of the loops aren't even in a rotated form.

     After taking a more detailed look, that happened because
     the loop's headers were too big by then. It is evident that
     SimplifyCFG's common code hoisting transform is at fault there,
     because the pattern it handles is precisely the unrotated
     loop basic block structure.

     Surprizingly, `SimplifyCFGOpt::HoistThenElseCodeToIf()` is enabled
     by default, and is always run, unlike it's friend, common code sinking
     transform, `SinkCommonCodeFromPredecessors()`, which is not enabled
     by default and is only run once very late in the pipeline.

     I'm proposing to harmonize this, and disable common code hoisting
     until //late// in pipeline. Definition of //late// may vary,
     here currently i've picked the same one as for code sinking,
     but i suppose we could enable it as soon as right after
     loop rotation happens.

     Experimentation shows that this does indeed unsurprizingly help,
     more loops got rotated, although other issues remain elsewhere.

     Now, this undoubtedly seriously shakes phase ordering.
     This will undoubtedly be a mixed bag in terms of both compile- and
     run- time performance, codesize. Since we no longer aggressively
     hoist+deduplicate common code, we don't pay the price of said hoisting
     (which wasn't big). That may allow more loops to be rotated,
     so we pay that price. That, in turn, that may enable all the transforms
     that require canonical (rotated) loop form, including but not limited to
     vectorization, so we pay that too. And in general, no deduplication means
     more [duplicate] instructions going through the optimizations. But there's still
     late hoisting, some of them will be caught late.

     As per benchmarks i've run {F12360204}, this is mostly within the noise,
     there are some small improvements, some small regressions.
     One big regression i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
     this will expose many more pre-existing missed optimizations, as usual :S

     llvm-compile-time-tracker.com thoughts on this:
     http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions
     * this does regress compile-time by +0.5% geomean (unsurprizingly)
     * size impact varies; for ThinLTO it's actually an improvement

     The largest fallout appears to be in GVN's load partial redundancy
     elimination, it spends *much* more time in
     `MemoryDependenceResults::getNonLocalPointerDependency()`.
     Non-local `MemoryDependenceResults` is widely-known to be, uh, costly.
     There does not appear to be a proper solution to this issue,
     other than silencing the compile-time performance regression
     by tuning cut-off thresholds in `MemoryDependenceResults`,
     at the cost of potentially regressing run-time performance.
     D84609 attempts to move in that direction, but the path is unclear
     and is going to take some time.

     If we look at stats before/after diffs, some excerpts:
     * RawSpeed (the target) {F12360200}
       * -14 (-73.68%) loops not rotated due to the header size (yay)
       * -272 (-0.67%) `"Number of live out of a loop variables"` - good for vectorizer
       * -3937 (-64.19%) common instructions hoisted
       * +561 (+0.06%) x86 asm instructions
       * -2 basic blocks
       * +2418 (+0.11%) IR instructions
     * vanilla test-suite + RawSpeed + darktable  {F12360201}
       * -36396 (-65.29%) common instructions hoisted
       * +1676 (+0.02%) x86 asm instructions
       * +662 (+0.06%) basic blocks
       * +4395 (+0.04%) IR instructions

     It is likely to be sub-optimal for when optimizing for code size,
     so one might want to change tune pipeline by enabling sinking/hoisting
     when optimizing for size.

     Reviewed By: mkazantsev

     Differential Revision: https://reviews.llvm.org/D84108

     This reverts commit 503deec2183d466dad64b763bab4e15fd8804239.

 diff --git a/llvm/include/llvm/Transforms/Utils/SimplifyCFGOptions.h b/llvm/include/llvm/Transforms/Utils/SimplifyCFGOptions.h
 index 46f6ca0462f..fb3a7490346 100644
 --- a/llvm/include/llvm/Transforms/Utils/SimplifyCFGOptions.h
 +++ b/llvm/include/llvm/Transforms/Utils/SimplifyCFGOptions.h
 @@ -25,7 +25,7 @@ struct SimplifyCFGOptions {
    bool ForwardSwitchCondToPhi = false;
    bool ConvertSwitchToLookupTable = false;
    bool NeedCanonicalLoop = true;
 -  bool HoistCommonInsts = true;
 +  bool HoistCommonInsts = false;
    bool SinkCommonInsts = false;
    bool SimplifyCondBranch = true;
    bool FoldTwoEntryPHINode = true;
 diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
 index 9df6a985789..9a2e895d7b7 100644
 --- a/llvm/lib/Passes/PassBuilder.cpp
 +++ b/llvm/lib/Passes/PassBuilder.cpp
 @@ -1160,11 +1160,14 @@ ModulePassManager PassBuilder::buildModuleOptimizationPipeline(
    // convert to more optimized IR using more aggressive simplify CFG options.
    // The extra sinking transform can create larger basic blocks, so do this
    // before SLP vectorization.
 -  OptimizePM.addPass(SimplifyCFGPass(SimplifyCFGOptions().
 -                                     forwardSwitchCondToPhi(true).
 -                                     convertSwitchToLookupTable(true).
 -                                     needCanonicalLoops(false).
 -                                     sinkCommonInsts(true)));
 +  // FIXME: study whether hoisting and/or sinking of common instructions should
 +  //        be delayed until after SLP vectorizer.
 +  OptimizePM.addPass(SimplifyCFGPass(SimplifyCFGOptions()
 +                                         .forwardSwitchCondToPhi(true)
 +                                         .convertSwitchToLookupTable(true)
 +                                         .needCanonicalLoops(false)
 +                                         .hoistCommonInsts(true)
 +                                         .sinkCommonInsts(true)));

    // Optimize parallel scalar instruction chains into SIMD instructions.
    if (PTO.SLPVectorization)
 diff --git a/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp b/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
 index 8b15898c1c1..d7a14a3dc77 100644
 --- a/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
 +++ b/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
 @@ -455,6 +455,7 @@ void AArch64PassConfig::addIRPasses() {
                                              .forwardSwitchCondToPhi(true)
                                              .convertSwitchToLookupTable(true)
                                              .needCanonicalLoops(false)
 +                                            .hoistCommonInsts(true)
                                              .sinkCommonInsts(true)));

    // Run LoopDataPrefetch
 diff --git a/llvm/lib/Target/ARM/ARMTargetMachine.cpp b/llvm/lib/Target/ARM/ARMTargetMachine.cpp
 index 55ac332e2c6..5068f9b5a0f 100644
 --- a/llvm/lib/Target/ARM/ARMTargetMachine.cpp
 +++ b/llvm/lib/Target/ARM/ARMTargetMachine.cpp
 @@ -407,7 +407,8 @@ void ARMPassConfig::addIRPasses() {
    // ldrex/strex loops to simplify this, but it needs tidying up.
    if (TM->getOptLevel() != CodeGenOpt::None && EnableAtomicTidy)
      addPass(createCFGSimplificationPass(
 -        SimplifyCFGOptions().sinkCommonInsts(true), [this](const Function &F) {
 +        SimplifyCFGOptions().hoistCommonInsts(true).sinkCommonInsts(true),
 +        [this](const Function &F) {
            const auto &ST = this->TM->getSubtarget<ARMSubtarget>(F);
            return ST.hasAnyDataBarrier() && !ST.isThumb1Only();
          }));
 diff --git a/llvm/lib/Target/Hexagon/HexagonTargetMachine.cpp b/llvm/lib/Target/Hexagon/HexagonTargetMachine.cpp
 index 6728306db3d..37cf391c998 100644
 --- a/llvm/lib/Target/Hexagon/HexagonTargetMachine.cpp
 +++ b/llvm/lib/Target/Hexagon/HexagonTargetMachine.cpp
 @@ -327,6 +327,7 @@ void HexagonPassConfig::addIRPasses() {
                                                .forwardSwitchCondToPhi(true)
                                                .convertSwitchToLookupTable(true)
                                                .needCanonicalLoops(false)
 +                                              .hoistCommonInsts(true)
                                                .sinkCommonInsts(true)));
      if (EnableLoopPrefetch)
        addPass(createLoopDataPrefetchPass());
 diff --git a/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp b/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
 index 326d1ab28b6..caa9a98ecb0 100644
 --- a/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
 +++ b/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
 @@ -784,10 +784,13 @@ void PassManagerBuilder::populateModulePassManager(
    // convert to more optimized IR using more aggressive simplify CFG options.
    // The extra sinking transform can create larger basic blocks, so do this
    // before SLP vectorization.
 +  // FIXME: study whether hoisting and/or sinking of common instructions should
 +  //        be delayed until after SLP vectorizer.
    MPM.add(createCFGSimplificationPass(SimplifyCFGOptions()
                                            .forwardSwitchCondToPhi(true)
                                            .convertSwitchToLookupTable(true)
                                            .needCanonicalLoops(false)
 +                                          .hoistCommonInsts(true)
                                            .sinkCommonInsts(true)));

    if (SLPVectorize) {
 diff --git a/llvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp b/llvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp
 index db5211df397..b0435bf6e4e 100644
 --- a/llvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp
 +++ b/llvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp
 @@ -63,8 +63,8 @@ static cl::opt<bool> UserForwardSwitchCond(
      cl::desc("Forward switch condition to phi ops (default = false)"));

  static cl::opt<bool> UserHoistCommonInsts(
 -    "hoist-common-insts", cl::Hidden, cl::init(true),
 -    cl::desc("hoist common instructions (default = true)"));
 +    "hoist-common-insts", cl::Hidden, cl::init(false),
 +    cl::desc("hoist common instructions (default = false)"));

  static cl::opt<bool> UserSinkCommonInsts(
      "sink-common-insts", cl::Hidden, cl::init(false),
 diff --git a/llvm/test/Transforms/PGOProfile/chr.ll b/llvm/test/Transforms/PGOProfile/chr.ll
 index c2e1ae4f53a..1a22d7f0b84 100644
 --- a/llvm/test/Transforms/PGOProfile/chr.ll
 +++ b/llvm/test/Transforms/PGOProfile/chr.ll
 @@ -2006,9 +2006,16 @@ define i64 @test_chr_22(i1 %i, i64* %j, i64 %v0) !prof !14 {
  ; CHECK-NEXT:  bb0:
  ; CHECK-NEXT:    [[REASS_ADD:%.*]] = shl i64 [[V0:%.*]], 1
  ; CHECK-NEXT:    [[V2:%.*]] = add i64 [[REASS_ADD]], 3
 +; CHECK-NEXT:    [[C1:%.*]] = icmp slt i64 [[V2]], 100
 +; CHECK-NEXT:    br i1 [[C1]], label [[BB0_SPLIT:%.*]], label [[BB0_SPLIT_NONCHR:%.*]], !prof !15
 +; CHECK:       bb0.split:
  ; CHECK-NEXT:    [[V299:%.*]] = mul i64 [[V2]], 7860086430977039991
  ; CHECK-NEXT:    store i64 [[V299]], i64* [[J:%.*]], align 4
  ; CHECK-NEXT:    ret i64 99
 +; CHECK:       bb0.split.nonchr:
 +; CHECK-NEXT:    [[V299_NONCHR:%.*]] = mul i64 [[V2]], 7860086430977039991
 +; CHECK-NEXT:    store i64 [[V299_NONCHR]], i64* [[J]], align 4
 +; CHECK-NEXT:    ret i64 99
  ;
  bb0:
    %v1 = add i64 %v0, 3
 diff --git a/llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll b/llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll
 index 1d8cce6879e..314af1c1414 100644
 --- a/llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll
 +++ b/llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll
 @@ -5,14 +5,11 @@
  ; RUN: opt -O3 -rotation-max-header-size=1 -S < %s                    | FileCheck %s --check-prefixes=HOIST,THR1,FALLBACK2
  ; RUN: opt -passes='default<O3>' -rotation-max-header-size=1 -S < %s  | FileCheck %s --check-prefixes=HOIST,THR1,FALLBACK3

 -; RUN: opt -O3 -rotation-max-header-size=2 -S < %s                    | FileCheck %s --check-prefixes=HOIST,THR2,FALLBACK4
 -; RUN: opt -passes='default<O3>' -rotation-max-header-size=2 -S < %s  | FileCheck %s --check-prefixes=HOIST,THR2,FALLBACK5
 +; RUN: opt -O3 -rotation-max-header-size=2 -S < %s                    | FileCheck %s --check-prefixes=ROTATED_LATER,ROTATED_LATER_OLDPM,FALLBACK4
 +; RUN: opt -passes='default<O3>' -rotation-max-header-size=2 -S < %s  | FileCheck %s --check-prefixes=ROTATED_LATER,ROTATED_LATER_NEWPM,FALLBACK5

 -; RUN: opt -O3 -rotation-max-header-size=3 -S < %s                    | FileCheck %s --check-prefixes=ROTATED_LATER,ROTATED_LATER_OLDPM,FALLBACK6
 -; RUN: opt -passes='default<O3>' -rotation-max-header-size=3 -S < %s  | FileCheck %s --check-prefixes=ROTATED_LATER,ROTATED_LATER_NEWPM,FALLBACK7
 -
 -; RUN: opt -O3 -rotation-max-header-size=4 -S < %s                    | FileCheck %s --check-prefixes=ROTATE,ROTATE_OLDPM,FALLBACK8
 -; RUN: opt -passes='default<O3>' -rotation-max-header-size=4 -S < %s  | FileCheck %s --check-prefixes=ROTATE,ROTATE_NEWPM,FALLBACK9
 +; RUN: opt -O3 -rotation-max-header-size=3 -S < %s                    | FileCheck %s --check-prefixes=ROTATE,ROTATE_OLDPM,FALLBACK6
 +; RUN: opt -passes='default<O3>' -rotation-max-header-size=3 -S < %s  | FileCheck %s --check-prefixes=ROTATE,ROTATE_NEWPM,FALLBACK7

  ; This example is produced from a very basic C code:
  ;
 @@ -61,8 +58,8 @@ define void @_Z4loopi(i32 %width) {
  ; HOIST-NEXT:    br label [[FOR_COND:%.*]]
  ; HOIST:       for.cond:
  ; HOIST-NEXT:    [[I_0:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY:%.*]] ], [ 0, [[FOR_COND_PREHEADER]] ]
 -; HOIST-NEXT:    tail call void @f0()
  ; HOIST-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i32 [[I_0]], [[TMP0]]
 +; HOIST-NEXT:    tail call void @f0()
  ; HOIST-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
  ; HOIST:       for.cond.cleanup:
  ; HOIST-NEXT:    tail call void @f2()
 @@ -80,17 +77,17 @@ define void @_Z4loopi(i32 %width) {
  ; ROTATED_LATER_OLDPM-NEXT:    br i1 [[CMP]], label [[RETURN:%.*]], label [[FOR_COND_PREHEADER:%.*]]
  ; ROTATED_LATER_OLDPM:       for.cond.preheader:
  ; ROTATED_LATER_OLDPM-NEXT:    [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
 -; ROTATED_LATER_OLDPM-NEXT:    tail call void @f0()
  ; ROTATED_LATER_OLDPM-NEXT:    [[EXITCOND_NOT3:%.*]] = icmp eq i32 [[TMP0]], 0
  ; ROTATED_LATER_OLDPM-NEXT:    br i1 [[EXITCOND_NOT3]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY:%.*]]
  ; ROTATED_LATER_OLDPM:       for.cond.cleanup:
 +; ROTATED_LATER_OLDPM-NEXT:    tail call void @f0()
  ; ROTATED_LATER_OLDPM-NEXT:    tail call void @f2()
  ; ROTATED_LATER_OLDPM-NEXT:    br label [[RETURN]]
  ; ROTATED_LATER_OLDPM:       for.body:
  ; ROTATED_LATER_OLDPM-NEXT:    [[I_04:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[FOR_COND_PREHEADER]] ]
 +; ROTATED_LATER_OLDPM-NEXT:    tail call void @f0()
  ; ROTATED_LATER_OLDPM-NEXT:    tail call void @f1()
  ; ROTATED_LATER_OLDPM-NEXT:    [[INC]] = add nuw i32 [[I_04]], 1
 -; ROTATED_LATER_OLDPM-NEXT:    tail call void @f0()
  ; ROTATED_LATER_OLDPM-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC]], [[TMP0]]
  ; ROTATED_LATER_OLDPM-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
  ; ROTATED_LATER_OLDPM:       return:
 @@ -102,19 +99,19 @@ define void @_Z4loopi(i32 %width) {
  ; ROTATED_LATER_NEWPM-NEXT:    br i1 [[CMP]], label [[RETURN:%.*]], label [[FOR_COND_PREHEADER:%.*]]
  ; ROTATED_LATER_NEWPM:       for.cond.preheader:
  ; ROTATED_LATER_NEWPM-NEXT:    [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
 -; ROTATED_LATER_NEWPM-NEXT:    tail call void @f0()
  ; ROTATED_LATER_NEWPM-NEXT:    [[EXITCOND_NOT3:%.*]] = icmp eq i32 [[TMP0]], 0
  ; ROTATED_LATER_NEWPM-NEXT:    br i1 [[EXITCOND_NOT3]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_COND_PREHEADER_FOR_BODY_CRIT_EDGE:%.*]]
  ; ROTATED_LATER_NEWPM:       for.cond.preheader.for.body_crit_edge:
  ; ROTATED_LATER_NEWPM-NEXT:    [[INC_1:%.*]] = add nuw i32 0, 1
  ; ROTATED_LATER_NEWPM-NEXT:    br label [[FOR_BODY:%.*]]
  ; ROTATED_LATER_NEWPM:       for.cond.cleanup:
 +; ROTATED_LATER_NEWPM-NEXT:    tail call void @f0()
  ; ROTATED_LATER_NEWPM-NEXT:    tail call void @f2()
  ; ROTATED_LATER_NEWPM-NEXT:    br label [[RETURN]]
  ; ROTATED_LATER_NEWPM:       for.body:
  ; ROTATED_LATER_NEWPM-NEXT:    [[INC_PHI:%.*]] = phi i32 [ [[INC_0:%.*]], [[FOR_BODY_FOR_BODY_CRIT_EDGE:%.*]] ], [ [[INC_1]], [[FOR_COND_PREHEADER_FOR_BODY_CRIT_EDGE]] ]
 -; ROTATED_LATER_NEWPM-NEXT:    tail call void @f1()
  ; ROTATED_LATER_NEWPM-NEXT:    tail call void @f0()
 +; ROTATED_LATER_NEWPM-NEXT:    tail call void @f1()
  ; ROTATED_LATER_NEWPM-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC_PHI]], [[TMP0]]
  ; ROTATED_LATER_NEWPM-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_FOR_BODY_CRIT_EDGE]]
  ; ROTATED_LATER_NEWPM:       for.body.for.body_crit_edge:
 @@ -129,19 +126,19 @@ define void @_Z4loopi(i32 %width) {
  ; ROTATE_OLDPM-NEXT:    br i1 [[CMP]], label [[RETURN:%.*]], label [[FOR_COND_PREHEADER:%.*]]
  ; ROTATE_OLDPM:       for.cond.preheader:
  ; ROTATE_OLDPM-NEXT:    [[CMP13_NOT:%.*]] = icmp eq i32 [[WIDTH]], 1
 -; ROTATE_OLDPM-NEXT:    tail call void @f0()
  ; ROTATE_OLDPM-NEXT:    br i1 [[CMP13_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY_PREHEADER:%.*]]
  ; ROTATE_OLDPM:       for.body.preheader:
  ; ROTATE_OLDPM-NEXT:    [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
  ; ROTATE_OLDPM-NEXT:    br label [[FOR_BODY:%.*]]
  ; ROTATE_OLDPM:       for.cond.cleanup:
 +; ROTATE_OLDPM-NEXT:    tail call void @f0()
  ; ROTATE_OLDPM-NEXT:    tail call void @f2()
  ; ROTATE_OLDPM-NEXT:    br label [[RETURN]]
  ; ROTATE_OLDPM:       for.body:
  ; ROTATE_OLDPM-NEXT:    [[I_04:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
 +; ROTATE_OLDPM-NEXT:    tail call void @f0()
  ; ROTATE_OLDPM-NEXT:    tail call void @f1()
  ; ROTATE_OLDPM-NEXT:    [[INC]] = add nuw nsw i32 [[I_04]], 1
 -; ROTATE_OLDPM-NEXT:    tail call void @f0()
  ; ROTATE_OLDPM-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC]], [[TMP0]]
  ; ROTATE_OLDPM-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
  ; ROTATE_OLDPM:       return:
 @@ -153,19 +150,19 @@ define void @_Z4loopi(i32 %width) {
  ; ROTATE_NEWPM-NEXT:    br i1 [[CMP]], label [[RETURN:%.*]], label [[FOR_COND_PREHEADER:%.*]]
  ; ROTATE_NEWPM:       for.cond.preheader:
  ; ROTATE_NEWPM-NEXT:    [[CMP13_NOT:%.*]] = icmp eq i32 [[WIDTH]], 1
 -; ROTATE_NEWPM-NEXT:    tail call void @f0()
  ; ROTATE_NEWPM-NEXT:    br i1 [[CMP13_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY_PREHEADER:%.*]]
  ; ROTATE_NEWPM:       for.body.preheader:
  ; ROTATE_NEWPM-NEXT:    [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
  ; ROTATE_NEWPM-NEXT:    [[INC_1:%.*]] = add nuw nsw i32 0, 1
  ; ROTATE_NEWPM-NEXT:    br label [[FOR_BODY:%.*]]
  ; ROTATE_NEWPM:       for.cond.cleanup:
 +; ROTATE_NEWPM-NEXT:    tail call void @f0()
  ; ROTATE_NEWPM-NEXT:    tail call void @f2()
  ; ROTATE_NEWPM-NEXT:    br label [[RETURN]]
  ; ROTATE_NEWPM:       for.body:
  ; ROTATE_NEWPM-NEXT:    [[INC_PHI:%.*]] = phi i32 [ [[INC_0:%.*]], [[FOR_BODY_FOR_BODY_CRIT_EDGE:%.*]] ], [ [[INC_1]], [[FOR_BODY_PREHEADER]] ]
 -; ROTATE_NEWPM-NEXT:    tail call void @f1()
  ; ROTATE_NEWPM-NEXT:    tail call void @f0()
 +; ROTATE_NEWPM-NEXT:    tail call void @f1()
  ; ROTATE_NEWPM-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC_PHI]], [[TMP0]]
  ; ROTATE_NEWPM-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_FOR_BODY_CRIT_EDGE]]
  ; ROTATE_NEWPM:       for.body.for.body_crit_edge:
 diff --git a/llvm/test/Transforms/SimplifyCFG/common-code-hoisting.ll b/llvm/test/Transforms/SimplifyCFG/common-code-hoisting.ll
 index b58017ba7ef..37cbc4640e4 100644
 --- a/llvm/test/Transforms/SimplifyCFG/common-code-hoisting.ll
 +++ b/llvm/test/Transforms/SimplifyCFG/common-code-hoisting.ll
 @@ -1,7 +1,7 @@
  ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
  ; RUN: opt -simplifycfg -hoist-common-insts=1 -S < %s                    | FileCheck %s --check-prefixes=HOIST
  ; RUN: opt -simplifycfg -hoist-common-insts=0 -S < %s                    | FileCheck %s --check-prefixes=NOHOIST
 -; RUN: opt -simplifycfg                       -S < %s                    | FileCheck %s --check-prefixes=HOIST,DEFAULT
 +; RUN: opt -simplifycfg                       -S < %s                    | FileCheck %s --check-prefixes=NOHOIST,DEFAULT

  ; This example is produced from a very basic C code:
  ;
	commit bb7d3af1139c36270bc9948605e06f40e4c51541
	Author: Roman Lebedev <lebedev.ri@gmail.com>
	Date: Mon Sep 7 23:54:06 2020 +0300

	Reland [SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipeline

	This was reverted in 503deec2183d466dad64b763bab4e15fd8804239
	because it caused gigantic increase (3x) in branch mispredictions
	in certain benchmarks on certain CPU's,
	see https://reviews.llvm.org/D84108#2227365.

	It has since been investigated and here are the results:
	https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20200907/827578.html
	> It's an amazingly severe regression, but it's also all due to branch
	> mispredicts (about 3x without this). The code layout looks ok so there's
	> probably something else to deal with. I'm not sure there's anything we can
	> reasonably do so we'll just have to take the hit for now and wait for
	> another code reorganization to make the branch predictor a bit more happy :)
	>
	> Thanks for giving us some time to investigate and feel free to recommit
	> whenever you'd like.
	>
	> -eric

	So let's just reland this.
	Original commit message:


	I've been looking at missed vectorizations in one codebase.
	One particular thing that stands out is that some of the loops
	reach vectorizer in a rather mangled form, with weird PHI's,
	and some of the loops aren't even in a rotated form.

	After taking a more detailed look, that happened because
	the loop's headers were too big by then. It is evident that
	SimplifyCFG's common code hoisting transform is at fault there,
	because the pattern it handles is precisely the unrotated
	loop basic block structure.

	Surprizingly, `SimplifyCFGOpt::HoistThenElseCodeToIf()` is enabled
	by default, and is always run, unlike it's friend, common code sinking
	transform, `SinkCommonCodeFromPredecessors()`, which is not enabled
	by default and is only run once very late in the pipeline.

	I'm proposing to harmonize this, and disable common code hoisting
	until //late// in pipeline. Definition of //late// may vary,
	here currently i've picked the same one as for code sinking,
	but i suppose we could enable it as soon as right after
	loop rotation happens.

	Experimentation shows that this does indeed unsurprizingly help,
	more loops got rotated, although other issues remain elsewhere.

	Now, this undoubtedly seriously shakes phase ordering.
	This will undoubtedly be a mixed bag in terms of both compile- and
	run- time performance, codesize. Since we no longer aggressively
	hoist+deduplicate common code, we don't pay the price of said hoisting
	(which wasn't big). That may allow more loops to be rotated,
	so we pay that price. That, in turn, that may enable all the transforms
	that require canonical (rotated) loop form, including but not limited to
	vectorization, so we pay that too. And in general, no deduplication means
	more [duplicate] instructions going through the optimizations. But there's still
	late hoisting, some of them will be caught late.

	As per benchmarks i've run {F12360204}, this is mostly within the noise,
	there are some small improvements, some small regressions.
	One big regression i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
	this will expose many more pre-existing missed optimizations, as usual :S

	llvm-compile-time-tracker.com thoughts on this:
	http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions
	* this does regress compile-time by +0.5% geomean (unsurprizingly)
	* size impact varies; for ThinLTO it's actually an improvement

	The largest fallout appears to be in GVN's load partial redundancy
	elimination, it spends much more time in
	`MemoryDependenceResults::getNonLocalPointerDependency()`.
	Non-local `MemoryDependenceResults` is widely-known to be, uh, costly.
	There does not appear to be a proper solution to this issue,
	other than silencing the compile-time performance regression
	by tuning cut-off thresholds in `MemoryDependenceResults`,
	at the cost of potentially regressing run-time performance.
	D84609 attempts to move in that direction, but the path is unclear
	and is going to take some time.

	If we look at stats before/after diffs, some excerpts:
	* RawSpeed (the target) {F12360200}
	* -14 (-73.68%) loops not rotated due to the header size (yay)
	* -272 (-0.67%) `"Number of live out of a loop variables"` - good for vectorizer
	* -3937 (-64.19%) common instructions hoisted
	* +561 (+0.06%) x86 asm instructions
	* -2 basic blocks
	* +2418 (+0.11%) IR instructions
	* vanilla test-suite + RawSpeed + darktable {F12360201}
	* -36396 (-65.29%) common instructions hoisted
	* +1676 (+0.02%) x86 asm instructions
	* +662 (+0.06%) basic blocks
	* +4395 (+0.04%) IR instructions

	It is likely to be sub-optimal for when optimizing for code size,
	so one might want to change tune pipeline by enabling sinking/hoisting
	when optimizing for size.

	Reviewed By: mkazantsev

	Differential Revision: https://reviews.llvm.org/D84108

	This reverts commit 503deec2183d466dad64b763bab4e15fd8804239.

	diff --git a/llvm/include/llvm/Transforms/Utils/SimplifyCFGOptions.h b/llvm/include/llvm/Transforms/Utils/SimplifyCFGOptions.h
	index 46f6ca0462f..fb3a7490346 100644
	--- a/llvm/include/llvm/Transforms/Utils/SimplifyCFGOptions.h
	+++ b/llvm/include/llvm/Transforms/Utils/SimplifyCFGOptions.h
	@@ -25,7 +25,7 @@ struct SimplifyCFGOptions {
	bool ForwardSwitchCondToPhi = false;
	bool ConvertSwitchToLookupTable = false;
	bool NeedCanonicalLoop = true;
	- bool HoistCommonInsts = true;
	+ bool HoistCommonInsts = false;
	bool SinkCommonInsts = false;
	bool SimplifyCondBranch = true;
	bool FoldTwoEntryPHINode = true;
	diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
	index 9df6a985789..9a2e895d7b7 100644
	--- a/llvm/lib/Passes/PassBuilder.cpp
	+++ b/llvm/lib/Passes/PassBuilder.cpp
	@@ -1160,11 +1160,14 @@ ModulePassManager PassBuilder::buildModuleOptimizationPipeline(
	// convert to more optimized IR using more aggressive simplify CFG options.
	// The extra sinking transform can create larger basic blocks, so do this
	// before SLP vectorization.
	- OptimizePM.addPass(SimplifyCFGPass(SimplifyCFGOptions().
	- forwardSwitchCondToPhi(true).
	- convertSwitchToLookupTable(true).
	- needCanonicalLoops(false).
	- sinkCommonInsts(true)));
	+ // FIXME: study whether hoisting and/or sinking of common instructions should
	+ // be delayed until after SLP vectorizer.
	+ OptimizePM.addPass(SimplifyCFGPass(SimplifyCFGOptions()
	+ .forwardSwitchCondToPhi(true)
	+ .convertSwitchToLookupTable(true)
	+ .needCanonicalLoops(false)
	+ .hoistCommonInsts(true)
	+ .sinkCommonInsts(true)));

	// Optimize parallel scalar instruction chains into SIMD instructions.
	if (PTO.SLPVectorization)
	diff --git a/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp b/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
	index 8b15898c1c1..d7a14a3dc77 100644
	--- a/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
	+++ b/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
	@@ -455,6 +455,7 @@ void AArch64PassConfig::addIRPasses() {
	.forwardSwitchCondToPhi(true)
	.convertSwitchToLookupTable(true)
	.needCanonicalLoops(false)
	+ .hoistCommonInsts(true)
	.sinkCommonInsts(true)));

	// Run LoopDataPrefetch
	diff --git a/llvm/lib/Target/ARM/ARMTargetMachine.cpp b/llvm/lib/Target/ARM/ARMTargetMachine.cpp
	index 55ac332e2c6..5068f9b5a0f 100644
	--- a/llvm/lib/Target/ARM/ARMTargetMachine.cpp
	+++ b/llvm/lib/Target/ARM/ARMTargetMachine.cpp
	@@ -407,7 +407,8 @@ void ARMPassConfig::addIRPasses() {
	// ldrex/strex loops to simplify this, but it needs tidying up.
	if (TM->getOptLevel() != CodeGenOpt::None && EnableAtomicTidy)
	addPass(createCFGSimplificationPass(
	- SimplifyCFGOptions().sinkCommonInsts(true), [this](const Function &F) {
	+ SimplifyCFGOptions().hoistCommonInsts(true).sinkCommonInsts(true),
	+ [this](const Function &F) {
	const auto &ST = this->TM->getSubtarget<ARMSubtarget>(F);
	return ST.hasAnyDataBarrier() && !ST.isThumb1Only();
	}));
	diff --git a/llvm/lib/Target/Hexagon/HexagonTargetMachine.cpp b/llvm/lib/Target/Hexagon/HexagonTargetMachine.cpp
	index 6728306db3d..37cf391c998 100644
	--- a/llvm/lib/Target/Hexagon/HexagonTargetMachine.cpp
	+++ b/llvm/lib/Target/Hexagon/HexagonTargetMachine.cpp
	@@ -327,6 +327,7 @@ void HexagonPassConfig::addIRPasses() {
	.forwardSwitchCondToPhi(true)
	.convertSwitchToLookupTable(true)
	.needCanonicalLoops(false)
	+ .hoistCommonInsts(true)
	.sinkCommonInsts(true)));
	if (EnableLoopPrefetch)
	addPass(createLoopDataPrefetchPass());
	diff --git a/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp b/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
	index 326d1ab28b6..caa9a98ecb0 100644
	--- a/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
	+++ b/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
	@@ -784,10 +784,13 @@ void PassManagerBuilder::populateModulePassManager(
	// convert to more optimized IR using more aggressive simplify CFG options.
	// The extra sinking transform can create larger basic blocks, so do this
	// before SLP vectorization.
	+ // FIXME: study whether hoisting and/or sinking of common instructions should
	+ // be delayed until after SLP vectorizer.
	MPM.add(createCFGSimplificationPass(SimplifyCFGOptions()
	.forwardSwitchCondToPhi(true)
	.convertSwitchToLookupTable(true)
	.needCanonicalLoops(false)
	+ .hoistCommonInsts(true)
	.sinkCommonInsts(true)));

	if (SLPVectorize) {
	diff --git a/llvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp b/llvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp
	index db5211df397..b0435bf6e4e 100644
	--- a/llvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp
	+++ b/llvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp
	@@ -63,8 +63,8 @@ static cl::opt<bool> UserForwardSwitchCond(
	cl::desc("Forward switch condition to phi ops (default = false)"));

	static cl::opt<bool> UserHoistCommonInsts(
	- "hoist-common-insts", cl::Hidden, cl::init(true),
	- cl::desc("hoist common instructions (default = true)"));
	+ "hoist-common-insts", cl::Hidden, cl::init(false),
	+ cl::desc("hoist common instructions (default = false)"));

	static cl::opt<bool> UserSinkCommonInsts(
	"sink-common-insts", cl::Hidden, cl::init(false),
	diff --git a/llvm/test/Transforms/PGOProfile/chr.ll b/llvm/test/Transforms/PGOProfile/chr.ll
	index c2e1ae4f53a..1a22d7f0b84 100644
	--- a/llvm/test/Transforms/PGOProfile/chr.ll
	+++ b/llvm/test/Transforms/PGOProfile/chr.ll
	@@ -2006,9 +2006,16 @@ define i64 @test_chr_22(i1 %i, i64* %j, i64 %v0) !prof !14 {
	; CHECK-NEXT: bb0:
	; CHECK-NEXT: [[REASS_ADD:%.]] = shl i64 [[V0:%.]], 1
	; CHECK-NEXT: [[V2:%.*]] = add i64 [[REASS_ADD]], 3
	+; CHECK-NEXT: [[C1:%.*]] = icmp slt i64 [[V2]], 100
	+; CHECK-NEXT: br i1 [[C1]], label [[BB0_SPLIT:%.]], label [[BB0_SPLIT_NONCHR:%.]], !prof !15
	+; CHECK: bb0.split:
	; CHECK-NEXT: [[V299:%.*]] = mul i64 [[V2]], 7860086430977039991
	; CHECK-NEXT: store i64 [[V299]], i64* [[J:%.*]], align 4
	; CHECK-NEXT: ret i64 99
	+; CHECK: bb0.split.nonchr:
	+; CHECK-NEXT: [[V299_NONCHR:%.*]] = mul i64 [[V2]], 7860086430977039991
	+; CHECK-NEXT: store i64 [[V299_NONCHR]], i64* [[J]], align 4
	+; CHECK-NEXT: ret i64 99
	;
	bb0:
	%v1 = add i64 %v0, 3
	diff --git a/llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll b/llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll
	index 1d8cce6879e..314af1c1414 100644
	--- a/llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll
	+++ b/llvm/test/Transforms/PhaseOrdering/loop-rotation-vs-common-code-hoisting.ll
	@@ -5,14 +5,11 @@
	; RUN: opt -O3 -rotation-max-header-size=1 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR1,FALLBACK2
	; RUN: opt -passes='default<O3>' -rotation-max-header-size=1 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR1,FALLBACK3

	-; RUN: opt -O3 -rotation-max-header-size=2 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR2,FALLBACK4
	-; RUN: opt -passes='default<O3>' -rotation-max-header-size=2 -S < %s \| FileCheck %s --check-prefixes=HOIST,THR2,FALLBACK5
	+; RUN: opt -O3 -rotation-max-header-size=2 -S < %s \| FileCheck %s --check-prefixes=ROTATED_LATER,ROTATED_LATER_OLDPM,FALLBACK4
	+; RUN: opt -passes='default<O3>' -rotation-max-header-size=2 -S < %s \| FileCheck %s --check-prefixes=ROTATED_LATER,ROTATED_LATER_NEWPM,FALLBACK5

	-; RUN: opt -O3 -rotation-max-header-size=3 -S < %s \| FileCheck %s --check-prefixes=ROTATED_LATER,ROTATED_LATER_OLDPM,FALLBACK6
	-; RUN: opt -passes='default<O3>' -rotation-max-header-size=3 -S < %s \| FileCheck %s --check-prefixes=ROTATED_LATER,ROTATED_LATER_NEWPM,FALLBACK7
	-
	-; RUN: opt -O3 -rotation-max-header-size=4 -S < %s \| FileCheck %s --check-prefixes=ROTATE,ROTATE_OLDPM,FALLBACK8
	-; RUN: opt -passes='default<O3>' -rotation-max-header-size=4 -S < %s \| FileCheck %s --check-prefixes=ROTATE,ROTATE_NEWPM,FALLBACK9
	+; RUN: opt -O3 -rotation-max-header-size=3 -S < %s \| FileCheck %s --check-prefixes=ROTATE,ROTATE_OLDPM,FALLBACK6
	+; RUN: opt -passes='default<O3>' -rotation-max-header-size=3 -S < %s \| FileCheck %s --check-prefixes=ROTATE,ROTATE_NEWPM,FALLBACK7

	; This example is produced from a very basic C code:
	;
	@@ -61,8 +58,8 @@ define void @_Z4loopi(i32 %width) {
	; HOIST-NEXT: br label [[FOR_COND:%.*]]
	; HOIST: for.cond:
	; HOIST-NEXT: [[I_0:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY:%.*]] ], [ 0, [[FOR_COND_PREHEADER]] ]
	-; HOIST-NEXT: tail call void @f0()
	; HOIST-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[I_0]], [[TMP0]]
	+; HOIST-NEXT: tail call void @f0()
	; HOIST-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
	; HOIST: for.cond.cleanup:
	; HOIST-NEXT: tail call void @f2()
	@@ -80,17 +77,17 @@ define void @_Z4loopi(i32 %width) {
	; ROTATED_LATER_OLDPM-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]
	; ROTATED_LATER_OLDPM: for.cond.preheader:
	; ROTATED_LATER_OLDPM-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
	-; ROTATED_LATER_OLDPM-NEXT: tail call void @f0()
	; ROTATED_LATER_OLDPM-NEXT: [[EXITCOND_NOT3:%.*]] = icmp eq i32 [[TMP0]], 0
	; ROTATED_LATER_OLDPM-NEXT: br i1 [[EXITCOND_NOT3]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY:%.]]
	; ROTATED_LATER_OLDPM: for.cond.cleanup:
	+; ROTATED_LATER_OLDPM-NEXT: tail call void @f0()
	; ROTATED_LATER_OLDPM-NEXT: tail call void @f2()
	; ROTATED_LATER_OLDPM-NEXT: br label [[RETURN]]
	; ROTATED_LATER_OLDPM: for.body:
	; ROTATED_LATER_OLDPM-NEXT: [[I_04:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[FOR_COND_PREHEADER]] ]
	+; ROTATED_LATER_OLDPM-NEXT: tail call void @f0()
	; ROTATED_LATER_OLDPM-NEXT: tail call void @f1()
	; ROTATED_LATER_OLDPM-NEXT: [[INC]] = add nuw i32 [[I_04]], 1
	-; ROTATED_LATER_OLDPM-NEXT: tail call void @f0()
	; ROTATED_LATER_OLDPM-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC]], [[TMP0]]
	; ROTATED_LATER_OLDPM-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
	; ROTATED_LATER_OLDPM: return:
	@@ -102,19 +99,19 @@ define void @_Z4loopi(i32 %width) {
	; ROTATED_LATER_NEWPM-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]
	; ROTATED_LATER_NEWPM: for.cond.preheader:
	; ROTATED_LATER_NEWPM-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
	-; ROTATED_LATER_NEWPM-NEXT: tail call void @f0()
	; ROTATED_LATER_NEWPM-NEXT: [[EXITCOND_NOT3:%.*]] = icmp eq i32 [[TMP0]], 0
	; ROTATED_LATER_NEWPM-NEXT: br i1 [[EXITCOND_NOT3]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_COND_PREHEADER_FOR_BODY_CRIT_EDGE:%.]]
	; ROTATED_LATER_NEWPM: for.cond.preheader.for.body_crit_edge:
	; ROTATED_LATER_NEWPM-NEXT: [[INC_1:%.*]] = add nuw i32 0, 1
	; ROTATED_LATER_NEWPM-NEXT: br label [[FOR_BODY:%.*]]
	; ROTATED_LATER_NEWPM: for.cond.cleanup:
	+; ROTATED_LATER_NEWPM-NEXT: tail call void @f0()
	; ROTATED_LATER_NEWPM-NEXT: tail call void @f2()
	; ROTATED_LATER_NEWPM-NEXT: br label [[RETURN]]
	; ROTATED_LATER_NEWPM: for.body:
	; ROTATED_LATER_NEWPM-NEXT: [[INC_PHI:%.]] = phi i32 [ [[INC_0:%.]], [[FOR_BODY_FOR_BODY_CRIT_EDGE:%.*]] ], [ [[INC_1]], [[FOR_COND_PREHEADER_FOR_BODY_CRIT_EDGE]] ]
	-; ROTATED_LATER_NEWPM-NEXT: tail call void @f1()
	; ROTATED_LATER_NEWPM-NEXT: tail call void @f0()
	+; ROTATED_LATER_NEWPM-NEXT: tail call void @f1()
	; ROTATED_LATER_NEWPM-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC_PHI]], [[TMP0]]
	; ROTATED_LATER_NEWPM-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_FOR_BODY_CRIT_EDGE]]
	; ROTATED_LATER_NEWPM: for.body.for.body_crit_edge:
	@@ -129,19 +126,19 @@ define void @_Z4loopi(i32 %width) {
	; ROTATE_OLDPM-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]
	; ROTATE_OLDPM: for.cond.preheader:
	; ROTATE_OLDPM-NEXT: [[CMP13_NOT:%.*]] = icmp eq i32 [[WIDTH]], 1
	-; ROTATE_OLDPM-NEXT: tail call void @f0()
	; ROTATE_OLDPM-NEXT: br i1 [[CMP13_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY_PREHEADER:%.]]
	; ROTATE_OLDPM: for.body.preheader:
	; ROTATE_OLDPM-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
	; ROTATE_OLDPM-NEXT: br label [[FOR_BODY:%.*]]
	; ROTATE_OLDPM: for.cond.cleanup:
	+; ROTATE_OLDPM-NEXT: tail call void @f0()
	; ROTATE_OLDPM-NEXT: tail call void @f2()
	; ROTATE_OLDPM-NEXT: br label [[RETURN]]
	; ROTATE_OLDPM: for.body:
	; ROTATE_OLDPM-NEXT: [[I_04:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	+; ROTATE_OLDPM-NEXT: tail call void @f0()
	; ROTATE_OLDPM-NEXT: tail call void @f1()
	; ROTATE_OLDPM-NEXT: [[INC]] = add nuw nsw i32 [[I_04]], 1
	-; ROTATE_OLDPM-NEXT: tail call void @f0()
	; ROTATE_OLDPM-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC]], [[TMP0]]
	; ROTATE_OLDPM-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
	; ROTATE_OLDPM: return:
	@@ -153,19 +150,19 @@ define void @_Z4loopi(i32 %width) {
	; ROTATE_NEWPM-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[FOR_COND_PREHEADER:%.]]
	; ROTATE_NEWPM: for.cond.preheader:
	; ROTATE_NEWPM-NEXT: [[CMP13_NOT:%.*]] = icmp eq i32 [[WIDTH]], 1
	-; ROTATE_NEWPM-NEXT: tail call void @f0()
	; ROTATE_NEWPM-NEXT: br i1 [[CMP13_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY_PREHEADER:%.]]
	; ROTATE_NEWPM: for.body.preheader:
	; ROTATE_NEWPM-NEXT: [[TMP0:%.*]] = add nsw i32 [[WIDTH]], -1
	; ROTATE_NEWPM-NEXT: [[INC_1:%.*]] = add nuw nsw i32 0, 1
	; ROTATE_NEWPM-NEXT: br label [[FOR_BODY:%.*]]
	; ROTATE_NEWPM: for.cond.cleanup:
	+; ROTATE_NEWPM-NEXT: tail call void @f0()
	; ROTATE_NEWPM-NEXT: tail call void @f2()
	; ROTATE_NEWPM-NEXT: br label [[RETURN]]
	; ROTATE_NEWPM: for.body:
	; ROTATE_NEWPM-NEXT: [[INC_PHI:%.]] = phi i32 [ [[INC_0:%.]], [[FOR_BODY_FOR_BODY_CRIT_EDGE:%.*]] ], [ [[INC_1]], [[FOR_BODY_PREHEADER]] ]
	-; ROTATE_NEWPM-NEXT: tail call void @f1()
	; ROTATE_NEWPM-NEXT: tail call void @f0()
	+; ROTATE_NEWPM-NEXT: tail call void @f1()
	; ROTATE_NEWPM-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC_PHI]], [[TMP0]]
	; ROTATE_NEWPM-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY_FOR_BODY_CRIT_EDGE]]
	; ROTATE_NEWPM: for.body.for.body_crit_edge:
	diff --git a/llvm/test/Transforms/SimplifyCFG/common-code-hoisting.ll b/llvm/test/Transforms/SimplifyCFG/common-code-hoisting.ll
	index b58017ba7ef..37cbc4640e4 100644
	--- a/llvm/test/Transforms/SimplifyCFG/common-code-hoisting.ll
	+++ b/llvm/test/Transforms/SimplifyCFG/common-code-hoisting.ll
	@@ -1,7 +1,7 @@
	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -simplifycfg -hoist-common-insts=1 -S < %s \| FileCheck %s --check-prefixes=HOIST
	; RUN: opt -simplifycfg -hoist-common-insts=0 -S < %s \| FileCheck %s --check-prefixes=NOHOIST
	-; RUN: opt -simplifycfg -S < %s \| FileCheck %s --check-prefixes=HOIST,DEFAULT
	+; RUN: opt -simplifycfg -S < %s \| FileCheck %s --check-prefixes=NOHOIST,DEFAULT

	; This example is produced from a very basic C code:
	;