[autotest] limit repair failed count to the same host and hqe
The limit is added so we won't repeatedly repair a host for a job created
from AFE. The code path has a bug that will set the host in repair failed
status even for jobs created with meta_host, and the host was repair the
first time.
This CL limits the count of repair job to the ones with same host and hqe.
Thus, a host can be tried to be repaired if an hqe failed in multiple hosts.
DEPLOY=scheduler
BUG=chromium:392496,chromium:426905
TEST=local
set max_repair_limit in global config to 0, raise an exception in reset to
force reset to fail.
test frontend job:
Create a job from AFE with a given host. Confirm that the dut goes into repair
failed status and no repair job queued.
test suite job:
create a suite job
When max_repair_limit is set to 0, confirm the duts goes into repair failed
status and no repair job queued.
Wehn max_repair_limit is set to 2, confirm that repair job was created after
reset failure.
Change-Id: Icf737f7ff90a96edd6f08b5d79f431b66313d242
Reviewed-on: https://chromium-review.googlesource.com/225442
Reviewed-by: Dan Shi <dshi@chromium.org>
Commit-Queue: Dan Shi <dshi@chromium.org>
Tested-by: Dan Shi <dshi@chromium.org>
diff --git a/scheduler/prejob_task.py b/scheduler/prejob_task.py
index 64c63c3..4524fd7 100644
--- a/scheduler/prejob_task.py
+++ b/scheduler/prejob_task.py
@@ -125,9 +125,13 @@
# limit, since then we overwrite the PARSING state of the HQE.
self.queue_entry.requeue()
+ # Limit the repair on a host when a prejob task fails, e.g., reset,
+ # verify etc. The number of repair jobs is limited to the specific
+ # HQE and host.
previous_repairs = models.SpecialTask.objects.filter(
task=models.SpecialTask.Task.REPAIR,
- queue_entry_id=self.queue_entry.id).count()
+ queue_entry_id=self.queue_entry.id,
+ host_id=self.queue_entry.host_id).count()
if previous_repairs >= scheduler_config.config.max_repair_limit:
self.host.set_status(models.Host.Status.REPAIR_FAILED)
self._fail_queue_entry()