tradefed_test: tune the wait for filelock.
1) Must not use a small integer to wait, much safer to use a float.
2) Use exponential backoff to increase wait time variance.
This is all very ugly. Trying to do the least amount of damage here.
The original bug still needs root causing.
BUG=b:70539494
TEST=None.
Change-Id: Icaf88b315c8699cf6559cd0e949e6e6908d89159
Reviewed-on: https://chromium-review.googlesource.com/831065
Commit-Ready: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org>
(cherry picked from commit 1c941f54a08612015ec51080e7d78f6b56c14449)
Reviewed-on: https://chromium-review.googlesource.com/831175
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
(cherry picked from commit 2f6ec00cca87729ce687a46799471833d38bd69e)
Reviewed-on: https://chromium-review.googlesource.com/831176
diff --git a/server/cros/tradefed_test.py b/server/cros/tradefed_test.py
index cc4b70c..8e05e2b 100644
--- a/server/cros/tradefed_test.py
+++ b/server/cros/tradefed_test.py
@@ -245,11 +245,16 @@
attempts = 0
while not filelock.i_am_locking():
try:
- attempts += 1
logging.info('Waiting for cache lock...')
- filelock.acquire(random.randint(1, 5))
+ # We must not use a random integer as the filelock implementations
+ # may underflow an integer division.
+ filelock.acquire(random.uniform(0.0, pow(2.0, attempts)))
+ attempts += 1
except (lockfile.AlreadyLocked, lockfile.LockTimeout):
- if attempts > 200:
+ # Our goal is to wait long enough to be sure something very bad
+ # happened to the locking thread. 11 attempts is between 15 and
+ # 30 minutes.
+ if attempts > 11:
# Normally we should aqcuire the lock immediately. Once we
# wait on the order of 10 minutes either the dev server IO is
# overloaded or a lock didn't get cleaned up. Take one for the
@@ -257,6 +262,11 @@
# the lock for following tests. If the failure affects more than
# one job look for a deadlock or dev server overload.
logging.error('Permanent lock failure. Trying to break lock.')
+ # TODO(ihf): Think how to do this cleaner without having a
+ # recursive lock breaking problem. We may have to kill every
+ # job that is currently waiting. The main goal though really is
+ # to have a cache that does not corrupt. And cache updates
+ # only happen once a month or so, everything else are reads.
filelock.break_lock()
raise error.TestFail('Error: permanent cache lock failure.')
else:
@@ -1115,7 +1125,7 @@
expected_fail_files += glob.glob(expected_fail_dir + '/*.yaml')
waivers = cts_expected_failure_parser.ParseKnownCTSFailures(
- expected_fail_files)
+ expected_fail_files)
return waivers.find_waivers(test_board, test_arch)
def _get_release_channel(self, host):