commit	c1363b207f1c94fe7e2ed91e25203f91e078bfd3	[log] [tgz]
author	Luigi Semenzato <semenzato@chromium.org>	Fri Dec 09 11:51:01 2016 -0800
committer	Keith Haddow <haddowk@chromium.org>	Wed Mar 15 00:29:24 2017 +0000
tree	d570d5ffb089757849fe97624ae36e2d64a6adfc
parent	87f5627aa9e202caadcf5cbb9c850e4b53c22ff2 [diff]

SSHHost.run(): add API to retry ssh calls on probable ssh failure

We see a very small rate of probable ssh failures (timeout,
or command status 255), which usually translate into complete
failure for the testing in progress.  Since a single run of
the release builders involves millions of ssh commands, even
a small failure rate results in frequent breakage.

The failures are not understood, but because the ssh commands
following a failure usually succeed, it is likely that a simple
retry strategy, when possible, will recover from them in most
cases.

Each failure is either reported as ssh error 255, or by a timeout
enforced by the python code running the ssh command.  In either
case we cannot tell whether the failure is due to ssh and/or
network errors, or problems with the command executing on the
DUT.  So we just add the keyword parameter "ssh_failure_retry_ok"
to SSHHost.run(), so that individual calling sites can choose
the new behavior, with the understanding that the retries will
happen on "probable" ssh errors.

Failure modes of ssh are complicated by the presence of the
ssh control master, which sits in between the ssh client and
the ssh daemon on the DUT.  So during the retry attempts, we
eventually restart the control master.

This is the flow for calls that specify
sh_failure_retry_ok = True:

1. try first time, but disable exceptions on timeout
2. if success or DNS error, same behavior as before (return on
success, retry once on DNS error)
3. if timeout or status 255 is returned, retry identically
4. if timeout or status 255, retry again after restarting the control
master and resetting the timeout behavior to the original one.

When switching a call site to use ssh_failure_retry_ok, consider
that the worst-case timeout could be 3x longer than the specified
timeout, and adjust it as needed.  This should be necessary
only 1. for very long timeouts or 2. for code that can recover
from such timeouts (if it's going to fail anyway, it doesn't
matter if it takes a little longer).

BUG=chromium:664587
TEST=none

Change-Id: I1f61cdba98b6ed1f3543e5ab38fa7f5bfc37bdc3
Reviewed-on: https://chromium-review.googlesource.com/418691
Commit-Ready: Luigi Semenzato <semenzato@chromium.org>
Tested-by: Luigi Semenzato <semenzato@chromium.org>
Reviewed-by: Luigi Semenzato <semenzato@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/455336
Reviewed-by: Michael Tang <ntang@chromium.org>
Tested-by: Keith Haddow <haddowk@chromium.org>

2 files changed

tree: d570d5ffb089757849fe97624ae36e2d64a6adfc

README.md

Autotest: Automated integration testing for Android and Chrome OS Devices

Autotest is a framework for fully automated testing. It was originally designed to test the Linux kernel, and expanded by the Chrome OS team to validate complete system images of Chrome OS and Android.

Autotest is composed of a number of modules that will help you to do stand alone tests or setup a fully automated test grid, depending on what you are up to. A non extensive list of functionality is:

A body of code to run tests on the device under test. In this setup, test logic executes on the machine being tested, and results are written to files for later collection from a development machine or lab infrastructure.
A body of code to run tests against a remote device under test. In this setup, test logic executes on a development machine or piece of lab infrastructure, and the device under test is controlled remotely via SSH/adb/some combination of the above.
Developer tools to execute one or more tests. test_that for Chrome OS and test_droid for Android allow developers to run tests against a device connected to their development machine on their desk. These tools are written so that the same test logic that runs in the lab will run at their desk, reducing the number of configurations under which tests are run.
Lab infrastructure to automate the running of tests. This infrastructure is capable of managing and running tests against thousands of devices in various lab environments. This includes code for both synchronous and asynchronous scheduling of tests. Tests are run against this hardware daily to validate every build of Chrome OS.
Infrastructure to set up miniature replicas of a full lab. A full lab does entail a certain amount of administrative work which isn't appropriate for a work group interested in automated tests against a small set of devices. Since this scale is common during device bringup, a special setup, called Moblab, allows a natural progressing from desk -> mini lab -> full lab.

Run some autotests

See the guides to test_that and test_droid:

test_droid Basic Usage

test_that Basic Usage

Write some autotests

See the best practices guide, existing tests, and comments in the code.

Autotest Best Practices

Grabbing the latest source

git clone https://chromium.googlesource.com/chromiumos/third_party/autotest

Hacking and submitting patches

See the coding style guide for guidance on submitting patches.

Coding Style