SSHHost.run(): add API to retry ssh calls on probable ssh failure We see a very small rate of probable ssh failures (timeout, or command status 255), which usually translate into complete failure for the testing in progress. Since a single run of the release builders involves millions of ssh commands, even a small failure rate results in frequent breakage. The failures are not understood, but because the ssh commands following a failure usually succeed, it is likely that a simple retry strategy, when possible, will recover from them in most cases. Each failure is either reported as ssh error 255, or by a timeout enforced by the python code running the ssh command. In either case we cannot tell whether the failure is due to ssh and/or network errors, or problems with the command executing on the DUT. So we just add the keyword parameter "ssh_failure_retry_ok" to SSHHost.run(), so that individual calling sites can choose the new behavior, with the understanding that the retries will happen on "probable" ssh errors. Failure modes of ssh are complicated by the presence of the ssh control master, which sits in between the ssh client and the ssh daemon on the DUT. So during the retry attempts, we eventually restart the control master. This is the flow for calls that specify sh_failure_retry_ok = True: 1. try first time, but disable exceptions on timeout 2. if success or DNS error, same behavior as before (return on success, retry once on DNS error) 3. if timeout or status 255 is returned, retry identically 4. if timeout or status 255, retry again after restarting the control master and resetting the timeout behavior to the original one. When switching a call site to use ssh_failure_retry_ok, consider that the worst-case timeout could be 3x longer than the specified timeout, and adjust it as needed. This should be necessary only 1. for very long timeouts or 2. for code that can recover from such timeouts (if it's going to fail anyway, it doesn't matter if it takes a little longer). BUG=chromium:664587 TEST=none Change-Id: I1f61cdba98b6ed1f3543e5ab38fa7f5bfc37bdc3 Reviewed-on: https://chromium-review.googlesource.com/418691 Commit-Ready: Luigi Semenzato <semenzato@chromium.org> Tested-by: Luigi Semenzato <semenzato@chromium.org> Reviewed-by: Luigi Semenzato <semenzato@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/455336 Reviewed-by: Michael Tang <ntang@chromium.org> Tested-by: Keith Haddow <haddowk@chromium.org>

commit: c1363b207f1c94fe7e2ed91e25203f91e078bfd3 [log] [tgz]
author: Luigi Semenzato <semenzato@chromium.org> Fri Dec 09 11:51:01 2016 -0800
committer: Keith Haddow <haddowk@chromium.org> Wed Mar 15 00:29:24 2017 +0000
tree: d570d5ffb089757849fe97624ae36e2d64a6adfc
parent: 87f5627aa9e202caadcf5cbb9c850e4b53c22ff2 [diff]
diff --git a/server/hosts/abstract_ssh.py b/server/hosts/abstract_ssh.py
index 57d030d..f539381 100644
--- a/server/hosts/abstract_ssh.py
+++ b/server/hosts/abstract_ssh.py

@@ -704,6 +704,16 @@
         os.remove(self.known_hosts_file)
 
 
+    def restart_master_ssh(self):
+        """
+        Stop and restart the ssh master connection.  This is meant as a last
+        resort when ssh commands fail and we don't understand why.
+        """
+        logging.debug('Restarting master ssh connection')
+        self._cleanup_master_ssh()
+        self.start_master_ssh(timeout=30)
+
+
     def _cleanup_master_ssh(self):
         """
         Release all resources (process, temporary directory) used by an active

diff --git a/server/hosts/ssh_host.py b/server/hosts/ssh_host.py
index abc44b1..7e43695 100644
--- a/server/hosts/ssh_host.py
+++ b/server/hosts/ssh_host.py

@@ -18,6 +18,13 @@
 from autotest_lib.server import utils
 from autotest_lib.server.hosts import abstract_ssh
 
+# In case cros_host is being ran via SSP on an older Moblab version with an
+# older chromite version.
+try:
+    from chromite.lib import metrics
+except:
+    metrics = None
+
 
 class SSHHost(abstract_ssh.AbstractSSHHost):
     """
@@ -97,7 +104,7 @@
 
     def _run(self, command, timeout, ignore_status,
              stdout, stderr, connect_timeout, env, options, stdin, args,
-             ignore_timeout):
+             ignore_timeout, ssh_failure_retry_ok):
         """Helper function for run()."""
         ssh_cmd = self.ssh_command(connect_timeout, options)
         if not env.strip():
@@ -117,27 +124,112 @@
         #
         # We work around it by detecting the first DNS resolution error
         # and retrying exactly one time.
-        dns_retry_count = 2
+        dns_error_retry_count = 1
+
+        def counters_inc(counter_name, failure_name):
+            """Helper function to increment metrics counters.
+            @param counter_name: string indicating which counter to use
+            @param failure_name: string indentifying an error, or 'success'
+            """
+            if counter_name == 'call':
+                # ssh_counter records the outcome of each ssh invocation
+                # inside _run(), including exceptions.
+                ssh_counter = metrics.Counter('chromeos/autotest/ssh/calls')
+                fields = {'error' : failure_name or 'success',
+                          'attempt' : ssh_call_count}
+                ssh_counter.increment(fields=fields)
+
+            if counter_name == 'run':
+                # run_counter records each call to _run() with its result
+                # and how many tries were made.  Calls are recorded when
+                # _run() exits (including exiting with an exception)
+                run_counter = metrics.Counter('chromeos/autotest/ssh/runs')
+                fields = {'error' : failure_name or 'success',
+                          'attempt' : ssh_call_count}
+                run_counter.increment(fields=fields)
+
+        # If ssh_failure_retry_ok is True, retry twice on timeouts and generic
+        # error 255: if a simple retry doesn't work, kill the ssh master
+        # connection and try again.  (Note that either error could come from
+        # the command running in the DUT, in which case the retry may be
+        # useless but, in theory, also harmless.)
+        if ssh_failure_retry_ok:
+            # Ignore ssh command timeout, even though it could be a timeout due
+            # to the command executing in the remote host.  Note that passing
+            # ignore_timeout = True makes utils.run() return None on timeouts
+            # (and only on timeouts).
+            original_ignore_timeout = ignore_timeout
+            ignore_timeout = True
+            ssh_failure_retry_count = 2
+        else:
+            ssh_failure_retry_count = 0
+
+        ssh_call_count = 0
+
         while True:
-            result = utils.run(full_cmd, timeout, True, stdout, stderr,
-                               verbose=False, stdin=stdin,
-                               stderr_is_expected=ignore_status,
-                               ignore_timeout=ignore_timeout)
-            dns_retry_count -= 1
-            if (result and result.exit_status == 255 and
-                    re.search(r'^ssh: .*: Name or service not known',
-                              result.stderr)):
-                if dns_retry_count:
-                    logging.debug('Retrying because of DNS failure')
-                    continue
-                logging.debug('Retry failed.')
-                autotest_stats.Counter('dns_retry_hack.fail').increment()
-            elif not dns_retry_count:
-                logging.debug('Retry succeeded.')
-                autotest_stats.Counter('dns_retry_hack.pass').increment()
+            try:
+                # Increment call count first, in case utils.run() throws an
+                # exception.
+                ssh_call_count += 1
+                result = utils.run(full_cmd, timeout, True, stdout, stderr,
+                                   verbose=False, stdin=stdin,
+                                   stderr_is_expected=ignore_status,
+                                   ignore_timeout=ignore_timeout)
+            except Exception as e:
+                # No retries on exception.
+                counters_inc('call', 'exception')
+                counters_inc('run', 'exception')
+                raise e
+
+            failure_name = None
+
+            if result:
+                if result.exit_status == 255:
+                    if re.search(r'^ssh: .*: Name or service not known',
+                                 result.stderr):
+                        failure_name = 'dns_failure'
+                    else:
+                        failure_name = 'error_255'
+                elif result.exit_status > 0:
+                    failure_name = 'nonzero_status'
+            else:
+                # result == None
+                failure_name = 'timeout'
+
+            # Record the outcome of the ssh invocation.
+            counters_inc('call', failure_name)
+
+            if failure_name:
+                # There was a failure: decide whether to retry.
+                if failure_name == 'dns_failure':
+                    if dns_error_retry_count > 0:
+                        logging.debug('retrying ssh because of DNS failure')
+                        dns_error_retry_count -= 1
+                        continue
+                else:
+                    if ssh_failure_retry_count == 2:
+                        logging.debug('retrying ssh command after %s',
+                                       failure_name)
+                        ssh_failure_retry_count -= 1
+                        continue
+                    elif ssh_failure_retry_count == 1:
+                        # After two failures, restart the master connection
+                        # before the final try.
+                        logging.debug('retry 2: restarting master connection')
+                        self.restart_master_ssh()
+                        # Last retry: reinstate timeout behavior.
+                        ignore_timeout = original_ignore_timeout
+                        ssh_failure_retry_count -= 1
+                        continue
+
+            # No retry conditions occurred.  Exit the loop.
             break
 
+        # The outcomes of ssh invocations have been recorded.  Now record
+        # the outcome of this function.
+
         if ignore_timeout and not result:
+            counters_inc('run', 'ignored_timeout')
             return None
 
         # The error messages will show up in band (indistinguishable
@@ -148,30 +240,42 @@
         if result.exit_status == 255:
             if re.search(r'^ssh: connect to host .* port .*: '
                          r'Connection timed out\r$', result.stderr):
+                counters_inc('run', 'final_timeout')
                 raise error.AutoservSSHTimeout("ssh timed out", result)
             if "Permission denied." in result.stderr:
                 msg = "ssh permission denied"
+                counters_inc('run', 'final_eperm')
                 raise error.AutoservSshPermissionDeniedError(msg, result)
 
         if not ignore_status and result.exit_status > 0:
+            counters_inc('run', 'final_run_error')
             raise error.AutoservRunError("command execution error", result)
 
+        counters_inc('run', failure_name)
         return result
 
 
     def run(self, command, timeout=3600, ignore_status=False,
             stdout_tee=utils.TEE_TO_LOGS, stderr_tee=utils.TEE_TO_LOGS,
             connect_timeout=30, options='', stdin=None, verbose=True, args=(),
-            ignore_timeout=False):
+            ignore_timeout=False, ssh_failure_retry_ok=False):
         """
         Run a command on the remote host.
         @see common_lib.hosts.host.run()
 
-        @param connect_timeout: connection timeout (in seconds)
+        @param timeout: command execution timeout
+        @param connect_timeout: ssh connection timeout (in seconds)
         @param options: string with additional ssh command options
         @param verbose: log the commands
         @param ignore_timeout: bool True if SSH command timeouts should be
                 ignored.  Will return None on command timeout.
+        @param ssh_failure_retry_ok: True if the command may be retried on
+                probable ssh failure (error 255 or timeout).  When true,
+                the command may be executed up to three times, the second
+                time after restarting the ssh master connection.  Use only for
+                commands that are idempotent, because when a "probable
+                ssh failure" occurs, we cannot tell if the command executed
+                or not.
 
         @raises AutoservRunError: if the command failed
         @raises AutoservSSHTimeout: ssh connection has timed out
@@ -187,7 +291,8 @@
         try:
             return self._run(command, timeout, ignore_status,
                              stdout_tee, stderr_tee, connect_timeout, env,
-                             options, stdin, args, ignore_timeout)
+                             options, stdin, args, ignore_timeout,
+                             ssh_failure_retry_ok)
         except error.CmdError, cmderr:
             # We get a CmdError here only if there is timeout of that command.
             # Catch that and stuff it into AutoservRunError and raise it.
commit	c1363b207f1c94fe7e2ed91e25203f91e078bfd3	[log] [tgz]
author	Luigi Semenzato <semenzato@chromium.org>	Fri Dec 09 11:51:01 2016 -0800
committer	Keith Haddow <haddowk@chromium.org>	Wed Mar 15 00:29:24 2017 +0000
tree	d570d5ffb089757849fe97624ae36e2d64a6adfc
parent	87f5627aa9e202caadcf5cbb9c850e4b53c22ff2 [diff]