If you need to return StatusChain<CryptohomeError>
or StatusChain<Cryptohome*Error>
, then you should:
MakeStatus<>
to create the error, check existing usage in codebase for examples..Wrap()
to wrap errors from lower levels, if any.CRYPTOHOME_ERR_LOC
to cryptohome/error/locations.h
.For login-related UserDataAuth DBus APIs, we use a set of error handling/reporting mechanisms known as the CryptohomeError
.
The goals are: * Better Visibility for Users: Surfacing the above distinct errors to Chrome so that users have more information and visibility when something goes wrong, including the ability to redirect them to help center page specific to the error that they’re receiving. * Put Users in Control: Surfacing the handling of these errors (such as giving recommendations on how to handle the error) to Chrome so that users have more control over errors, especially those that might cause them to lose their data. * Effective Monitoring: A more fine-grained and systematic error monitoring so that we can better find new errors or trends within the UMA.
For errors that are represented by CryptohomeError
, there are 2 key concepts: - Recommended Actions, it details what can the caller or user do to resolve the situation or error. - Error ID - An identifier that identifies the exact error. It is fine grained
There are 2 types of recommended actions:
One of them is PrimaryAction
, in which cryptohome is sure about the cause of the error or situation. For instance, Mount() failed because the vault migration is incomplete.
Meanwhile, PossibleAction
is when cryptohome is unsure about what caused the error, however, it has reasons to believe that certain action could clear those conditions. For instance, if Mount()
failed because the mount point is busy, then cryptohome could recommend the user to reboot.
Therefore, when raising an error in cryptohomed, we can either list a several possible actions that we believe to be relevant, or list one primary action when we are sure about the root cause. When the errors propagate up the callstack to the DBus/UserDataAuth layer, a single PrimaryAction
or a set of PossibleAction
s will be determined and transmitted as part of the DBus reply.
The error ID aims to be: - Relatively stable across version (as compared to file and line number). - Not confusing to the user when displayed. - Stackable so that there's context. - Fine grained, i.e. tied to that specific error.
This is implemented as a dash-seperated series of integers. For example: “42-2-17”.
Each of the integers in the ID is called a node. Each node usually corresponds to a frame in the C++ callstack. Every time MakeStatus<>
is called, a node is generated. Also, TPM errors are appended to the ID as a node.
The numerical value of the node is the enum that is passed through CRYPTOHOME_ERR_LOC()
to MakeStatus<>
, and they should be unique and managed either manually or automatically in cryptohome/error/locations.h
. Each of the enums should be named in a way that briefly descripts what happened and its value should not conflict with any values that was used previously. Enum that was previously in use can remain in the file and not be removed. There's a tool that can check that each of the enum is not used multiple times.
Also, a special range of numerical value is defined for errors from TPM, to separate them from the enums coming from locations.h
.
To map the numerical value reported back to the enum or TPM error, see the section on Error Location tool.
The CryptohomeError
is stackable, and if there‘s an error, the CryptohomeError
stack will contain recommended actions that will advise the caller on what the situation is or what actions might resolve the issue. Furthermore, the stack of CryptohomeError
can be extracted to produce an error ID mentioned above that is unique to the error and behaves like a “smart stacktrace”, so that we’ve more insights into how the error occurred.
During the transition period, the CryptohomeError
and related classes will take the CryptohomeErrorCode
and when constructing the DBus reply, the related utility methods will insert the legacy CryptohomeErrorCode
into the reply as previous. Thus, there will be a period in which both the legacy CryptohomeErrorCode
and the new information from CryptohomeError
class will be present in the DBus reply.
If an error occurred, we can create it with:
return MakeStatus<CryptohomeError>( CRYPTOHOME_ERR_LOC(kLocClassNameAndShortDescriptionOfError), ErrorActionSet({PossibleAction::kReboot}), user_data_auth::CryptohomeErrorCode::CRYPTOHOME_ERROR_MOUNT_MOUNT_POINT_BUSY); return MakeStatus<CryptohomeError>( CRYPTOHOME_ERR_LOC(kLocClassNameAndShortDescriptionOfError), ErrorActionSet(PrimaryAction::kIncorrectAuth), user_data_auth::CryptohomeErrorCode::CRYPTOHOME_ERROR_MOUNT_MOUNT_POINT_BUSY);
In which kLocClassNameAndShortDescriptionOfError
is a project wide unique location ID. Callers can name and use the identifier directly in the call site, then invoke the following command before compiling to generate the declaration in cryptohome/error/locations.h:
(cros_sdk) /mnt/host/source/src/platform2 $ ./cryptohome/error/tool/location_db.py --update
The second parameter is a literal initializer list for the set of actions that cryptohomed recommends. For possible values of recommended actions, check out cryptohome/error/action.h
. The initializer list should be wrapped in ErrorActionSet()
so that the template can deduce its type. If the initializer list is empty, we can use NoErrorAction()
as well.
Lastly, we supply the legacy CryptohomeErrorCode
to the error so that the same error will be present in the DBus reply.
If an error occurred in a function that we've called and we want to wrap the error for it to bubble up the callstack, we can:
StatusChain<CryptohomeError> error = ActionThatMightResultInAnError(); if (error) { // Logging and etc... return MakeStatus<CryptohomeError>(CRYPTOHOME_ERR_LOC(kLocXXX), NoErrorAction(), std::nullopt).Wrap(std::move(error)); }
In which kLocXXX
is the location ID that is the same as above. The NoErrorAction()
can be replaced with any recommended actions that the current layer feels appropriate, and std::nullopt
can be replaced with any CryptohomeErrorCode if the layer original intends to return any legacy CryptohomeErrorCode.
In the case when the error needs to carry more information than that the CryptohomeError
can hold, custom errors can be used. For instance, if we need to check the reason why something failed halfway when the error is bubbling up the stack, so that we can consider retrying it immediately in cryptohomed, we'll need a field that notes these information.
In these scenarios, it is recommended to implement a custom error class that inherits off the CryptohomeErrorObj. For an example on how it's done, see MountError
in cryptohome/error/cryptohome_mount_error.h
.
We need special care when dealing with TPM-related errors. Currently, we've an existing implementation for TPMErrorBase
and it is generic, belonging to libhwsec. Therefore, it does not make sense to have TPMErrorBase
as a derived class of CryptohomeError
, thus forbidding the TPMErrorBase
as part of the chain/stack of error in this design.
To deal with that, we‘ve a CryptohomeTPMError
that contains the exact set of information that the TPMError contains. Furthermore, the CryptohomeTPMError is a derived class of the CryptohomeError
, and thus can be part of the error chain. Note that we should only use CryptohomeTPMError
to wrap TPMErrorBase
into a cryptohome error, and shouldn’t construct a CryptohomeTPMError
ourselves because it doesn‘t make sense to do so. In addition, you can’t specify the location or error actions when constructing a CryptohomeTPMError
because those info will be derived from the TPMErrorBase
.
In practice, to wrap a TPMErrorBase
that we've received from libhwsec (or other classes that produces such error) as a CryptohomeError
so that it can bubble up the call stack, we can:
StatusChain<TPMErrorBase> err = SomethingThatProduceTPMErrorBase(); if (err) { auto converted = MakeStatus<CryptohomeTPMError>(err); return MakeStatus<CryptohomeError>(CRYPTOHOME_ERR_LOC(kLocXXX)) .Wrap(std::move(converted)); }
The error location is treated more akin to logging as it is more of a diagnostic tool, and therefore it is generally not expected to test it in unit test. However, for scenarios that matter to the end users, we should test that the error actions returned are correct.
This can be done through using the PrimaryActionIs
and PossibleActionsInclude
utility functions:
EXPECT_TRUE(PrimaryActionIs(PrimaryAction::kIncorrectAuth)); EXPECT_TRUE(PossibleActionsInclude(PossibleAction::kReboot));
Sometimes we‘ll have a StatusChain that will be disposed because there’s a retry that followed, if the error is working as intended, or if we have fallback actions for that error. In those situations, we should dispose of the said StatusChain with the Reap*() functions instead of simply letting it disappear. In the case that we have fallback actions, we‘ll likely still want to monitor those errors as they’re not intended. Use ReapAndReportError
for those errors.
For instance:
// The action may fail and it is expected. CryptohomeStatus status = ...; // The WAI error should be disposed of properly. ReapWorkingAsIntendedError(std::move(status));
CryptohomeStatus status = ...; if (!status.ok()) { // Failed, we'll retry. // The previous error should be disposed of properly. ReapRetryError(std::move(status)); status = ...; }
CryptohomeStatus migration_status = ...; if (!migration_status.ok()) { // We don't want to fail the operation, but the previous // error should be reported. ReapAndReportError(std::move(status), "PinMigrationError"); } return pre_migration_status;
A tool is written to deal with the location enums used in CRYPTOHOME_ERR_LOC()
. The tool should run in cros_sdk.
The use of this tool is optional. Developer can choose to write the enums manually and disregard this tool. However, this tool will still automatically check that the use of enums are correct.
If you are using the tool to decode or lookup error code, it is recommended to sync your repository to at least the version at which the error code is generated. For instance, if you are debugging issues with logs from M100, it is recommended to sync your source code to at least M100 or newer. It is OK to be on version newer than the version on which the error code was generated.
To generate the declarations for location enums used throughout the code base and update cryptohome/error/locations.h
, run the following command:
(cros_sdk) /mnt/host/source/src/platform2 $ ./cryptohome/error/tool/location_db.py --update
The checks are ran with unit test builds, i.e. FEATURES=test emerge-$BOARD cryptohome
.
However, to run the checks manually:
(cros_sdk) /mnt/host/source/src/platform2 $ ./cryptohome/error/tool/location_db.py --check
If we have a location UMA value that is found in the UMA or error log, we can look it up with the tool, for example:
(cros_sdk) /mnt/host/source/src/platform2 $ ./cryptohome/error/tool/location_db.py --lookup 5 INFO:root:Using cryptohome source at: /mnt/host/source/src/platform2/cryptohome Value 5 is kLocChalCredDecryptSPKIPubKeyMismatch and can be found at: ./challenge_credentials/challenge_credentials_decrypt_operation.cc:94
If we have a stack of error location and wishes to decode it, we can run the tool:
(cros_sdk) /mnt/host/source/src/platform2 $ ./cryptohome/error/tool/location_db.py --decode 5-6-7 INFO:root:Using cryptohome source at: /mnt/host/source/src/platform2/cryptohome kLocChalCredDecryptSPKIPubKeyMismatch=5 @ ./challenge_credentials/challenge_credentials_decrypt_operation.cc:94 kLocChalCredDecryptSaltProcessingFailed=6 @ ./challenge_credentials/challenge_credentials_decrypt_operation.cc:99 kLocChalCredDecryptNoSalt=7 @ ./challenge_credentials/challenge_credentials_decrypt_operation.cc:112
It is possible to get frequent merge conflicts in error/locations.h. To resolve them, it is recommended to revert all changes in error/locations.h before the rebase, then re-run the tools after the rebase. This can be done by:
Reverting changes in error/locations.h:
(cros_sdk) /mnt/host/source/src/platform2 $ common_ancestor=`git merge-base cros/main HEAD`; \ cl_count=`git rev-list --count "${common_ancestor}"...HEAD`; \ for i in `seq $(expr ${cl_count} + 1) -1 2`; do \ GIT_SEQUENCE_EDITOR="sed -i -e '${i}i x git checkout HEAD^ -- \ cryptohome/error/locations.h && git add --all && git commit --amend \ --no-edit'" \ git rebase -i "${common_ancestor}"; \ done;
Then rebase as usual.
Then re-apply the changes by re-running the tools:
(cros_sdk) /mnt/host/source/src/platform2 $ common_ancestor=`git merge-base cros/main HEAD`; \ cl_count=`git rev-list --count "${common_ancestor}"...HEAD`; \ for i in `seq 2 $(expr ${cl_count} + 1)`; do \ GIT_SEQUENCE_EDITOR="sed -i -e '${i}i x \ /mnt/host/source/src/platform2/cryptohome/error/tool/location_db.py \ --update && git add --all && git commit --amend --no-edit'" \ git rebase -i "${common_ancestor}"; \ done;
Note that this technique should not be applied when cherry-picking across branches. See the section on cherry-picking across branches.
The location_db.py tool can also be used to decode numeric TPM error, for example:
(cros_sdk) /mnt/host/source/src/platform2 $ ./cryptohome/error/tool/location_db.py --decode-tpm 0x7004 (cros_sdk) /mnt/host/source/src/platform2 $ ./cryptohome/error/tool/location_db.py --decode-tpm 5
Both hexadecimals and decimals are supported.
In order to ensure that the error location ID reported in all branches are the same, when cherry-picking across branches, one should not run the location_db.py tool, and instead should manually resolve any merge conflict, keeping the error ID defined in locations.h consistent across branches.