ArticlePDF Available

A Naturalistic Investigation of Trust, AI, and Intelligence Work

Authors:

Abstract and Figures

Artificial Intelligence (AI) is often viewed as the means by which the intelligence community will cope with increasing amounts of data. There are challenges in adoption, however, as outputs of such systems may be difficult to trust, for a variety of factors. We conducted a naturalistic study using the Critical Incident Technique (CIT) to identify which factors were present in incidents where trust in an AI technology used in intelligence work (i.e., the collection, processing, analysis, and dissemination of intelligence) was gained or lost. We found that explainability and performance of the AI were the most prominent factors in responses; however, several other factors affected the development of trust. Further, most incidents involved two or more trust factors, demonstrating that trust is a multifaceted phenomenon. We also conducted a broader thematic analysis to identify other trends in the data. We found that trust in AI is often affected by the interaction of other people with the AI (i.e., people who develop it or use its outputs), and that involving end users in the development of the AI also affects trust. We provide an overview of key findings, practical implications for design, and possible future areas for research.
Content may be subject to copyright.
TRUST, AI, AND INTELLIGENCE WORK
i
Article Name: A Naturalistic Investigation of Trust, AI, and Intelligence Work
Authors: Stephen L. Dorton & Samantha B. Harper
Institutions: Human-Autonomy Interaction Laboratory Sonalysts, Inc.
Abstract: Artificial Intelligence (AI) is often viewed as the means by which the intelligence community
will cope with increasing amounts of data. There are challenges in adoption, however, as outputs of such
systems may be difficult to trust for a variety of factors. We conducted a naturalistic study using the Critical
Incident Technique (CIT) to identify which factors were present in incidents where trust in an AI technology
used in intelligence work (i.e. the collection, processing, analysis, dissemination, of intelligence) was
gained or lost. We found that explainability and performance of the AI were the most prominent factors in
responses; however, several other factors affected the development of trust. Further, most incidents
involved two or more trust factors, demonstrating that trust is a multifaceted phenomenon. We also
conducted a broader thematic analysis to identify other trends in the data. We found that trust in AI is often
affected by the interaction of other people with the AI (i.e. people who develop it or use its outputs), and
that involving end users in the development of the AI also affects trust. We provide an overview of key
findings, practical implications for design, and possible future areas for research.
Citation:
Dorton, S.L., & Harper, S.B. (2022). A naturalistic investigation of trust, AI, and intelligence work.
Journal of Cognitive Engineering and Decision Making, 16(4), 222-236.
https://doi.org/10.1177/15553434221103718
TRUST, AI, AND INTELLIGENCE WORK
1
Intelligence is a complex and high-stakes work domain, concerned with the planning, collection,
processing, analysis, and dissemination of information to support decision making (Clark, 2014).
Intelligence professionals, especially analysts, are faced with numerous challenges to cognition, including
time pressure (Hoffman, et al., 2011), undefined starting and stopping points (Hoffman et al., 2011; Wong
& Kodagoda, 2015), a surplus of non-diagnostic and intentionally deceptive information (Trent, et al.,
2007), and numerous cognitive biases and pitfalls that adversely affect reasoning (Heuer, 2017).
In addition to these cognitive challenges, new sensors and open sources are providing a continuously
expanding amount of data for analysts to work with, exceeding what analysts can feasibly process and
exploit. Artificial Intelligence (AI) technologies (including machine learning) are commonly viewed as the
means to increase the speed and effectiveness of an analyst’s ability to glean insights from large and
complex datasets (McNeese, et al., 2016; Ackerman, 2021), with applications for intelligence analysis at
the tactical, operational, and strategic levels (Symon & Tarapore, 2015). This vision for AI-driven analysis
has been echoed in policy and other documentation across the Intelligence Community (IC) and the
Department of Defense (DoD) (ODNI, 2019; Lee, et al., 2018).
A concern, however, is that AI technologies are relatively prescriptive in nature (i.e. they are built upon a
relatively linear process of entering data, values, and weights), and are often designed, developed, and
fielded without considering the cognitive work of analysts (Moon & Hoffman, 2005). There is a growing
body of work suggesting that intelligence analysis is not a linear or prescriptive process of critical thinking,
but rather an iterative sensemaking process, where analysts use a variety of abductive, inductive, and
deductive reasoning strategies to make sense of disparate information (Hoffman, et al., 2012; Moon &
Hoffman, 2005; Moore, 2011; Wong, 2014; Wong & Kodagoda, 2015; Wong & Kodagoda, 2016; Gerber,
et al., 2016). This mismatch of rigid tools with more fluid cognitive processes results in tools being misused
or disused by analysts (Moon & Hoffman, 2005).
Recently, efforts have been made to leverage the intuition and robustness of human analysts with the
computational capacity of AI through novel workflows, visualizations, and interfaces (Kamaraj & Lee,
2020; Skarbez, et al., 2019). More specific to intelligence work, recent research has investigated novel
human-AI workflows, where the human analyst works with the AI to maximize the benefits of both agents
in analytic tasks such as authorship attribution (Dorton & Hall, 2021) and aerial collections planning
(Gutzwiller & Reeder, 2021). Vogel, et al. (2021) assessed the impact of AI on intelligence analysis,
although their focus was on the broader analytic culture, rather than analysis itself. Capiola et al. (2020)
examined the factors affecting how teams of intelligence professionals rapidly build trust in each other,
although their work was focused on human-human trust (not human-AI trust). There remains a gap in the
research of understanding how analysts gain and lose trust in AI in the context of challenging intelligence
work.
Explainable AI
While there are numerous factors affecting trust in AI, none have received as much recent attention as
explainability; therefore, we discuss it here as its own phenomenon, before exploring trust more broadly.
Explainable AI (XAI) refers to AI technology that can be easily understood, such that a human can interpret
why and how the technology arrived at a specific decision (Volz, et al., 2018; Michael, 2019). Stated more
casually, XAI aims to overcome the “black box” characterization typical of deep learning technologies
(Angelov, et al., 2021). Explanations may be characterized as global or local, where global explanations
are focused on how the system works in general, and local explanations provide insight as to why a
particular step or decision was made (Hoffman, et al., 2018). Further, there are various explanation methods,
such as contrastive explanations or counterfactual reasoning, which can be employed based on the context
of the application (Hoffman, et al., 2018; Pieters, 2020). Recent research has explored the required
TRUST, AI, AND INTELLIGENCE WORK
2
components of an explanation or explainability (Baber, et al., 2021; Yang, Wang, & Deleris, 2021), and
has even developed a self-explanation scorecard (Muller, et al., 2021).
Explainability is an important factor in trusting AI systems, as it enables users to not only justify system
outputs and maintain better control of the system, but also enables discovery, or the general gain of
knowledge (Adadi & Berrada, 2018). Aside from these various benefits, Angelov, et al. (2021) argue that
explainability is critical simply because it allows users to evaluate risk, which drives the adoption (or not)
of AI in different high-stakes applications. This is evidenced by the calls for XAI in numerous high-stakes
fields such as medicine, autonomous driving systems, and air traffic control (Cadario, et al., 2021; Lorente,
et al., 2021; Xie, et al., 2021). Given this argument that explainability is crucial for overcoming opacity and
assessing risk, we assumed that explainability would also be a critical factor in gaining or losing trust in AI
in intelligence work, a high-stakes domain.
The following are factors or components of XAI technologies that have been identified in the literature
(Roth-Berghofer & Cassens, 2005; Sørmo, et al., 2005; Hagras, 2018):
Justification: The AI explains why the answer provided is a good answer.
Transparency: The AI explains how the system reached the answer (where system decisions are
explained in terms, formats, and languages that we can understand).
Conceptualization: The AI clarifies the meaning of concepts.
Learning: The AI teaches you about the domain.
Bias: The AI has verification that decisions made based on the AI system were made fairly and
were not based on a biased view of the world.
Trust in AI
While much of the recent trust work on AI has been through the lens of explainability, there is a considerable
body of work on the broader area of trust in automation (e.g. alarms, robotics, and unmanned systems) to
consider. Trust has been defined by Lee & See (2004, p. 54), as “the attitude that an agent will help achieve
an individual’s goals in a situation characterized by uncertainty and vulnerability.” Trust is not a binary
phenomenon, but rather a spectrum with a considerable gray area between trust and distrust (Roff & Danks,
2018), where trust is calibrated over time based on interactions with the system (Schaefer, et al., 2016;
Rebensky, et al. 2021; Yang, Schemanske, & Searle, 2021). Trust (or, to be more specific, calibrated trust)
is important for effective human-AI teaming. Miscalibrated trust, in which the user places either too much
or too little trust in a system, can result in the user relying on the system for more than is intended (i.e.
misuse), or not using the system to its full capabilities (i.e. disuse) (Hoff & Bashir, 2015; Parasuraman &
Riley, 1997).
Trust in a given system is influenced by dozens of factors, including human-related factors (e.g. individual
traits and emotive factors), automation- or system-related factors (e.g. capabilities of the system), and
environmental factors (e.g. task characteristics) (Schaefer, et al., 2016; Dorton & Harper, 2021). While
some researchers frame trust as primarily a function of reliability or predictability (Roff & Danks, 2018),
understandability is also a prominent factor, as are factors such as goal congruence. We have synthesized a
set of trust factors from the literature which we expected to have an impact on trust in AI systems (Table
1).
Throughout the literature there are many cases where different terms were used to describe the same concept
or trust factor, and in turn, there were also cases where a single term was used by various sources to describe
different trust factors. For example, Jian et al. (2000) use “reliability” to describe something similar to goal
congruence, while others refer to it in a more statistical manner (e.g. Madsen & Gregor, 2000). Further, the
distinction between some factors can become blurry depending on the type or application of AI. For
example, the factors of performance and reliability are similar in cases where the AI performs a binary
classification problem (i.e. provides a yes or no answer). However, we distinguish performance and
TRUST, AI, AND INTELLIGENCE WORK
3
reliability as being analogous to accuracy (the outputs are correct) and precision (the outputs are consistent
based on the inputs), respectively (Watson, 2019). To attempt to address these issues, we combined
synonyms that were deemed to be sufficiently conceptually proximal. For example, we combined
explainability, understandability, transparency, interpretability, and feedback, as these terms can all be used
somewhat interchangeably to describe the properties of an AI system that do not take a “black box”
approach (Angelov, et al., 2021).
Table 1
Factors Affecting Trust
Factor
Synonyms
Summary Definition
Reputation
Transitive Trust
The agent has received endorsement or reviews from others
(Siau & Wang, 2018; Roff & Danks, 2018).
Usability
Personal
Attachment
The agent is easy to interact with, and/or enjoyable to work
with (Siau & Wang, 2018; Madsen & Gregor, 2000;
Balfe, et al., 2018).
Security
Privacy Protection
The importance of operational safety and data security to
the agent (Siau & Wang, 2018).
Utility
-
The usefulness of the agent in completing a task (Siau &
Wang, 2018).*
Goal Congruence
Shared Mental
Model
The extent to which the agent’s goals align with your own
(Siau & Wang, 2018).
Reliability
Predictability
The agent is reliable and consistent in functioning over
time (Madsen & Gregor, 2000; Muir & Moray, 1996,
Sheridan, 1999; Balfe, et al., 2018).
Robustness
Error Tolerance
The agent is able to function under a variety of
circumstances (Sheridan, 1999; Woods, 1996; Balfe, et
al., 2018).
Explainability
Understandability,
Transparency,
Interpretability,
Feedback
The extent to which you are able to understand what the
agent is doing, why it is doing it, and how it is doing it
(Balfe, et al., 2018; Angelov, et al., 2021).
Performance
Competence,
Accuracy,
Errors, False
Alarms
The perceived ability of the agent to perform its tasks well
(Balfe, et al., 2018; Madsen & Gregor, 2000).
Directability
Subordination
The degree to which the agent’s actions are able to be
modified or changed (Klein et al., 2004; Schaefer, et al.,
2016)
* Yang, Schemanske, & Searle (2021) do not ascribe a specific term, but describe a similar concept where
a trust decrement after automation failure is larger when the outcome is undesirable, but smaller when
human-automation team is still successful (i.e. the automation failure does not impede the ultimate outcome
of work).
Goals and Research Questions
As previously discussed, there are large bodies of work on trust in automation, and explainability regarding
AI, with both supported primarily by heavy theoretical and laboratory work. The naturalistic work on AI in
intelligence analysis has so far not focused on trust in AI specifically. Therefore, our overarching goal is to
fill this gap in the research by developing a naturalistic understanding of how various factors affect the gain
and loss of trust in AI in the high-stakes domain of intelligence work (planning, collections, analysis, etc.).
More specifically, we desire to investigate whether specific factors and frameworks from the literature align
TRUST, AI, AND INTELLIGENCE WORK
4
with naturalistic findings regarding the complex sociotechnical system of humans, AI technologies, and
intelligence work. Although this research is fundamentally exploratory in nature, there are two high-level
research questions we aimed to answer:
Which factors (e.g. explainability) from the literature are commonly cited in incidents where trust
is gained or lost in AI, in the context of intelligence work?
What various sociotechnical phenomena exist when humans use AI for intelligence work?
Methods
We employed a naturalistic approach to achieve the aforementioned research goals, meaning that we
focused on how trust is actually gained or lost in the natural, context-rich environment of intelligence work.
The Naturalistic Decision Making (NDM) approach (and various naturalistic methods) was adopted after it
was found that statistical models and decision support systems did not improve decision making and/or
were not adopted for field use (Nemeth & Klein, 2010). Naturalistic approaches were developed to consider
behavior outside the laboratory (i.e. how individuals should make decisions), and how they are made in real
world settings (i.e. how individuals actually make decisions). Although this effort is focused on trust rather
than decision making, it is still considered a naturalistic approach to understanding the phenomenology.
Data Collection
Data were collected using the Critical Incident Technique (CIT), a research method for collecting
retrospective reports of human behavior in incidents that meet specific criteria of interest (Flanagan, 1954).
This method is not only effective for exploratory and investigative analysis of extreme or atypical events,
but is also tremendously flexible, allowing researchers to adopt it for a wide variety of uses (Butterfield, et
al., 2005).
We developed a CIT template with a standard set of questions that were split up into three different passes.
The first pass consisted of questions to understand the background and context of the event itself (e.g. did
they have a choice in using the AI, did they receive training, and was the training sufficient?). The second
pass included having participants tell the story about the incident, and any follow-up questions from the
research team (based on the context of their story). The third and final pass consisted of retrospective
questions (e.g. did the incident change the way you worked with the AI, how much did you trust it before
and after the incident, how much do you trust it now?).
Interviews were conducted with two researchers and a single participant, which generally took between 60
and 90 minutes to complete. Participants were asked to describe two incidents, with one incident involving
the gain or loss of trust with an AI technology, and the other incident involving the gain or loss of trust with
a human, for a total of two CIT responses. We documented responses in a CIT template form, whereby all
responses were transcribed within 24 hours of the interview. No audio recordings were collected because
of the need to protect the anonymity of participants, which is of utmost importance when working in the
intelligence domain (Vogel, et al., 2021). As we could not collect verbatim recordings, each incident was
transcribed by one researcher and then reviewed and edited by the other researcher, in accordance with best
practices to increase validity (Johnson, 1997).
Thematic Analysis
We used thematic analysis to analyze the data collected during CIT interviews, because the method is well
suited for identifying and describing themes in large amounts of qualitative data (Braun & Clarke, 2006;
Nowell, et al., 2017; King, 2004). We used an iterative thematic analysis approach similar to that of
Sherwood, et al. (2020), where we initially conducted a low-level thematic analysis to identify statements
of interest from the transcript, followed by a high-level thematic analysis to identify and define themes
based on these extracts. This was an iterative process where themes were cut, modified, or added as more
of the data were reviewed.
TRUST, AI, AND INTELLIGENCE WORK
5
The authors then individually coded the transcribed responses for the presence or absence of the identified
factors (from the literature) and the themes (from the thematic analysis process). We regularly met to
compare batches of coded cases, and used a simple two-step argumentation process to resolve any
discrepancies in codes (Dorton, et al., 2019). This resulted in consensus being reached on all cases. As
conducted by Klein and Jarosz (2011), we re-coded all cases for a certain theme if at any point the definition
for the theme was updated to resolve any ambiguities. We then completed a final iteration of low-level
thematic analysis of each theme based on the cases mapped to each theme.
Although we came to consensus on all codes for all cases through the argumentation process, we used
Cohens Kappa (K) to measure the inter-rater agreement of the two coders in identifying whether a factor,
theme, or other phenomenon was present (1) or absent (0) in a case. Kappa measures the proportion of
agreement (values range from -1.00 to 1.00) while correcting for random chance, and is an appropriate
measure of agreement given the two coders and the nominal level data (Cohen, 1960; Tinsley & Weiss,
1975; Hallgren, 2012). Hallgren (2012) argues that Kappa values can indicate slight agreement (.00-.20),
fair agreement (.21-.40), moderate agreement (.41-.60), substantial agreement (.61-.80), or near perfect
agreement (.81-1.00); however, others have argued simply that Kappa values less than .40 indicate poor
agreement (Banerjee, et al., 1999).
Participants and Dataset
We interviewed 29 current and former intelligence professionals, who had a total of 563 years (n = 28, M
= 20.11, SD = 11.26) of experience in intelligence, including not only analysis, but also planning,
collections, and decision making (military and/or policy) as a direct consumer of intelligence. This sample
was above and beyond the recommended sample size for phenomenological research (Guest, 2006). As
shown in Table 2, the participants had experience in a broad variety of intelligence disciplines (INTs), with
the majority of experience in All-source Intelligence, Signals Intelligence (SIGINT), and Open Source
Intelligence (OSINT). These INTs included what Clark (2014) calls non-literal and literal intelligence
disciplines, where the collected intelligence requires processing to have actionable value or not,
respectively.
Table 2
Participant Experience Across Different Intelligence Disciplines
Intelligence Discipline
n
M
SD
Sum
SIGINT: Signals Intelligence
21
20.11
9.62
324
*ELINT: Electronic Intelligence
3
17.33
14.19
52
*COMINT: Communications Intelligence
1
25.00
-
25
All-Source
20
13.18
11.35
346
OSINT: Open Source Intelligence
13
13.38
10.08
174
MASINT: Measurement & Signature Intelligence
6
15.17
12.75
91
ACINT: Acoustics Intelligence
6
15.00
8.94
90
HUMINT: Human Intelligence
4
10.50
9.88
42
GEOINT: Geospatial Intelligence
3
13.33
10.41
40
IMINT: Imagery Intelligence
2
11.00
1.41
22
* ELINT and COMINT are subsets of SIGINT.
Note. SD cannot be calculated where n = 1.
Not only did participants have a diverse set of experiences across the different INTs, but also hailed from
a diverse set of organizations, each specializing in specific INTs, and/or in specific geopolitical or technical
domains. Participants had years of experience in different IC and DoD organizations, including, but not
limited to the: Central Intelligence Agency (CIA), Defense Intelligence Agency (DIA), National Security
Agency (NSA), Department of State (DOS), Defense Counterintelligence and Security Agency (DCSA),
Office of Naval Intelligence (ONI), National Air and Space Intelligence Center (NASIC), U.S. Navy, U.S.
Air Force, U.S. Marine Corps, and the U.S. Coast Guard.
TRUST, AI, AND INTELLIGENCE WORK
6
Each participant provided one story about gaining or losing trust in AI in the context of intelligence (one
participant provided two), generating a rough set of 30 stories. Three stories were excluded from the dataset.
One was not directly relevant to intelligence, another because the technology described in the incident failed
to meet a fairly broad characterization of AI (Diakopoulos, 2016), and the third because the participant
opted to tell a more general story about the evolution of analytic tradecraft. It should be noted that two of
these stories failed to discuss a specific instance where trust was gained or lost in AI, but still provided
valuable insights about AI use within the context of intelligence work, and were within scope (i.e. they
were concerning AI and intelligence work). Thus, we excluded those stories for the analysis of trust factors
(N = 25), but included them for the more general thematic analysis (N = 27).
Results
Factors Affecting Trust
We coded incidents for the presence or absence of each of the factors identified in Table 1, to determine
which factors were more prevalent in gaining or losing trust in AI. Table 3 provides an overview of these
results, including the count of how many times each factor was present in incidents (broken out by
directionality), and the reliability of coding (K) for each factor. These factors are hereafter reported
individually for sake of clarity in explanation; however, it should be noted that they do not exist in isolation,
and commonly co-occur with other factors. The mean number of factors present in an AI story was 2.76
(SD = 1.01), where 23 incidents (92%) had two or more factors present. To protect the anonymity of
participants we parenthetically cite the incident number for each participant quote (e.g. CIT 9).
Table 3
Agreement (K) and Count of Trust Factors in CIT Responses (N = 25)
Agreement
(K)
By Directionality
Gained
Trust (+)
Lost Trust
(-)
All Cases
(+/-)
.82
8
9
17
.23
7
6
13
.43
10
2
12
.48
2
7
9
.60
2
4
6
.51
2
4
6
.24
1
2
3
-*
0
2
2
1.00
0
1
1
-*
0
0
0
* Kappa could not be calculated because inputs from at least one rater were constant.
Explainability was the most frequently present factor in gaining or losing trust in AI (n = 17, 68%). Of
those 17 cases, explainability was a factor in eight cases where trust was gained, and nine cases where trust
was lost. Generally, explainability of the AI increased trust because it allowed users to understand why
outputs were or were not correct, “You could look at the model and see the words and phrases it sorts for…
the model gave you some insights as to why it was flagged. This reinforced that it was capturing the right
things…” (CIT 15). Conversely, when AI lacked explainability, participants found that they could not make
sense of obviously correct or incorrect outcomes, or in some cases, apply correct outcomes to decision
making, “It will give you a GEO Plot with red, yellow, green… ‘and red means… uhhhh…’ We realized we
didn’t know what it meant… Don’t go there ever? Be alert if you do go there?” (CIT 8).
Performance was the second most common factor in AI incidents (n = 13, 52%), where it was present in
seven incidents where trust was gained, and six incidents where trust was lost. In addition to providing
TRUST, AI, AND INTELLIGENCE WORK
7
utility in decision making, correct outputs from high-performance AI increased trust by confirming
intelligence from other sources, The [parameter] matched what we expected… the sensor told us…
confirmed it was the [threat that intelligence had warned about](CIT 28). Similarly, incorrect outputs
(i.e. low performance AI) decreased trust by not allowing participants to make informed decisions, knowing
that the outputs were incorrect.
Utility was the third most common factor in AI incidents (n = 12, 48%). Simply put, participants gained
trust when the AI helped them do their job (n = 10), and lost trust when it did not (n = 2). Participants
disentangled utility from performance (i.e. accuracy of outputs), where trust was gained in even a
marginally-performing AI if it helped in at least some aspect of the job at hand, “If it found something for
me at all it was a huge positive, because it was for such a huge quantity of data that there was probably no
way for me to get it in a practical sense - I had nothing to lose by using the tool” (CIT 57). Similarly, trust
was lost in high-performing AI if it did not provide practical utility, We made the best DIME and PMESII
models ever
1
, they just won’t [ever] have enough data… [Organization] cancelled the program” (CIT 46).
Robustness was another common factor (n = 9, 36%), which was present in only two stories about gaining
trust, and in seven stories about losing trust in AI. Participants reported robustness in AI manifesting in
terms of the AI being successful outside of its originally-intended use case, and by being able to
continuously learn and adapt based on new intelligence, “The system is able to learn new data… getting
data from intelligence on targets that are always changing” (CIT 36). Conversely, trust was commonly lost
in brittle AI (i.e. agents that lacked robustness) where the technology could only perform under a specific
set of circumstances that did not account for an adequate proportion of real-world events, The tool is
designed to work at a certain range... They trained what is a LDA model
2
on data, validated it, then gave
it data outside the boundaries of what it was trained on” (CIT 4).
Usability was a less common factor (n = 6, 24%), that was present in two incidents where trust was gained,
and four incidents where trust was lost. Usability did not appear to be the primary factor in gaining or losing
trust, but instead compounded the effects of other factors. For example, the ease of interacting with the
agent promoted use of the AI and competency development, which positively affected trust, “The UI
3
was
also very easy to use, which helped [learn the system](CIT 24). Conversely, if the AI was cumbersome
or otherwise unpleasant to interact with, the lack of usability promoted disuse of the system, preventing the
development or repair of trust, I was spending hours per day combing through reports… It was hard to
find relevant information so I used it less frequently” (CIT 31).
Reliability was another relatively uncommon factor (n = 6, 24%), which was present in two incidents where
trust was gained, and four incidents where trust was lost. Interestingly, there was an observed duality in the
effects of reliability in misperforming AI, where reliably low performance contributed to both gaining and
losing trust. In some cases reliability helped build trust when model outputs were demonstrably wrong, “I
guess I’ve gained confidence because the algorithm consistently gives results that are imperfect” (CIT 55).
However, reliably poor performance also degraded trust in some cases, driving disuse of systems, Is it
functioning properly or not? More often than not it wasn’t... It was kind of expected… so the alternative
was just [workaround]” (CIT 59).
Other Factors such as ‘Reputation’ (n = 3, 12%), ‘Safety and Security’ (n = 2, 8%) and ‘Directability’ (n
= 1, 4%) were also present, although infrequently. Reputation impacted trust when it was confirmed or
disconfirmed by firsthand experience with the AI, “Other people told me that all the reporting it spits out
is relevant… I didn’t have a lot of experience in this domain so I relied on it and trusted it more than an
experienced person would(CIT 31). As one might assume, trust was lost when AI compromised the safety
1
DIME = Diplomacy, Information, Military, and Economic; PMESII = Political, Military, Economic, Social,
Information, and Infrastructure.
2
LDA = Linear Discriminant Analysis.
3
UI = User Interface
TRUST, AI, AND INTELLIGENCE WORK
8
of the participant and/or their colleagues, even in simulated training exercises, “If it was an actual [threat]
we would have been dead… you want the screening to be good enough to keep [threats] out of the kill
radius” (CIT 33). In one case, an inability to override or otherwise direct the AI based on the context of the
mission decreased trust, where the participant said the outputs of an AI-based GEOINT system were more
limited than features on their car that can be overridden when necessary, “It’s like [automated] brakes on
a car- it goes red for tall grass but I need to be able to bump into it to make the turn…” (CIT 8).
Themes Regarding Trust, AI, and Intelligence
Using the previously described thematic analysis process resulted in the identification of several themes.
The following is a list of these themes identified in the data, accompanied by a brief description.
Trust by Proxy: Trust was affected by how other humans interacted with the AI (i.e. beyond
reputation or endorsements).
Users Involved in Development: Trust was affected by the presence or absence of end users and/or
domain subject matter experts being involved in the development of the AI.
Reputation: Reputation or endorsements from peers affected trust in the agent. This is not limited
to reputation’s role in the incident itself, but also in calibrating trust before or after the incident.
Character: Trust was affected by the absence of character in the AI. This was not an issue with
anthropomorphism (i.e. the AI having humanlike features or not), but regarding AI not having
character flaws like human colleagues (e.g. lying or self-promoting), but instead being a function
of its inputs.
Trust by Failure: Trust was gained in the agent despite it failing at some part of its task.
Asymmetric Feedback: The agent typically provided only one type of feedback (i.e. positive or
negative), which affected trust.
As shown in Table 4, the two most common themes were trust by proxy (n = 11) and users in development
(n = 10). Other themes were less prevalent, but are still reported, accompanied by more detailed descriptions
and exemplar quotes from different collected incidents.
Table 4
Agreement (K) and Count of General Themes in CIT Responses (N = 27)
Agreement
(K)
By Directionality
Factor
Gained
Trust (+)
Lost Trust
(-)
All Cases
(+/-)
Trust by Proxy*
.49
5
5
11
Users Inv. in Dev.*
.32
6
3
10
Reputation
.92
2
5
7
Character*
.68
3
1
5
Trust by Failure
.65
3
1
4
Asymmetric Feedback
.78
2
0
2
* Included a general case, so gained and lost trust cases do not add up to all cases.
Trust by Proxy was the most common theme found in incidents with AI (n = 11, 41%), which included
incidents where trust in the AI was affected by human behavior with, or input to the AI. In most cases (n =
9) participants lost trust in the AI because its performance was largely subject to the inputs from other
fallible humans, the reason I put less faith in this stuff is that humans can train it on bad data with bad
features and then say it’s gold when the outcomes are bad… you can see how much human assumptions
can affect the model’s performance” (CIT 55). Similarly, participants lost trust in the AI because of how
other people misused the outputs of the AI, typically by using it in conditions for which it was not validated,
TRUST, AI, AND INTELLIGENCE WORK
9
I understand what it should be used for… I lost trust in people’s use of the tool(CIT 11). Conversely,
trust was gained in AI in a few cases (n = 2) because of the role of humans affecting the inputs to the AI,
A human verifying a nomination gives me much higher confidence than the algorithm feeding itself” (CIT
29).
Users Involved in Development was the second most common theme in incidents with AI (n = 10, 37%).
Participants reported gaining trust in the system because end users and/or domain experts were involved in
the development of the system (n = 6), “…we were lucky to have experts/former targeting officers, it was
helpful for the development of models(CIT 55). Conversely, participants lost trust in the AI when they
knew end users were not involved in the development process (n = 3), It was the mathematicians that
developed it, and they did not include the experts enough… the design requirement to develop the AI was
flawed” (CIT 10). In one case, the participant specifically highlighted the need for collaboration across the
technical and operational experts, “I think they should have cross-fertilized teams… [the analysts] have no
concept on the performance of the system or the guys developing the neural nets” (CIT 50).
Reputation was a recurring theme in several cases (n = 7, 19%) of gaining or losing trust in AI. While the
factor of reputation was limited to instances where a reputation affected trust during the incident, the theme
of reputation is more inclusive, and includes the role of reputation before, during, and after incidents where
trust was gained or lost. Reputation (as the more inclusive theme) was used to calibrate trust in two ways,
before and after the incident. Most commonly (n = 5), reputation was used to calibrate trust prior to the
incident in question, “I’ve heard anecdotal stories where people tell me it didn’t pick this up and it should
have (CIT 44). Less commonly (n = 2), participants stated they would need corroboration or peer
endorsements from others in order to regain trust in an AI after a negative incident.
Character was present in some cases (n = 5, 19%), across incidents where trust was gained and lost, and in
a general story (i.e. one that did not fit the CIT template). Participants cited the fact that AI has no character
(unlike humans, who can have character flaws) as a factor in gaining trust in AI, “It’s a computer program
that does what we tell it do… so we force our own goals on it(CIT 28). Conversely, some participants
viewed the lack of character with indifference, “I don’t believe technology is good or bad, it’s really the
way I use it” (CIT 50).
Trust by Failure was a relatively uncommon theme (n = 4, 15%), and was exclusive to incidents with AI.
In half of these cases participants reported gaining trust in the AI because said failure allowed them to learn
the boundary conditions or limitations of the AI, “In a way, [it] increased my trust because I have a better
understanding of what challenges can occur (CIT 50); or, as another participant stated, I didn’t
necessarily lose trust in the system- I learned its limitations(CIT 17). The other two cases involved gaining
trust because the system behaved consistently or predictably, despite low performance, Yeah, abject failure
improved my trust in the [AI]… it demonstrated that [it] is doing what I told it to” (CIT 55).
Asymmetric Feedback was only present in only two cases (7%), but merits discussion. In both cases
participants gained trust in an AI technology after it finally provided positive feedback (e.g. that it detected
and identified a certain threat), because it had previously not provided negative feedback (e.g. status updates
confirming that it is working properly, but has not detected anything), “We thought it was broken because
it never worked… When you were underway it was on all the time and didn’t spit anything out until it
received something (CIT 53). Despite being a rarely occurring theme, this is important to note as
intelligence and national security work often involves using AI to detect rare but exceptionally grave
threats.
Discussion
We used the CIT to understand what factors affected trust in AI technologies in the context of intelligence
work. Additionally, we conducted thematic analysis of collected data to identify other sociotechnical
themes in human-AI interaction. This naturalistic approach allowed us to test theory and laboratory work,
TRUST, AI, AND INTELLIGENCE WORK
10
and to identify other factors to be considered when designing, employing, or otherwise integrating such
systems into intelligence work. As a result, we provided specific examples of how trust factors manifest in
incidents where intelligence professionals gain or lose trust in AI. Further, we have found evidence for
several high-level themes regarding trust, AI, and intelligence work, some of which are novel or not
reported elsewhere. These findings provide readers with terms and representations for various phenomena,
as well as real-world cases to point to when justifying research, requirements, and designs to AI developers,
project managers, and other engineers involved in the development of AI-based intelligence systems. After
conducting analysis of nearly 30 incidents, we were able to generate several key findings:
Explainability is, in fact, a critical factor in developing trustable AI. It played a role in nearly 70%
of all reported incidents, spread evenly across incidents of gaining and losing trust. In all 17 cases
where explainability was a factor, participants referred specifically to transparency (how the AI
generated the output, n = 13) and/or justification (why the output is a good output, n = 11). In
contrast, the aforementioned XAI concepts of conceptualization, learning, and bias were never cited
by participants in incidents where they gained or lost trust in AI.
Performance plays a role in gaining or losing trust in AI, but is separate from utility. We saw in
several instances that highly accurate AI lacked utility if it was not robust enough for use with real
world data, and conversely, intelligence professionals found poorly performing AI to be useful.
Knowing the strengths and weaknesses of AI, people will adapt their work accordingly, As a
detection system it worked great-better than a human. The identification of the signal- not at all. It
was not accurate for the most part, it would require human inputs. It was a good tipper… it had
accurate [detection], but bad ID… so we passed the detections on for manual analysis…(CIT
17).
Reliability or predictability was not a prominent factor in gaining or losing trust in AI. Despite
assertions from the literature (e.g. Roff & Danks, 2018), reliability was present in only six incidents
(24%).
It is difficult to separate trust in AI from trust in the greater human-AI sociotechnical system. The
trust by proxy theme was present in nearly half of reported incidents (n = 11, 41%). Trust in an AI
system can be affected by humans in two primary ways: The people who curate inputs to the system,
and the people who use the outputs of the system (sometimes erroneously).
Having users involved in the development of the AI had an impact on gaining or losing trust (n =
10, 37%). When users were not involved, trust was lost by not only from a decreased shared mental
model between the intelligence professional and the AI, but we also saw specific instances where
the developers struggled to operationalize the outputs of their models.
Trust factors and themes rarely occurred in isolation. As previously mentioned, all but two incidents
had two or more trust factors present (M = 2.76, SD = 1.01). Similarly, we conducted a two-tailed
point-biserial correlation on themes, which showed that there was a significant correlation between
the Trust by Proxy and Character themes (RS = .56, p < .01). Said differently, people who believe
AI to have no agency or character tended to also base their trust in the AI at least partially on other
people who interact with it. Both themes were present in five cases, or 19% of the sample. These
were incidents where participants described the AI lacking any character or agency, It’s a
computer program that does what we tell it to do…” (CIT 28), and therefore their trust in the AI
was also affected by other humans who were developing the AI or putting data into it, “There’s a
lot of people who have accounts… you don’t know everyone who is putting info into it” (CIT 22).
Some trust factors, depending on their manifestation, were disproportionately reported in incidents
with gained or lost trust. For example, participants rarely gained trust because an AI was robust to
operational context (n = 2); however, they commonly lost trust when the AI was brittle to such
context (n = 7). Conversely, participants frequently noted when AI provided utility to complete
work (n = 10); however, a lack of utility was rarely reported in incidents where trust was lost (n =
2).
TRUST, AI, AND INTELLIGENCE WORK
11
This research was hindered somewhat by the inability to collect verbatim recordings of interviews, violating
the best practice of low inference descriptors (Johnson, 1997). Although this is not ideal, it has been
acknowledged that studying intelligence work often requires artificialities in scenarios and extra care to be
taken in protecting the identities of individuals in order to render it publishable in public forum (Trent, et
al., 2007; Vogel, et al., 2021). We took several precautions to increase validity, including data triangulation
and investigator triangulation, and noted specifically when we were confident we captured specific quotes
verbatim (Johnson, 1997). Although it is far from systematic or exhaustive, we also briefed our results to
approximately 10 participants (one-third of the sample), who concurred with our interpretation of the data,
and registered no issues (Johnson, 1997; Butterfield et al., 2005). Further, we relied on what each participant
stated with minimal inference, rather than making broader inference on factors, likely resulting in
underreporting for some factors (e.g. safety and security was likely a factor in most cases; however,
participants only explicitly mentioned it as a concern in two cases). Finally, we focused on a relatively
narrow set of trust factors. While we believe our set of factors was relatively exhaustive for the domain of
study (i.e. it included enough factors such that some had few or no cases), it is less than half of the factors
identified from Schaefer et al. (2016). It is entirely possible that with longer interviews and/or survey
methods we could uncover the role of several factors that were not studied in this effort.
Another consideration is that we did not focus specifically on intelligence analysis, but rather intelligence
work more broadly (planning, collections, analysis, and decision making as a consumer of intelligence).
Further, we did not focus on AI use by a sample within a specific organization or INT. Intelligence
professionals in each organization and each INT likely use AI in different ways to achieve different goals
under different contexts. Thus, there are limits to the generalizability or predictive value of the different
counts or frequencies reported herein. For example, one cannot validly infer that utility (n = 12) is twice as
important as usability (n = 6) as a factor in developing trust in AI for intelligence work (i.e. because it
appeared twice as frequently). While some may argue that this highly inclusive dataset may dilute the
findings, we argue that it is a strength of this study, providing broader understanding of how trust in AI is
gained or lost across the broader intelligence cycle. As noted by Klein, et al. (2021), such a “wide net”
approach with minimal inclusion criteria is appropriate when data collection is opportunistic (in this case
participants with unclassified stories were difficult to come by) and when the objective of the study is more
exploratory in nature (e.g. not a more formal meta-analysis). This wide net and naturalistic approach has
enabled greater understanding of phenomena including decision making, insight, and explanations (Klein
et al., 2010; Klein & Jarosz, 2011; Klein, et al., 2021); so it stands to reason that it is also sufficient for
investigating trust.
This study also provided practical considerations for designers and integrators of AI systems for intelligence
work. The following are some recommendations based on the findings of this study:
Involve end users in the development of the AI. More specifically, end users and domain experts
should be engaged to determine suitable inputs and outputs for the system, define expectations for
performance, and to identify contextual factors that are likely to change in operational use. Doing
so should allow developers to design and develop AI that has greater robustness to changing
contexts, greater utility in nominal and off-nominal situations, and better explainability when
performance expectations are not met.
Involve developers in the training of end users. During collaborative development and/or through
training products delivered with the system, developers should clearly articulate capabilities and
limitations of the AI that may affect its performance, robustness, and reliability, as well as the
degree to which users were involved in the development of the AI. Understanding these limitations
will likely augment the “explainability” of the AI by allowing end users to interpret cues that
provide insights as to the behavior of the system. Knowing the extent to which users were involved
in the development of the AI (e.g. feature engineering and/or development of algorithms) will allow
users to understand the degree to which their mental model of the system matches its logic, and
also serves as a trust signaling function (Riasnow, et al., 2015).
TRUST, AI, AND INTELLIGENCE WORK
12
Provide symmetric explainability. Explainability has been demonstrated to affect the development
of trust with AI. Additionally, we have seen that an ideal system should not only provide positive
feedback, but also negative feedback in use cases where the AI works persistently.
Looking forward, the degree to which these findings can be generalized to other high- (medical, power
plant, etc.) and low-stakes (media recommender systems) domains is unclear. Similarly, it would be
interesting to analyze cases based on the type of AI they refer to (e.g. prioritizing, classification, association,
or filtering; Diakapoulos, 2016); however, more data would be required for an analysis at that level. Further,
we envision the development of an empirically-driven checklist or scale that can be used to determine the
readiness of an AI system for adoption into the intelligence enterprise. These possibilities serve to illustrate
that this study should be viewed as a stepping stone to numerous other paths of research.
Acknowledgements
This work was supported in part by the US Army Combat Capabilities Development Command
(DEVCOM) under Contract No.W56KGU-18-C-0045. The views, opinions, and/or findings contained in
this report are those of the author and should not be construed as an official Department of the Army
position, policy, or decision unless so designated by other documentation. This document was approved for
public release on 23 September 2021, Item No. A364.
The authors wish to thank Kelly Neville and Mark Pfaff for their helpful insights regarding research and
analytic methods.
References
Ackerman, R. K. (2021). AI offers to change every aspect of intelligence. AFCEA SIGNAL.
Adadi, A. & Berrada, M. (2018). Peeking inside the black box: A survey on explainable artificial
intelligence (XAI). IEEE Access, 6, 52138-52160.
https://doi.org/10.1109/ACCESS.2018.2870052
Angelov, P. P., Soares, E. A., Jiang, R., Arnold, N. I., & Atkinson, P. M. (2021). Explainable artificial
intelligence: an analytical review. WIREs Data Mining and Knowledge Discovery, 1424, 1-13.
https://doi.org/10.1002/widm.1424
Baber, C., McCormick, E., & Apperly, I. (2021). A human-centered process model for explainable AI.
Naturalistic Decision Making and Resilience Engineering Symposium 2021. Toulouse, France.
Balfe, N., Sharples, S., & Wilson, J. R. (2018). Understanding is key: An analysis of factors pertaining to
trust in a real-world automation system. Human Factors, 60(4), 477495.
https://doi.org/10.1177/0018720818761256
Banerjee, M., Capozzoli, M. McSweeny, L., & Sinha, D. (1999). Beyond kappa: A review of interrater
agreement measures. Canadian Journal of Statistics, 27, 3-23. https://doi.org/10.2307/3315487
Braun, V., Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology,
3, 77101. https://doi.org/10.1191/1478088706qp063oa
Butterfield, L. D., Borgen, W. A., Amundson, N. E., Maglio, A. T. (2005). Fifty years of the critical
incident technique: 1954-2004 and beyond. Qualitative Research, 5(4), 475-497.
https://doi.org/10.1177/1468794105056924
Cadario, R., Longoni, C., & Morewedge, C. K. (2021). Understanding, explaining, and utilizing medical
artificial intelligence. Nature Human Behavior. https://doi.org/10.1038/s41562-021-01146-0.
Capiola, A., Baxter, H. C., Pfahler, M. D., Calhoun, C. S., & Bobko, P. (2020). Swift trust in ad hoc
teams: A cognitive task analysis of intelligence operators in multi-domain command and control
contexts. Journal of Cognitive Engineering and Decision Making, 14(3), 218-241.
https://doi.org/10.1177/1555343420943460
Clark, R. M. (2014). Intelligence collection. SAGE.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological
Measurement, 20(1), 37-46. https://doi.org/10.1177/001316446002000104
TRUST, AI, AND INTELLIGENCE WORK
13
Diakopoulos, N. (2016). Accountability in algorithmic decision making. Communications of the ACM,
59(2), 56-62. https://doi.org/10.1145/2844110
Dorton, S. L., Frommer, I. D., & Garrison, T. M. (2019). A theoretical model for assessing information
validity from multiple observers. 2019 IEEE Conference on Cognitive and Computational
Aspects of Situation Management (CogSIMA), 62-68.
https://doi.org/10.1109/COGSIMA.2019.8724242
Dorton, S. L. & Hall, R. A. (2021). Collaborative human-AI sensemaking for intelligence analysis. In H.
Degen & S. Ntoa (Eds.), Artificial intelligence in HCI, (pp. 185-201). Springer Nature.
https://doi.org/10.1007/978-3-030-77772-2_12
Dorton, S. L. & Harper, S. (2021). Trustable AI: A critical challenge for naval intelligence. Center for
International and Maritime Security (CIMSEC).
Flanagan, J. C. (1954). The Critical Incident Technique. Psychological Bulletin, 5, 327-358.
http://dx.doi.org/10.1037/h0061470
Gerber, M., Wong, B. L. W., & Kodagoda, N. (2016). How analysts think: Intuition, leap of faith, and
insight. Proceedings of the Human Factors and Ergonomics Society 2016 Annual Meeting, 60(1)
173-177. https://doi.org/10.1177/1541931213601039
Guest, G., Bunce, A., & Johnson, L. (2006). How many interviews are enough? An experiment with data
saturation and variability. Field Methods, 18(1), 59-82.
https://doi.org/10.1177/1525822X05279903
Gutzwiller, R. S. & Reeder, J. (2021). Dancing with algorithms: Interaction creates greater preference and
trust in machine-learned behavior. Human Factors, 63(5), 854-867.
https://doi.org/10.1177/0018720820903893
Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial.
Tutor Quant Methods Psychol., 8(1), 23-34.
Hagras, H. (2018). Toward human-understandable, explainable AI. In Computer, 51(9), 28-36.
https://doi.org/10.1109/MC.2018.3620965.
Heuer, R. (2017). Psychology of intelligence analysis. Echo Point Books.
Hoff, K. A., & Bashir, M. (2015). Trust in automation: Integrating empirical evidence on factors that
influence trust. Human Factors, 57(3), 407434. https://doi.org/10.1177/0018720814547570
Hoffman, R., Klein, G., & Muller, S.T. (2018) Explaining Explanation for “explainable AI”. Proceedings
of the Human Factors and Ergonomics Society 2018 Annual Meeting, 62(1), 197-201.
https://doi.org/10.1177/1541931218621047
Hoffman, R. R, Henderson, S., Moon, B., Moore, D. T., & Litman, J. A. (2011). Reasoning difficulty in
analytical activity. Theoretical Issues in Ergonomics Science, 12(3), 225-240.
https://doi.org/10.1080/1464536X.2011.564484
Jian, J., Bisantz, A., Drury, C.G., Llinas, J. (2000). Foundations for an empirically determined scale of
trust in automated systems. International Journal of Cognitive Ergonomics, 4(1), 53-71.
https://doi.org/10.1207/S15327566IJCE0401_04
Johnson, R. B. (1997). Examining the validity structure of qualitative research. Education, 118(2), 282-
292.
Kamaraj, A. V. & Lee, J. D. (2021). Using machine learning to aid in data classification: Classifying
occupation compatibility with highly automated vehicles. Ergonomics in Design, 29(2), 4-12.
https://doi.org/10.1177/1064804620923193
King, N. (2004). Using templates in the thematic analysis of text. In Cassell, C. & Symon, G. (Eds.),
Essential guide to qualitative methods in organizational research (pp. 257270). SAGE.
Klein, G., Hoffman, R., Mueller, S., & Newsome, E. (2021). Modeling the process by which people try to
explain complex things to others. Journal of Cognitive Engineering and Decision Making, 15(4),
213-232. https://doi.org/10.1177/15553434211045154
Klein, G. & Jarosz, A. (2011). A naturalistic study of insight. Journal of Cognitive Engineering and
Decision Making, 5(4), 335-351. https://doi.org/10.1177/1555343411427013
TRUST, AI, AND INTELLIGENCE WORK
14
Klein, G., Calderwood, R., Clinton-Cirocco, A. (2010). Rapid decision making on the fire ground: The
original study, plus a postscript. Journal of Cognitive Engineering and Decision Making, 4(3),
186-209. https://doi.org/10.1518/155534310X12844000801203
Klein, G., Woods, D. D. Bradshaw, J. M., & Hoffman, R. (2004). Ten challenges for making automation
a “team player” in joint human-agent activity. Intelligent Systems, 19(6), 91-95.
https://doi.org/10.1109/MIS.2004.74
Lee, J., & See, K. A. (2004). Trust in automation: Designing for appropriate reliance. Human Factors,
46(1), 50-80. https://doi.org/10.1518/hfes.46.1.50_30392
Lee, M., Valisetty, R., Breuer, A., Kirk, K., Panneton, B., & Brown, S. (2018). Current and future
applications of machine learning for the US Army (Report No. ARL-TR-8345). Aberdeen
Proving Ground, MD: US Army Research Laboratory.
Lorente, M. P., Lopez, E. M., Florez, L. A., Espino, A. L., Martinez, J. A. I, & de Miguel, A. S. (2021).
Explaining deep-learning-based driver models. Applied Science, 11(8).
https://doi.org/10.3390/app11083321
Madsen, M. & Gregor, S. (2000). Measuring human-computer trust. 11th Australasian Conference on
Information Systems, 53, 6-8.
McNeese, N.J., Hoffman, R.R., McNeese, M.D., Patterson, E.S., Cooke, N.J., & Klein, G. (2016). The
human factors of intelligence analysis. Proceedings of the 2016 International Annual Meeting of
the Human Factors and Ergonomics Society, 59(1), 130-134.
https://doi.org/10.1177/1541931215591027
Michael, N. (2019). Trustworthy AI Why does it matter? National Defense.
Moon, B. M. & Hoffman, R. R (2005). How might “transformational” technologies and concepts be
barriers to sensemaking in intelligence analysis, Proceedings of the Seventh International
Naturalistic Decision Making Conference, J.M.C. Schraagen (Ed.), Amsterdam, The Netherlands,
June 2005.
Muir, B.M. & Moray, N. (1996). Trust in automation. Part II. Experimental studies of trust and human
intervention in a process control simulation. Ergonomics, 39(3), 429-460.
https://doi.org/10.1080/00140139608964474
Muller, S. T., Veinott, E. S., Hoffman, R. R., Klien, G., Alam, L., Mamun, T., & Clancey, W. J. (2021).
Principles of explanation in human-AI systems. AAAI 2021 Explainable Agency in Artificial
Intelligence Workshop.
Nemeth, C. & Klein, G. (2010). The naturalistic decision making perspective. Encyclopedia of
Operations Research and Management Science.
https://doi.org/10.1002/9780470400531.eorms0410.
Nowell, L. S., Norris, J. M., White, D. E., & Moules, N. J. (2017). Thematic analysis: Striving to meet
trustworthiness criteria. International Journal of Qualitative Methods, 16(1), 1-13.
https://doi.org/10.1177/1609406917733847
Office of the Director of National Intelligence (ODNI) (2019). National Intelligence Strategy of the
United States of America 2019. Retrieved from:
https://www.dni.gov/files/ODNI/documents/National_Intelligence_Strategy_2019.pdf%3futm_so
urce%3dPress%20Release%26utm_medium%3dEmail%26utm_campaign%3dNIS_2019
Parasuraman, R. & Riley, V. (1997). Humans and automation: Use, misuse, disuse, and abuse. Human
Factors, 39(2), 240-253. https://doi.org/10.1518/001872097778543886
Rebensky, S., Carmody, K., Ficke, C., Nguyen, D., Carroll, M., Wildman, J., & Thayer, A. (2021).
Whoops! Something went wrong: Errors, trust, and trust repair strategies in human agent teaming.
In H. Degen & S. Ntoa (Eds). Artificial intelligence in HCI, (pp. 95-106). Springer Nature.
https://doi.org/10.1007/978-3-030-77772-2_7
Riasnow, T., Ye, H., & Goswami, S. (2015). Generating trust in online consumer reviews through
signaling: An experimental study. 48th Hawaii International Conference on Systems Sciences,
3307-3316. https://doi.org/10.1109/HICSS.2015.399
TRUST, AI, AND INTELLIGENCE WORK
15
Roff, H. M. & Danks, D. (2018). “Trust but verify”: The difficulty of trusting autonomous weapons
systems. Journal of Military Ethics, 17(1), 2-20, https://doi.org/10.1080/15027570.2018.1481907
Rosala, M. (2020). The Critical Incident Technique in UX. Nielsen Norman Group. Retrieved from:
https://www.nngroup.com/articles/critical-incident-technique/
Roth-Berghofer, T. R., & Cassens, J. (2005). Mapping goals and kinds of explanations to the knowledge
containers of case-based reasoning systems. ICCBR 2005, 3630, 451464.
Schaefer, K. E., Chen, J. Y. C., Szalma, J. L., & Hancock, P. A. (2016). A meta-analysis of factors
influencing the development of trust in automation: Implications for understanding autonomy in
future systems. Human Factors, 58(3), 377-400. https://doi.org/10.1177/0018720816634228
Sherwood, S. M., Neville, K. J., McLean, A. L. M. T., Walwanis, M. M., & Bolton, A. E. (2020).
Integrating new technology into the complex system of air combat training. In H. A. H Handley
& A. Tolk (Eds). A framework of human systems engineering: Applications and case studies, (pp.
185-204). Wiley. https://doi.org/10.1002/9781119698821.ch10
Siau, K. & Wang, W. (2018). Building trust in artificial intelligence, machine learning, and robotics.
Cutter Business Technology Journal, 31(2), 47-53.
Skarbez, R., Polys, N. F., Ogle, J. T., North, C., Bowman, D. A. (2019). Immersive analytics: Theory and
research agenda. Frontiers in Robotics and AI, 6 (82), 1-15.
https://doi.org/10.3389/frobt.2019.00082
Sørmo, F., Cassens, J., & Aamodt, A. (2005). Explanation in case-based reasoning perspectives and
goals. Artificial Intelligence Review, 24(2), 109143. https://doi.org/10.1007/s10462-005-4607-7
Symon, P. B. & Tarapore, A. (2015). Defense intelligence analysis in the age of big data. Joint Forces
Quarterly, 79(4), 4-11.
Sheridan, T. B. (1999). Human supervisory control. In A. P. Sage & W. B. Rouse (Eds.), Handbook of
systems engineering and management (pp. 645690). Wiley & Sons.
Tinsley, H. E. A., & Weiss, D. J. (1975). Interrater reliability and agreement of subjective judgments.
Journal of Counseling Psychology, 22(4), 358-376.
Trent, S. A., Patterson, E. S., & Woods, D. D. (2007). Challenges for cognition in intelligence analysis.
Journal of Cognitive Engineering and Decision Making, 1(1), 75-97.
https://doi.org/10.1177/155534340700100104
Vogel, K. M., Reid, G., Kampe, C., & Jones, P. (2021). The impact of AI on intelligence analysis:
Tackling issues of collaboration, algorithmic transparency, accountability, and management.
Intelligence and National Security, 1-11. https://doi.org/10.1080/02684527.2021.1946952
Volz, V., Marjchrzak, K., & Preuss, M. (2018). A social science-based approach to explanations for
(game) AI. 2018 IEEE Conference on Computational Intelligence and Games (CIG), 474-481.
https://doi.org/ 10.1109/CIG.2018.8490361
Watson, D. (2019). The rhetoric and reality of anthropomorphism in artificial intelligence. Minds and
Machines, 29, 417-440. https://doi.org/10.1007/S11023-019-09506-6
Wong, B. L. W. (2014). How analysts think (?): Early observations. 2014 IEEE Joint Intelligence and
Security Informatics Conference, 269-299. https://doi.org/10.1109/JISIC.2014.59
Wong, B. L. W. & Kodagoda, N. (2015). How analysts think: inference making strategies. Proceedings of
the Human Factors and Ergonomics Society Annual Meeting, 59(1), 269-273.
https://doi.org/10.1177/1541931215591055
Wong, B. L. W., Kodagoda, N. (2016). How analysts think: Anchoring, laddering, and associations.
Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 60(1), 178-182.
https://doi.org/10.1177/1541931213601040
Woods, D. D. (1996). Decomposing automation: Apparent simplicity, real complexity. In R. Parasuraman
& M. Mouloua (Eds.), Automation technology and human performance (pp. 317). Lawrence
Erlbaum.
Xie, Y., Ponsakornsathien, N., Gardi, A., & Sabatini, R. (2021). Explanation of machine-learning
solutions in air traffic management. Aerospace, 8(8). https://doi.org/10.3390/aerospace8080224
TRUST, AI, AND INTELLIGENCE WORK
16
Yang, X.J., Schemanske, C., & Searle, C. (2021). Toward quantifying trust dynamics: how people adjust
their trust after moment-to-moment interaction with automation. Human Factors, 1-17.
https://doi.org/10.1177/0018720811034716
Yang, L., Wang, H., & Deleris, L. (2021). What does it mean to explain? A user-centered study on AI
explainability. In H. Degen & S. Ntoa (Eds.), Artificial intelligence in HCI, (pp. 107-121).
Springer Nature. https://doi.org/10.1007/978-3-030-77772-2_8
... The Naturalistic Decision Making (NDM) tradition focuses on developing understanding of human cognition situated in the context of work (vs in a controlled lab setting) and enables the engineering of systems to support cognitive processes (Klein et al., 1993). Naturalistic inquiry has been effectively applied to glean insights on how trust is formed in AI in high consequence work (Rotner et al., 2021;Dorton & Harper, 2022a;Dorton, 2022), work adaptations to changes in trust , and log-term trust dynamics with AI (Dorton & Harper, 2022). Therefore, we seek to develop resources and tools for trust engineering of AI based on the naturalistic research tradition --evidence-based tools that are based on actual cases when trust was gained or lost in AI "in the wild." ...
... The AI Incident Database allowed us to answer various research questions about how trust in AI is lost "in the wild" across various contexts. To be more precise, we found the AI Incident Database to be a viable means to conduct naturalistic research, although it appears to have advantages and disadvantages when compared to interview-based knowledge elicitation methods such as the critical incident technique (e.g., Dorton & Harper, 2022a;Dorton, 2022): ...
... We also found that trust was mediated by more than just the performance of the AI and was often grounded in the impact of the AI on the broader sociotechnical system over time. This aligned with what Dorton and Harper (2022a) called the role of utility (impact on the work system, irrespective of AI performance) in trust development, and more broadly, other work showing that there are dozens of factors mediating trust in automation (e.g., Chiou & Lee, 2023;Schaefer et al., 2016). Negative affect was also a factor in several cases, which aligns closely with findings on algorithm aversion, where stakeholders trust AI outputs less than they trust other humans, especially for tasks that are perceived as being subjective in nature (Hou & Jung, 2021). ...
Article
Engineering trustworthy artificial intelligence (AI) is important to adoption and appropriate use, but there are challenges to implementing trustworthy AI systems. It is difficult to translate trust studies from the laboratory to the field. It is also difficult to operationalize “trustworthy AI” frameworks and principles to inform the actual development of AI. We address these challenges with an approach based in reported incidents of trust loss “in the wild.” We systematically identified 30 cases of trust loss in the AI Incident Database to gain insight into how and why humans lose trust in AI in various contexts. These factors could be codified into the development cycle in various forms such as checklists and design patterns to manage trust in AI systems and avoid similar incidents in the future. Because it is based in real incidents, this approach offers recommendations that are concrete and actionable for teams addressing real use cases with AI systems.
... The Naturalistic Decision Making (NDM) tradition focuses on developing understanding of human cognition situated in the context of work (vs in a controlled lab setting) and enables the engineering of systems to support cognitive processes (Klein et al., 1993). Naturalistic inquiry has been effectively applied to glean insights on how trust is formed in AI in high consequence work (Rotner et al., 2021;Dorton & Harper, 2022a;Dorton, 2022), work adaptations to changes in trust , and longterm trust dynamics with AI (Dorton & Harper, 2022). Therefore, we seek to develop resources and tools for trust engineering of AI based on the naturalistic research tradition -evidence-based tools that are based on actual cases when trust was gained or lost in AI "in the wild." ...
... The AI Incident Database allowed us to answer various research questions about how trust in AI is lost "in the wild" across various contexts. To be more precise, we found the AI Incident Database to be a viable means to conduct naturalistic research, although it appears to have advantages and disadvantages when compared to interview-based knowledge elicitation methods such as the critical incident technique (e.g., Dorton & Harper, 2022a;Dorton, 2022): ...
... We also found that trust was mediated by more than just the performance of the AI and was often grounded in the impact of the AI on the broader sociotechnical system over time. This aligned with what Dorton and Harper (2022a) called the role of utility (impact on the work system, irrespective of AI performance) in trust development, and more broadly, other work showing that there are dozens of factors mediating trust in automation (e.g., Chiou & Lee, 2023;Schaefer et al., 2016). Negative affect was also a factor in several cases, which aligns closely with findings on algorithm aversion, where stakeholders trust AI outputs less than they trust other humans, especially for tasks that are perceived as being subjective in nature (Hou & Jung, 2021). ...
Preprint
Full-text available
Engineering trustworthy artificial intelligence (AI) is important to adoption and appropriate use, but there are challenges to implementing trustworthy AI systems. It is difficult to translate trust studies from the laboratory to the field. It is also difficult to operationalize "trustworthy AI" frameworks and principles to inform the actual development of AI. We address these challenges with an approach based in reported incidents of trust loss "in the wild." We systematically identified 30 cases of trust loss in the AI Incident Database to gain insight into how and why humans lose trust in AI in various contexts. These factors could be codified into the development cycle in various forms such as checklists and design patterns to manage trust in AI systems and avoid similar incidents in the future. Because it is based in real incidents, this approach offers recommendations that are concrete and actionable for teams addressing real use cases with AI systems.
... Each type of AI system, (i.e., SA vs DS) will have unique issues with engendering trust with end users, which is an already complex and contextdependent phenomenon (Dorton & Harper, 2022). Users have been shown to resist technologies that are more prescriptive in nature, rather than supporting their actual cognitive needs (e.g., exploring the options jointly) (Moon & Hoffman, 2005). ...
Conference Paper
Full-text available
Climate change and the resulting cascade of impacts pose a real and urgent threat to human safety. Simultaneously, products from Artificial Intelligence (AI) research have grown exponentially and show high potential towards use in climate adaptation. However, an increasingly large barrier to responsive deployment and adoption of AI tools into climate change adaptation workflows is the actionable knowledge discrepancy between the fields of AI, Human Machine Teaming (HMT), AI Assurance, and the work of climate adaptation decision makers. To ensure alignment, applications of AI to climate change adaptation actions need a framework and knowledge base that map development considerations to the decision maker workflow. This paper introduces CHAAIS (Climate-focused Human-machine teaming and Assurance in Artificial Intelligence Systems), a design standard and accompanying knowledge base detailing the necessary human element of AI interaction in the high-risk domain of climate change. CHAAIS incorporates direct user interaction, decision maker adoption considerations, and downstream implications. Our process combines accepted HMT and AI Assurance principles for ethical design while testing specific issues in their intersection in the climate change domain. Specifically, we demonstrate this process with a case study in forestry and implications for wildfire management. The goal for the CHAAIS design framework and knowledge base is to be both a living information source and an adaptable method of tailoring future climate change AI solutions for responsive deployment directly informed by climate decision makers.
... Thus, it is useful to explore trust in automation via actual practitioners performing representative tasks that involve realistic (i.e., imperfect) automated agents. However, most of the literature on trust in automation is theoretical or from experimental studies; with few exceptions (e.g., Dorton & Harper 2022), naturalistic studies of trust in automation have been neglected (Hoffman 2017). ...
Article
Full-text available
Trust in automation depends on more than just the automation itself, but the larger context in which the automation and the human operator are collaborating. This study takes a naturalistic approach to explore providers' trust in a Clinical Decision Support System. Primary Care Providers were shown simulated medical records and a prototype Clinical Reminder indicating that the patient should be titrated with recommended Beta Blockers to address the patient's Heart Failure with reduced ejection fraction. Analysis of responses showed three main themes: Concerns about the medical documentation used to generate the recommendation; Complexity of the patient condition and care delivery context (and how such factors limit possible courses of action); and Concerns about the Clinical Reminder and clinical guideline it is instantiating. These results align with the macrocognitive model of trust and reliance based on sensemaking and flexecution.
... Ultimately, the understanding of how these different trust assessment techniques map onto trust is not only important for advancing theory but also for the application of trust assessment in the "real world." Recently, there has been consideration about ways in which to assess trust in the "real world" (Dorton & Harper, 2022;Tenhundfeld et al., 2022). Part of this push has focused on ways in which to assess trust in nonintrusive ways that allow for naturalistic interactions between the human and automation (i.e., not simply relying on subjective assessments). ...
Article
Full-text available
As automated and autonomous systems become more widely available, the ability to integrate them into environments seamlessly becomes more important. One cognitive construct that can predict the use, misuse, and disuse of automated and autonomous systems is trust that a user has in the system. The literature has explored not only the predictive nature of trust but also the ways in which it can be evaluated. As a result, various measures, such as physiological and behavioral measures, have been proposed as ways to evaluate trust in real-time. However, inherent differences in the measurement approaches (e.g., task dependencies and timescales) raise questions about whether the use of these approaches will converge upon each other. If they do, then the selection of any given proven approach to trust assessment may not matter. However, if they do not converge, it raises questions about the ability of these measures to assess trust equally and whether discrepancies are attributable to discriminant validity or other factors. The present study used various trust assessment techniques for passengers in a self-driving golf-cart. We find little to no convergence across measures, raising questions that need to be addressed in future research.
... We often focus on those goals and assume the technology will be adopted into use; however, any technology injected into the complex, time-honed dynamics of an established sociotechnical system is unlikely to be prepared for participating effectively in those dynamics. As such, it is unfortunately common that the introduction of new technologies to an existing sociotechnical system can have adverse unintended effects [6][7][8]. ...
Article
Full-text available
Despite noble intentions, new technologies may have adverse effects on the resilience of the sociotechnical systems into which they are integrated. Our objective was to develop a lightweight method to elicit requirements that, if implemented, would support sociotechnical system resilience. We developed and piloted the Resilience-Aware Development Exercise Protocol (RAD-XP), a method to generate tabletop exercises (TTXs) to elicit resilience requirements. In the pilot study, this approach generated 15 requirements from a one-hour TTX, where the majority of requirements were found to support resilience. Participants indicated via survey that RAD-XP was effective and efficient, and that they would want to use RAD-XP regularly throughout the agile development process. We discuss future research and development to refine this approach to eliciting resilience requirements.
... Ezenyilimba et al. (2023) utilize the case of searchand-rescue to investigate the relationship between robot explicability and transparency on SA and trust, showing that trust improved when detailed explanations were provided, as opposed to transparency alone. Dorton and Harper (2022) provide a naturalistic exploration of trust and AI from the perspective of intelligence analysts. They identify that the performance of the system coupled with its explicability and perceived utility were important elements in enabling them to achieve their mission (Endsley et al., 2022b). ...
Article
Full-text available
We are now entering the third decade of the 21st Century, and in recent years, the achievements made by scientists have been exceptional, leading to major advancements in the still-growing field of Aerospace Engineering. Artificial Intelligence (AI) is one of the technologies driving the aerospace industry, leading to increasingly autonomous aerospace systems which have resulted in new opportunities to radically alter the state-of-the-art. The aim of this Research Topic, led by Specialty Chief Editor Professor Kelly Cohen, is to highlight the latest advancements in research across the field of intelligent aerospace systems for civil and commercial aviation, with new insights, novel developments, current challenges, and future perspectives being key areas of interest. To acknowledge the efforts and commitment of authors to shaping this discussion, Professor Kelly Cohen will select and award the best article accepted to this collection. Topic themes of particular interest include, but are not limited to, the following: • Trustworthy AI • Certifiable AI • Advanced Air Mobility • Human-AI Teaming • Assured autonomy • Air traffic management enabled by AI • Systems Engineering of AI-Enabled Aerospace Systems • AI-assisted aerospace design • Cybersecurity-hardening of AI-driven systems A brief discussion of some of these topics can be found in this Specialty Grand Challenge article. All manuscript types are welcome and new articles will be added to this collection as they are published.
... However, complex solutions such as deep learning solutions thwart understanding of how the algorithm functions. To counter this, Dorton and Harper (2022) proposed involving end users in developing systems and involving developers in training end users. ...
Article
Full-text available
Sustainability in our food and fiber agriculture systems is inherently knowledge intensive. It is more likely to be achieved by using all the knowledge, technology, and resources available , including data-driven agricultural technology and precision agriculture methods, than by relying entirely on human powers of observation, analysis, and memory following practical experience. Data collected by sensors and digested by artificial intelligence (AI) can help farmers learn about synergies between the domains of natural systems that are key to simultaneously achieve sustainability and food security. In the quest for agricultural sustainability, some high-payoff research areas are suggested to resolve critical legal and technical barriers as well as economic and social constraints. These include: the development of holistic decision-making systems, automated animal intake measurement, low-cost environmental sensors, robot obstacle avoidance, integrating remote sensing with crop and pasture models, extension methods for data-driven agriculture, methods for exploiting naturally occurring Genotype x Environment x Management experiments, innovation in business models for data sharing and data regulation reinforcing trust. Public funding for research is needed in several critical areas identified in this paper to enable sustainable agriculture and innovation.
Article
Full-text available
There is a growing expectation that artificial intelligence (AI) developers foresee and mitigate harms that might result from their creations; however, this is exceptionally difficult given the prevalence of emergent behaviors that occur when integrating AI into complex sociotechnical systems. We argue that Naturalistic Decision Making (NDM) principles, models, and tools are well-suited to tackling this challenge. Already applied in high-consequence domains, NDM tools such as the premortem, and others, have been shown to uncover a reasonable set of risks of underlying factors that would lead to ethical harms. Such NDM tools have already been used to develop AI that is more trustworthy and resilience, and can help avoid unintended consequences of AI built with noble intentions. We present predictive policing algorithms as a use case, highlighting various factors that led to ethical harms and how NDM tools could help foresee and mitigate such harms.
Article
Full-text available
Advances in the trusted autonomy of air-traffic management (ATM) systems are currently being pursued to cope with the predicted growth in air-traffic densities in all classes of airspace. Highly automated ATM systems relying on artificial intelligence (AI) algorithms for anomaly detection, pattern identification, accurate inference, and optimal conflict resolution are technically feasible and demonstrably able to take on a wide variety of tasks currently accomplished by humans. However, the opaqueness and inexplicability of most intelligent algorithms restrict the usability of such technology. Consequently, AI-based ATM decision-support systems (DSS) are foreseen to integrate eXplainable AI (XAI) in order to increase interpretability and transparency of the system reasoning and, consequently, build the human operators’ trust in these systems. This research presents a viable solution to implement XAI in ATM DSS, providing explanations that can be appraised and analysed by the human air-traffic control operator (ATCO). The maturity of XAI approaches and their application in ATM operational risk prediction is investigated in this paper, which can support both existing ATM advisory services in uncontrolled airspace (Classes E and F) and also drive the inflation of avoidance volumes in emerging performance-driven autonomy concepts. In particular, aviation occurrences and meteorological databases are exploited to train a machine learning (ML)-based risk-prediction tool capable of real-time situation analysis and operational risk monitoring. The proposed approach is based on the XGBoost library, which is a gradient-boost decision tree algorithm for which post-hoc explanations are produced by SHapley Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME). Results are presented and discussed, and considerations are made on the most promising strategies for evolving the human–machine interactions (HMI) to strengthen the mutual trust between ATCO and systems. The presented approach is not limited only to conventional applications but also suitable for UAS-traffic management (UTM) and other emerging applications.
Article
Full-text available
This paper provides a brief analytical review of the current state-of-the-art in relation to the explainability of artificial intelligence in the context of recent advances in machine learning and deep learning. The paper starts with a brief historical introduction and a taxonomy, and formulates the main challenges in terms of explainability building on the recently formulated National Institute of Standards four principles of explainability. Recently published methods related to the topic are then critically reviewed and analyzed. Finally, future directions for research are suggested. This article is categorized under: • Technologies > Artificial Intelligence • Fundamental Concepts of Data and Knowledge > Explainable AI Abstract Accuracy versus interpretability for different machine learning models.
Article
Full-text available
Medical artificial intelligence is cost-effective and scalable and often outperforms human providers, yet people are reluctant to use it. We show that resistance to the utilization of medical artificial intelligence is driven by both the subjective difficulty of understanding algorithms (the perception that they are a ‘black box’) and by an illusory subjective understanding of human medical decision-making. In five pre-registered experiments (1–3B: N = 2,699), we find that people exhibit an illusory understanding of human medical decision-making (study 1). This leads people to believe they better understand decisions made by human than algorithmic healthcare providers (studies 2A,B), which makes them more reluctant to utilize algorithmic than human providers (studies 3A,B). Fortunately, brief interventions that increase subjective understanding of algorithmic decision processes increase willingness to utilize algorithmic healthcare providers (studies 3A,B). A sixth study on Google Ads for an algorithmic skin cancer detection app finds that the effectiveness of such interventions generalizes to field settings (study 4: N = 14,013).
Article
Full-text available
Different systems based on Artificial Intelligence (AI) techniques are currently used in relevant areas such as healthcare, cybersecurity, natural language processing, and self-driving cars. However, many of these systems are developed with “black box” AI, which makes it difficult to explain how they work. For this reason, explainability and interpretability are key factors that need to be taken into consideration in the development of AI systems in critical areas. In addition, different contexts produce different explainability needs which must be met. Against this background, Explainable Artificial Intelligence (XAI) appears to be able to address and solve this situation. In the field of automated driving, XAI is particularly needed because the level of automation is constantly increasing according to the development of AI techniques. For this reason, the field of XAI in the context of automated driving is of particular interest. In this paper, we propose the use of an explainable intelligence technique in the understanding of some of the tasks involved in the development of advanced driver-assistance systems (ADAS). Since ADAS assist drivers in driving functions, it is essential to know the reason for the decisions taken. In addition, trusted AI is the cornerstone of the confidence needed in this research area. Thus, due to the complexity and the different variables that are part of the decision-making process, this paper focuses on two specific tasks in this area: the detection of emotions and the distractions of drivers. The results obtained are promising and show the capacity of the explainable artificial techniques in the different tasks of the proposed environments.
Article
The process of explaining something to another person is more than offering a statement. Explaining means taking the perspective and knowledge of the Learner into account and determining whether the Learner is satisfied. While the nature of explanation—conceived of as a set of statements—has been explored philosophically and empirically, the process of explaining, as an activity, has received less attention. We conducted an archival study, looking at 73 cases of explaining. We were particularly interested in cases in which the explanations focused on the workings of complex systems or technologies. The results generated two models: local explaining to address why a device (such an intelligent system) acted in a surprising way, and global explaining about how a device works. The examination of the processes of explaining as it occurs in natural settings revealed a number of mistaken beliefs about how explaining happens, and what constitutes an explanation that encourages learning.
Chapter
AI/ML is often considered the means by which intelligence analysts will overcome challenges of data overload under time pressure; however, AI/ML tools are often data- or algorithm-centric and opaque, and do not support the complexities of analyst sensemaking. An exploratory sensitivity analysis was conducted with a simple Authorship Attribution (AA) task to identify the degree to which an analyst can apply their sensemaking outputs as inputs to affect the performance of AI/ML tools, which can then provide higher quality information for continued sensemaking. These results show that analysts may support the performance of AI/ML primarily by refinement of potential outcomes, refinement of data and features, and refinement of algorithms themselves. A notional model of collaborative sensemaking with AI/ML was developed to show how AI/ML can support analyst sensemaking by processing large amounts of data to assist with different inference-making strategies to build and refine frames of information. Designing tools to fit this framework will increase the performance of the AI/ML, the user’s understanding of the technology and outputs, and the efficiency of the sensemaking process.
Chapter
One frequent concern associated with the development of AI models is their perceived lack of transparency. Consequently, the AI academic community has been active in exploring mathematical approaches that can increase the explainability of models. However, ensuring explainability thoroughly in the real world remains an open question. Indeed, besides data scientists, a variety of users is involved in the model lifecycle with varying motivations and backgrounds. In this paper, we sought to better characterize these explanations needs. Specifically, we conducted a user research study within a large institution that routinely develops and deploys AI model. Our analysis led to the identification of five explanation focuses and three standard user profiles that together enable to better describe what explainability means in real life. We also propose a mapping between explanation focuses and a set of existing explainability approaches as a way to link the user view and AI-born techniques.
Chapter
Human interactions with computerized systems are shifting from using computers as tools, into collaborating with them as teammates via autonomous capabilities. Modern technological advances will inevitably lead to the integration of autonomous systems and will consequently increase the need for effective human agent teaming (HAT). One of the most paramount ideals human operators must discern is their perception of autonomous agents as equal team members. In order to instill this trust within human operators, it is necessary for HAT missions to apply the proper trust repair strategies after a team member commits a trust violation. Identifying the correct trust repair strategy is critical to advancing HAT and preventing degrading team performance or potential misuse. Based on the current literature, this paper addresses key components necessary for effective trust repair and the numerous variables that can further improve upcoming HAT operations. The impacting factors of HAT trust, trust repair strategies, and needed areas of future research are presented.
Chapter
Air combat training can be dangerous. Yet, over time, procedures, training rules, practices, standards, and even the air combat community's culture have evolved to form a comprehensive, sociotechnical training system that minimizes risk of incident and optimizes training value. In response to recent advances in air combat capability and resultant demands on the training infrastructure, the Navy proposed a substantial change to live air combat training: the addition of virtual and constructive (i.e. non‐live) aircraft. In this chapter, we describe work contributing to the evaluation of the proposed change. We used ethnographic and risk evaluation methods to identify and assess risks to training system effectiveness and aircrew safety. We identified 26 proposed new hazards representing seven categories. Five hazard categories were assessed as existing and mitigated within the current baseline training system (without non‐live aircraft). One was assessed as similar to existing hazards, but with a higher likelihood and therefore greater hazard exposure. Two hazard categories were found to represent new types of hazard that require mitigation in the future training system. We conclude that with a discussion of ways, new technology can disrupt a complex system and the importance of evaluating and mitigating disruptive effects prior to and during the system update or change process.