Health care simulation is a growing field that combines innovative technologies and adult learning theory to reproducibly train medical professionals in clinical skills and practices. A wide range of assessment tools are available to assess learners on taught skills and knowledge, and there is stake-holder interest validating these assessment tools. Reliable quantitative assessment is critical for high-stakes certification, such as licensing opportunities and board examinations. There are many aspects to an evaluation in healthcare simulation that range from educating new learners and training current professionals, to a systematic review of programs to improve outcomes. Validation of these assessment tools is essential to ensure that they are valid and reliable. Validity refers to whether any measuring instrument measures what it is intended to measure. Additionally, reliability is part of the validity assessment and refers to the consistent or reproducible results of an assessment tool. The assessment tool should yield the same results for the same type of learner every time it is used. In practice, actual healthcare delivery requires knowledge of technical, analytical, and interpersonal skills. This merits assessment systems to be comprehensive, valid, and reliable enough to assess the necessary elements along with testing for critical knowledge and skills. Validating assessment tools for healthcare simulation education ensure that learners can demonstrate the integration of knowledge and skills in a realistic setting.
The assessment process itself is influential for the process of curriculum development, as well as feedback and learning. Recent developments in psychometric theory and standard settings have been efficient in assessing professionalism, communication, procedural, and clinical skills. Ideally, simulation developers should reflect on the purpose of the simulation to determine if the focus will be on teaching or learning. If the focus is on teaching, then assessments should focus on performance criteria with exercises for a set of skill-based experiences – this assesses the teaching method's effectiveness in task training. Alternatively, if the focus of the simulation is to determine higher-order learning, then the assessment should be designed to measure multiple integrated abilities such as factual understanding, problem-solving, analysis, and synthesis. In general, multiple assessment methods are necessary to capture all relevant aspects of clinical competency. For higher-order cognitive assessment (knowledge, application, and synthesis of knowledge), context-based multiple-choice questions (MCQ), extended matching items, and short answer questions are appropriate. For the demonstration of skills mastery, a multi-station objective structured clinical examination (OSCE) is viable. Performance-based assessments such as Mini-Clinical Evaluation Exercise (mini-CEX) and Direct Observation of Procedural Skills (DOPS) are appropriate to have a positive effect on learner comprehension. Alternatively, for the advanced professional continuing learner, a clinical work sampling and portfolio or logbook may be used.
In an assessment, the developers select an assessment instrument with known characteristics. A wide range of assessment tools is currently available for assessment of knowledge and application and performance assessment. The assessment materials are then created around learning objectives, and the developers directly control all aspects of delivery and assessment. The content should relate to the learning objectives and the test comprehensive enough that it produces reliable scores. This ensures that the performance is wholly attributable to the learner – and not an artifact of curriculum planning or execution. Additionally, different versions of the assessment that are comparable in difficulty will permit comparisons among examinees and against standards.
Learner assessment is a wide-ranging decision-making process with implications beyond student achievement alone. It is also related to program evaluation and provides important information to determine program effectiveness. Valid and reliable assessments satisfy accreditation needs and contribute to student learning.
Validity and reliability in assessments.
The intended purpose of an assessment is to determine whether an assessment tool gives valid and reliable results that are wholly attributable to the learner. An assessment tool has to give reproducible results, and after many trials, statistical analyses can determine areas with a variation. This determines the effectiveness of the assessment tool. If there is a previously established ideal, against which others are compared, then a novel assessment should correlate the results to this; however, many times, there is no such "gold standard," and thus, a comparison should be made to similar assessment tools.
Reliability relates to the uniformity of a measure. For example, a learner completing an assessment meant to measure the effectiveness of chest compressions should have approximately the same results each time the test is completed. Reliability can be estimated in several ways and will vary depending upon the type of assessment tool used.
- The Kuder-Richardson coefficient for two-answer questions and Cronbach's alpha for questions with more than two answers can be used to measure internal consistency where 2 to 3 questions are generated that measure the same concept, and the correlation among the answers is measured. Strong correlations where the reliability estimate is as close to 1 as possible (higher than 0.7), indicate high reliability, while weak correlations indicate the assessment tool may not be reliable.
- Test-retest reliability is measured when an assessment tool is given to the same learners more than once, at different times, and under similar conditions. The correlation between the measurements at different time points is then calculated. Similarly, parallel-form or alternate-form reliability is similar to test-retest except that a different form of the original assessment is given to learners in the following evaluations. For example, the concepts being tested are the same in both versions of the assessment, but the phrasing of it is different. This approach is more conservative regarding reliability than Cronbach alpha. However, it takes at least two rounds of the assessment, whereas Cronbach alpha can be calculated after a single assessment. Reliability is measured as a correlation with Pearson r, a statistic that measures the linear correlation between two variables X and Y and has a value between +1 and −1. In general, a correlation coefficient of less than 0.3 signifies a weak correlation, 0.3 to 0.5 is moderate, and greater than 0.5 is strong.
- Interrater reliability is used to study the effect of different assessors using the same assessment tools. Consistency in rater scores relates to the level of inter-rater reliability of the assessment tool. It is estimated by Cohen's Kappa, which compares the proportion of actual agreement between raters to the proportion expected to agree by chance.
- Analysis of variance (ANOVA) is another tool to generate a generalizability coefficient. This theory recognizes that multiple sources of error and true score variance exist and that measures may have different reliabilities in different situations. This method aims to quantify how much measurement error is attributable to each potential factor, such as differences in question phrasing, learner attributes, raters, or time between assessments. This model looks at the overall reliability of the results.
The validity of an assessment tool refers to how well the tool measures what it intends to measure. High reliability is not the only measure of efficacy for an assessment tool; other measures of validity are necessary to determine the integrity of the assessment approach. Determining validity requires evidence to support the use of the assessment tool in a particular context. The development of new tools is not always necessary, but they must be appropriate for program activities, and reliability and validity reported or references cited for each assessment tool used. Validity evidence, as stipulated in the Standards for Educational and Psychological Testing and are briefly outlined here, for assessment is necessary for building a validity argument to support the use of an assessment for a specific purpose. The five sources of evidence for validity are:
- Evidence for content validity is the "relationship between a test's content and the construct it is intended to measure." This refers to the themes, wording, and format of the items presented in an assessment tool. This should include analyses by independent subject matter experts (SME) regarding how adequately items represent the targeted domain. Additionally, if assessments that have been previously used in similar settings cannot be utilized, the development of new assessment tools should be based on established educational theories. If learners improve scores after receiving additional training, then this would add to the validity of the assessment tool.
- The response process involves the analyses of the responses to the assessment and includes the strategies and thought processes of individual learners. Analyzing the variance in response patterns between different types of learners may reveal sources of inconsistency, and that is irrelevant to the concept being measured.
- The internal structure of the assessment tool refers to "the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based." Evidence to support the internal structure of an assessment includes dimensionality, measurement invariance, and reliability. For example, an assessment that intends to report one composite score should be mostly unidimensional. Evidence for measurement invariance should include item characteristics that are comparable across demographics such as gender or race. Reliability, as previously discussed, should report that assessment outcomes are consistent throughout repeated test administrations.
- Evidence for the relation to other variables involves the statistical relationship between assessment scores and another measure relevant to the measured construct. A strongly positive relationship would indicate two measures that measure the same construct, or a negligible relationship would describe measures that should be independent.
- Consequences refer to effects from the administration of the assessment tool. In other words, consequences evidence assesses the impact, whether positive or negative and intended or unintended, of the assessment itself. This can either support or contest the soundness of score interpretations.
Issues of Concern
Threats to validity can weaken the validity of an assessment. These threats refer to alternative factors that are attributable to the performance of a learner that is unrelated to the knowledge or skills assessed. Some potential threats to validity come from 1) low reliability, 2) misalignment of the assessment to the learning objectives, or 3) interface difficultly that includes inadequate instructions or lack of computer skills for computerized assessments. To have validity evidence, all threats to validity should be addressed and eliminated.
Another question remains on how to determine reliability and validity in healthcare simulation, which in many cases, can be subjective. To overcome subjectivity, the assessment tool should be designed so that there is no room for assessor interpretation, such as by having a uniform rubric; however, even this would not remove all individual bias and thus should be stated.
To develop a curriculum and subsequent assessment, a literature search should be performed to find previously developed measures for outcomes. If the assessment tool is to be modified for a particular setting or learners, describe the modifications and include support for how these changes improve suitability for the novel situation. Discussion of the adaptation is warranted if 1) previously characterized assessment tools are modified, 2) the assessment is used for a different setting, purpose, or set of learners or 3) there is a different interpretation for the outcomes. Potential limitations for the new approach should also be disclosed and discussed. If a previously characterized assessment tool is used in the same setting, with the same types of learners, and for the same purpose, then citing the referenced literature is appropriate. Included in this discussion should be whether these modifications are likely to affect the validity or reliability of the assessment tools. Additionally, the developers of novel assessment tools should state the development process and present any data that would substantiate validity and reliability.
Developers should reflect on the purpose of the simulation to determine if the focus will be on teaching or learning. If the focus is on teaching, then assessments should evaluate mastery of taught skills through performance criteria as well as weigh the teaching method effectiveness in task training. Alternatively, if the focus is to determine higher-order learning, then the assessment should be designed to measure multiple integrated abilities such as factual understanding, problem-solving, analysis, and synthesis. This can be accomplished by performance assessments that use problem-solving experiences meant to draw on a longitudinal set of acquired skills. Higher-order learning assessments should follow established psychometric and adult learning theory to guide the design.
Assessment for theoretical knowledge in clinical clerkships is generally constructed around learner performance in examinations and assignments. Assessment for clinical competence, however, is more complicated and involves the degree to which theoretical knowledge is applied, decision-making skills, and the ability to act in a shifting environment. Clinical assessment requires the use of diverse measures such as journals, surveys, peer evaluation, self-assessments, and learner interviews. There exists an inherent difficulty in clinical assessment because there will always exist some degree of subjectivity and interrater bias. Integrating the use of a rubric-based assessment tool with a rating scale, clearly defined performance criteria, and a detailed description of performance at each level would help to overcome such interrater bias. Assessment tools for clinical clerkships need an emphasis on validity evidence that rate for interrater agreement. For instance, resident performance from observer ratings would need to mitigate bias from interrater bias to be validated. Learners should be presented with the assessment tool before the learning experience and during the description of the learning objectives so that they are aware of expectations.
Traditionally, oral examinations have poor content validity, high inter-rater variability, and are inconsistent, depending on the grading schema. Thus, this assessment tool is susceptible to bias and is intrinsically unreliable. Alternatives to this assessment approach include short answer questions, extended matching items, key feature tests, OSCE, Mini-CEX, and DOPS. A short description follows here:
The Short Answer Question (SAQ) or Modified Essay Question (MEQ) is an open-ended, semi-structured question format. The questions can incorporate clinical scenarios, and equivalent or higher test reliabilities can be reached with fewer items when compared to true/false questions.
Extended Matching Item is based on a single theme and is a written examination format similar to multiple-choice questions (MCQ). The key difference between EM and MCQ is that it can be used for the assessment of clinical scenarios and diagnostic reasoning while maintaining objectivity and consistency that is not likely from MCQs.
Key-feature questions (KFQs) have been developed to assess clinical reasoning skills. Examinations using KFQs focus on the diagnosis and management of a clinical problem where the learners are most likely to make errors. More than other methods, KFQs illuminate the strengths and limits of a learners' clinical competency, and this assessment tool is more likely than other forms of evaluation to differentiate between stronger or weaker candidates in the area of clinical reasoning.
Objective Structured Clinical examination (OSCE) involves multiple locations where each learner performs a defined task. This tool can be used to assess competency-based on direct observation. Unlike a traditional clinical exam, the OSCE can evaluate areas critical to performance, such as communication skills and the ability to handle changing patient behaviors.
The Mini-Clinical Evaluation Exercise (Mini-CEX) is a rating scale in which an expert observes and rates the learner's performance. It was developed to assess six core competencies in residents, such as medical interviewing skills, physical examination skills, professionalism, clinical judgment, counseling skills, and organization and efficiency.
Similarly, Direct Observation of Procedural Skills (DOPS) is a structured rating scale for assessing and providing feedback on practical procedures. The competencies that are commonly evaluated are general knowledge about the procedure, informed consent, pre-procedure preparation, analgesia, technical ability, aseptic technique, and counseling and communication.
Assessment of Procedural Skills
Methods to evaluate procedural skills will be different than those used in clinical clerkships where simulation is most commonly used as a teaching modality. There are two general approaches to measuring procedural skills: global rating scales and checklists. The global rating scale is based on the Objective Structured Assessment of Technical Skills (OSATS) tool used to evaluate surgical residents' technical skills. It can be modified and validated for content, response process, and interrater reliability, to evaluate learner performance for varied procedures and with standardized patients. While GRS is subjective, they provide flexibility when innovative approaches are required to assess decision-making skills, team management, and patient assessments.
A training checklist can determine if the learner appropriately prepares for and executes a task by checking off whether they independently and correctly perform the specified exercise. Given that procedures are likely to have sequential steps, checklists are appropriate for assessing technical skills. Checklists allow for a thorough, structured, and objective assessment of the modular skills for a procedure. While a detailed checklist may include more steps than a trainee can memorize, it serves as a useful instruction tool for guiding the learner through a complex technique.
Alternatively, the global rating scale allows a rater to evaluate the degree (on a 1 to 5 scale) to which a learner performs all steps in a given assessment exercise. Paired together, the checklist and the global rating scale is an influential assessment tool for procedural skills assessment.
Medical Decision Making and Leadership Development
Medical training requires a high level of cognitive function and confidence in the decision-making process. These abilities are fundamental to being a leader; thus, there is an essential need to provide opportunities for clinicians to develop leadership abilities. Additionally, influencing how services are delivered leads to greater confidence that patient care is central to function and not directed by external agendas.
Leadership training as part of the resident curriculum can significantly increase confidence in leadership skills in terms of alignment, communication, and integrity, which are tools that have been previously shown in business models to be essential for effective and efficient teams. Assessing these attributes with a pre- and post-administered scorecard survey can determine whether confidence in decision making and leadership development has improved with training. Previous studies on the implementation of leadership courses have shown that the experience is enjoyable and results in enhanced leadership skills for the participants.
Another assessment tool in ethical decision making (EDM) stems from using a modified Delphi method, where a theoretical framework and a self-assessment tool consisting of 35 statements called Ethical Decision-Making Climate Questionnaire (EDMCQ) was developed. The EDMCQ is meant to capture three EDM domains in healthcare, such as interdisciplinary collaboration and communication, leadership, and ethical environment. This assessment tool has been subsequently validated in 13 European countries and the USA and has been useful in EDM by clinicians and contributes to the EDM climate of healthcare organizations.
Medical science is an ever-evolving field with advances in all aspects of patient care that stem from basic, translational, and clinical investigations. Clinicians, therefore, engage in lifelong learning to keep up with these changes throughout their professional lives. Continuing medical education (CME) is such a critical aspect that it is required in most states for medical license renewal. Involvement in CME may include a variety of learning experiences to keep the learner up to date in their area of expertise.
In recent decades, there has been a transformation of CME in the USA due to stakeholder concerns over the cost of healthcare, frequency of medical errors, fragmentation of patient care, commercial influence, and the competence of healthcare professionals. The resulting recommendations from the Institute of Medicine have led to strategies to address these challenges. Five themes that are grounded in educational and politico-economic priorities for healthcare in the USA were motivators for these developments. The main themes were 1) a shift from attendance and time-based credits to a metric that infers competence, 2) an increased focus on cross-professional competencies that foster coordinated care delivery, 3) integration of CPD quality improvement and linking evidence-based science; for CPD, 4) a shift from disease-specific CPD to expansion to address complex population and public health issues, and 5) the standardization of continuing medical education competencies by measuring outcomes of participation and through performance improvement initiatives. The overall goals are for improved effectiveness and efficiency of the healthcare system to meet the needs of patients.
Competency-based medical education (CBME) has become mainstream in medical education and assessment for the next generation of clinicians. Providing higher quality care and reducing variation in healthcare delivery were significant motivators for the implementation of CBME as multiple studies demonstrated systemic failures in healthcare improvement  and indications that residency training has a significant influence in future performance.
A conceptual framework for clinician assessment has been developed and seeks to address the issues of competence and performance in clinical settings. The seven-level outcomes framework works progressively through participation (level 1) through the evaluation of the impact of changes in actual practice. Additionally, this framework can aid in the development of new CME assessment tools that are in agreement with the industry-wide paradigm shift in continuing professional development. The framework for CME developed is fully described by Moore et al. 2009, and is briefly outlined here :
- Level 1: Participation in CME determined through attendance records.
- Level 2: Learner satisfaction determined through questionnaires following the CME activity.
- Level 3: The level of declarative knowledge or procedural learning assessed by pre and post-test (objective) or self-reporting of knowledge gained (subjective).
- Level 4: Competence determined through observation (objective) or self-reporting.
- Level 5: The degree to which the participants performed learned skills in practice can be assessed objectively by observation of performance in a patient care setting or through patient chart studies.
- Level 6: The degree to which the health status of patients improves as a result of CME can be assessed through chart studies and administrative databases.
- Level 7: The effect of CME on public and community health can be assessed through epidemiological data or community surveys.
This framework was specifically developed to aid CME developers in assessing clinician competence, performance, and patient health status in an actual healthcare setting.
Pearls and Other Issues
Health care simulation is a growing field that combines innovative technologies and adult learning theory to train medical professionals in clinical skills and practices reproducibly.
The intended purpose of an assessment is to determine whether an assessment tool gives valid and reliable results that are wholly attributable to the learner.
Threats to validity can weaken the validity of an assessment.
To develop a curriculum and subsequent assessment, a literature search should take place to find previously developed measures for outcomes. If the assessment tool is to be modified for a particular setting or learners, describe the modifications and include support for how these changes improve suitability for the novel situation.
Assessment for theoretical knowledge in clinical clerkships is generally constructed around learner performance in examinations and assignments.
Methods to evaluate procedural skills will be different than those used in clinical clerkships where simulation is most commonly used as a teaching modality.
Medical training requires a high level of cognitive function and confidence in the decision-making process. These abilities are fundamental to being a leader; thus, there is an essential need to provide opportunities for clinicians to develop leadership abilities.
Medical science is an ever-evolving field with advances in all aspects of patient care that stem from basic, translational, and clinical investigations. Clinicians, therefore, engage in lifelong learning to keep up with these changes throughout their professional lives.
Providing higher quality care and reducing variation in healthcare delivery were significant motivators for the implementation of CBME as multiple studies demonstrated systemic failures in healthcare improvement and indications that residency training has a significant influence in future performance.
Enhancing Healthcare Team Outcomes
It has become necessary to develop medicine as a cooperative science; the clinician, the specialist, and the laboratory workers uniting for the good of the patient, each assisting in the elucidation of the problem at hand, and each dependent upon the other for support. –William J. Mayo, Commencement speech at Rush Medical College, 1910
Patients receive safer, higher quality care when a network of providers work as an effective healthcare team. Evidence regarding the effectiveness of interventions in team-training has grown and include patient outcomes, including mortality and morbidity, and quality of care indices. Additionally, secondary outcomes include teamwork behaviors, knowledge, skills, and attitudes.
Simulation and classroom-based team-training can improve teamwork processes such as communication, coordination, and cooperation, and correlates with improvements in patient safety outcomes. Team training interventions are a practical approach that organizations can take to enhance team, and therefore patient, outcomes. These exercises are shown to improve cognitive and affective outcomes, teamwork processes, and performance outcomes. The most effective healthcare team training interventions were those that were reported to have included organizational changes to support the teamwork environment and the transfer of such competencies into daily practice.