KLAUS D. KUBINGER
Full Text Available Multiple choice response formats are problematical as an item is often scored as solved simply because the test-taker is a lucky guesser. Instead of applying pertinent IRT models which take guessing effects into account, a pragmatic approach of re-conceptualizing multiple choice response formats to reduce the chance of lucky guessing is considered. This paper compares the free response format with two different multiple choice formats. A common multiple choice format with a single correct response option and five distractors (“1 of 6” is used, as well as a multiple choice format with five response options, of which any number of the five is correct and the item is only scored as mastered if all the correct response options and none of the wrong ones are marked (“x of 5”. An experiment was designed, using pairs of items with exactly the same content but different response formats. 173 test-takers were randomly assigned to two test booklets of 150 items altogether. Rasch model analyses adduced a fitting item pool, after the deletion of 39 items. The resulting item difficulty parameters were used for the comparison of the different formats. The multiple choice format “1 of 6” differs significantly from “x of 5”, with a relative effect of 1.63, while the multiple choice format “x of 5” does not significantly differ from the free response format. Therefore, the lower degree of difficulty of items with the “1 of 6” multiple choice format is an indicator of relevant guessing effects. In contrast the “x of 5” multiple choice format can be seen as an appropriate substitute for free response format.
Liegl, Gregor; Gandek, Barbara; Fischer, H. Felix
precision between the short forms using different item formats. Results: Sufficient unidimensionality of all short-form items and the original PF item bank was supported. Compared to formats A and B, format C increased the range of reliable measurement by about 0.5 standard deviations on the positive side...
Liegl, Gregor; Gandek, Barbara; Fischer, H Felix; Bjorner, Jakob B; Ware, John E; Rose, Matthias; Fries, James F; Nolte, Sandra
Physical function (PF) is a core patient-reported outcome domain in clinical trials in rheumatic diseases. Frequently used PF measures have ceiling effects, leading to large sample size requirements and low sensitivity to change. In most of these instruments, the response category that indicates the highest PF level is the statement that one is able to perform a given physical activity without any limitations or difficulty. This study investigates whether using an item format with an extended response scale, allowing respondents to state that the performance of an activity is easy or very easy, increases the range of precise measurement of self-reported PF. Three five-item PF short forms were constructed from the Patient-Reported Outcomes Measurement Information System (PROMIS®) wave 1 data. All forms included the same physical activities but varied in item stem and response scale: format A ("Are you able to …"; "without any difficulty"/"unable to do"); format B ("Does your health now limit you …"; "not at all"/"cannot do"); format C ("How difficult is it for you to …"; "very easy"/"impossible"). Each short-form item was answered by 2217-2835 subjects. We evaluated unidimensionality and estimated a graded response model for the 15 short-form items and remaining 119 items of the PROMIS PF bank to compare item and test information for the short forms along the PF continuum. We then used simulated data for five groups with different PF levels to illustrate differences in scoring precision between the short forms using different item formats. Sufficient unidimensionality of all short-form items and the original PF item bank was supported. Compared to formats A and B, format C increased the range of reliable measurement by about 0.5 standard deviations on the positive side of the PF continuum of the sample, provided more item information, and was more useful in distinguishing known groups with above-average functioning. Using an item format with an extended
Full Text Available Large-scale state assessment programs use both multiple-choice and open-ended items on tests for accountability purposes. Certainly, there is an intuitive belief among some educators and policy makers that open-ended items measure something different than multiple-choice items. This study examined two item formats in custom-built, standards-based tests of achievement in Reading and Mathematics at grades 3-8. In this paper, we raise questions about the value of including open-ended items, given scoring costs, time constraints, and the higher probability of missing data from test-takers.
Learning progressions are used to describe how students' understanding of a topic progresses over time and to classify the progress of students into steps or levels. This study applies Item Response Theory (IRT) based methods to investigate how to design learning progression-based science assessments. The research questions of this study are: (1) how to use items in different formats to classify students into levels on the learning progression, (2) how to design a test to give good information about students' progress through the learning progression of a particular construct and (3) what characteristics of test items support their use for assessing students' levels. Data used for this study were collected from 1500 elementary and secondary school students during 2009--2010. The written assessment was developed in several formats such as the Constructed Response (CR) items, Ordered Multiple Choice (OMC) and Multiple True or False (MTF) items. The followings are the main findings from this study. The OMC, MTF and CR items might measure different components of the construct. A single construct explained most of the variance in students' performances. However, additional dimensions in terms of item format can explain certain amount of the variance in student performance. So additional dimensions need to be considered when we want to capture the differences in students' performances on different types of items targeting the understanding of the same underlying progression. Items in each item format need to be improved in certain ways to classify students more accurately into the learning progression levels. This study establishes some general steps that can be followed to design other learning progression-based tests as well. For example, first, the boundaries between levels on the IRT scale can be defined by using the means of the item thresholds across a set of good items. Second, items in multiple formats can be selected to achieve the information criterion at all
Mixed-format tests containing both multiple-choice (MC) items and constructed-response (CR) items are now widely used in many testing programs. Mixed-format tests often are considered to be superior to tests containing only MC items although the use of multiple item formats leads to measurement challenges in the context of equating conducted under…
Liou, Pey-Yan; Bulut, Okan
The purpose of this study was to examine eighth-grade students' science performance in terms of two test design components, item format, and cognitive domain. The portion of Taiwanese data came from the 2011 administration of the Trends in International Mathematics and Science Study (TIMSS), one of the major international large-scale assessments in science. The item difficulty analysis was initially applied to show the proportion of correct items. A regression-based cumulative link mixed modeling (CLMM) approach was further utilized to estimate the impact of item format, cognitive domain, and their interaction on the students' science scores. The results of the proportion-correct statistics showed that constructed-response items were more difficult than multiple-choice items, and that the reasoning cognitive domain items were more difficult compared to the items in the applying and knowing domains. In terms of the CLMM results, students tended to obtain higher scores when answering constructed-response items as well as items in the applying cognitive domain. When the two predictors and the interaction term were included together, the directions and magnitudes of the predictors on student science performance changed substantially. Plausible explanations for the complex nature of the effects of the two test-design predictors on student science performance are discussed. The results provide practical, empirical-based evidence for test developers, teachers, and stakeholders to be aware of the differential function of item format, cognitive domain, and their interaction in students' science performance.
Glas, Cornelis A.W.; Eggen, T.J.H.M.; Veldkamp, B.P.
Item response theory is usually applied to items with a selected-response format, such as multiple choice items, whereas generalizability theory is usually applied to constructed-response tasks assessed by raters. However, in many situations, raters may use rating scales consisting of items with a selected-response format. This chapter presents a short overview of how item response theory and generalizability theory were integrated to model such assessments. Further, the precision of the esti...
Haberman, Shelby J; Sinharay, Sandip; Chon, Kyong Hee
Residual analysis (e.g. Hambleton & Swaminathan, Item response theory: principles and applications, Kluwer Academic, Boston, 1985; Hambleton, Swaminathan, & Rogers, Fundamentals of item response theory, Sage, Newbury Park, 1991) is a popular method to assess fit of item response theory (IRT) models. We suggest a form of residual analysis that may be applied to assess item fit for unidimensional IRT models. The residual analysis consists of a comparison of the maximum-likelihood estimate of the item characteristic curve with an alternative ratio estimate of the item characteristic curve. The large sample distribution of the residual is proved to be standardized normal when the IRT model fits the data. We compare the performance of our suggested residual to the standardized residual of Hambleton et al. (Fundamentals of item response theory, Sage, Newbury Park, 1991) in a detailed simulation study. We then calculate our suggested residuals using data from an operational test. The residuals appear to be useful in assessing the item fit for unidimensional IRT models.
Full Text Available Automatic item generation is the process of using item models to produce assessment tasks using computer technology. An item model is similar to a template that highlights the elements in the task that must be manipulated to produce new items. The purpose of our study is to describe an innovative method for generating large numbers of diverse and heterogeneous items along with their solutions and associated rationales to support formative feedback. We demonstrate the method by generating items in two diverse content areas, mathematics and nonverbal reasoning
Wan, Lei; Henly, George A.
Many innovative item formats have been proposed over the past decade, but little empirical research has been conducted on their measurement properties. This study examines the reliability, efficiency, and construct validity of two innovative item formats--the figural response (FR) and constructed response (CR) formats used in a K-12 computerized…
Wang, Wen-Chung; Chen, Hui-Fang; Jin, Kuan-Yu
Many scales contain both positively and negatively worded items. Reverse recoding of negatively worded items might not be enough for them to function as positively worded items do. In this study, we commented on the drawbacks of existing approaches to wording effect in mixed-format scales and used bi-factor item response theory (IRT) models to…
Wang, Wen-Chung; Shih, Ching-Lin
Three multiple indicators-multiple causes (MIMIC) methods, namely, the standard MIMIC method (M-ST), the MIMIC method with scale purification (M-SP), and the MIMIC method with a pure anchor (M-PA), were developed to assess differential item functioning (DIF) in polytomous items. In a series of simulations, it appeared that all three methods…
Glas, Cornelis A.W.; Eggen, T.J.H.M.; Veldkamp, B.P.
Item response theory is usually applied to items with a selected-response format, such as multiple choice items, whereas generalizability theory is usually applied to constructed-response tasks assessed by raters. However, in many situations, raters may use rating scales consisting of items with a
Guenole, Nigel; Brown, Anna A; Cooper, Andrew J
This article describes an investigation of whether Thurstonian item response modeling is a viable method for assessment of maladaptive traits. Forced-choice responses from 420 working adults to a broad-range personality inventory assessing six maladaptive traits were considered. The Thurstonian item response model's fit to the forced-choice data was adequate, while the fit of a counterpart item response model to responses to the same items but arranged in a single-stimulus design was poor. Monotrait heteromethod correlations indicated corresponding traits in the two formats overlapped substantially, although they did not measure equivalent constructs. A better goodness of fit and higher factor loadings for the Thurstonian item response model, coupled with a clearer conceptual alignment to the theoretical trait definitions, suggested that the single-stimulus item responses were influenced by biases that the independent clusters measurement model did not account for. Researchers may wish to consider forced-choice designs and appropriate item response modeling techniques such as Thurstonian item response modeling for personality questionnaire applications in industrial psychology, especially when assessing maladaptive traits. We recommend further investigation of this approach in actual selection situations and with different assessment instruments.
Assessing difference between classical test theory and item response theory methods in scoring primary four multiple choice objective test items. ... All research participants were ranked on the CTT number correct scores and the corresponding IRT item pattern scores from their performance on the PRISMADAT. Wilcoxon ...
Uto, Masaki; Ueno, Maomi
As an assessment method based on a constructivist approach, peer assessment has become popular in recent years. However, in peer assessment, a problem remains that reliability depends on the rater characteristics. For this reason, some item response models that incorporate rater parameters have been proposed. Those models are expected to improve…
Bruce, Bonnie; Fries, James F; Ambrosini, Debbie; Lingala, Bharathi; Gandek, Barbara; Rose, Matthias; Ware, John E
Physical function is a key component of patient-reported outcome (PRO) assessment in rheumatology. Modern psychometric methods, such as Item Response Theory (IRT) and Computerized Adaptive Testing, can materially improve measurement precision at the item level. We present the qualitative and quantitative item-evaluation process for developing the Patient Reported Outcomes Measurement Information System (PROMIS) Physical Function item bank. The process was stepwise: we searched extensively to identify extant Physical Function items and then classified and selectively reduced the item pool. We evaluated retained items for content, clarity, relevance and comprehension, reading level, and translation ease by experts and patient surveys, focus groups, and cognitive interviews. We then assessed items by using classic test theory and IRT, used confirmatory factor analyses to estimate item parameters, and graded response modeling for parameter estimation. We retained the 20 Legacy (original) Health Assessment Questionnaire Disability Index (HAQ-DI) and the 10 SF-36's PF-10 items for comparison. Subjects were from rheumatoid arthritis, osteoarthritis, and healthy aging cohorts (n = 1,100) and a national Internet sample of 21,133 subjects. We identified 1,860 items. After qualitative and quantitative evaluation, 124 newly developed PROMIS items composed the PROMIS item bank, which included revised Legacy items with good fit that met IRT model assumptions. Results showed that the clearest and best-understood items were simple, in the present tense, and straightforward. Basic tasks (like dressing) were more relevant and important versus complex ones (like dancing). Revised HAQ-DI and PF-10 items with five response options had higher item-information content than did comparable original Legacy items with fewer response options. IRT analyses showed that the Physical Function domain satisfied general criteria for unidimensionality with one-, two-, three-, and four-factor models
Tay, Louis; Huang, Qiming; Vermunt, Jeroen K.
In large-scale testing, the use of multigroup approaches is limited for assessing differential item functioning (DIF) across multiple variables as DIF is examined for each variable separately. In contrast, the item response theory with covariate (IRT-C) procedure can be used to examine DIF across multiple variables (covariates) simultaneously. To…
Trotman-Dickenson, D. I.
Describes some of the problems in writing data response items in economics for use by A Level and General Certificate of Secondary Education (GCSE) students. Examines the experience of two series of workshops on writing items, evaluating them and assessing responses from schools. Offers suggestions for producing packages of data response items as…
Missouri State Dept. of Elementary and Secondary Education, Jefferson City.
This assessment sample provides information on the Missouri Assessment Program (MAP) for grade 10 science. The sample consists of six items taken from the test booklet and scoring guides for the six items. The items assess ecosystems, mechanics, and data analysis. (MM)
Fleming, Danielle; Wilson, Mark; Ahlgrim-Delzell, Lynn
The Nonverbal Literacy Assessment (NVLA) is a literacy assessment designed for students with significant intellectual disabilities. The 218-item test was initially examined using confirmatory factor analysis. This method showed that the test worked as expected, but the items loaded onto a single factor. This article uses item response theory to…
Full Text Available Abstract Background Previous studies have analyzed the psychometric properties of the World Health Organization Disability Assessment Schedule II (WHO-DAS II using classical omnibus measures of scale quality. These analyses are sample dependent and do not model item responses as a function of the underlying trait level. The main objective of this study was to examine the effectiveness of the WHO-DAS II items and their options in discriminating between changes in the underlying disability level by means of item response analyses. We also explored differential item functioning (DIF in men and women. Methods The participants were 3615 adult general practice patients from 17 regions of Spain, with a first diagnosed major depressive episode. The 12-item WHO-DAS II was administered by the general practitioners during the consultation. We used a non-parametric item response method (Kernel-Smoothing implemented with the TestGraf software to examine the effectiveness of each item (item characteristic curves and their options (option characteristic curves in discriminating between changes in the underliying disability level. We examined composite DIF to know whether women had a higher probability than men of endorsing each item. Results Item response analyses indicated that the twelve items forming the WHO-DAS II perform very well. All items were determined to provide good discrimination across varying standardized levels of the trait. The items also had option characteristic curves that showed good discrimination, given that each increasing option became more likely than the previous as a function of increasing trait level. No gender-related DIF was found on any of the items. Conclusions All WHO-DAS II items were very good at assessing overall disability. Our results supported the appropriateness of the weights assigned to response option categories and showed an absence of gender differences in item functioning.
Feinberg, Richard A; Clauser, Amanda L
In graduate medical education, assessment results can effectively guide professional development when both assessment and feedback support a formative model. When individuals cannot directly access the test questions and responses, a way of using assessment results formatively is to provide item keyword feedback. The purpose of the following study was to investigate whether exposure to item keyword feedback aids in learner remediation. Participants included 319 trainees who completed a medical subspecialty in-training examination (ITE) in 2012 as first-year fellows, and then 1 year later in 2013 as second-year fellows. Performance on 2013 ITE items in which keywords were, or were not, exposed as part of the 2012 ITE score feedback was compared across groups based on the amount of time studying (preparation). For the same items common to both 2012 and 2013 ITEs, response patterns were analyzed to investigate changes in answer selection. Test takers who indicated greater amounts of preparation on the 2013 ITE did not perform better on the items in which keywords were exposed compared to those who were not exposed. The response pattern analysis substantiated overall growth in performance from the 2012 ITE. For items with incorrect responses on both attempts, examinees selected the same option 58% of the time. Results from the current study were unsuccessful in supporting the use of item keywords in aiding remediation. Unfortunately, the results did provide evidence of examinees retaining misinformation.
Cobern, William W.; Schuster, David; Adams, Betty; Skjold, Brandy Ann; Zeynep Muğaloğlu, Ebru; Bentz, Amy; Sparks, Kelly
A critical aspect of teacher education is gaining pedagogical content knowledge of how to teach science for conceptual understanding. Given the time limitations of college methods courses, it is difficult to touch on more than a fraction of the science topics potentially taught across grades K-8, particularly in the context of relevant pedagogies. This research and development work centers on constructing a formative assessment resource to help expose pre-service teachers to a greater number of science topics within teaching episodes using various modes of instruction. To this end, 100 problem-based, science pedagogy assessment items were developed via expert group discussions and pilot testing. Each item contains a classroom vignette followed by response choices carefully crafted to include four basic pedagogies (didactic direct, active direct, guided inquiry, and open inquiry). The brief but numerous items allow a substantial increase in the number of science topics that pre-service students may consider. The intention is that students and teachers will be able to share and discuss particular responses to individual items, or else record their responses to collections of items and thereby create a snapshot profile of their teaching orientations. Subsets of items were piloted with students in pre-service science methods courses, and the quantitative results of student responses were spread sufficiently to suggest that the items can be effective for their intended purpose.
Betancourt, Theresa S; Yang, Frances; Bolton, Paul; Normand, Sharon-Lise
This study aimed to refine a dimensional scale for measuring psychosocial adjustment in African youth using item response theory (IRT). A 60-item scale derived from qualitative data was administered to 667 war-affected adolescents (55% female). Exploratory factor analysis (EFA) determined the dimensionality of items based on goodness-of-fit indices. Items with loadings less than 0.4 were dropped. Confirmatory factor analysis (CFA) was used to confirm the scale's dimensionality found under the EFA. Item discrimination and difficulty were estimated using a graded response model for each subscale using weighted least squares means and variances. Predictive validity was examined through correlations between IRT scores (θ) for each subscale and ratings of functional impairment. All models were assessed using goodness-of-fit and comparative fit indices. Fisher's Information curves examined item precision at different underlying ranges of each trait. Original scale items were optimized and reconfigured into an empirically-robust 41-item scale, the African Youth Psychosocial Assessment (AYPA). Refined subscales assess internalizing and externalizing problems, prosocial attitudes/behaviors and somatic complaints without medical cause. The AYPA is a refined dimensional assessment of emotional and behavioral problems in African youth with good psychometric properties. Validation studies in other cultures are recommended. Copyright © 2014 John Wiley & Sons, Ltd.
Timmerby, Nina; Cosci, Fiammetta; Watson, Maggie; Csillag, Claudio; Schmitt, Florence; Steck, Barbara; Bech, Per; Thastum, Mikael
The Family Assessment Device (FAD) is a 60-item questionnaire widely used to evaluate self-reported family functioning. However, the factor structure as well as the number of items has been questioned. A shorter and more user-friendly version of the original FAD-scale, the 36-item FAD, has therefore previously been proposed, based on findings in a nonclinical population of adults. We aimed in this study to evaluate the brief 36-item version of the FAD in a clinical population. Data from a European multinational study, examining factors associated with levels of family functioning in adult cancer patients' families, were used. Both healthy and ill parents completed the 60-item version FAD. The psychometric analyses conducted were Principal Component Analysis and Mokken-analysis. A total of 564 participants were included. Based on the psychometric analysis we confirmed that the 36-item version of the FAD has robust psychometric properties and can be used in clinical populations. The present analysis confirmed that the 36-item version of the FAD (18 items assessing 'well-being' and 18 items assessing 'dysfunctional' family function) is a brief scale where the summed total score is a valid measure of the dimensions of family functioning. This shorter version of the FAD is, in accordance with the concept of 'measurement-based care', an easy to use scale that could be considered when the aim is to evaluate self-reported family functioning.
Byun, Yosep; Seong, Joohyun; Kim, Mingi; Park, Kyunghan; Yoon, Hyungkoo
In recent years in Korea, Typhoon and the localized extreme rainfall caused by the abnormal climate has increased. Accordingly, debris flow is becoming one of the most dangerous natural disaster. This study aimed to develop the assessment items which can be used for conducting damage investigation of debris flow. Delphi method was applied to classify the realms of assessment items. As a result, 29 assessment items which can be classified into 6 groups were determined.
Mallinckrodt, Brent; Tekie, Yacob T
The Working Alliance Inventory (WAI) has made great contributions to psychotherapy research. However, studies suggest the 7-point response format and 3-factor structure of the client version may have psychometric problems. This study used Rasch item response theory (IRT) to (a) improve WAI response format, (b) compare two brief 12-item versions (WAI-sr; WAI-s), and (c) develop a new 16-item Brief Alliance Inventory (BAI). Archival data from 1786 counseling center and community clients were analyzed. IRT findings suggested problems with crossed category thresholds. A rescoring scheme that combines neighboring responses to create 5- and 4-point scales sharply reduced these problems. Although subscale variance was reduced by 11-26%, rescoring yielded improved reliability and generally higher correlations with therapy process (session depth and smoothness) and outcome measures (residual gain symptom improvement). The 16-item BAI was designed to maximize "bandwidth" of item difficulty and preserve a broader range of WAI sensitivity than WAI-s or WAI-sr. Comparisons suggest the BAI performed better in several respects than the WAI-s or WAI-sr and equivalent to the full WAI on several performance indicators.
Cunningham, Timothy J.; Berkman, Lisa F.; Gortmaker, Steven L.; Kiefe, Catarina I.; Jacobs, David R.; Seeman, Teresa E.; Kawachi, Ichiro
The psychometric properties of instruments used to measure self-reported experiences of discrimination in epidemiologic studies are rarely assessed, especially regarding construct validity. The authors used 2000–2001 data from the Coronary Artery Risk Development in Young Adults (CARDIA) Study to examine differential item functioning (DIF) in 2 versions of the Experiences of Discrimination (EOD) Index, an index measuring self-reported experiences of racial/ethnic and gender discrimination. DIF may confound interpretation of subgroup differences. Large DIF was observed for 2 of 7 racial/ethnic discrimination items: White participants reported more racial/ethnic discrimination for the “at school” item, and black participants reported more racial/ethnic discrimination for the “getting housing” item. The large DIF by race/ethnicity in the index for racial/ethnic discrimination probably reflects item impact and is the result of valid group differences between blacks and whites regarding their respective experiences of discrimination. The authors also observed large DIF by race/ethnicity for 3 of 7 gender discrimination items. This is more likely to have been due to item bias. Users of the EOD Index must consider the advantages and disadvantages of DIF adjustment (omitting items, constructing separate measures, and retaining items). The EOD Index has substantial usefulness as an instrument that can assess self-reported experiences of discrimination. PMID:22038104
Tulsky, David S; Kisala, Pamela A; Victorson, David; Choi, Seung W; Gershon, Richard; Heinemann, Allen W; Cella, David
To develop a comprehensive, psychometrically sound, and conceptually grounded patient reported outcomes (PRO) measurement system for individuals with spinal cord injury (SCI). Individual interviews (n=44) and focus groups (n=65 individuals with SCI and n=42 SCI clinicians) were used to select key domains for inclusion and to develop PRO items. Verbatim items from other cutting-edge measurement systems (i.e. PROMIS, Neuro-QOL) were included to facilitate linkage and cross-population comparison. Items were field tested in a large sample of individuals with traumatic SCI (n=877). Dimensionality was assessed with confirmatory factor analysis. Local item dependence and differential item functioning were assessed, and items were calibrated using the item response theory (IRT) graded response model. Finally, computer adaptive tests (CATs) and short forms were administered in a new sample (n=245) to assess test-retest reliability and stability. A calibration sample of 877 individuals with traumatic SCI across five SCI Model Systems sites and one Department of Veterans Affairs medical center completed SCI-QOL items in interview format. We developed 14 unidimensional calibrated item banks and 3 calibrated scales across physical, emotional, and social health domains. When combined with the five Spinal Cord Injury--Functional Index physical function banks, the final SCI-QOL system consists of 22 IRT-calibrated item banks/scales. Item banks may be administered as CATs or short forms. Scales may be administered in a fixed-length format only. The SCI-QOL measurement system provides SCI researchers and clinicians with a comprehensive, relevant and psychometrically robust system for measurement of physical-medical, physical-functional, emotional, and social outcomes. All SCI-QOL instruments are freely available on Assessment CenterSM.
Pearson, Jennifer L; Hitchman, Sara C; Brose, Leonie S; Bauld, Linda; Glasser, Allison M; Villanti, Andrea C; McNeill, Ann; Abrams, David B; Cohen, Joanna E
A consistent approach using standardised items to assess e-cigarette use in both youth and adult populations will aid cross-survey and cross-national comparisons of the effect of e-cigarette (and tobacco) policies and improve our understanding of the population health impact of e-cigarette use. Focusing on adult behaviour, we propose a set of e-cigarette use items, discuss their utility and potential adaptation, and highlight e-cigarette constructs that researchers should avoid without further item development. Reliable and valid items will strengthen the emerging science and inform knowledge synthesis for policy-making. Building on informal discussions at a series of international meetings of 65 experts from 15 countries, the authors provide recommendations for assessing e-cigarette use behaviour, relative perceived harm, device type, presence of nicotine, flavours and reasons for use. We recommend items assessing eight core constructs: e-cigarette ever use, frequency of use and former daily use; relative perceived harm; device type; primary flavour preference; presence of nicotine; and primary reason for use. These items should be standardised or minimally adapted for the policy context and target population. Researchers should be prepared to update items as e-cigarette device characteristics change. A minimum set of e-cigarette items is proposed to encourage consensus around items to allow for cross-survey and cross-jurisdictional comparisons of e-cigarette use behaviour. These proposed items are a starting point. We recognise room for continued improvement, and welcome input from e-cigarette users and scientific colleagues. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Weidmer, Beverly A; Brach, Cindy; Hays, Ron D
The complexity of health information often exceeds patients' skills to understand and use it. To develop survey items assessing how well healthcare providers communicate health information. Domains and items for the Consumer Assessment of Healthcare Providers and Systems (CAHPS) Item Set for Addressing Health Literacy were identified through an environmental scan and input from stakeholders. The draft item set was translated into Spanish and pretested in both English and Spanish. The revised item set was field tested with a randomly selected sample of adult patients from 2 sites using mail and telephonic data collection. Item-scale correlations, confirmatory factor analysis, and internal consistency reliability estimates were estimated to assess how well the survey items performed and identify composite measures. Finally, we regressed the CAHPS global rating of the provider item on the CAHPS core communication composite and the new health literacy composites. A total of 601 completed surveys were obtained (52% response rate). Two composite measures were identified: (1) Communication to Improve Health Literacy (16 items); and (2) How Well Providers Communicate About Medicines (6 items). These 2 composites were significantly uniquely associated with the global rating of the provider (communication to improve health literacy: PLiteracy composite accounted for 90% of the variance of the original 16-item composite. This study provides support for reliability and validity of the CAHPS Item Set for Addressing Health Literacy. These items can serve to assess whether healthcare providers have communicated effectively with their patients and as a tool for quality improvement.
Watanabe, Yusuke; Madani, Amin; Ito, Yoichi M; Bilgic, Elif; McKendy, Katherine M; Feldman, Liane S; Fried, Gerald M; Vassiliou, Melina C
The extent to which each item assessed using the Global Operative Assessment of Laparoscopic Skills (GOALS) contributes to the total score remains unknown. The purpose of this study was to evaluate the level of difficulty and discriminative ability of each of the 5 GOALS items using item response theory (IRT). A total of 396 GOALS assessments for a variety of laparoscopic procedures over a 12-year time period were included. Threshold parameters of item difficulty and discrimination power were estimated for each item using IRT. The higher slope parameters seen with "bimanual dexterity" and "efficiency" are indicative of greater discriminative ability than "depth perception", "tissue handling", and "autonomy". IRT psychometric analysis indicates that the 5 GOALS items do not demonstrate uniform difficulty and discriminative power, suggesting that they should not be scored equally. "Bimanual dexterity" and "efficiency" seem to have stronger discrimination. Weighted scores based on these findings could improve the accuracy of assessing individual laparoscopic skills. Copyright © 2016 Elsevier Inc. All rights reserved.
Lebedeva, Elena; Huang, Mei; Koski, Lisa
The Montreal Cognitive Assessment (MoCA) is a screening tool for mild cognitive impairment (MCI) in elderly individuals. We hypothesized that measurement error when using the new alternate MoCA versions to monitor change over time could be related to the use of items that are not of comparable difficulty to their corresponding originals of similar content. The objective of this study was to compare the difficulty of the alternate MoCA items to the original ones. Five selected items from alternate versions of the MoCA were included with items from the original MoCA administered adaptively to geriatric outpatients (N = 78). Rasch analysis was used to estimate the difficulty level of the items. None of the five items from the alternate versions matched the difficulty level of their corresponding original items. This study demonstrates the potential benefits of a Rasch analysis-based approach for selecting items during the process of development of parallel forms. The results suggest that better match of the items from different MoCA forms by their difficulty would result in higher sensitivity to changes in cognitive function over time.
LeBouthillier, Daniel M; Thibodeau, Michel A; Alberts, Nicole M; Hadjistavropoulos, Heather D; Asmundson, Gordon J G
Individuals with medical conditions are likely to have elevated health anxiety; however, research has not demonstrated how medical status impacts response patterns on health anxiety measures. Measurement bias can undermine the validity of a questionnaire by overestimating or underestimating scores in groups of individuals. We investigated whether the Short Health Anxiety Inventory (SHAI), a widely-used measure of health anxiety, exhibits medical condition-based bias on item and subscale levels, and whether the SHAI subscales adequately assess the health anxiety continuum. Data were from 963 individuals with diabetes, breast cancer, or multiple sclerosis, and 372 healthy individuals. Mantel-Haenszel tests and item characteristic curves were used to classify the severity of item-level differential item functioning in all three medical groups compared to the healthy group. Test characteristic curves were used to assess scale-level differential item functioning and whether the SHAI subscales adequately assess the health anxiety continuum. Nine out of 14 items exhibited differential item functioning. Two items exhibited differential item functioning in all medical groups compared to the healthy group. In both Thought Intrusion and Fear of Illness subscales, differential item functioning was associated with mildly deflated scores in medical groups with very high levels of the latent traits. Fear of Illness items poorly discriminated between individuals with low and very low levels of the latent trait. While individuals with medical conditions may respond differentially to some items, clinicians and researchers can confidently use the SHAI with a variety of medical populations without concern of significant bias. Copyright © 2015 Elsevier Inc. All rights reserved.
Gu, Chenyang; Gutman, Roee
The assessment of patients' functional status across the continuum of care requires a common patient assessment tool. However, assessment tools that are used in various health care settings differ and cannot be easily contrasted. For example, the Functional Independence Measure (FIM) is used to evaluate the functional status of patients who stay in inpatient rehabilitation facilities, the Minimum Data Set (MDS) is collected for all patients who stay in skilled nursing facilities, and the Outcome and Assessment Information Set (OASIS) is collected if they choose home health care provided by home health agencies. All three instruments or questionnaires include functional status items, but the specific items, rating scales, and instructions for scoring different activities vary between the different settings. We consider equating different health assessment questionnaires as a missing data problem, and propose a variant of predictive mean matching method that relies on Item Response Theory (IRT) models to impute unmeasured item responses. Using real data sets, we simulated missing measurements and compared our proposed approach to existing methods for missing data imputation. We show that, for all of the estimands considered, and in most of the experimental conditions that were examined, the proposed approach provides valid inferences, and generally has better coverages, relatively smaller biases, and shorter interval estimates. The proposed method is further illustrated using a real data set. © 2016, The International Biometric Society.
Alsadaawi, Abdullah Saleh
The Saudi National Assessment Centre administers the Computer Science Teacher Test for teacher certification. The aim of this study is to explore gender differences in candidates' scores, and investigate dimensionality, reliability, and differential item functioning using confirmatory factor analysis and item response theory. The confirmatory…
Full Text Available This study examines the degree of acquiescence present when the item and response formats of a summated rating scale are varied. It is often recommended that acquiescence response bias in rating scales may be controlled by using both positively and negatively worded items. Such items are generally worded in the Likert-type format of statements. The purpose of the study was to establish whether items in question format would result in a smaller degree of acquiescence than items worded as statements. the response format was also varied (five- and seven-point options to determine whether this would influence the reliability and degree of acquiescence in the scales. A twenty-item Locus of Control (LC questionnaire was used, but each item was complemented by its opposite, resulting in 40 items. The subjects, divided randomly into two groups, were second year students who had to complete four versions of the questionnaire, plus a shortened version of Bass's scale for measuring acquiescence. The LC version were questions or statements each combined with a five- or seven-point respons format. Partial counterbalancing was introduced by testing on two separate occasions, presenting the tests to the two groups in the opposite order. The degree of acquiescence was assessed by correlating the items with their opposite, and by correlating scores on each version with scores on the acquiescence questionnaire. No major difference were found between the various item and response format in relation to acquiescence. Opsomming Hierdie ondersoek is uitgevoer om te bepaal of die mate van instemmingsgeneigdheid deur die item- en responsformaat van 'n gesommeerde selfbeoordelingskaal beinvloed word. Daar word dikwels aanbeveel dat die gebruik van positief- sowel as negatiefbewoorde items in 'n vraelys instemmingsgeneigdheid beperk. Suike items word gewoonlik in die tradisionele Likertformaat as stellings geformuleer. Die doel van die ondersoek was om te bepaal of items
Messinger, H B; Messinger, M I
Recently in this journal Peters and Murphy challenged the validity of factor analyses done on bimodal handedness data, suggesting instead that right- and left-handers be studied separately. But bimodality may be avoidable if attention is paid to Oldfield's questionnaire format and instructions for the subjects. Two characteristics appear crucial: a two-column LEFT-RIGHT format for the body of the instrument and what we call Oldfield's Admonition: not to indicate strong preference for handedness item, such as write, unless "... the preference is so strong that you would never try to use the other hand unless absolutely forced to...". Attaining unimodality of an item distribution would seem to overcome the objections of Peters and Murphy. In a 1984 survey in Boston we used Oldfield's ten-item questionnaire exactly as published. This produced unimodal item distributions. With reflection of the five-point item scale and a logarithmic transformation, we achieved a degree of normalization for the items. Two surveys elsewhere based on Oldfield's 20-item list but with changes in the questionnaire format and the instructions, yielded markedly different item distributions with peaks at each extreme and sometimes in the middle as well.
Ortega, Javier Virues; Iwata, Brian A.; Nogales-Gonzalez, Celia; Frades, Belen
We conducted 2 studies on reinforcer preference in patients with dementia. Results of preference assessments yielded differential selections by 14 participants. Unlike prior studies with individuals with intellectual disabilities, all participants showed a noticeable preference for leisure items over edible items. Results of a subsequent analysis…
Full Text Available The test of relational reasoning (TORR is designed to assess the ability to identify complex patterns within visuospatial stimuli. The TORR is designed for use in school and university settings, and therefore, its measurement invariance across diverse groups is critical. In this investigation, a large sample, representative of a major university on key demographic variables, was collected, and the resulting data were analyzed using a multi-group, multidimensional item-response theory model-comparison procedure. No significant differential item functioning was found on any of the TORR items across any of the demographic groups of interest. This finding is interpreted as evidence of the cultural fairness of the TORR, and potential test-development choices that may have contributed to that cultural fairness are discussed.
MAHR, ALFRED D.; NEOGI, TUHINA; LAVALLEY, MICHAEL P.; DAVIS, JOHN C.; HOFFMAN, GARY S.; MCCUNE, W. JOSEPH; SPECKS, ULRICH; SPIERA, ROBERT F.; ST.CLAIR, E. WILLIAM; STONE, JOHN H.; MERKEL, PETER A.
Objective To assess the Birmingham Vasculitis Activity Score for Wegener's Granulomatosis (BVAS/WG) with respect to its selection and weighting of items. Methods This study used the BVAS/WG data from the Wegener's Granulomatosis Etanercept Trial. The scoring frequencies of the 34 predefined items and any “other” items added by clinicians were calculated. Using linear regression with generalized estimating equations in which the physician global assessment (PGA) of disease activity was the dependent variable, we computed weights for all predefined items. We also created variables for clinical manifestations frequently added as other items, and computed weights for these as well. We searched for the model that included the items and their generated weights yielding an activity score with the highest R2 to predict the PGA. Results We analyzed 2,044 BVAS/WG assessments from 180 patients; 734 assessments were scored during active disease. The highest R2 with the PGA was obtained by scoring WG activity based on the following items: the 25 predefined items rated on ≥5 visits, the 2 newly created fatigue and weight loss variables, the remaining minor other and major other items, and a variable that signified whether new or worse items were present at a specific visit. The weights assigned to the items ranged from 1 to 21. Compared with the original BVAS/WG, this modified score correlated significantly more strongly with the PGA. Conclusion This study suggests possibilities to enhance the item selection and weighting of the BVAS/WG. These changes may increase this instrument's ability to capture the continuum of disease activity in WG. PMID:18512722
Karl W. Kosko
Full Text Available Quantitative Literacy (QL has been described as the skill set an individual uses when interacting with the world in a quantitative manner. A necessary component of this interaction is communication. To this end, assessments of QL have included open-ended items as a means of including communicative aspects of QL. The present study sought to examine whether such open-ended items typically measured aspects of quantitative communication, as compared to mathematical communication, or mathematical skills. We focused on public-released items and rubrics from four of the most widely referenced assessments: the Third International Mathematics and Science Study (TIMSS-95: the National Adult Literacy Survey (NALS; now the National Assessment of Adult Literacy, NAAL in 1985 and 1992, the International Adult Literacy Skills (IALS beginning in 1994; and the Program for International Student Assessment (PISA beginning in 2000. We found that open-ended item rubrics in these QL assessments showed a strong tendency to assess answer-only responses. Therefore, while some open-ended items may have required certain levels of quantitative reasoning to find a solution, it is the solution rather than the reasoning that was often assessed.
Cappelleri, Joseph C.; Lundy, J. Jason; Hays, Ron D.
Introduction The U.S. Food and Drug Administration’s patient-reported outcome (PRO) guidance document defines content validity as “the extent to which the instrument measures the concept of interest” (FDA, 2009, p. 12). “Construct validity is now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity” (Strauss & Smith, 2009, p. 7). Hence both qualitative and quantitative information are essential in evaluating the validity of measures. Methods We review classical test theory and item response theory approaches to evaluating PRO measures including frequency of responses to each category of the items in a multi-item scale, the distribution of scale scores, floor and ceiling effects, the relationship between item response options and the total score, and the extent to which hypothesized “difficulty” (severity) order of items is represented by observed responses. Conclusion Classical test theory and item response theory can be useful in providing a quantitative assessment of items and scales during the content validity phase of patient-reported outcome measures. Depending on the particular type of measure and the specific circumstances, either one or both approaches should be considered to help maximize the content validity of PRO measures. PMID:24811753
Petscher, Yaacov; Mitchell, Alison M.; Foorman, Barbara R.
A growing body of literature suggests that response latency, the amount of time it takes an individual to respond to an item, may be an important factor to consider when using assessment data to estimate the ability of an individual. Considering that tests of passage and list fluency are being adapted to a computer administration format, it is…
Full Text Available Objectives: Students who perceived their learning environment positively are more likely to develop effective learning strategies, and adopt a deep learning approach. Currently, there is no validated instrument for measuring the educational environment of educational programs on respiratory care (RC. The aim of this study was to develop an instrument to measure students′ perception of the RC educational environment. Materials and Methods: Based on the literature review and an assessment of content validity by multiple focus groups of RC educationalists, potential items of the instrument relevant to RC educational environment construct were generated by the research group. The initial 71 item questionnaire was then field-tested on all students from the 3 RC programs in Saudi Arabia and was subjected to multi-trait scaling analysis. Cronbach′s alpha was used to assess internal consistency reliabilities. Results: Two hundred and twelve students (100% completed the survey. The initial instrument of 71 items was reduced to 65 across 5 scales. Convergent and discriminant validity assessment demonstrated that the majority of items correlated more highly with their intended scale than a competing one. Cronbach′s alpha exceeded the standard criterion of >0.70 in all scales except one. There was no floor or ceiling effect for scale or overall score. Conclusions: This instrument is the first assessment tool developed to measure the RC educational environment. There was evidence of its good feasibility, validity, and reliability. This first validation of the instrument supports its use by RC students to evaluate educational environment.
Item response theory (IRT) is a framework for modeling and analyzing item response data. Item-level modeling gives IRT advantages over classical test theory. The fit of an item score pattern to an item response theory (IRT) models is a necessary condition that must be assessed for further use of item and models that best fit ...
Siskind, Theresa G.; Anderson, Lorin W.
The study was designed to examine the similarity of response options generated by different item writers using a systematic approach to item writing. The similarity of response options to student responses for the same item stems presented in an open-ended format was also examined. A non-systematic (subject matter expertise) approach and a…
Cheung, Felix; Lucas, Richard E
The present paper assessed the validity of single-item life satisfaction measures by comparing single-item measures to the Satisfaction with Life Scale (SWLS)-a more psychometrically established measure. Two large samples from Washington (N = 13,064) and Oregon (N = 2,277) recruited by the Behavioral Risk Factor Surveillance System and a representative German sample (N = 1,312) recruited by the Germany Socio-Economic Panel were included in the present analyses. Single-item life satisfaction measures and the SWLS were correlated with theoretically relevant variables, such as demographics, subjective health, domain satisfaction, and affect. The correlations between the two life satisfaction measures and these variables were examined to assess the construct validity of single-item life satisfaction measures. Consistent across three samples, single-item life satisfaction measures demonstrated substantial degree of criterion validity with the SWLS (zero-order r = 0.62-0.64; disattenuated r = 0.78-0.80). Patterns of statistical significance for correlations with theoretically relevant variables were the same across single-item measures and the SWLS. Single-item measures did not produce systematically different correlations compared to the SWLS (average difference = 0.001-0.005). The average absolute difference in the magnitudes of the correlations produced by single-item measures and the SWLS was very small (average absolute difference = 0.015-0.042). Single-item life satisfaction measures performed very similarly compared to the multiple-item SWLS. Social scientists would get virtually identical answer to substantive questions regardless of which measure they use.
Biswas, Shubho Subrata; Jain, Vaishali; Agrawal, Vandana; Bindra, Maninder
Small group sessions are regarded as a more active and student-centered approach to learning. Item analysis provides objective evidence of whether such sessions improve comprehension and make the topic easier for students, in addition to assessing the relative benefit of the sessions to good versus poor performers. Self-assessment makes students aware of their deficiencies. Small group sessions can also help students develop the ability to self-assess. This study was carried out to assess the effect of small group sessions on item analysis and students' self-assessment. A total of 21 female and 29 male first year medical students participated in a small group session on topics covered by didactic lectures two weeks earlier. It was preceded and followed by two multiple choice question (MCQ) tests, in which students were asked to self-assess their likely score. The MCQs used were item analyzed in a previous group and were chosen of matching difficulty and discriminatory indices for the pre- and post-tests. The small group session improved the marks of both genders equally, but female performance was better. The session made the items easier; increasing the difficulty index significantly but there was no significant alteration in the discriminatory index. There was overestimation in the self-assessment of both genders, but male overestimation was greater. The session improved the self-assessment of students in terms of expected marks and expectation of passing. Small group session improved the ability of students to self-assess their knowledge and increased the difficulty index of items reflecting students' better performance.
Smith, Clifton L.; And Others
This document contains duties and tasks, multiple-choice test items, and other assessment techniques for Missouri's advanced marketing core curriculum. The core curriculum begins with a list of 13 suggested textbook resources. Next, nine duties with their associated tasks are given. Under each task appears one or more citations to appropriate…
Full Text Available Jose Luis Justícia1, Eva Baró2, Victoria Cardona3, Pedro Guardia4, Pedro Ojeda5, José Maria Olaguíbel6, José Maria Vega7, Carmen Vidal81Medical Department, Stallergenes Ibérica, Barcelona, Spain; 2Health Outcomes Research Department, 3D Health Research, Barcelona, Spain; 3Hospital Vall d'Hebron, Barcelona, Spain; 4Hospital Virgen Macarena, Sevilla, Spain; 5Clínica de Asma y Alergia Dres. Ojeda, Madrid, Spain; 6Complejo Hospitalario de Navarra, Pamplona, Spain; 7Hospital Regional Universitario Carlos Haya Málaga, Spain; 8Complejo Hospitalario Universitario de Santiago, Santiago de Compostela, SpainBackground: Allergen-specific immunotherapy (SIT is a treatment capable of modifying the natural course of allergy, so ensuring good adherence to SIT is fundamental. Up until now there has not existed an instrument specifically developed to measure patient satisfaction with SIT, although its assessment could help us to comprehend better and improve treatment adherence and effectiveness. The aim of this study was to develop an instrument to measure adult patient satisfaction with SIT.Methods: Items were generated from a literature review, focus groups with allergic adult patients undergoing SIT, and a meeting with experts. Potential items were administered to allergic patients undergoing SIT in an observational, cross-sectional, multicenter study. Item reduction was based on quantitative and qualitative criteria. A preliminary assessment of feasibility, reliability, and validity of the retained items was performed.Results: An initial pool of 70 items was administered to 257 patients undergoing SIT. Fifty-four items were eliminated resulting in a provisional instrument with 16 items. Factor analysis yielded four factors that were identified as perceived efficacy, activities and environment, cost-benefit balance, and overall satisfaction, explaining 74.8% of variance. Ceiling and floor effects were negligible for overall score. Overall score was
Cheung, Felix; Lucas, Richard E.
Purpose The present paper assessed the validity of single-item life satisfaction measures by comparing single-item measures to the Satisfaction with Life Scale (SWLS) - a more psychometrically established measure. Methods Two large samples from Washington (N=13,064) and Oregon (N=2,277) recruited by the Behavioral Risk Factor Surveillance System (BRFSS) and a representative German sample (N=1,312) recruited by the Germany Socio-Economic Panel (GSOEP) were included in the present analyses. Single-item life satisfaction measures and the SWLS were correlated with theoretically relevant variables, such as demographics, subjective health, domain satisfaction, and affect. The correlations between the two life satisfaction measures and these variables were examined to assess the construct validity of single-item life satisfaction measures. Results Consistent across three samples, single-item life satisfaction measures demonstrated substantial degree of criterion validity with the SWLS (zero-order r = 0.62 – 0.64; disattenuated r = 0.78 – 0.80). Patterns of statistical significance for correlations with theoretically relevant variables were the same across single-item measures and the SWLS. Single-item measures did not produce systematically different correlations compared to the SWLS (average difference = 0.001 – 0.005). The average absolute difference in the magnitudes of the correlations produced by single-item measures and the SWLS were very small (average absolute difference = 0.015 −0.042). Conclusions Single-item life satisfaction measures performed very similarly compared to the multiple-item SWLS. Social scientists would get virtually identical answer to substantive questions regardless of which measure they use. PMID:24890827
The article provides an overview of goodness-of-fit assessment methods for item response theory (IRT) models. It is now possible to obtain accurate "p"-values of the overall fit of the model if bivariate information statistics are used. Several alternative approaches are described. As the validity of inferences drawn on the fitted model…
Cappelleri, Joseph C; Jason Lundy, J; Hays, Ron D
The US Food and Drug Administration's guidance for industry document on patient-reported outcomes (PRO) defines content validity as "the extent to which the instrument measures the concept of interest" (FDA, 2009, p. 12). According to Strauss and Smith (2009), construct validity "is now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity" (p. 7). Hence, both qualitative and quantitative information are essential in evaluating the validity of measures. We review classical test theory and item response theory (IRT) approaches to evaluating PRO measures, including frequency of responses to each category of the items in a multi-item scale, the distribution of scale scores, floor and ceiling effects, the relationship between item response options and the total score, and the extent to which hypothesized "difficulty" (severity) order of items is represented by observed responses. If a researcher has few qualitative data and wants to get preliminary information about the content validity of the instrument, then descriptive assessments using classical test theory should be the first step. As the sample size grows during subsequent stages of instrument development, confidence in the numerical estimates from Rasch and other IRT models (as well as those of classical test theory) would also grow. Classical test theory and IRT can be useful in providing a quantitative assessment of items and scales during the content-validity phase of PRO-measure development. Depending on the particular type of measure and the specific circumstances, the classical test theory and/or the IRT should be considered to help maximize the content validity of PRO measures. Copyright © 2014 Elsevier HS Journals, Inc. All rights reserved.
Doolittle, Allen E.; Cleary, T. Anne
Eight randomly equivalent samples of high school seniors were each given a unique form of the ACT Assessment Mathematics Usage Test (ACTM). Signed measures of differential item performance (DIP) were obtained for each item in the eight ACTM forms. DIP estimates were analyzed and a significant item category effect was found. (Author/LMO)
Fries, James F; Witter, James; Rose, Matthias; Cella, David; Khanna, Dinesh; Morgan-DeWitt, Esi
Patient-reported outcome (PRO) questionnaires record health information directly from research participants because observers may not accurately represent the patient perspective. Patient-reported Outcomes Measurement Information System (PROMIS) is a US National Institutes of Health cooperative group charged with bringing PRO to a new level of precision and standardization across diseases by item development and use of item response theory (IRT). With IRT methods, improved items are calibrated on an underlying concept to form an item bank for a "domain" such as physical function (PF). The most informative items can be combined to construct efficient "instruments" such as 10-item or 20-item PF static forms. Each item is calibrated on the basis of the probability that a given person will respond at a given level, and the ability of the item to discriminate people from one another. Tailored forms may cover any desired level of the domain being measured. Computerized adaptive testing (CAT) selects the best items to sharpen the estimate of a person's functional ability, based on prior responses to earlier questions. PROMIS item banks have been improved with experience from several thousand items, and are calibrated on over 21,000 respondents. In areas tested to date, PROMIS PF instruments are superior or equal to Health Assessment Questionnaire and Medical Outcome Study Short Form-36 Survey legacy instruments in clarity, translatability, patient importance, reliability, and sensitivity to change. Precise measures, such as PROMIS, efficiently incorporate patient self-report of health into research, potentially reducing research cost by lowering sample size requirements. The advent of routine IRT applications has the potential to transform PRO measurement.
Khadka, Jyoti; McAlinden, Colm; Gothwal, Vijaya K; Lamoureux, Ecosse L; Pesudovs, Konrad
To investigate the effect of rating scale designs (question formats and response categories) on item difficulty calibrations and assess the impact that rating scale differences have on overall vision-related activity limitation (VRAL) scores. Sixteen existing patient-reported outcome instruments (PROs) suitable for cataract assessment, with different rating scales, were self-administered by patients on a cataract surgery waiting list. A total of 226 VRAL items from these PROs in their native rating scales were included in an item bank and calibrated using Rasch analysis. Fifteen item/content areas (e.g., reading newspapers) appearing in at least three different PROs were identified. Within each content area, item calibrations were compared and their range calculated. Similarly, five PROs having at least three items in common with the Visual Function (VF-14) were compared in terms of average item measures. A total of 614 patients (mean age ± SD, 74.1 ± 9.4 years) participated. Items with the same content varied in their calibration by as much as two logits; "reading the small print" had the largest range (1.99 logits) followed by "watching TV" (1.60). Compared with the VF-14 (0.00 logits), the rating scale of the Visual Disability Assessment (1.13 logits) produced the most difficult items and the Cataract Symptom Scale (0.24 logits) produced the least difficult items. The VRAL item bank was suboptimally targeted to the ability level of the participants (2.00 logits). Rating scale designs have a significant effect on item calibrations. Therefore, constructing item banks from existing items in their native formats carries risks to face validity and transmission of problems inherent in existing instruments, such as poor targeting.
Glynn, Shawn M.
The Trends in International Mathematics and Science Study (TIMSS) is a comparative assessment of the achievement of students in many countries. In the present study, a rigorous independent evaluation was conducted of a representative sample of TIMSS science test items because item quality influences the validity of the scores used to inform…
Linn, Marcia C.; de Benedictis, Tina; Delucchi, Kevin; Harris, Abigail; Stage, Elizabeth
The National Assessment of Educational Progress Science Assessment has consistently revealed small gender differences on science content items but not on science inquiry items. This assessment differs from others in that respondents can choose I don't know rather than guessing. This paper examines explanations for the gender differences including (a) differential prior instruction, (b) differential response to uncertainty and use of the I don't know response, (c) differential response to figurally presented items, and (d) different attitudes towards science. Of these possible explanations, the first two received support. Females are more likely to use the I don't know response, especially for items with physical science content or masculine themes such as football. To ameliorate this situation we need more effective science instruction and more gender-neutral assessment items.
Wind, Stefanie A
The concept of invariant measurement is typically associated with Rasch measurement theory (Engelhard, 2013). Concerned with the appropriateness of the parametric transformation upon which the Rasch model is based, Mokken (1971) proposed a nonparametric procedure for evaluating the quality of social science measurement that is theoretically and empirically related to the Rasch model. Mokken's nonparametric procedure can be used to evaluate the quality of dichotomous and polytomous items in terms of the requirements for invariant measurement. Despite these potential benefits, the use of Mokken scaling to examine the properties of multiple-choice (MC) items in education has not yet been fully explored. A nonparametric approach to evaluating MC items is promising in that this approach facilitates the evaluation of assessments in terms of invariant measurement without imposing potentially inappropriate transformations. Using Rasch-based indices of measurement quality as a frame of reference, data from an eighth-grade physical science assessment are used to illustrate and explore Mokken-based techniques for evaluating the quality of MC items. Implications for research and practice are discussed.
Yau, David T W; Wong, May C M; Lam, K F; McGrath, Colman
Four-factor structure of the two 8-item short forms of Child Perceptions Questionnaire CPQ11-14 (RSF:8 and ISF:8) has been confirmed. However, the sum scores are typically reported in practice as a proxy of Oral health-related Quality of Life (OHRQoL), which implied a unidimensional structure. This study first assessed the unidimensionality of 8-item short forms of CPQ11-14. Item response theory (IRT) was employed to offer an alternative and complementary approach of validation and to overcome the limitations of classical test theory assumptions. A random sample of 649 12-year-old school children in Hong Kong was analyzed. Unidimensionality of the scale was tested by confirmatory factor analysis (CFA), principle component analysis (PCA) and local dependency (LD) statistic. Graded response model was fitted to the data. Contribution of each item to the scale was assessed by item information function (IIF). Reliability of the scale was assessed by test information function (TIF). Differential item functioning (DIF) across gender was identified by Wald test and expected score functions. Both CPQ11-14 RSF:8 and ISF:8 did not deviate much from the unidimensionality assumption. Results from CFA indicated acceptable fit of the one-factor model. PCA indicated that the first principle component explained >30 % of the total variation with high factor loadings for both RSF:8 and ISF:8. Almost all LD statistic items suggesting little contribution of information to the scale and item removal caused little practical impact. Comparing the TIFs, RSF:8 showed slightly better information than ISF:8. In addition to oral symptoms items, the item "Concerned with what other people think" demonstrated a uniform DIF (p Items related to oral symptoms were not informative to OHRQoL and deletion of these items is suggested. The impact of DIF across gender on the overall score was minimal. CPQ11-14 RSF:8 performed slightly better than ISF:8 in measurement precision. The 6-item short forms
Jabrayilov, Ruslan; Emons, Wilco H. M.; Sijtsma, Klaas
Clinical psychologists are advised to assess clinical and statistical significance when assessing change in individual patients. Individual change assessment can be conducted using either the methodologies of classical test theory (CTT) or item response theory (IRT). Researchers have been optimistic
Composite assessments aim to combine different aspects of a disease in a single score and are utilized in a variety of therapeutic areas. The data arising from these evaluations are inherently discrete with distinct statistical properties. This tutorial presents the framework of the item response theory (IRT) for the analysis of this data type in a pharmacometric context. The article considers both conceptual (terms and assumptions) and practical questions (modeling software, data requirements, and model building). PMID:29493119
Galindo-Garre, Francisca; Hidalgo, María Dolores; Guilera, Georgina; Pino, Oscar; Rojo, J Emilio; Gómez-Benito, Juana
The World Health Organization Disability Assessment Schedule II (WHO-DAS II) is a multidimensional instrument developed for measuring disability. It comprises six domains (getting around, self-care, getting along with others, life activities and participation in society). The main purpose of this paper is the evaluation of the psychometric properties for each domain of the WHO-DAS II with parametric and non-parametric Item Response Theory (IRT) models. A secondary objective is to assess whether the WHO-DAS II items within each domain form a hierarchy of invariantly ordered severity indicators of disability. A sample of 352 patients with a schizophrenia spectrum disorder is used in this study. The 36 items WHO-DAS II was administered during the consultation. Partial Credit and Mokken scale models are used to study the psychometric properties of the questionnaire. The psychometric properties of the WHO-DAS II scale are satisfactory for all the domains. However, we identify a few items that do not discriminate satisfactorily between different levels of disability and cannot be invariantly ordered in the scale. In conclusion the WHO-DAS II can be used to assess overall disability in patients with schizophrenia, but some domains are too general to assess functionality in these patients because they contain items that are not applicable to this pathology. Copyright © 2014 John Wiley & Sons, Ltd.
Kurimoto, Shigeru; Suzuki, Mikako; Yamamoto, Michiro; Okui, Nobuyuki; Imaeda, Toshihiko; Hirata, Hitoshi
The purpose of this study is to develop a short and valid measure for upper extremity disorders and to assess the effect of attached illustrations in item reduction of a self-administered disability questionnaire while retaining psychometric properties. A validated questionnaire used to assess upper extremity disorders, the Hand20, was reduced to ten items using two item-reduction techniques. The psychometric properties of the abbreviated form, the Hand10, were evaluated on an independent sample that was used for the shortening process. Validity, reliability, and responsiveness of the Hand10 were retained in the item reduction process. It was possible that the use of explanatory illustrations attached to the Hand10 helped with its reproducibility. The illustrations for the Hand10 promoted text comprehension and motivation to answer the items. These changes resulted in high acceptability; more than 99.3% of patients, including 98.5% of elderly patients, could complete the Hand10 properly. The illustrations had favorable effects on the item reduction process and made it possible to retain precision of the instrument. The Hand10 is a reliable and valid instrument for individual-level applications with the advantage of being compact and broadly applicable, even in elderly individuals.
Pearson, Jennifer L; Hitchman, Sara C; Brose, Leonie S; Bauld, Linda; Glasser, Allison M; Villanti, Andrea C; McNeill, Ann; Abrams, David B; Cohen, Joanna E
Background: A consistent approach using standardized items to assess e-cigarette use in both youth and adult populations will aid cross-survey and cross-national comparisons of the effect of e-cigarette (and tobacco) policies and improve our understanding of the population health impact of e-cigarette use. Focusing on adult behavior, we propose a set of e-cigarette use items, discuss their utility and potential adaptation, and highlight e-cigarette constructs that researchers should avoid wit...
Oude Voshaar, Martijn A H; Ten Klooster, Peter M; Vonkeman, Harald E; van de Laar, Mart A F J
Traditional patient-reported physical function instruments often poorly differentiate patients with mild-to-moderate disability. We describe the development and psychometric evaluation of a generic item bank for measuring everyday activity limitations in outpatient populations. Seventy-two items generated from patient interviews and mapped to the International Classification of Functioning, Disability and Health (ICF) domestic life chapter were administered to 1128 adults representative of the Dutch population. The partial credit model was fitted to the item responses and evaluated with respect to its assumptions, model fit, and differential item functioning (DIF). Measurement performance of a computerized adaptive testing (CAT) algorithm was compared with the SF-36 physical functioning scale (PF-10). A final bank of 41 items was developed. All items demonstrated acceptable fit to the partial credit model and measurement invariance across age, sex, and educational level. Five- and ten-item CAT simulations were shown to have high measurement precision, which exceeded that of SF-36 physical functioning scale across the physical function continuum. Floor effects were absent for a 10-item empirical CAT simulation, and ceiling effects were low (13.5%) compared with SF-36 physical functioning (38.1%). CAT also discriminated better than SF-36 physical functioning between age groups, number of chronic conditions, and respondents with or without rheumatic conditions. The Rasch assessment of everyday activity limitations (REAL) item bank will hopefully prove a useful instrument for assessing everyday activity limitations. T-scores obtained using derived measures can be used to benchmark physical function outcomes against the general Dutch adult population.
Myszkowski, Nils; Storme, Martin; Tavani, Jean-Louis
Because of their length and objective of broad content coverage, very short scales can show limited internal consistency and structural validity. We argue that it is because their objectives may be better aligned with formative investigations than with reflective measurement methods that capitalize on content overlap. As proofs of concept of formative investigations of short scales, we investigate the Ten Item Personality Inventory (TIPI). In Study 1, we administered the TIPI and the Big Five Inventory (BFI) to 938 adults, and fitted a formative Multiple Indicator Multiple Causes model, which consisted of the TIPI items forming 5 latent variables, which in turn predicted the 5 BFI scores. These results were replicated in Study 2, on a sample of 759 adults, with, this time, the Revised NEO Personality Inventory (NEO-PI-R) as the external criterion. The models fit the data adequately, and moderate to strong significant effects (.37<|β|<.69, all p<.001) of all 5 latent formative variables on their corresponding BFI and NEOPI-R scores were observed. This study presents a formative approach that we propose to be more consistent with the aims of scales with broad content and short length like the TIPI. This article is protected by copyright. All rights reserved. © 2018 Wiley Periodicals, Inc.
Hill, D A; Guinea, A I; McCarthy, W H
An educator's view would be that formative assessment has an important role in the learning process. This study was carried out to obtain a student perspective of the place of formative assessment in the curriculum. Final-year medical students at Royal Prince Alfred Hospital took part in four teaching sessions, each structured to integrate teaching with assessment. Three assessment methods were used; the group objective structured clinical examination (G-OSCE), structured short answer (SSA) questions and a pre/post-test multiple choice questionnaire (MCQ). Teaching sessions were conducted on the subject areas of traumatology, the 'acute abdomen', arterial disorders and cancer. Fifty-five students, representing 83% of those who took part in the programme, responded to a questionnaire where they were asked to rate (on a 5-point Likert scale) their response to general questions about formative assessment and 13 specific questions concerning the comparative value of the three assessment modalities. Eighty-nine per cent of respondents felt that formative assessment should be incorporated into the teaching process. The SSA assessment was regarded as the preferred modality to reinforce previous teaching and test problem-solving skills. The MCQ was the least favoured assessment method. The effect size variable between the total scores for the SSA and MCQ was 0.64. The variable between G-OSCE and SSA/MCQ was 0.26 and 0.33 respectively. Formative assessment is a potentially powerful method to direct learning behaviour. Students should have input into the methods used.
Arbuckle, Robert; Clark, Marci; Harness, Jane; Bonner, Nicola; Scott, Jane; Draelos, Zoe; Rizer, Ronald; Yeh, Yating; Copley-Merriman, Kati
Developed using focus groups, the Oily Skin Self Assessment Scale (OSSAS) and Oily Skin Impact Scale (OSIS) are patient-reported outcome measures of oily facial skin. The aim of this study was to finalize the item-scale structure of the instruments and perform psychometric validation in adults with self-reported oily facial skin. The OSSAS and OSIS were administered to 202 adult subjects with oily facial skin in the United States. A subgroup of 152 subjects returned, 4 to 10 days later, for test–retest reliability evaluation. Of the 202 participants, 72.8% were female; 64.4% had self-reported nonsevere acne. Item reduction resulted in a 14-item OSSAS with Sensation (five items), Tactile (four items) and Visual (four items) domains, a single blotting item, and an overall oiliness item. The OSIS was reduced to two three-item domains assessing Annoyance and Self-Image. Confirmatory factor analysis supported the construct validity of the final item-scale structures. The OSSAS and OSIS scales had acceptable item convergent validity (item-scale correlations >0.40) and floor and ceiling effects (skin severity (P skin (P skin), as assessments of self-reported oily facial skin severity and its emotional impact, respectively.
Brese, Falk, Ed.
The goal for selecting the released set of test items was to have approximately 25% of each of the full item sets for mathematics content knowledge (MCK) and mathematics pedagogical content knowledge (MPCK) that would represent the full range of difficulty, content, and item format used in the TEDS-M study. The initial step in the selection was to…
Lou, Meei-Fang; Dai, Yu-Tzu; Huang, Guey-Shiun; Yu, Po-Jui
The purpose of the study was to identify the most efficient items from the Mini-Mental State Examination for assessment of cognitive function. The Mini-Mental State Examination is the most frequently used cognitive screening instrument. However, the Mini-Mental State Examination has been criticized for insensitivity to mild cognitive dysfunction, limited memory assessment and variability in level of difficulty of the individual items. This study used secondary data analysis. Item response theory two-parameter model was used to analyse the data from the admission assessment of mental status by the Mini-Mental State Examination for 801 patients. By using item response analysis, 16 items were selected from the original 30-item Mini-Mental State Examination. The 16 items included mainly the measures of orientation, recall and attention and calculation. The internal consistency of the 16-item Mini-Mental State Examination was 0.84. The proposed new cut-off point for the 16-item Mini-Mental State Examination was 11. The correct classification rate was 0.94, the sensitivity was 100% and the specificity was 97.4%, when compared with the original 30-item Mini-Mental State Examination from the cut-off point of 24. This new cut-off point was determined for the purpose of over-identifying patients at risk so as to ensure early detection of and prevention from the onset of cognitive disturbance. Only a few items are needed to describe the subject's cognitive status. Using item response theory analysis, the study found that the Mini-Mental State Examination could be simplified. Deleting the items with less variation makes this assessment tool not only shorter, easier to administer and less strenuous for respondents, but also enables one to maintain validity as a cognitive function test for clinical setting.
Chon, Kyong Hee; Lee, Won-Chan; Dunbar, Stephen B.
In this study we examined procedures for assessing model-data fit of item response theory (IRT) models for mixed format data. The model fit indices used in this study include PARSCALE's G[superscript 2], Orlando and Thissen's S-X[superscript 2] and S-G[superscript 2], and Stone's chi[superscript 2*] and G[superscript 2*]. To investigate the…
van Diggelen, M.R.; Morgan, C.M.; Funk, M.; Bruns, M.
Formative assessment is a valuable aspect in teaching and learning, and is proven to be an e ective learning method. There is evidence that adding formative assessment to your teaching increases students’ learning results (Black and William, 1998), but in practice many of the possibilities are left
Evans, Robert Harry; Clesham, Rose; Dolin, Jens
This chapter examines three different classroom teacher perspectives when using ASSIST-ME project formative assessment methods as described in the introductory chapter. The first ‘teacher perspective’ is about changes in teacher self-efficacies while using formative assessment methods as monitored...... by a pre- and post-teacher questionnaire. Teachers who tried the unfamiliar formative methods of assessment (see introductory book chapter for these methods) as well as their colleagues who did not were surveyed. The second ‘teacher perspective’ examines changes in teachers’ subjective theories while...... trying project-specific formative assessment methods in Czech Republic. Analyses are done through case studies and interviews. The final part of the chapter looks at teacher perspectives while using an Internet-based application to facilitate formative assessment. The teacher use of the application...
Assessment is an integral part of society and education, and for this reason it is important to know what you measure. This thesis is about explanatory item response modelling of an abstract reasoning assessment, with the objective to create a modern test design framework for automatic generation of valid and precalibrated items of abstract reasoning. Modern test design aims to strengthen the connections between the different components of a test, with a stress on strong theory, systematic it...
Item response theory (IRT) is a framework for modeling and analyzing item response ... data. Though, there is an argument that the evaluation of fit in IRT modeling has been ... National Council on Measurement in Education ... model data fit should be based on three types of ... prediction should be assessed through the.
Gierl, Mark J; Lai, Hollis; Turner, Simon R
Many tests of medical knowledge, from the undergraduate level to the level of certification and licensure, contain multiple-choice items. Although these are efficient in measuring examinees' knowledge and skills across diverse content areas, multiple-choice items are time-consuming and expensive to create. Changes in student assessment brought about by new forms of computer-based testing have created the demand for large numbers of multiple-choice items. Our current approaches to item development cannot meet this demand. We present a methodology for developing multiple-choice items based on automatic item generation (AIG) concepts and procedures. We describe a three-stage approach to AIG and we illustrate this approach by generating multiple-choice items for a medical licensure test in the content area of surgery. To generate multiple-choice items, our method requires a three-stage process. Firstly, a cognitive model is created by content specialists. Secondly, item models are developed using the content from the cognitive model. Thirdly, items are generated from the item models using computer software. Using this methodology, we generated 1248 multiple-choice items from one item model. Automatic item generation is a process that involves using models to generate items using computer technology. With our method, content specialists identify and structure the content for the test items, and computer technology systematically combines the content to generate new test items. By combining these outcomes, items can be generated automatically. © Blackwell Publishing Ltd 2012.
van der Kleij, Fabienne
Formative assessment concerns any assessment that provides feedback that is intended to support learning and can be used by teachers and/or students. Computers could offer a solution to overcoming obstacles encountered in implementing formative assessment. For example, computer-based assessments
Mueller, Evelyn A; Bengel, Juergen; Wirtz, Markus A
This study aimed to develop a self-description assessment instrument to measure work performance in patients with musculoskeletal diseases. In terms of the International Classification of Functioning, Disability and Health (ICF), work performance is defined as the degree of meeting the work demands (activities) at the actual workplace (environment). To account for the fact that work performance depends on the work demands of the job, we strived to develop item banks that allow a flexible use of item subgroups depending on the specific work demands of the patients' jobs. Item development included the collection of work tasks from literature and content validation through expert surveys and patient interviews. The resulting 122 items were answered by 621 patients with musculoskeletal diseases. Exploratory factor analysis to ascertain dimensionality and Rasch analysis (partial credit model) for each of the resulting dimensions were performed. Exploratory factor analysis resulted in four dimensions, and subsequent Rasch analysis led to the following item banks: 'impaired productivity' (15 items), 'impaired cognitive performance' (18), 'impaired coping with stress' (13) and 'impaired physical performance' (low physical workload 20 items, high physical workload 10 items). The item banks exhibited person separation indices (reliability) between 0.89 and 0.96. The assessment of work performance adds the activities component to the more commonly employed participation component of the ICF-model. The four item banks can be adapted to specific jobs where necessary without losing comparability of person measures, as the item banks are based on Rasch analysis.
Ida Ayu Made Sri Widiastuti
Full Text Available This study investigated the challenges and opportunities of formative assessment in EFL classes. It made use of qualitative research design by using indepth interviews to collect the required data. Three teachers and three students were involved as research participants in this study and they were intensively interviewed to get valid and reliable data regarding their understanding of formative assessment and the follow up actions they took after implementing formative assessment. The results of this study showed that the English teachers were found not to take appropriate follow up actions due to their low understanding of formative assessment. The teachers’ understanding could influence their ability in deciding the actions. This study indicates that EFL teachers need urgent further intensive training on the appropriate implementation of formative assessment and how follow up actions should be integrated into classroom practices
Kosko Karl W.; Singh, Rashmi
Multiplicative reasoning is a key concept in elementary school mathematics. Item statistics reported by the National Assessment of Educational Progress (NAEP) assessment provide the best current indicator for how well elementary students across the U.S. understand this, and other concepts. However, beyond expert reviews and statistical analysis,…
This article follows the development of test items (see "Language Assessment Quarterly", Volume 3 Issue 1, pp. 71-79 for the article "Test and Item Specifications Development"), beginning with a review of test and item specifications, then proceeding to writing and editing of items, pretesting and analysis, and finally selection of an item for a…
Full Text Available In usability research, difference between formative and reflective measurement models for the assessment of latent variables has been ignored largely. As a consequence, many usability scales are misspecified. This might result in reduced scale validity because of the elimination of important usability facets within the procedure of scale development. The aim of the current study was to develop a questionnaire for the evaluation of On-line store usability (UFOS-V2 that includes both a formative and a reflective scale. 378 subjects participated in a laboratory experimental study. Each participant visited two out of 35 On-line stores. The usability and intention to buy was assessed for both stores. In addition, actual purchase behaviour was observed by combining the subjects' reward with the decision to buy. In a two-construct PLS structural equation model the formative usability scale was used as a predictor for the reflective usability measure. Results indicate that the formative usability scale UFOS-V2f forms a valid set of items for the user-based assessment of online store usability. The reflective usability scale shows high internal consistency. Positive relationships to intention and decision to buy confirm high scale validity.
Fenwick, Eva K; Pesudovs, Konrad; Khadka, Jyoti; Rees, Gwyn; Wong, Tien Y; Lamoureux, Ecosse L
We are developing an item bank assessing the impact of diabetic retinopathy (DR) on quality of life (QoL) using a rigorous multi-staged process combining qualitative and quantitative methods. We describe here the first two qualitative phases: content development and item evaluation. After a comprehensive literature review, items were generated from four sources: (1) 34 previously validated patient-reported outcome measures; (2) five published qualitative articles; (3) eight focus groups and 18 semi-structured interviews with 57 DR patients; and (4) seven semi-structured interviews with diabetes or ophthalmic experts. Items were then evaluated during 3 stages, namely binning (grouping) and winnowing (reduction) based on key criteria and panel consensus; development of item stems and response options; and pre-testing of items via cognitive interviews with patients. The content development phase yielded 1,165 unique items across 7 QoL domains. After 3 sessions of binning and winnowing, items were reduced to a minimally representative set (n = 312) across 9 domains of QoL: visual symptoms; ocular surface symptoms; activity limitation; mobility; emotional; health concerns; social; convenience; and economic. After 8 cognitive interviews, 42 items were amended resulting in a final set of 314 items. We have employed a systematic approach to develop items for a DR-specific QoL item bank. The psychometric properties of the nine QoL subscales will be assessed using Rasch analysis. The resulting validated item bank will allow clinicians and researchers to better understand the QoL impact of DR and DR therapies from the patient's perspective.
Full Text Available We present definition of the concept of formative assessment and its significance for modern education. Displaying developmental approach in foreign studies, the further development, the risks and the possibility of their reduction. We discuss some of the techniques and examples of formative assessment. We investigate the relationship between formative and final evaluation, including the national curriculum levels.
Full Text Available Abstract Background Clinical vignettes have been used widely to compare quality of clinical care and to assess variation in practice, but the effect of different response formats has not been extensively evaluated. Our objective was to compare three clinical vignette-based survey response formats – open-ended questionnaire (A, closed-ended (multiple-choice questionnaire with deceptive response items mixed with correct items (B, and closed-ended questionnaire with only correct items (C – in rheumatologists' pre-treatment assessment for tumor-necrosis-factor (TNF blocker therapy. Methods Study design: Prospective randomized study. Setting: Rheumatologists attending the 2004 French Society of Rheumatology meeting. Physicians were given a vignette describing the history of a fictitious woman with active rheumatoid arthritis, who was a candidate for therapy with TNF blocking agents, and then were randomized to receive questionnaire A, B, or C, each containing the same four questions but with different response formats, that asked about their pretreatment assessment. Measurements: Long (recommended items and short (mandatory items checklists were developed for pretreatment assessment for TNF-blocker therapy, and scores were expressed on the basis of responses to questionnaires A, B, and C as the percentage of respondents correctly choosing explicit items on these checklists. Statistical analysis: Comparison of the selected items using pairwise Chi-square tests with Bonferonni correction for variables with statistically significant differences. Results Data for all surveys distributed (114 As, 118 Bs, and 118 Cs were complete and available for analysis. The percentage of questionnaire A, B, and C respondents for whom data was correctly complete for the short checklist was 50.4%, 84.0% and 95.0%, respectively, and was 0%, 5.0% and 5.9%, respectively, for the long version. As an example, 65.8%, 85.7% and 95.8% of the respondents of A, B, and C
Gierl, Mark J.; Lai, Hollis
Changes to the design and development of our educational assessments are resulting in the unprecedented demand for a large and continuous supply of content-specific test items. One way to address this growing demand is with automatic item generation (AIG). AIG is the process of using item models to generate test items with the aid of computer…
Full Text Available Background: The purpose of this study was to investigate the use of item analysis to assess objectively the quality of items on the Calgary-Cambridge Communications OSCE checklist. Methods: A total of 150 first year medical students were provided with extensive teaching on the use of the Calgary-Cambridge Guidelines for interviewing patients and participated in a final year end 20 minute communication OSCE station. Grouped into either the upper half (50% or lower half (50% communication skills performance groups, discrimination, difficulty and point biserial values were calculated for each checklist item. Results: The mean score on the 33 item communication checklist was 24.09 (SD = 4.46 and the internal reliability coefficient was ? = 0.77. Although most of the items were found to have moderate (k = 12, 36% or excellent (k = 10, 30% discrimination values, there were 6 (18% identified as ‘fair’ and 3 (9% as ‘poor’. A post-examination review focused on item analysis findings resulted in an increase in checklist reliability (? = 0.80. Conclusions: Item analysis has been used with MCQ exams extensively. In this study, it was also found to be an objective and practical approach to use in evaluating the quality of a standardized OSCE checklist.
Polikoff, Morgan S.; May, Henry; Porter, Andrew C.; Elliott, Stephen N.; Goldring, Ellen; Murphy, Joseph
The Vanderbilt Assessment of Leadership in Education is a 360-degree assessment of the effectiveness of principals' learning-centered leadership behaviors. In this report, we present results from a differential item functioning (DIF) study of the assessment. Using data from a national field trial, we searched for evidence of DIF on school level,…
Samah, Mas Norbany binti Abu; Tajudin, Nor'ain binti Mohd
Formative assessment of school-based assessment (SBA) is implemented in schools as a move to improve the National Education Assessment System (NEAS). Formative assessment focuses on assessment for learning. There are various types of formative assessment instruments used by teachers of mathematics, namely the form of observation, questioning protocols, worksheets and quizzes. This study aims to help teachers improve skills in formative assessments during the teaching and learning (t&l) Mathematics. One mathematics teacher had been chosen as the study participants. The collecting data using document analysis, observation and interviews. Data were analyzed narrative and assessments can help teachers implement PBS. Formative assessment is conducted to improve the skills of students in t&l effectively.
Yang, Ji Seung; Zheng, Xiaying
The purpose of this article is to introduce and review the capability and performance of the Stata item response theory (IRT) package that is available from Stata v.14, 2015. Using a simulated data set and a publicly available item response data set extracted from Programme of International Student Assessment, we review the IRT package from…
While the methodology used in developing test items can vary significantly, to ensure quality examinations, test items should be developed systematically. Test design and development is discussed in the DOE Guide to Good Practices for Design, Development, and Implementation of Examinations. This guide is intended to be a supplement by providing more detailed guidance on the development of specific test items. This guide addresses the development of written examination test items primarily. However, many of the concepts also apply to oral examinations, both in the classroom and on the job. This guide is intended to be used as guidance for the classroom and laboratory instructor or curriculum developer responsible for the construction of individual test items. This document focuses on written test items, but includes information relative to open-reference (open book) examination test items, as well. These test items have been categorized as short-answer, multiple-choice, or essay. Each test item format is described, examples are provided, and a procedure for development is included. The appendices provide examples for writing test items, a test item development form, and examples of various test item formats.
Sachse, Karoline A.; Haag, Nicole
Standard errors computed according to the operational practices of international large-scale assessment studies such as the Programme for International Student Assessment's (PISA) or the Trends in International Mathematics and Science Study (TIMSS) may be biased when cross-national differential item functioning (DIF) and item parameter drift are…
Alphs, Larry; Morlock, Robert; Coon, Cheryl; van Willigenburg, Arjen; Panagides, John
Objective. To assess the ability of mental health professionals to use the 4-item Negative Symptom Assessment instrument, derived from the Negative Symptom Assessment-16, to rapidly determine the severity of negative symptoms of schizophrenia.Design. Open participation.Setting. Medical education conferences.Participants. Attendees at two international psychiatry conferences.Measurements. Participants read a brief set of the 4-item Negative Symptom Assessment instructions and viewed a videotape of a patient with schizophrenia. Using the 1 to 6 4-item Negative Symptom Assessment severity rating scale, they rated four negative symptom items and the overall global negative symptoms. These ratings were compared with a consensus rating determination using frequency distributions and Chi-square tests for the proportion of participant ratings that were within one point of the expert rating.Results. More than 400 medical professionals (293 physicians, 50% with a European practice, and 55% who reported past utilization of schizophrenia ratings scales) participated. Between 82.1 and 91.1 percent of the 4-items and the global rating determinations by the participants were within one rating point of the consensus expert ratings. The differences between the percentage of participant rating scores that were within one point versus the percentage that were greater than one point different from those by the consensus experts was significant (pnegative symptoms using the 4-item Negative Symptom Assessment did not generally differ among the geographic regions of practice, the professional credentialing, or their familiarity with the use of schizophrenia symptom rating instruments.Conclusion. These findings suggest that clinicians from a variety of geographic practices can, after brief training, use the 4-item Negative Symptom Assessment effectively to rapidly assess negative symptoms in patients with schizophrenia.
Full Text Available Decision-makers in organizations providing continuing professional development (CPD have identified the need for routine assessment of its impact on practice. We sought to develop a theory-based instrument for evaluating the impact of CPD activities on health professionals' clinical behavioral intentions.Our multipronged study had four phases. 1 We systematically reviewed the literature for instruments that used socio-cognitive theories to assess healthcare professionals' clinically-oriented behavioral intentions and/or behaviors; we extracted items relating to the theoretical constructs of an integrated model of healthcare professionals' behaviors and removed duplicates. 2 A committee of researchers and CPD decision-makers selected a pool of items relevant to CPD. 3 An international group of experts (n = 70 reached consensus on the most relevant items using electronic Delphi surveys. 4 We created a preliminary instrument with the items found most relevant and assessed its factorial validity, internal consistency and reliability (weighted kappa over a two-week period among 138 physicians attending a CPD activity. Out of 72 potentially relevant instruments, 47 were analyzed. Of the 1218 items extracted from these, 16% were discarded as improperly phrased and 70% discarded as duplicates. Mapping the remaining items onto the constructs of the integrated model of healthcare professionals' behaviors yielded a minimum of 18 and a maximum of 275 items per construct. The partnership committee retained 61 items covering all seven constructs. Two iterations of the Delphi process produced consensus on a provisional 40-item questionnaire. Exploratory factorial analysis following test-retest resulted in a 12-item questionnaire. Cronbach's coefficients for the constructs varied from 0.77 to 0.85.A 12-item theory-based instrument for assessing the impact of CPD activities on health professionals' clinical behavioral intentions showed adequate validity and
Full Text Available Background. The purpose of this study was to evaluate the effectiveness of two methods of detecting differential item functioning (DIF in the presence of multilevel data and polytomously scored items. The assessment of DIF with multilevel data (e.g., patients nested within hospitals, hospitals nested within districts from large-scale assessment programs has received considerable attention but very few studies evaluated the effect of hierarchical structure of data on DIF detection for polytomously scored items. Methods. The ordinal logistic regression (OLR and hierarchical ordinal logistic regression (HOLR were utilized to assess DIF in simulated and real multilevel polytomous data. Six factors (DIF magnitude, grouping variable, intraclass correlation coefficient, number of clusters, number of participants per cluster, and item discrimination parameter with a fully crossed design were considered in the simulation study. Furthermore, data of Pediatric Quality of Life Inventory™ (PedsQL™ 4.0 collected from 576 healthy school children were analyzed. Results. Overall, results indicate that both methods performed equivalently in terms of controlling Type I error and detection power rates. Conclusions. The current study showed negligible difference between OLR and HOLR in detecting DIF with polytomously scored items in a hierarchical structure. Implications and considerations while analyzing real data were also discussed.
Rougas, Steven; Clyne, Brian; Cianciolo, Anna T; Chan, Teresa M; Sherbino, Jonathan; Yarris, Lalena M
NEGEA 2015 CONFERENCE ABSTRACT (EDITED): Measuring an Organization's Culture of Feedback: Can It Be Done? Steven Rougas and Brian Clyne. CONSTRUCT: This study sought to develop a construct for measuring formative feedback culture in an academic emergency medicine department. Four archetypes (Market, Adhocracy, Clan, Hierarchy) reflecting an organization's values with respect to focus (internal vs. external) and process (flexibility vs. stability and control) were used to characterize one department's receptiveness to formative feedback. The prevalence of residents' identification with certain archetypes served as an indicator of the department's organizational feedback culture. New regulations have forced academic institutions to implement wide-ranging changes to accommodate competency-based milestones and their assessment. These changes challenge residencies that use formative feedback from faculty as a major source of data for determining training advancement. Though various approaches have been taken to improve formative feedback to residents, there currently exists no tool to objectively measure the organizational culture that surrounds this process. Assessing organizational culture, commonly used in the business sector to represent organizational health, may help residency directors gauge their program's success in fostering formative feedback. The Organizational Culture Assessment Instrument (OCAI) is widely used, extensively validated, applicable to survey research, and theoretically based and may be modifiable to assess formative feedback culture in the emergency department. Using a modified Delphi technique and several iterations of focus groups amongst educators at one institution, four of the original six OCAI domains (which each contain 4 possible responses) were modified to create a 16-item Formative Feedback Culture Tool (FFCT) that was administered to 26 residents (response rate = 55%) at a single academic emergency medicine department. The mean
McDonough, Christine M; Jette, Alan M; Ni, Pengsheng; Bogusz, Kara; Marfeo, Elizabeth E; Brandt, Diane E; Chan, Leighton; Meterko, Mark; Haley, Stephen M; Rasch, Elizabeth K
To build a comprehensive item pool representing work-relevant physical functioning and to test the factor structure of the item pool. These developmental steps represent initial outcomes of a broader project to develop instruments for the assessment of function within the context of Social Security Administration (SSA) disability programs. Comprehensive literature review; gap analysis; item generation with expert panel input; stakeholder interviews; cognitive interviews; cross-sectional survey administration; and exploratory and confirmatory factor analyses to assess item pool structure. In-person and semistructured interviews and Internet and telephone surveys. Sample of SSA claimants (n=1017) and a normative sample of adults from the U.S. general population (n=999). Not applicable. Model fit statistics. The final item pool consisted of 139 items. Within the claimant sample, 58.7% were white; 31.8% were black; 46.6% were women; and the mean age was 49.7 years. Initial factor analyses revealed a 4-factor solution, which included more items and allowed separate characterization of: (1) changing and maintaining body position, (2) whole body mobility, (3) upper body function, and (4) upper extremity fine motor. The final 4-factor model included 91 items. Confirmatory factor analyses for the 4-factor models for the claimant and the normative samples demonstrated very good fit. Fit statistics for claimant and normative samples, respectively, were: Comparative Fit Index=.93 and .98; Tucker-Lewis Index=.92 and .98; and root mean square error approximation=.05 and .04. The factor structure of the physical function item pool closely resembled the hypothesized content model. The 4 scales relevant to work activities offer promise for providing reliable information about claimant physical functioning relevant to work disability. Copyright © 2013 American Congress of Rehabilitation Medicine. Published by Elsevier Inc. All rights reserved.
Full Text Available The aim of this study was to evaluate differences in physical education students’ perception on an educational innovation based on formative and peer assessment through the blogosphere. The sample was made up of 253 students from two Spanish universities. Data was collected using a self-reported questionnaire and t tests were employed in order to find differences among students’ groups. Results show significant differences in almost all of the items on which the students were questioned. Basque students were more satisfied with the assessment tool used than the Valencian students. Students found the blogosphere more active, meaningful, functional and motivating and that it made for collaborative learning in comparison to other traditional evaluation methods. They also showed disapproval related to the demands on attendance, continuity and the greater effort required. For future occasions, negotiation about assessment criteria with the students should be implemented right at the very start of the course.
Mueller, Anne E; Segal, Daniel L; Gavett, Brandon; Marty, Meghan A; Yochim, Brian; June, Andrea; Coolidge, Frederick L
The Geriatric Anxiety Scale (GAS; Segal et al. (Segal, D. L., June, A., Payne, M., Coolidge, F. L. and Yochim, B. (2010). Journal of Anxiety Disorders, 24, 709-714. doi:10.1016/j.janxdis.2010.05.002) is a self-report measure of anxiety that was designed to address unique issues associated with anxiety assessment in older adults. This study is the first to use item response theory (IRT) to examine the psychometric properties of a measure of anxiety in older adults. A large sample of older adults (n = 581; mean age = 72.32 years, SD = 7.64 years, range = 60 to 96 years; 64% women; 88% European American) completed the GAS. IRT properties were examined. The presence of differential item functioning (DIF) or measurement bias by age and sex was assessed, and a ten-item short form of the GAS (called the GAS-10) was created. All GAS items had discrimination parameters of 1.07 or greater. Items from the somatic subscale tended to have lower discrimination parameters than items on the cognitive or affective subscales. Two items were flagged for DIF, but the impact of the DIF was negligible. Women scored significantly higher than men on the GAS and its subscales. Participants in the young-old group (60 to 79 years old) scored significantly higher on the cognitive subscale than participants in the old-old group (80 years old and older). Results from the IRT analyses indicated that the GAS and GAS-10 have strong psychometric properties among older adults. We conclude by discussing implications and future research directions.
DiVall, Margarita V; Alston, Greg L; Bird, Eleanora; Buring, Shauna M; Kelley, Katherine A; Murphy, Nanci L; Schlesselman, Lauren S; Stowe, Cindy D; Szilagyi, Julianna E
This paper aims to increase understanding and appreciation of formative assessment and its role in improving student outcomes and the instructional process, while educating faculty on formative techniques readily adaptable to various educational settings. Included are a definition of formative assessment and the distinction between formative and summative assessment. Various formative assessment strategies to evaluate student learning in classroom, laboratory, experiential, and interprofessional education settings are discussed. The role of reflective writing and portfolios, as well as the role of technology in formative assessment, are described. The paper also offers advice for formative assessment of faculty teaching. In conclusion, the authors emphasize the importance of creating a culture of assessment that embraces the concept of 360-degree assessment in both the development of a student's ability to demonstrate achievement of educational outcomes and a faculty member's ability to become an effective educator.
Full Text Available A topic search of the Web of Science (WoS database using the term “numeracy” produced a bibliography of 293 articles, reviews and editorial commentaries (Oct 2008. The citation graph of the bibliography clearly identifies five benchmark papers (1995-2001, four of which developed numeracy assessment instruments. Starting with the 80 papers that cite these benchmarks, we identified a set of 25 papers (1995-2008 in which the medical research community reports the development and/or application of health-numeracy assessments. In all we found 10 assessment instruments from which we have compiled a total of 48 assessment items. There are both general and context-specific tests, with the wide range in the latter illustrated by names such as the Diabetes Numeracy Test and the Asthma Numeracy Questionnaire. There is also a Medical Data Interpretation Test and a Subjective Numeracy Scale. Much of this literature discusses the validity and reliability of the test, and many papers include item-by-item results of the tests from when they were applied in the research reported in the papers. The research that used the tests was directed at exploring such subjects as the patients’ ability to evaluate risks and benefits in order to make informed decisions; to understand and carry out instructions in order to self-manage their medical conditions; and, in research settings, to understand what the researchers were asking in their assessments (e.g., quantified quality of life that require comparison of numerical information. We present the collection of items as a potential resource for educators interested in numeracy assessments in context.
Bisby, J. A.; Burgess, N.
The formation of associations between items and their context has been proposed to rely on mechanisms distinct from those supporting memory for a single item. Although emotional experiences can profoundly affect memory, our understanding of how it interacts with different aspects of memory remains unclear. We performed three experiments to examine the effects of emotion on memory for items and their associations. By presenting neutral and negative items with background contexts, Experiment 1 ...
Eichenbaum, Alexander E; Marcus, David K; French, Brian F
This study examined item and scale functioning in the Psychopathic Personality Inventory-Revised (PPI-R) using an item response theory analysis. PPI-R protocols from 1,052 college student participants (348 male, 704 female) were analyzed. Analyses were conducted on the 131 self-report items comprising the PPI-R's eight content scales, using a graded response model. Scales collected a majority of their information about respondents possessing higher than average levels of the traits being measured. Each scale contained at least some items that evidenced limited ability to differentiate between respondents with differing levels of the trait being measured. Moreover, 80 items (61.1%) yielded significantly different responses between men and women presumably possessing similar levels of the trait being measured. Item performance was also influenced by the scoring format (directly scored vs. reverse-scored) of the items. Overall, the results suggest that the PPI-R, despite identifying psychopathic personality traits in individuals possessing high levels of those traits, may not identify these traits equally well for men and women, and scores are likely influenced by the scoring format of the individual item and scale.
Sadler, Philip M.; Sonnert, Gerhard; Coyle, Harold P.; Miller, Kelly A.
The psychometrically sound development of assessment instruments requires pilot testing of candidate items as a first step in gauging their quality, typically a time-consuming and costly effort. Crowdsourcing offers the opportunity for gathering data much more quickly and inexpensively than from most targeted populations. In a simulation of a…
Scherr, Rachel E.; Close, Hunter G.; McKagan, Sarah B.
The practice of proximal formative assessment - the continual, responsive attention to students' developing understanding as it is expressed in real time - depends on students' sharing their ideas with instructors and on teachers' attending to them. Rogerian psychology presents an account of the conditions under which proximal formative assessment may be promoted or inhibited: (1) Normal classroom conditions, characterized by evaluation and attention to learning targets, may present threats to students' sense of their own competence and value, causing them to conceal their ideas and reducing the potential for proximal formative assessment. (2) In contrast, discourse patterns characterized by positive anticipation and attention to learner ideas increase the potential for proximal formative assessment and promote self-directed learning. We present an analysis methodology based on these principles and demonstrate its utility for understanding episodes of university physics instruction.
Full Text Available This study aims to obtain a valid and reliable formative evaluation model of critical thinking. The method used in this research was the research and development by integrating Borg & Gall's model and Plomp's development model. The ten steps Borg & Gall’s model were modified into five stages as the stages in the Plomp's model. The subjects in this study were 1,446 students of junior high schools in DIY, 14 mathematics teacher, and six experts. The content validity employed was expert judgment, the empirical validity and reliability used were loading factor, item analysis used PCM 1PL, and the relationship between disposition and critical thinking skill used was structural equation modeling (SEM. The developed formative evaluation model is the procedural model. There are five aspects of critical thinking skill: mathematic reasoning, interpretation, analysis, evaluation, and inference, which entirely composed of 42 items. The validity of the critical thinking skill instruments achieves a significance degree as indicated by the lowest and the highest loading factors of 0.38 and 0.74 subsequently, the reliability of every aspect in a good category. The average level of difficulty is 0.00 with the standard deviation of 0.45 which is in a good category. The peer assessment questionnaire of critical thinking disposition consists of seven aspects: truth-seeking, open-minded, analysis, systematic, self-confidence, inquisitiveness, and maturity with 23 items. The critical thinking disposition validity achieves the significance degree as indicated by the lowest and the high factor loading of 0.66 and 0.76 subsequently, and the reliability of every aspect in a good category. Based on the analysis of the structural equation model, the model fits the data.
Dickenson, Tammiee S.; Gilmore, Joanna A.; Price, Karen J.; Bennett, Heather L.
This study evaluated the benefits of item enhancements applied to science-inquiry items for incorporation into an alternate assessment based on modified achievement standards for high school students. Six items were included in the cognitive lab sessions involving both students with and without disabilities. The enhancements (e.g., use of visuals,…
Full Text Available In this paper a modern item design framework for computer based assessment based on Flash authoring environment will be introduced. Question design will be discussed as well as the multimedia authoring environment used for item modeling emphasized. Item type templates are a structured means of collecting and storing item information that can be used to improve the efficiency and security of the innovative item design process. Templates can modernize the item design, enhance and speed up the development process. Along with content creation, multimedia has vast potential for use in innovative testing. The introduced item design template is based on taxonomy of innovative items which have great potential for expanding the content areas and construct coverage of an assessment. The presented item design approach is based on GUI's – one for question design based on implemented item design templates and one for user interaction tracking/retrieval. The concept of user interfaces based on Flash technology will be discussed as well as implementation of the innovative approach of the item design forms with multimedia authoring. Also an innovative method for user interaction storage/retrieval based on PHP extending Flash capabilities in the proposed framework will be introduced.
Full Text Available Background: Multiple choice questions (MCQs are frequently used to assess students in different educational streams for their objectivity and wide reach of coverage in less time. However, the MCQs to be used must be of quality which depends upon its difficulty index (DIF I, discrimination index (DI and distracter efficiency (DE. Objective: To evaluate MCQs or items and develop a pool of valid items by assessing with DIF I, DI and DE and also to revise/ store or discard items based on obtained results. Settings: Study was conducted in a medical school of Ahmedabad. Materials and Methods: An internal examination in Community Medicine was conducted after 40 hours teaching during 1 st MBBS which was attended by 148 out of 150 students. Total 50 MCQs or items and 150 distractors were analyzed. Statistical Analysis: Data was entered and analyzed in MS Excel 2007 and simple proportions, mean, standard deviations, coefficient of variation were calculated and unpaired t test was applied. Results: Out of 50 items, 24 had "good to excellent" DIF I (31 - 60% and 15 had "good to excellent" DI (> 0.25. Mean DE was 88.6% considered as ideal/ acceptable and non functional distractors (NFD were only 11.4%. Mean DI was 0.14. Poor DI (< 0.15 with negative DI in 10 items indicates poor preparedness of students and some issues with framing of at least some of the MCQs. Increased proportion of NFDs (incorrect alternatives selected by < 5% students in an item decrease DE and makes it easier. There were 15 items with 17 NFDs, while rest items did not have any NFD with mean DE of 100%. Conclusion: Study emphasizes the selection of quality MCQs which truly assess the knowledge and are able to differentiate the students of different abilities in correct manner.
Knol Dirk L
Full Text Available Abstract Background For the Low Vision Quality Of Life questionnaire (LVQOL it is unknown whether the psychometric properties are satisfactory when an item response theory (IRT perspective is considered. This study evaluates some essential psychometric properties of the LVQOL questionnaire in an IRT model, and investigates differential item functioning (DIF. Methods Cross-sectional data were used from an observational study among visually-impaired patients (n = 296. Calibration was performed for every dimension of the LVQOL in the graded response model. Item goodness-of-fit was assessed with the S-X2-test. DIF was assessed on relevant background variables (i.e. age, gender, visual acuity, eye condition, rehabilitation type and administration type with likelihood-ratio tests for DIF. The magnitude of DIF was interpreted by assessing the largest difference in expected scores between subgroups. Measurement precision was assessed by presenting test information curves; reliability with the index of subject separation. Results All items of the LVQOL dimensions fitted the model. There was significant DIF on several items. For two items the maximum difference between expected scores exceeded one point, and DIF was found on multiple relevant background variables. Item 1 'Vision in general' from the "Adjustment" dimension and item 24 'Using tools' from the "Reading and fine work" dimension were removed. Test information was highest for the "Reading and fine work" dimension. Indices for subject separation ranged from 0.83 to 0.94. Conclusions The items of the LVQOL showed satisfactory item fit to the graded response model; however, two items were removed because of DIF. The adapted LVQOL with 21 items is DIF-free and therefore seems highly appropriate for use in heterogeneous populations of visually impaired patients.
Ruth A. Childs
Full Text Available Matrix sampling of items -' that is, division of a set of items into different versions of a test form..-' is used by several large-scale testing programs. Like other test designs, matrixed designs have..both advantages and disadvantages. For example, testing time per student is less than if each..student received all the items, but the comparability of student scores may decrease. Also,..curriculum coverage is maintained, but reporting of scores becomes more complex. In this paper,..matrixed designs are compared with more traditional designs in nine categories of costs:..development costs, materials costs, administration costs, educational costs, scoring costs,..reliability costs, comparability costs, validity costs, and reporting costs. In choosing among test..designs, a testing program should examine the costs in light of its mandate(s, the content of the..tests, and the financial resources available, among other considerations.
Johnson, Matthew S.; Sinharay, Sandip
For complex educational assessments, there is an increasing use of "item families," which are groups of related items. However, calibration or scoring for such an assessment requires fitting models that take into account the dependence structure inherent among the items that belong to the same item family. C. Glas and W. van der Linden…
Lin, Jian-Wei; Lai, Yuan-Cheng
This paper harnesses collaborative annotations by students as learning feedback on online formative assessments to improve the learning achievements of students. Through the developed Web platform, students can conduct formative assessments, collaboratively annotate, and review historical records in a convenient way, while teachers can generate…
Hoffmann-Eßer, Wiebke; Siering, Ulrich; Neugebauer, Edmund A M; Brockhaus, Anne Catharina; McGauran, Natalie; Eikermann, Michaela
The AGREE II instrument is the most commonly used guideline appraisal tool. It includes 23 appraisal criteria (items) organized within six domains. AGREE II also includes two overall assessments (overall guideline quality, recommendation for use). Our aim was to investigate how strongly the 23 AGREE II items influence the two overall assessments. An online survey of authors of publications on guideline appraisals with AGREE II and guideline users from a German scientific network was conducted between 10th February 2015 and 30th March 2015. Participants were asked to rate the influence of the AGREE II items on a Likert scale (0 = no influence to 5 = very strong influence). The frequencies of responses and their dispersion were presented descriptively. Fifty-eight of the 376 persons contacted (15.4%) participated in the survey and the data of the 51 respondents with prior knowledge of AGREE II were analysed. Items 7-12 of Domain 3 (rigour of development) and both items of Domain 6 (editorial independence) had the strongest influence on the two overall assessments. In addition, Items 15-17 (clarity of presentation) had a strong influence on the recommendation for use. Great variations were shown for the other items. The main limitation of the survey is the low response rate. In guideline appraisals using AGREE II, items representing rigour of guideline development and editorial independence seem to have the strongest influence on the two overall assessments. In order to ensure a transparent approach to reaching the overall assessments, we suggest the inclusion of a recommendation in the AGREE II user manual on how to consider item and domain scores. For instance, the manual could include an a-priori weighting of those items and domains that should have the strongest influence on the two overall assessments. The relevance of these assessments within AGREE II could thereby be further specified.
Assessment can be used to stimulate and direct student learning. This refers to the formative function of assessment. Formative assessments contribute to learning by generating feedback. Here, feedback is conceptualised as information about learners actual state of performance intended to modify
Shu, Lianghua; Schwarz, Richard D.
As a global measure of precision, item response theory (IRT) estimated reliability is derived for four coefficients (Cronbach's a, Feldt-Raju, stratified a, and marginal reliability). Models with different underlying assumptions concerning test-part similarity are discussed. A detailed computational example is presented for the targeted…
Maddox, Bryan; Zumbo, Bruno D.; Tay-Lim, Brenda; Qu, Demin
This article explores the potential for ethnographic observations to inform the analysis of test item performance. In 2010, a standardized, large-scale adult literacy assessment took place in Mongolia as part of the United Nations Educational, Scientific and Cultural Organization Literacy Assessment and Monitoring Programme (LAMP). In a novel form…
Dowling, N Maritza; Bolt, Daniel M; Deng, Sien
When assessments are primarily used to measure change over time, it is important to evaluate items according to their sensitivity to change, specifically. Items that demonstrate good sensitivity to between-person differences at baseline may not show good sensitivity to change over time, and vice versa. In this study, we applied a longitudinal factor model of change to a widely used cognitive test designed to assess global cognitive status in dementia, and contrasted the relative sensitivity of items to change. Statistically nested models were estimated introducing distinct latent factors related to initial status differences between test-takers and within-person latent change across successive time points of measurement. Models were estimated using all available longitudinal item-level data from the Alzheimer's Disease Assessment Scale-Cognitive subscale, including participants representing the full-spectrum of disease status who were enrolled in the multisite Alzheimer's Disease Neuroimaging Initiative. Five of the 13 Alzheimer's Disease Assessment Scale-Cognitive items demonstrated noticeably higher loadings with respect to sensitivity to change. Attending to performance change on only these 5 items yielded a clearer picture of cognitive decline more consistent with theoretical expectations in comparison to the full 13-item scale. Items that show good psychometric properties in cross-sectional studies are not necessarily the best items at measuring change over time, such as cognitive decline. Applications of the methodological approach described and illustrated in this study can advance our understanding regarding the types of items that best detect fine-grained early pathological changes in cognition. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Mathysen, Danny G P; Aclimandos, Wagih; Roelant, Ella; Wouters, Kristien; Creuzot-Garcher, Catherine; Ringens, Peter J; Hawlina, Marko; Tassignon, Marie-José
To investigate whether introduction of item-response theory (IRT) analysis, in parallel to the 'traditional' statistical analysis methods available for performance evaluation of multiple T/F items as used in the European Board of Ophthalmology Diploma (EBOD) examination, has proved beneficial, and secondly, to study whether the overall assessment performance of the current written part of EBOD is sufficiently high (KR-20≥ 0.90) to be kept as examination format in future EBOD editions. 'Traditional' analysis methods for individual MCQ item performance comprise P-statistics, Rit-statistics and item discrimination, while overall reliability is evaluated through KR-20 for multiple T/F items. The additional set of statistical analysis methods for the evaluation of EBOD comprises mainly IRT analysis. These analysis techniques are used to monitor whether the introduction of negative marking for incorrect answers (since EBOD 2010) has a positive influence on the statistical performance of EBOD as a whole and its individual test items in particular. Item-response theory analysis demonstrated that item performance parameters should not be evaluated individually, but should be related to one another. Before the introduction of negative marking, the overall EBOD reliability (KR-20) was good though with room for improvement (EBOD 2008: 0.81; EBOD 2009: 0.78). After the introduction of negative marking, the overall reliability of EBOD improved significantly (EBOD 2010: 0.92; EBOD 2011:0.91; EBOD 2012: 0.91). Although many statistical performance parameters are available to evaluate individual items, our study demonstrates that the overall reliability assessment remains the only crucial parameter to be evaluated allowing comparison. While individual item performance analysis is worthwhile to undertake as secondary analysis, drawing final conclusions seems to be more difficult. Performance parameters need to be related, as shown by IRT analysis. Therefore, IRT analysis has
Ní Fhloinn, Eabhnat; Carr, Michael
In this paper, we present a range of formative assessment types for engineering mathematics, including in-class exercises, homework, mock examination questions, table quizzes, presentations, critical analyses of statistical papers, peer-to-peer teaching, online assessments and electronic voting systems. We provide practical tips for the implementation of such assessments, with a particular focus on time or resource constraints and large class sizes, as well as effective methods of feedback. In addition, we consider the benefits of such formative assessments for students and staff.
Hung, Man; Voss, Maren W; Bounsanga, Jerry; Crum, Anthony B; Tyser, Andrew R
Clinical measurement. The psychometric properties of the PROMIS v1.2 UE item bank were tested on various samples prior to its release, but have not been fully evaluated among the orthopaedic population. This study assesses the performance of the UE item bank within the UE orthopaedic patient population. The UE item bank was administered to 1197 adult patients presenting to a tertiary orthopaedic clinic specializing in hand and UE conditions and was examined using traditional statistics and Rasch analysis. The UE item bank fits a unidimensional model (outfit MNSQ range from 0.64 to 1.70) and has adequate reliabilities (person = 0.84; item = 0.82) and local independence (item residual correlations range from -0.37 to 0.34). Only one item exhibits gender differential item functioning. Most items target low levels of function. The UE item bank is a useful clinical assessment tool. Additional items covering higher functions are needed to enhance validity. Supplemental testing is recommended for patients at higher levels of function until more high function UE items are developed. 2c. Copyright © 2016 Hanley & Belfus. Published by Elsevier Inc. All rights reserved.
Ueckert, Sebastian; Plan, Elodie L; Ito, Kaori; Karlsson, Mats O; Corrigan, Brian; Hooker, Andrew C
This work investigates improved utilization of ADAS-cog data (the primary outcome in Alzheimer's disease (AD) trials of mild and moderate AD) by combining pharmacometric modeling and item response theory (IRT). A baseline IRT model characterizing the ADAS-cog was built based on data from 2,744 individuals. Pharmacometric methods were used to extend the baseline IRT model to describe longitudinal ADAS-cog scores from an 18-month clinical study with 322 patients. Sensitivity of the ADAS-cog items in different patient populations as well as the power to detect a drug effect in relation to total score based methods were assessed with the IRT based model. IRT analysis was able to describe both total and item level baseline ADAS-cog data. Longitudinal data were also well described. Differences in the information content of the item level components could be quantitatively characterized and ranked for mild cognitively impairment and mild AD populations. Based on clinical trial simulations with a theoretical drug effect, the IRT method demonstrated a significantly higher power to detect drug effect compared to the traditional method of analysis. A combined framework of IRT and pharmacometric modeling permits a more effective and precise analysis than total score based methods and therefore increases the value of ADAS-cog data.
McElhiney, Danielle; Kang, Minsoo; Starkey, Chad; Ragan, Brian
The purpose of the study was to improve the immediate and delayed memory sections of the Standardized Assessment of Concussion (SAC) by identifying a list of more psychometrically sound items (words). A total of 200 participants with no history of concussion in the previous six months (aged 19.60 ± 2.20 years; N?=?93 men, N?=?107 women)…
Steca, Patrizia; Monzani, Dario; Greco, Andrea; Chiesi, Francesca; Primi, Caterina
This study is aimed at testing the measurement properties of the Life Orientation Test-Revised (LOT-R) for the assessment of dispositional optimism by employing item response theory (IRT) analyses. The LOT-R was administered to a large sample of 2,862 Italian adults. First, confirmatory factor analyses demonstrated the theoretical conceptualization of the construct measured by the LOT-R as a single bipolar dimension. Subsequently, IRT analyses for polytomous, ordered response category data were applied to investigate the items' properties. The equivalence of the items across gender and age was assessed by analyzing differential item functioning. Discrimination and severity parameters indicated that all items were able to distinguish people with different levels of optimism and adequately covered the spectrum of the latent trait. Additionally, the LOT-R appears to be gender invariant and, with minor exceptions, age invariant. Results provided evidence that the LOT-R is a reliable and valid measure of dispositional optimism. © The Author(s) 2014.
Bisby, James A.; Burgess, Neil
The formation of associations between items and their context has been proposed to rely on mechanisms distinct from those supporting memory for a single item. Although emotional experiences can profoundly affect memory, our understanding of how it interacts with different aspects of memory remains unclear. We performed three experiments to examine…
Fischer, H Felix; Wahl, Inka; Nolte, Sandra; Liegl, Gregor; Brähler, Elmar; Löwe, Bernd; Rose, Matthias
To investigate differential item functioning (DIF) of PROMIS Depression items between US and German samples we compared data from the US PROMIS calibration sample (n = 780), a German general population survey (n = 2,500) and a German clinical sample (n = 621). DIF was assessed in an ordinal logistic regression framework, with 0.02 as criterion for R 2 -change and 0.096 for Raju's non-compensatory DIF. Item parameters were initially fixed to the PROMIS Depression metric; we used plausible values to account for uncertainty in depression estimates. Only four items showed DIF. Accounting for DIF led to negligible effects for the full item bank as well as a post hoc simulated computer-adaptive test (German general population sample was considerably lower compared to the US reference value of 50. Overall, we found little evidence for language DIF between US and German samples, which could be addressed by either replacing the DIF items by items not showing DIF or by scoring the short form in German samples with the corrected item parameters reported. Copyright © 2016 John Wiley & Sons, Ltd.
Harpole, Jared K; Levinson, Cheri A; Woods, Carol M; Rodebaugh, Thomas L; Weeks, Justin W; Brown, Patrick J; Heimberg, Richard G; Menatti, Andrew R; Blanco, Carlos; Schneier, Franklin; Liebowitz, Michael
The Brief Fear of Negative Evaluation Scale (BFNE; Leary Personality and Social Psychology Bulletin , 9, 371-375, 1983) assesses fear and worry about receiving negative evaluation from others. Rodebaugh et al. Psychological Assessment, 16 , 169-181, (2004) found that the BFNE is composed of a reverse-worded factor (BFNE-R) and straightforwardly-worded factor (BFNE-S). Further, they found the BFNE-S to have better psychometric properties and provide more information than the BFNE-R. Currently there is a lack of research regarding the measurement invariance of the BFNE-S across gender and ethnicity with respect to item thresholds. The present study uses item response theory (IRT) to test the BFNE-S for differential item functioning (DIF) related to gender and ethnicity (White, Asian, and Black). Six data sets consisting of clinical, community, and undergraduate participants were utilized ( N =2,109). The factor structure of the BFNE-S was confirmed using categorical confirmatory factor analysis, IRT model assumptions were tested, and the BFNE-S was evaluated for DIF. Item nine demonstrated significant non-uniform DIF between White and Black participants. No other items showed significant uniform or non-uniform DIF across gender or ethnicity. Results suggest the BFNE-S can be used reliably with men and women and Asian and White participants. More research is needed to understand the implications of using the BFNE-S with Black participants.
Obbarius, Nina; Fischer, Felix; Obbarius, Alexander; Nolte, Sandra; Liegl, Gregor; Rose, Matthias
To develop the first item bank to measure Stress Resilience (SR) in clinical populations. Qualitative item development resulted in an initial pool of 131 items covering a broad theoretical SR concept. These items were tested in n=521 patients at a psychosomatic outpatient clinic. Exploratory and Confirmatory Factor Analysis (CFA), as well as other state-of-the-art item analyses and IRT were used for item evaluation and calibration of the final item bank. Out of the initial item pool of 131 items, we excluded 64 items (54 factor loading .3, 2 non-discriminative Item Response Curves, 4 Differential Item Functioning). The final set of 67 items indicated sufficient model fit in CFA and IRT analyses. Additionally, a 10-item short form with high measurement precision (SE≤.32 in a theta range between -1.8 and +1.5) was derived. Both the SR item bank and the SR short form were highly correlated with an existing static legacy tool (Connor-Davidson Resilience Scale). The final SR item bank and 10-item short form showed good psychometric properties. When further validated, they will be ready to be used within a framework of Computer-Adaptive Tests for a comprehensive assessment of the Stress-Construct. Copyright © 2018. Published by Elsevier Inc.
Zhao, Gai; Bian, Yang; Li, Ming
To analyze the impact of passing items above the roof level in the gross motor subtest of Peabody development motor scales (PDMS-2) on its assessment results. In the subtests of PDMS-2, 124 children from 1.2 to 71 months were administered. Except for the original scoring method, a new scoring method which includes passing items above the ceiling were developed. The standard scores and quotients of the two scoring methods were compared using the independent-samples t test. Only one child could pass the items above the ceiling in the stationary subtest, 19 children in the locomotion subtest, and 17 children in the visual-motor integration subtest. When the scores of these passing items were included in the raw scores, the total raw scores got the added points of 1-12, the standard scores added 0-1 points and the motor quotients added 0-3 points. The diagnostic classification was changed only in two children. There was no significant difference between those two methods about motor quotients or standard scores in the specific subtest (P>0.05). The passing items above a ceiling of PDMS-2 isn't a rare situation. It usually takes place in the locomotion subtest and visual-motor integration subtest. Including these passing items into the scoring system will not make significant difference in the standard scores of the subtests or the developmental motor quotients (DMQ), which supports the original setting of a ceiling established by upassing 3 items in a row. However, putting the passing items above the ceiling into the raw score will improve tracking of children's developmental trajectory and intervention effects.
Lee Hang, Desmond Mene; Bell, Beverley
In this commentary, we build on Xinying Yin and Gayle Buck's discussion by exploring the cultural practices which are integral to formative assessment, when it is viewed as a sociocultural practice. First we discuss the role of assessment and in particular oral and written formative assessments in both western and Samoan cultures, building on the…
Kean, Jacob; Brodke, Darrel S; Biber, Joshua; Gross, Paul
Item response theory has its origins in educational measurement and is now commonly applied in health-related measurement of latent traits, such as function and symptoms. This application is due in large part to gains in the precision of measurement attributable to item response theory and corresponding decreases in response burden, study costs, and study duration. The purpose of this paper is twofold: introduce basic concepts of item response theory and demonstrate this analytic approach in a worked example, a Rasch model (1PL) analysis of the Eating Assessment Tool (EAT-10), a commonly used measure for oropharyngeal dysphagia. The results of the analysis were largely concordant with previous studies of the EAT-10 and illustrate for brain impairment clinicians and researchers how IRT analysis can yield greater precision of measurement.
Key words: Continuous assessment, Formative assessment, Summative ... According to Teacher Education System .... research design involving both qualitative ..... Table 3: Students' Response to Items about Learning and Teaching-Load.
International audience; Prior formative assessment research has shown positive achievement gains when classes using formative assessment are compared to classes that do not. However, little is known about what, if any, benefits of formative assessment occur within a class. The purpose of this study was to investigate the achievement of the students in introductory calculus using formative assessment at the two different participation levels observed in class. Although there was no significant...
New South Wales Dept. of Education, Sydney (Australia).
As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…
New South Wales Dept. of Education, Sydney (Australia).
As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…
New South Wales Dept. of Education, Sydney (Australia).
As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…
Mitten, Carolyn; Jacobbe, Tim; Jacobbe, Elizabeth
Formative assessment is so important to inform teachers' planning. A discussion of the benefits of using technology to facilitate formative assessment explains how four primary school teachers adopted three different apps to make their formative assessment more meaningful and useful.
Mesic, Vanes; Muratovic, Hasnija
Large-scale assessments of student achievement in physics are often approached with an intention to discriminate students based on the attained level of their physics competencies. Therefore, for purposes of test design, it is important that items display an acceptable discriminatory behavior. To that end, it is recommended to avoid extraordinary difficult and very easy items. Knowing the factors that influence physics item difficulty makes it possible to model the item difficulty even before the first pilot study is conducted. Thus, by identifying predictors of physics item difficulty, we can improve the test-design process. Furthermore, we get additional qualitative feedback regarding the basic aspects of student cognitive achievement in physics that are directly responsible for the obtained, quantitative test results. In this study, we conducted a secondary analysis of data that came from two large-scale assessments of student physics achievement at the end of compulsory education in Bosnia and Herzegovina. Foremost, we explored the concept of “physics competence” and performed a content analysis of 123 physics items that were included within the above-mentioned assessments. Thereafter, an item database was created. Items were described by variables which reflect some basic cognitive aspects of physics competence. For each of the assessments, Rasch item difficulties were calculated in separate analyses. In order to make the item difficulties from different assessments comparable, a virtual test equating procedure had to be implemented. Finally, a regression model of physics item difficulty was created. It has been shown that 61.2% of item difficulty variance can be explained by factors which reflect the automaticity, complexity, and modality of the knowledge structure that is relevant for generating the most probable correct solution, as well as by the divergence of required thinking and interference effects between intuitive and formal physics knowledge
Full Text Available Large-scale assessments of student achievement in physics are often approached with an intention to discriminate students based on the attained level of their physics competencies. Therefore, for purposes of test design, it is important that items display an acceptable discriminatory behavior. To that end, it is recommended to avoid extraordinary difficult and very easy items. Knowing the factors that influence physics item difficulty makes it possible to model the item difficulty even before the first pilot study is conducted. Thus, by identifying predictors of physics item difficulty, we can improve the test-design process. Furthermore, we get additional qualitative feedback regarding the basic aspects of student cognitive achievement in physics that are directly responsible for the obtained, quantitative test results. In this study, we conducted a secondary analysis of data that came from two large-scale assessments of student physics achievement at the end of compulsory education in Bosnia and Herzegovina. Foremost, we explored the concept of “physics competence” and performed a content analysis of 123 physics items that were included within the above-mentioned assessments. Thereafter, an item database was created. Items were described by variables which reflect some basic cognitive aspects of physics competence. For each of the assessments, Rasch item difficulties were calculated in separate analyses. In order to make the item difficulties from different assessments comparable, a virtual test equating procedure had to be implemented. Finally, a regression model of physics item difficulty was created. It has been shown that 61.2% of item difficulty variance can be explained by factors which reflect the automaticity, complexity, and modality of the knowledge structure that is relevant for generating the most probable correct solution, as well as by the divergence of required thinking and interference effects between intuitive and formal
Tutz, Gerhard; Berger, Moritz
A novel method for the identification of differential item functioning (DIF) by means of recursive partitioning techniques is proposed. We assume an extension of the Rasch model that allows for DIF being induced by an arbitrary number of covariates for each item. Recursive partitioning on the item level results in one tree for each item and leads to simultaneous selection of items and variables that induce DIF. For each item, it is possible to detect groups of subjects with different item difficulties, defined by combinations of characteristics that are not pre-specified. The way a DIF item is determined by covariates is visualized in a small tree and therefore easily accessible. An algorithm is proposed that is based on permutation tests. Various simulation studies, including the comparison with traditional approaches to identify items with DIF, show the applicability and the competitive performance of the method. Two applications illustrate the usefulness and the advantages of the new method.
Full Text Available Shifting from teacher-centred to student-centred practices requires teachers to understand strategies to interact with students in science classes. Formative assessment strategies are very critical component of classroom interaction where teachers obtain information about student learning wherever possible. Traditionally, however, teachers ask questions and evaluate student responses but without investigating student contributions to the classroom interaction. This qualitative study aimed at developing teachers’ knowledge of formative assessment strategies when teaching science-based inquiry in Saudi Arabia. 12 teachers were observed when teaching science and details of one teachers’ practices of formative assessment is presented in this study. Formative assessment framework that describes assessment conversations is used and modified to observe teachers’ assessment practices. Assessment conversation consists of four-step cycles, where the teacher elicits information from students through questioning, the student responds, the teacher recognizes the student’s response, and then uses the information to develop further inquiry. Findings indicate that teachers ask questions and receive responses but rarely allow students to share their own ideas or discuss their own thinking. The study underlines the importance of integrating formative assessment strategies during scientific inquiry teaching for professional development as a way to increase student participation and allow opportunities for students’ inquiry in science classes.
Our objective was to evaluate the psychometric properties of a vegetable parenting practices scale using multidimensional polytomous item response modeling which enables assessing item fit to latent variables and the distributional characteristics of the items in comparison to the respondents. We al...
This paper reviews the literature about item response models for the subject level and aggregated level (group level). Group-level item response models (IRMs) are used in the United States in large-scale assessment programs such as the National Assessment of Educational Progress and the California
DEMURA, Shinichi; SATO, Susumu; YAMAJI, Shunsuke; KASUGA, Kosho; NAGASAWA, Yoshinori
We aimed to examine the validity of fall risk assessment items for the healthy community-dwelling elderly Japanese population. Participants were 1122 healthy elderly individuals aged 60 years and over (380 males and 742 females). The percentage who had experienced a fall was 15.8%. This study used fall experience and 50 fall risk assessment items representing the five risk factors (symptoms of falling, physical function, disease and physical symptom, environment, and behavior and character), ...
Bregnbak, David; Johansen, Jeanne D.; Hamann, Dathan
We recently evaluated and validated a diphenylcarbazide(DPC)-based screening spot test that can detect the release of chromium(VI) ions (≥0.5 ppm) from various metallic items and leather goods (1). We then screened a selection of metal screws, leather shoes, and gloves, as well as 50 earrings......, and identified chromium(VI) release from one earring. In the present study, we used the DPC spot test to assess chromium(VI) release in a much larger sample of jewellery items (n=848), 160 (19%) of which had previously be shown to contain chromium when analysed with X-ray fluorescence spectroscopy (2)....
Full Text Available BACKGROUND: The World Health Organization Disability Assessment Schedule (WHODAS 2.0 measures disability due to health conditions including diseases, illnesses, injuries, mental or emotional problems, and problems with alcohol or drugs. METHOD: The 12 Item WHODAS 2.0 was used in the second Australian Survey of Mental Health and Well-being. We report the overall factor structure and the distribution of scores and normative data (means and SDs for people with any physical disorder, any mental disorder and for people with neither. FINDINGS: A single second order factor justifies the use of the scale as a measure of global disability. People with mental disorders had high scores (mean 6.3, SD 7.1, people with physical disorders had lower scores (mean 4.3, SD 6.1. People with no disorder covered by the survey had low scores (mean 1.4, SD 3.6. INTERPRETATION: The provision of normative data from a population sample of adults will facilitate use of the WHODAS 2.0 12 item scale in clinical and epidemiological research.
New South Wales Dept. of Education, Sydney (Australia).
New South Wales Dept. of Education, Sydney (Australia).
Benefiting from item preknowledge is a major type of fraudulent behavior during educational assessments. Belov suggested the posterior shift statistic for detection of item preknowledge and showed its performance to be better on average than that of seven other statistics for detection of item preknowledge for a known set of compromised items. Sinharay suggested a statistic based on the likelihood ratio test for detection of item preknowledge; the advantage of the statistic is that its null distribution is known. Results from simulated and real data and adaptive and nonadaptive tests are used to demonstrate that the Type I error rate and power of the statistic based on the likelihood ratio test are very similar to those of the posterior shift statistic. Thus, the statistic based on the likelihood ratio test appears promising in detecting item preknowledge when the set of compromised items is known.
Northwest Evaluation Association, 2016
If you are seeking greater student engagement and growth, you need to integrate high-impact formative assessment practices into daily instruction. Read the final article in our five-part series to find advice aimed at leaders determined to bring classroom formative assessment practices district wide. Learn: (1) what you MUST consider when…
Rauf, Ayesha; Shamim, Muhammad Shahid; Aly, Syed Moyn; Chundrigar, Tariq; Alam, Shams Nadeem
Formative assessment, described as "the process of appraising, judging or evaluating students' work or performance and using this to shape and improve students' competence", is generally missing from medical schools of Pakistan. Progressive institutions conduct "formative assessment" as a fleeting part of the curriculum by using various methods that may or may not include feedback to learners. The most important factor in the success of formative assessment is the quality of feedback, shown to have the maximum impact on student accomplishment. Inclusion of formative assessment into the curriculum and its implementation will require the following: Enabling Environment, Faculty and student Training, Role of Department of Medical Education (DME). Many issues can be predicted that may jeopardize the effectiveness of formative assessment including faculty resistance, lack of motivation from students and faculty and paucity of commitment from the top administration. For improvement in medical education in Pakistan, we need to develop a system considered worthy by national and international standards. This paper will give an overview of formative assessment, its implications and recommendations for implementation in medical institutes of Pakistan.
Martin, Christie; Lambert, Richard; Polly, Drew; Wang, Chuang; Pugalee, David
The purpose of this study was to examine the measurement properties of the Assessing Math Concepts AMC Anywhere Hiding and Ten Frame Assessments, formative assessments of primary students' number sense skills. Each assessment has two parts, where Part 1 is intended to be foundational skills for part two. Part 1 includes manipulatives whereas Part 2 does not. Student data from 228 kindergarten through second grade teachers with a total of 3,666 students was analyzed using Rasch scaling. Data analyses indicated that when the two assessments were examined separately the intended order of item difficulty was clear. When the parts of both assessments were analyzed together, the items in Part 2 were not consistently more difficult that the items in Part 1. This suggests an alternative sequence of tasks in that students may progress from working with a specific number with manipulatives then without manipulatives rather than working with a variety of numbers with manipulatives before moving onto assessments without manipulatives.
Diao, Qi; van der Linden, Wim J.
Automated test assembly uses the methodology of mixed integer programming to select an optimal set of items from an item bank. Automated test-form generation uses the same methodology to optimally order the items and format the test form. From an optimization point of view, production of fully formatted test forms directly from the item pool using…
Junod Perron, Noëlle; Louis-Simonet, Martine; Cerutti, Bernard; Pfarrwaller, Eva; Sommer, Johanna; Nendaz, Mathieu
Introduction Medical students at the Faculty of Medicine, University of Geneva, Switzerland, have the opportunity to practice clinical skills with simulated patients during formative sessions in preparation for clerkships. These sessions are given in two formats: 1) direct observation of an encounter followed by verbal feedback (direct feedback) and 2) subsequent review of the videotaped encounter by both student and supervisor (video-based feedback). The aim of the study was to evaluate whether content and process of feedback differed between both formats. Methods In 2013, all second- and third-year medical students and clinical supervisors involved in formative sessions were asked to take part in the study. A sample of audiotaped feedback sessions involving supervisors who gave feedback in both formats were analyzed (content and process of the feedback) using a 21-item feedback scale. Results Forty-eight audiotaped feedback sessions involving 12 supervisors were analyzed (2 direct and 2 video-based sessions per supervisor). When adjusted for the length of feedback, there were significant differences in terms of content and process between both formats; the number of communication skills and clinical reasoning items addressed were higher in the video-based format (11.29 vs. 7.71, p=0.002 and 3.71 vs. 2.04, p=0.010, respectively). Supervisors engaged students more actively during the video-based sessions than during direct feedback sessions (self-assessment: 4.00 vs. 3.17, p=0.007; active problem-solving: 3.92 vs. 3.42, p=0.009). Students made similar observations and tended to consider that the video feedback was more useful for improving some clinical skills. Conclusion Video-based feedback facilitates discussion of clinical reasoning, communication, and professionalism issues while at the same time actively engaging students. Different time and conceptual frameworks may explain observed differences. The choice of feedback format should depend on the educational
Noëlle Junod Perron
Full Text Available Introduction: Medical students at the Faculty of Medicine, University of Geneva, Switzerland, have the opportunity to practice clinical skills with simulated patients during formative sessions in preparation for clerkships. These sessions are given in two formats: 1 direct observation of an encounter followed by verbal feedback (direct feedback and 2 subsequent review of the videotaped encounter by both student and supervisor (video-based feedback. The aim of the study was to evaluate whether content and process of feedback differed between both formats. Methods: In 2013, all second- and third-year medical students and clinical supervisors involved in formative sessions were asked to take part in the study. A sample of audiotaped feedback sessions involving supervisors who gave feedback in both formats were analyzed (content and process of the feedback using a 21-item feedback scale. Results: Forty-eight audiotaped feedback sessions involving 12 supervisors were analyzed (2 direct and 2 video-based sessions per supervisor. When adjusted for the length of feedback, there were significant differences in terms of content and process between both formats; the number of communication skills and clinical reasoning items addressed were higher in the video-based format (11.29 vs. 7.71, p=0.002 and 3.71 vs. 2.04, p=0.010, respectively. Supervisors engaged students more actively during the video-based sessions than during direct feedback sessions (self-assessment: 4.00 vs. 3.17, p=0.007; active problem-solving: 3.92 vs. 3.42, p=0.009. Students made similar observations and tended to consider that the video feedback was more useful for improving some clinical skills. Conclusion: Video-based feedback facilitates discussion of clinical reasoning, communication, and professionalism issues while at the same time actively engaging students. Different time and conceptual frameworks may explain observed differences. The choice of feedback format should depend on
New South Wales Dept. of Education, Sydney (Australia).
Penfield, Randall David
A polytomous item is one for which the responses are scored according to three or more categories. Given the increasing use of polytomous items in assessment practices, item response theory (IRT) models specialized for polytomous items are becoming increasingly common. The purpose of this ITEMS module is to provide an accessible overview of…
Susan C. Gillmor
Full Text Available This study explores a new item-writing framework for improving the validity of math assessment items. The authors transfer insights from Cognitive Load Theory (CLT, traditionally used in instructional design, to educational measurement. Fifteen, multiple-choice math assessment items were modified using research-based strategies for reducing extraneous cognitive load. An experimental design with 222 middle-school students tested the effects of the reduced cognitive load items on student performance and anxiety. Significant findings confirm the main research hypothesis that reducing the cognitive load of math assessment items improves student performance. Three load-reducing item modifications are identified as particularly effective for reducing item difficulty: signalling important information, aesthetic item organization, and removing extraneous content. Load reduction was not shown to impact student anxiety. Implications for classroom assessment and future research are discussed.
Dopper, Sofia M.; Sjoer, Ellen
This article describes the possibilities offered by the online assessment system Etude to achieve the benefits of formative assessment. In order to find out the way this works in practice, we carried out an experiment with the use of Etude for formative assessment in the course on collaborative report writing. Results show that online formative…
Alonzo, Alicia C.
Learning progressions--particularly as defined and operationalized in science education--have significant potential to inform teachers' formative assessment practices. In this overview article, I lay out an argument for this potential, starting from definitions for "formative assessment practices" and "learning progressions"…
Andriessen, T.M.J.C.; Jong, B. de; Jacobs, B.; Werf, S.P. van der; Vos, P.E.
PRIMARY OBJECTIVE: To investigate how the type of stimulus (pictures or words) and the method of reproduction (free recall or recognition after a short or a long delay) affect the sensitivity and specificity of a 3-item memory test in the assessment of post traumatic amnesia (PTA). METHODS: Daily
Panjaitan, R. L.; Irawati, R.; Sujana, A.; Hanifah, N.; Djuanda, D.
In several literatures about evaluation and test analysis, it is common to find that there are calculations of item validity as well as item discrimination index (D) with different formula for each. Meanwhile, other resources said that item discrimination index could be obtained by calculating the correlation between the testee’s score in a particular item and the testee’s score on the overall test, which is actually the same concept as item validity. Some research reports, especially undergraduate theses tend to include both item validity and item discrimination index in the instrument analysis. It seems that these concepts might overlap for both reflect the test quality on measuring the examinees’ ability. In this paper, examples of some results of data processing on item validity and item discrimination index were compared. It would be discussed whether item validity and item discrimination index can be represented by one of them only or it should be better to present both calculations for simple test analysis, especially in undergraduate theses where test analyses were included.
van der Kleij, Fabienne; Vermeulen, Jorine; Schildkamp, Kim; Eggen, Theodorus Johannes Hendrikus Maria
Recent research has highlighted the lack of a uniform definition of formative assessment, although its effectiveness is widely acknowledged. This paper addresses the theoretical differences and similarities amongst three approaches to formative assessment that are currently most frequently discussed
Errors that are related to some intrinsic property of the items measured are often encountered in nuclear material accounting. An example is the error in nondestructive assay measurements caused by uncorrected matrix effects. Nuclear material accounting requires for each materials type one measurement method for which bounds on these errors can be determined. If such a method is available, a second method might be used to reduce costs or to improve precision. If the measurement error for the first method is longer-tailed than Gaussian, then precision might be improved by measuring all items by both methods. 8 refs
Goh, Shaun K Y; Tham, Elaine K H; Magiati, Iliana; Sim, Litwee; Sanmugam, Shamini; Qiu, Anqi; Daniel, Mary L; Broekman, Birit F P; Rifkin-Graboi, Anne
The purpose of this study was to improve standardized language assessments among bilingual toddlers by investigating and removing the effects of bias due to unfamiliarity with cultural norms or a distributed language system. The Expressive and Receptive Bayley-III language scales were adapted for use in a multilingual country (Singapore). Differential item functioning (DIF) was applied to data from 459 two-year-olds without atypical language development. This involved investigating if the probability of success on each item varied according to language exposure while holding latent language ability, gender, and socioeconomic status constant. Associations with language, behavioral, and emotional problems were also examined. Five of 16 items showed DIF, 1 of which may be attributed to cultural bias and another to a distributed language system. The remaining 3 items favored toddlers with higher bilingual exposure. Removal of DIF items reduced associations between language scales and emotional and language problems, but improved the validity of the expressive scale from poor to good. Our findings indicate the importance of considering cultural and distributed language bias in standardized language assessments. We discuss possible mechanisms influencing performance on items favoring bilingual exposure, including the potential role of inhibitory processing.
Johnston, Marie; Dixon, Diane; Hart, Jo; Glidewell, Liz; Schröder, Carin; Pollard, Beth
In studies involving theoretical constructs, it is important that measures have good content validity and that there is not contamination of measures by content from other constructs. While reliability and construct validity are routinely reported, to date, there has not been a satisfactory, transparent, and systematic method of assessing and reporting content validity. In this paper, we describe a methodology of discriminant content validity (DCV) and illustrate its application in three studies. Discriminant content validity involves six steps: construct definition, item selection, judge identification, judgement format, single-sample test of content validity, and assessment of discriminant items. In three studies, these steps were applied to a measure of illness perceptions (IPQ-R) and control cognitions. The IPQ-R performed well with most items being purely related to their target construct, although timeline and consequences had small problems. By contrast, the study of control cognitions identified problems in measuring constructs independently. In the final study, direct estimation response formats for theory of planned behaviour constructs were found to have as good DCV as Likert format. The DCV method allowed quantitative assessment of each item and can therefore inform the content validity of the measures assessed. The methods can be applied to assess content validity before or after collecting data to select the appropriate items to measure theoretical constructs. Further, the data reported for each item in Appendix S1 can be used in item or measure selection. Statement of contribution What is already known on this subject? There are agreed methods of assessing and reporting construct validity of measures of theoretical constructs, but not their content validity. Content validity is rarely reported in a systematic and transparent manner. What does this study add? The paper proposes discriminant content validity (DCV), a systematic and transparent method
Full Text Available This study investigated the multiple-choice test of understanding of vectors (TUV, by applying item response theory (IRT. The difficulty, discriminatory, and guessing parameters of the TUV items were fit with the three-parameter logistic model of IRT, using the parscale program. The TUV ability is an ability parameter, here estimated assuming unidimensionality and local independence. Moreover, all distractors of the TUV were analyzed from item response curves (IRC that represent simplified IRT. Data were gathered on 2392 science and engineering freshmen, from three universities in Thailand. The results revealed IRT analysis to be useful in assessing the test since its item parameters are independent of the ability parameters. The IRT framework reveals item-level information, and indicates appropriate ability ranges for the test. Moreover, the IRC analysis can be used to assess the effectiveness of the test’s distractors. Both IRT and IRC approaches reveal test characteristics beyond those revealed by the classical analysis methods of tests. Test developers can apply these methods to diagnose and evaluate the features of items at various ability levels of test takers.
Rakkapao, Suttida; Prasitpong, Singha; Arayathanitkul, Kwan
This study investigated the multiple-choice test of understanding of vectors (TUV), by applying item response theory (IRT). The difficulty, discriminatory, and guessing parameters of the TUV items were fit with the three-parameter logistic model of IRT, using the parscale program. The TUV ability is an ability parameter, here estimated assuming unidimensionality and local independence. Moreover, all distractors of the TUV were analyzed from item response curves (IRC) that represent simplified IRT. Data were gathered on 2392 science and engineering freshmen, from three universities in Thailand. The results revealed IRT analysis to be useful in assessing the test since its item parameters are independent of the ability parameters. The IRT framework reveals item-level information, and indicates appropriate ability ranges for the test. Moreover, the IRC analysis can be used to assess the effectiveness of the test's distractors. Both IRT and IRC approaches reveal test characteristics beyond those revealed by the classical analysis methods of tests. Test developers can apply these methods to diagnose and evaluate the features of items at various ability levels of test takers.
Jessen, Annika; Ho, Andrew D; Corrales, C Eduardo; Yueh, Bevan; Shin, Jennifer J
Objectives (1) To assess the 11-item Inner Effectiveness of Auditory Rehabilitation (Inner EAR) instrument with item response theory (IRT). (2) To determine whether the underlying latent ability could also be accurately represented by a subset of the items for use in high-volume clinical scenarios. (3) To determine whether the Inner EAR instrument correlates with pure tone thresholds and word recognition scores. Design IRT evaluation of prospective cohort data. Setting Tertiary care academic ambulatory otolaryngology clinic. Subjects and Methods Modern psychometric methods, including factor analysis and IRT, were used to assess unidimensionality and item properties. Regression methods were used to assess prediction of word recognition and pure tone audiometry scores. Results The Inner EAR scale is unidimensional, and items varied in their location and information. Information parameter estimates ranged from 1.63 to 4.52, with higher values indicating more useful items. The IRT model provided a basis for identifying 2 sets of items with relatively lower information parameters. Item information functions demonstrated which items added insubstantial value over and above other items and were removed in stages, creating a 8- and 3-item Inner EAR scale for more efficient assessment. The 8-item version accurately reflected the underlying construct. All versions correlated moderately with word recognition scores and pure tone averages. Conclusion The 11-, 8-, and 3-item versions of the Inner EAR scale have strong psychometric properties, and there is correlational validity evidence for the observed scores. Modern psychometric methods can help streamline care delivery by maximizing relevant information per item administered.
Full Text Available Previous research on the impact of text and formatting changes on test-item performance has produced mixed results. This matter is important because it is generally acknowledged that any change to an item requires that it be recalibrated. The present study investigated the effects of seven classes of stylistic changes on item difficulty, discrimination, and response time for a subset of 65 items that make up a standardized test for physician licensure completed by 31,918 examinees in 2012. One of two versions of each item (original or revised was randomly assigned to examinees such that each examinee saw only two experimental items, with each item being administered to approximately 480 examinees. The stylistic changes had little or no effect on item difficulty or discrimination; however, one class of edits -' changing an item from an open lead-in (incomplete statement to a closed lead-in (direct question -' did result in slightly longer response times. Data for nonnative speakers of English were analyzed separately with nearly identical results. These findings have implications for the conventional practice of repretesting (or recalibrating items that have been subjected to minor editorial changes.
Raupach, Tobias; Hanneforth, Nathalie; Anders, Sven; Pukrop, Tobias; Th J ten Cate, Olle; Harendza, Sigrid
Interpretation of the electrocardiogram (ECG) is a core clinical skill that should be developed in undergraduate medical education. This study assessed whether small-group peer teaching is more effective than lectures in enhancing medical students' ECG interpretation skills. In addition, the impact of assessment format on study outcome was analysed. Two consecutive cohorts of Year 4 medical students (n=335) were randomised to receive either traditional ECG lectures or the same amount of small-group, near-peer teaching during a 6-week cardiorespiratory course. Before and after the course, written assessments of ECG interpretation skills were undertaken. Whereas this final assessment yielded a considerable amount of credit points for students in the first cohort, it was merely formative in nature for the second cohort. An unannounced retention test was applied 8 weeks after the end of the cardiovascular course. A significant advantage of near-peer teaching over lectures (effect size 0.33) was noted only in the second cohort, whereas, in the setting of a summative assessment, both teaching formats appeared to be equally effective. A summative instead of a formative assessment doubled the performance increase (Cohen's d 4.9 versus 2.4), mitigating any difference between teaching formats. Within the second cohort, the significant difference between the two teaching formats was maintained in the retention test (p=0.017). However, in both cohorts, a significant decrease in student performance was detected during the 8 weeks following the cardiovascular course. Assessment format appeared to be more powerful than choice of instructional method in enhancing student learning. The effect observed in the second cohort was masked by an overriding incentive generated by the summative assessment in the first cohort. This masking effect should be considered in studies assessing the effectiveness of different teaching methods.
Balsis, Steve; Choudhury, Tabina K; Geraci, Lisa; Benge, Jared F; Patrick, Christopher J
Alzheimer's disease (AD) affects neurological, cognitive, and behavioral processes. Thus, to accurately assess this disease, researchers and clinicians need to combine and incorporate data across these domains. This presents not only distinct methodological and statistical challenges but also unique opportunities for the development and advancement of psychometric techniques. In this article, we describe relatively recent research using item response theory (IRT) that has been used to make progress in assessing the disease across its various symptomatic and pathological manifestations. We focus on applications of IRT to improve scoring, test development (including cross-validation and adaptation), and linking and calibration. We conclude by describing potential future multidimensional applications of IRT techniques that may improve the precision with which AD is measured.
Bisby, James A; Burgess, Neil
The formation of associations between items and their context has been proposed to rely on mechanisms distinct from those supporting memory for a single item. Although emotional experiences can profoundly affect memory, our understanding of how it interacts with different aspects of memory remains unclear. We performed three experiments to examine the effects of emotion on memory for items and their associations. By presenting neutral and negative items with background contexts, Experiment 1 demonstrated that item memory was facilitated by emotional affect, whereas memory for an associated context was reduced. In Experiment 2, arousal was manipulated independently of the memoranda, by a threat of shock, whereby encoding trials occurred under conditions of threat or safety. Memory for context was equally impaired by the presence of negative affect, whether induced by threat of shock or a negative item, relative to retrieval of the context of a neutral item in safety. In Experiment 3, participants were presented with neutral and negative items as paired associates, including all combinations of neutral and negative items. The results showed both above effects: compared to a neutral item, memory for the associate of a negative item (a second item here, context in Experiments 1 and 2) is impaired, whereas retrieval of the item itself is enhanced. Our findings suggest that negative affect impairs associative memory while recognition of a negative item is enhanced. They support dual-processing models in which negative affect or stress impairs hippocampal-dependent associative memory while the storage of negative sensory/perceptual representations is spared or even strengthened.
Van der Kleij, Fabienne M.; Vermeulen, Jorine A.; Schildkamp, Kim; Eggen, Theo J. H .M.
Recent research has highlighted the lack of a uniform definition of formative assessment, although its effectiveness is widely acknowledged. This paper addresses the theoretical differences and similarities amongst three approaches to formative assessment that are currently most frequently discussed in educational research literature: data-based…
Shirley, Melissa L.; Irving, Karen E.
Formative assessment has been demonstrated to result in increased student achievement across a variety of educational contexts. When using formative assessment strategies, teachers engage students in instructional tasks that allow the teacher to uncover levels of student understanding so that the teacher may change instruction accordingly. Tools that support the implementation of formative assessment strategies are therefore likely to enhance student achievement. Connected classroom technologies (CCTs) include a family of devices that show promise in facilitating formative assessment. By promoting the use of interactive student tasks and providing both teachers and students with rapid and accurate data on student learning, CCT can provide teachers with necessary evidence for making instructional decisions about subsequent lessons. In this study, the experiences of four middle and high school science teachers in their first year of implementing the TI-Navigator™ system, a specific type of CCT, are used to characterize the ways in which CCT supports the goals of effective formative assessment. We present excerpts of participant interviews to demonstrate the alignment of CCT with several main phases of the formative assessment process. CCT was found to support implementation of a variety of instructional tasks that generate evidence of student learning for the teacher. The rapid aggregation and display of student learning evidence provided teachers with robust data on which to base subsequent instructional decisions.
Doubet, Kristina J.
A rural middle level school had stalled in its third year of a district-wide differentiation initiative. This article describes the way teachers and the leadership team engaged in collaborative practices to put a spotlight on formative assessment. Teachers learned to systematically gather formative assessment data from their students and to use…
Tucker, Joan S; Shadel, William G; Edelen, Maria Orlando; Stucky, Brian D; Li, Zhen; Hansen, Mark; Cai, Li
The positive emotional and sensory expectancies of cigarette smoking include improved cognitive abilities, positive affective states, and pleasurable sensorimotor sensations. This paper describes development of Positive Emotional and Sensory Expectancies of Smoking item banks that will serve to standardize the assessment of this construct among daily and nondaily cigarette smokers. Data came from daily (N = 4,201) and nondaily (N =1,183) smokers who completed an online survey. To identify a unidimensional set of items, we conducted item factor analyses, item response theory analyses, and differential item functioning analyses. Additionally, we evaluated the performance of fixed-item short forms (SFs) and computer adaptive tests (CATs) to efficiently assess the construct. Eighteen items were included in the item banks (15 common across daily and nondaily smokers, 1 unique to daily, 2 unique to nondaily). The item banks are strongly unidimensional, highly reliable (reliability = 0.95 for both), and perform similarly across gender, age, and race/ethnicity groups. A SF common to daily and nondaily smokers consists of 6 items (reliability = 0.86). Results from simulated CATs indicated that, on average, less than 8 items are needed to assess the construct with adequate precision using the item banks. These analyses identified a new set of items that can assess the positive emotional and sensory expectancies of smoking in a reliable and standardized manner. Considerable efficiency in assessing this construct can be achieved by using the item bank SF, employing computer adaptive tests, or selecting subsets of items tailored to specific research or clinical purposes. © The Author 2014. Published by Oxford University Press on behalf of the Society for Research on Nicotine and Tobacco. All rights reserved. For permissions, please e-mail: firstname.lastname@example.org.
Wilbanks, Jammie T.
Research has been conducted on the effectiveness of formative assessments and on effectively teaching medical terminology; however, research had not been conducted on the use of formative assessments in a medical terminology course. A quantitative study was performed which captured data from a pretest, self-assessment, four module exams, and a…
The comparability of English, French and Dutch scores on the Functional Assessment of Chronic Illness Therapy-Fatigue (FACIT-F: an assessment of differential item functioning in patients with systemic sclerosis.
Full Text Available The Functional Assessment of Chronic Illness Therapy-Fatigue (FACIT-F is commonly used to assess fatigue in rheumatic diseases, and has shown to discriminate better across levels of the fatigue spectrum than other commonly used measures. The aim of this study was to assess the cross-language measurement equivalence of the English, French, and Dutch versions of the FACIT-F in systemic sclerosis (SSc patients.The FACIT-F was completed by 871 English-speaking Canadian, 238 French-speaking Canadian and 230 Dutch SSc patients. Confirmatory factor analysis was used to assess the factor structure in the three samples. The Multiple-Indicator Multiple-Cause (MIMIC model was utilized to assess differential item functioning (DIF, comparing English versus French and versus Dutch patient responses separately.A unidimensional factor model showed good fit in all samples. Comparing French versus English patients, statistically significant, but small-magnitude DIF was found for 3 of 13 items. French patients had 0.04 of a standard deviation (SD lower latent fatigue scores than English patients and there was an increase of only 0.03 SD after accounting for DIF. For the Dutch versus English comparison, 4 items showed small, but statistically significant, DIF. Dutch patients had 0.20 SD lower latent fatigue scores than English patients. After correcting for DIF, there was a reduction of 0.16 SD in this difference.There was statistically significant DIF in several items, but the overall effect on fatigue scores was minimal. English, French and Dutch versions of the FACIT-F can be reasonably treated as having equivalent scoring metrics.
Full Text Available Formative assessment is a pedagogic practice that has been the subject of much research and debate, as to how it can be used most effectively to deliver enhanced student learning in the higher education setting. Often described as a complex concept it embraces activities that range from facilitating students understanding of assessment standards, to providing formative feedback on their work; from very informal opportunities of engaging in conversations, to the very formal process of submitting drafts of work. This study aims to show how cultural historical activity theory can be used as a qualitative analysis framework to explore the complexities of formative assessment as it is used in higher education. The original data for the research was collected in 2008 by semi structured interviews and analysed using a hermeneutic phenomenological approach. For this present paper three selected transcripts were re-examined, using a case study approach that sought to understand and compare the perceptions of five academic staff, from three distinct subject areas taught within a UK university. It is proposed that using activity theory can provide insight into the complexity of such experiences, about what teachers do and why, and the influence of the community in which they are situated. Individually the cases from each subject area were analysed using activity theory exploring how the mediating artefacts of formative assessment were used; the often implicit rules that governed their use and the roles of teachers and students within the local subject community. The analysis also considered the influence each aspect of the unit of activity had on the other in understanding formative assessment practice. Subsequently the three subject cases were compared and contrasted. The findings illuminate a variety of practices, including how students and staff engage together in formative assessment activities and for some, how dialogue is used as one of the key tools
Wang, Jing-Jing; Chen, Tzu-An; Baranowski, Tom; Lau, Patrick W C
This study aimed to evaluate the psychometric properties of four self-efficacy scales (i.e., self-efficacy for fruit (FSE), vegetable (VSE), and water (WSE) intakes, and physical activity (PASE)) and to investigate their differences in item functioning across sex, age, and body weight status groups using item response modeling (IRM) and differential item functioning (DIF). Four self-efficacy scales were administrated to 763 Hong Kong Chinese children (55.2% boys) aged 8-13 years. Classical test theory (CTT) was used to examine the reliability and factorial validity of scales. IRM was conducted and DIF analyses were performed to assess the characteristics of item parameter estimates on the basis of children's sex, age and body weight status. All self-efficacy scales demonstrated adequate to excellent internal consistency reliability (Cronbach's α: 0.79-0.91). One FSE misfit item and one PASE misfit item were detected. Small DIF were found for all the scale items across children's age groups. Items with medium to large DIF were detected in different sex and body weight status groups, which will require modification. A Wright map revealed that items covered the range of the distribution of participants' self-efficacy for each scale except VSE. Several self-efficacy scales' items functioned differently by children's sex and body weight status. Additional research is required to modify the four self-efficacy scales to minimize these moderating influences for application.
Proposta de um instrumento de medida para avaliar a satisfação de clientes de bancos utilizando a Teoria da Resposta ao Item Proposal of tool to assess the satisfaction of bank customers using the Item Response Theory
Alceu Balbim Junior
Full Text Available Este artigo apresenta um instrumento de medida para avaliação da satisfação de clientes de bancos utilizando a Teoria da Resposta ao Item (TRI. Satisfazer os clientes tem sido uma busca constante das organizações que procuram manterem-se competitivas no mercado. Estudos constatam a relação entre a qualidade percebida pelos clientes, a satisfação e fidelidade. A avaliação da satisfação pode ser realizada por meio da qualidade percebida pelos clientes e a construção de ferramentas de avaliação deve contemplar características específicas da atividade em questão. Embasando-se em artigos que avaliam a satisfação de clientes de bancos, propõe-se um instrumento formado por 29 itens. Os itens foram aplicados a 240 clientes a fim de avaliar a satisfação com o banco de maior relacionamento. Utilizando a Teoria da Resposta ao Item, foram identificados os parâmetros dos itens e a curva de informação. A análise do grau de discriminação dos itens indicou que todos são apropriados. A curva de informação obtida evidenciou o intervalo no qual o instrumento apresenta melhores estimativas para níveis de satisfação. O trabalho apresentou o nível médio de satisfação da amostra e a concentração de clientes nos diferentes níveis de satisfação da escala.This paper presents a model for assessing the satisfaction of bank customers using the Item Response Theory (IRT. Organizations are constantly making effort to satisfy customers seeking to remain competitive. Several studies have reported on the relationship between perceived quality, satisfaction, and loyalty. The assessment of satisfaction can be accomplished through the perceived quality, and the development of assessment tools should address specific features of the activity in question. Based on articles that assess the satisfaction of bank customers, this study proposes an assessment tool consisting of 29 items. The items were applied to 240 clients to assess their
Hartmeyer, Rikke; Stevenson, Matthew Peter; Bentsen, Peter
Background and purpose: Research in formative assessment often pays close attention to the strategies which can be used by teachers. However, less emphasis in the literature seems to have been paid to study the application of formative assessment designs in practice. In this paper, we argue...... that a formative assessment design that we call Eva-Mapping, which is developed on the principles of design-based research, can be a productive starting point for disseminating and further developing formative assessment practices in outdoor science teaching. Sample, design and methods: We conducted an evaluation...... of the design, based on video-elicited focus group interviews with two groups of experienced science teachers. Both groups consisted of teachers who taught science outside the classroom on a regular basis. These groups watched identical video sequences which were recorded during lessons in which teachers...
Best-worst scaling is a judgment format in which participants are presented with a set of items and have to choose the superior and inferior items in the set. Best-worst scaling generates a large quantity of information per judgment because each judgment allows for inferences about the rank value of all unjudged items. This property of best-worst scaling makes it a promising judgment format for research in psychology and natural language processing concerned with estimating the semantic properties of tens of thousands of words. A variety of different scoring algorithms have been devised in the previous literature on best-worst scaling. However, due to problems of computational efficiency, these scoring algorithms cannot be applied efficiently to cases in which thousands of items need to be scored. New algorithms are presented here for converting responses from best-worst scaling into item scores for thousands of items (many-item scoring problems). These scoring algorithms are validated through simulation and empirical experiments, and considerations related to noise, the underlying distribution of true values, and trial design are identified that can affect the relative quality of the derived item scores. The newly introduced scoring algorithms consistently outperformed scoring algorithms used in the previous literature on scoring many-item best-worst data.
Hondrich, Annika Lena; Hertel, Silke; Adl-Amini, Katja; Klieme, Eckhard
The implementation of formative assessment strategies is challenging for teachers. We evaluated teachers' implementation fidelity of a curriculum-embedded formative assessment programme for primary school science education, investigating both material-supported, direct application and subsequent transfer. Furthermore, the relationship between…
Wang, Jing; Bao, Lei
Item response theory is a popular assessment method used in education. It rests on the assumption of a probability framework that relates students' innate ability and their performance on test questions. Item response theory transforms students' raw test scores into a scaled proficiency score, which can be used to compare results obtained with different test questions. The scaled score also addresses the issues of ceiling effects and guessing, which commonly exist in quantitative assessment. We used item response theory to analyze the force concept inventory (FCI). Our results show that item response theory can be useful for analyzing physics concept surveys such as the FCI and produces results about the individual questions and student performance that are beyond the capability of classical statistics. The theory yields detailed measurement parameters regarding the difficulty, discrimination features, and probability of correct guess for each of the FCI questions.
Singh, Lochan; Varshney, Jay G; Agarwal, Tripti
Polycyclic aromatic hydrocarbons (PAHs) emerged as an important contaminant group in a gamut of processed food groups like dairy, nuts, herbs, beverages, meat products etc. Different cooking processes and processing techniques like roasting, barbecuing, grilling, smoking, heating, drying, baking, ohmic-infrared cooking etc. contribute towards its formation. The level of PAHs depends on factors like distance from heat source, fuel used, level of processing, cooking durations and methods, whereas processes like reuse, conching, concentration, crushing and storage enhance the amount of PAHs in some food items. This review paper provides insight into the impact of dietary intake of PAHs, its levels and formation mechanism in processed food items and possible interventions for prevention and reduction of the PAHs contamination. The gaps and future prospects have also been assessed. Copyright © 2015 Elsevier Ltd. All rights reserved.
Taylor, Kathryn L; Shelby, Rebecca A; Schwartz, Marc D; Ackerman, Josh; LaSalle, V Holland; Gelmann, Edward P; McGuire, Colleen
Although perceived risk is central to most theories of health behavior, there is little consensus on its measurement with regard to item wording, response set, or the number of items to include. In a methodological assessment of perceived risk, we assessed the impact of changing the order of three commonly used perceived risk items: quantitative personal risk, quantitative population risk, and comparative risk. Participants were 432 men and women enrolled in an ancillary study of the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial. Three groups of consecutively enrolled participants responded to the three items in one of three question orders. Results indicated that item order was related to the perceived risk ratings of both ovarian (P Perceptions of risk were significantly lower when the comparative rating was made first. The findings suggest that compelling participants to consider their own risk relative to the risk of others results in lower ratings of perceived risk. Although the use of multiple items may provide more information than when only a single method is used, different conclusions may be reached depending on the context in which an item is assessed.
The Comparability of English, French and Dutch Scores on the Functional Assessment of Chronic Illness Therapy-Fatigue (FACIT-F): An Assessment of Differential Item Functioning in Patients with Systemic Sclerosis
Kwakkenbos, Linda; Willems, Linda M.; Baron, Murray; Hudson, Marie; Cella, David; van den Ende, Cornelia H. M.; Thombs, Brett D.
Objective The Functional Assessment of Chronic Illness Therapy- Fatigue (FACIT-F) is commonly used to assess fatigue in rheumatic diseases, and has shown to discriminate better across levels of the fatigue spectrum than other commonly used measures. The aim of this study was to assess the cross-language measurement equivalence of the English, French, and Dutch versions of the FACIT-F in systemic sclerosis (SSc) patients. Methods The FACIT-F was completed by 871 English-speaking Canadian, 238 French-speaking Canadian and 230 Dutch SSc patients. Confirmatory factor analysis was used to assess the factor structure in the three samples. The Multiple-Indicator Multiple-Cause (MIMIC) model was utilized to assess differential item functioning (DIF), comparing English versus French and versus Dutch patient responses separately. Results A unidimensional factor model showed good fit in all samples. Comparing French versus English patients, statistically significant, but small-magnitude DIF was found for 3 of 13 items. French patients had 0.04 of a standard deviation (SD) lower latent fatigue scores than English patients and there was an increase of only 0.03 SD after accounting for DIF. For the Dutch versus English comparison, 4 items showed small, but statistically significant, DIF. Dutch patients had 0.20 SD lower latent fatigue scores than English patients. After correcting for DIF, there was a reduction of 0.16 SD in this difference. Conclusions There was statistically significant DIF in several items, but the overall effect on fatigue scores was minimal. English, French and Dutch versions of the FACIT-F can be reasonably treated as having equivalent scoring metrics. PMID:24638101
Fayers, Tessa; Dolman, Peter J
To develop and test a user-friendly questionnaire for rapidly assessing quality of life (QOL) in thyroid eye disease (TED). A three-item questionnaire, the TED-QOL, was designed and compared to the 16-item Graves Ophthalmopathy (GO)-QOL and the nine-item GO-Quality of Life Scale (QLS). 100 patients with TED were administered all three questionnaires on two occasions. Results were compared to clinical severity scores (Vision, Inflammation, Strabismus, Appearance (VISA) classification). Main outcomes were construct and criterion validity, test-retest reliability, duration, comprehension and completion rates. TED-QOL correlated strongly with the other questionnaires for corresponding items (Pearson correlation: appearance 0.71, 0.62; functioning 0.69, 0.66; overall QOL 0.53). Test-retest analysis demonstrated good reliability for all three questionnaires (intraclass correlations: TED-QOL 0.81, 0.74, 0.87; GO-QOL 0.81, 0.82; GO-QLS 0.74, 0.86, 0.67). TED-QOL was significantly faster to complete (1.6 min vs GO-QOL 3.1 min, GO-QLS 2.7 min, p<0.0001) and had a higher completion rate (100% vs GO-QOL 78%, GO-QLS 94%). There was only moderate correlation between items on all three questionnaires and VISA scores. The TED-QOL is rapid and easy to complete and analyse and has similar validity and reliability to longer questionnaires. All questionnaires showed only moderate correlation with disease severity, emphasising the discrepancy between objective and subjective assessments and the importance of measuring both.
This column focuses on promoting learning through assessment. This month's issue describes using formative assessment probes to uncover several ways of thinking about the puzzling discovery of a marine fossil on top of a mountain.
Nunes, Sandra; Oliveira, Teresa; Oliveira, Amílcar
The Item Response Theory (IRT) has become one of the most popular scoring frameworks for measurement data, frequently used in computerized adaptive testing, cognitively diagnostic assessment and test equating. According to Andrade et al. (2000), IRT can be defined as a set of mathematical models (Item Response Models - IRM) constructed to represent the probability of an individual giving the right answer to an item of a particular test. The number of Item Responsible Models available to measurement analysis has increased considerably in the last fifteen years due to increasing computer power and due to a demand for accuracy and more meaningful inferences grounded in complex data. The developments in modeling with Item Response Theory were related with developments in estimation theory, most remarkably Bayesian estimation with Markov chain Monte Carlo algorithms (Patz & Junker, 1999). The popularity of Item Response Theory has also implied numerous overviews in books and journals, and many connections between IRT and other statistical estimation procedures, such as factor analysis and structural equation modeling, have been made repeatedly (Van der Lindem & Hambleton, 1997). As stated before the Item Response Theory covers a variety of measurement models, ranging from basic one-dimensional models for dichotomously and polytomously scored items and their multidimensional analogues to models that incorporate information about cognitive sub-processes which influence the overall item response process. The aim of this work is to introduce the main concepts associated with one-dimensional models of Item Response Theory, to specify the logistic models with one, two and three parameters, to discuss some properties of these models and to present the main estimation procedures.
Morean, Meghan E; Krishnan-Sarin, Suchitra; S O'Malley, Stephanie
Adolescent e-cigarette use (i.e., "vaping") likely confers risk for developing nicotine dependence. However, there have been no studies assessing e-cigarette nicotine dependence in youth. We evaluated the psychometric properties of the 4-item Patient-Reported Outcomes Measurement Information System Nicotine Dependence Item Bank for E-cigarettes (PROMIS-E) for assessing youth e-cigarette nicotine dependence and examined risk factors for experiencing stronger dependence symptoms. In 2017, 520 adolescent past-month e-cigarette users completed the PROMIS-E during a school-based survey (50.5% female, 84.8% White, 16.22[1.19] years old). Adolescents also reported on sex, grade, race, age at e-cigarette use onset, vaping frequency, nicotine e-liquid use, and past-month cigarette smoking. Analyses included conducting confirmatory factor analysis and examining the internal consistency of the PROMIS-E. Bivariate correlations and independent-samples t-tests were used to examine unadjusted relationships between e-cigarette nicotine dependence and the proposed risk factors. Regression models were run in which all potential risk factors were entered as simultaneous predictors of PROMIS-E scores. The single-factor structure of the PROMIS-E was confirmed and evidenced good internal consistency. Across models, larger PROMIS-E scores were associated with being in a higher grade, initiating e-cigarette use at an earlier age, vaping more frequently, using nicotine e-liquid (and higher nicotine concentrations), and smoking cigarettes. Adolescent e-cigarette users reported experiencing nicotine dependence, which was assessed using the psychometrically sound PROMIS-E. Experiencing stronger nicotine dependence symptoms was associated with characteristics that previously have been shown to confer risk for frequent vaping and tobacco cigarette dependence. Copyright © 2018 Elsevier B.V. All rights reserved.
Costa, Daniel S J; Asghari, Ali; Nicholas, Michael K
The Pain Self-Efficacy Questionnaire (PSEQ) is a 10-item instrument designed to assess the extent to which a person in pain believes s/he is able to accomplish various activities despite their pain. There is strong evidence for the validity and reliability of both the full-length PSEQ and a 2-item version. The purpose of this study is to further examine the properties of the PSEQ using an item response theory (IRT) approach. We used the two-parameter graded response model to examine the category probability curves, and location and discrimination parameters of the 10 PSEQ items. In item response theory, responses to a set of items are assumed to be probabilistically determined by a latent (unobserved) variable. In the graded-response model specifically, item response threshold (the value of the latent variable for which adjacent response categories are equally likely) and discrimination parameters are estimated for each item. Participants were 1511 mixed, chronic pain patients attending for initial assessment at a tertiary pain management centre. All items except item 7 ('I can cope with my pain without medication') performed well in IRT analysis, and the category probability curves suggested that participants used the 7-point response scale consistently. Items 6 ('I can still do many of the things I enjoy doing, such as hobbies or leisure activity, despite pain'), 8 ('I can still accomplish most of my goals in life, despite the pain') and 9 ('I can live a normal lifestyle, despite the pain') captured higher levels of the latent variable with greater precision. The results from this IRT analysis add to the body of evidence based on classical test theory illustrating the strong psychometric properties of the PSEQ. Despite the relatively poor performance of Item 7, its clinical utility warrants its retention in the questionnaire. The strong psychometric properties of the PSEQ support its use as an effective tool for assessing self-efficacy in people with pain
Saltychev, Mikhail; Bärlund, Esa; Mattie, Ryan; McCormick, Zachary; Paltamaa, Jaana; Laimi, Katri
To assess the validity of the Finnish translation of the 12-item World Health Organization Disability Assessment Schedule (WHODAS 2.0). Cross-sectional cohort survey study. Physical and Rehabilitation Medicine outpatient university clinic. The 501 consecutive patients with chronic musculoskeletal pain. Exploratory factor analysis and a graded response model using item response theory analysis were used to assess the constructs and discrimination ability of WHODAS 2.0. The exploratory factor analysis revealed two retained factors with eigenvalues 5.15 and 1.04. Discrimination ability of all items was high or perfect, varying from 1.2 to 2.5. The difficulty levels of seven out of 12 items were shifted towards the elevated disability level. As a result, the entire test characteristic curve showed a shift towards higher levels of disability, placing it at the point of disability level of +1 (where 0 indicates the average level of disability within the sample). The present data indicate that the Finnish translation of the 12-item WHODAS 2.0 is a valid instrument for measuring restrictions of activity and participation among patients with chronic musculoskeletal pain.
Andriessen, Teuntje M J C; de Jong, Ben; Jacobs, Bram; van der Werf, Sieberen P; Vos, Pieter E
To investigate how the type of stimulus (pictures or words) and the method of reproduction (free recall or recognition after a short or a long delay) affect the sensitivity and specificity of a 3-item memory test in the assessment of post traumatic amnesia (PTA). Daily testing was performed in 64 consecutively admitted traumatic brain injured patients, 22 orthopedically injured patients and 26 healthy controls until criteria for resolution of PTA were reached. Subjects were randomly assigned to a test with visual or verbal stimuli. Short delay reproduction was tested after an interval of 3-5 minutes, long delay reproduction was tested after 24 hours. Sensitivity and specificity were calculated over the first 4 test days. The 3-word test showed higher sensitivity than the 3-picture test, while specificity of the two tests was equally high. Free recall was a more effortful task than recognition for both patients and controls. In patients, a longer delay between registration and recall resulted in a significant decrease in the number of items reproduced. Presence of PTA is best assessed with a memory test that incorporates the free recall of words after a long delay.
Mielenz, Thelma J; Callahan, Leigh F; Edwards, Michael C
Examine the feasibility of performing an item response theory (IRT) analysis on two of the Centers for Disease Control and Prevention health-related quality of life (CDC HRQOL) modules - the 4-item Healthy Days Core Module (HDCM) and the 5-item Healthy days Symptoms Module (HDSM). Previous principal components analyses confirm that the two scales both assess a mix of mental (CDC-MH) and physical health (CDC-PH). The purpose is to conduct item response theory (IRT) analysis on the CDC-MH and CDC-PH scales separately. 2182 patients with self-reported or physician-diagnosed arthritis completed a cross-sectional survey including HDCM and HDSM items. Besides global health, the other 8 items ask the number of days that some statement was true; we chose to recode the data into 8 categories based on observed clustering. The IRT assumptions were assessed using confirmatory factor analysis and the data could be modeled using an unidimensional IRT model. The graded response model was used for IRT analyses and CDC-MH and CDC-PH scales were analyzed separately in flexMIRT. The IRT parameter estimates for the five-item CDC-PH all appeared reasonable. The three-item CDC-MH did not have reasonable parameter estimates. The CDC-PH scale is amenable to IRT analysis but the existing The CDC-MH scale is not. We suggest either using the 4-item Healthy Days Core Module (HDCM) and the 5-item Healthy days Symptoms Module (HDSM) as they currently stand or the CDC-PH scale alone if the primary goal is to measure physical health related HRQOL.
Ángel Vázquez Alonso
Full Text Available The scarce attention to assessment and evaluation in science education research has been especially harmful for Science-Technology-Society (STS education, due to the dialectic, tentative, value-laden, and controversial nature of most STS topics. To overcome the methodological pitfalls of the STS assessment instruments used in the past, an empirically developed instrument (VOSTS, Views on Science-Technology-Society have been suggested. Some methodological proposals, namely the multiple response models and the computing of a global attitudinal index, were suggested to improve the item implementation. The final step of these methodological proposals requires the categorization of STS statements. This paper describes the process of categorization through a scaling procedure ruled by a panel of experts, acting as judges, according to the body of knowledge from history, epistemology, and sociology of science. The statement categorization allows for the sound foundation of STS items, which is useful in educational assessment and science education research, and may also increase teachers’ self-confidence in the development of the STS curriculum for science classrooms.
Conejo, Ricardo; Garcia-Viñas, Juan Ignacio; Gastón, Aitor; Barros, Beatriz
Developing plant identification skills is an important part of the curriculum of any botany course in higher education. Frequent practice with dried and fresh plants is necessary to recognize the diversity of forms, states, and details that a species can present. We have developed a web-based assessment system for mobile devices that is able to pose appropriate questions according to the location of the student. A student's location can be obtained using the device position or by scanning a QR code attached to a dried plant sheet in a herbarium or to a fresh plant in an arboretum. The assessment questions are complemented with elaborated feedback that, according to the students' responses, provides indications of possible mistakes and correct answers. Three experiments were designed to measure the effectiveness of the formative assessment using dried and fresh plants. Three questionnaires were used to evaluate the system performance from the students' perspective. The results clearly indicate that formative assessment is objectively effective compared to traditional methods and that the students' attitudes towards the system were very positive.
Donati, Maria Anna; Chiesi, Francesca; Izzo, Viola A; Primi, Caterina
As there is a lack of evidence attesting the equivalent item functioning across genders for the most employed instruments used to measure pathological gambling in adolescence, the present study was aimed to test the gender invariance of the Gambling Behavior Scale for Adolescents (GBS-A), a new measurement tool to assess the severity of Gambling Disorder (GD) in adolescents. The equivalence of the items across genders was assessed by analyzing Differential Item Functioning within an Item Response Theory framework. The GBS-A was administered to 1,723 adolescents, and the graded response model was employed. The results attested the measurement equivalence of the GBS-A when administered to male and female adolescent gamblers. Overall, findings provided evidence that the GBS-A is an effective measurement tool of the severity of GD in male and female adolescents and that the scale was unbiased and able to relieve truly gender differences. As such, the GBS-A can be profitably used in educational interventions and clinical treatments with young people.
Gierl, Mark J.; Lai, Hollis
Automatic item generation represents a relatively new but rapidly evolving research area where cognitive and psychometric theories are used to produce tests that include items generated using computer technology. Automatic item generation requires two steps. First, test development specialists create item models, which are comparable to templates…
This research aims to develop a multiple-choice Web-based quiz-game-like formative assessment system, named GAM-WATA. The unique design of "Ask-Hint Strategy" turns the Web-based formative assessment into an online quiz game. "Ask-Hint Strategy" is composed of "Prune Strategy" and "Call-in Strategy".…
Gideon P. De Bruin
Full Text Available The factor analysis of items often produces spurious results in the sense that unidimensional scales appear multidimensional. This may be ascribed to failure in meeting the assumptions of linearity and normality on which factor analysis is based. Item response theory is explicitly designed for the modelling of the non-linear relations between ordinal variables and provides a strong alternative to the factor analysis of items. Items may also be combined in parcels that are more likely to satisfy the assumptions of factor analysis than do the items. The use of the Rasch rating scale model and the factor analysis of parcels is illustrated with data obtained with the Locus of Control Inventory. The results of these analyses are compared with the results obtained through the factor analysis of items. It is shown that the Rasch rating scale model and the factoring of parcels produce superior results to the factor analysis of items. Recommendations for the analysis of scales are made. Opsomming Die faktorontleding van items lewer dikwels misleidende resultate op, veral in die opsig dat eendimensionele skale as meerdimensioneel voorkom. Hierdie resultate kan dikwels daaraan toegeskryf word dat daar nie aan die aannames van lineariteit en normaliteit waarop faktorontleding berus, voldoen word nie. Itemresponsteorie, wat eksplisiet vir die modellering van die nie-liniêre verbande tussen ordinale items ontwerp is, bied ’n aantreklike alternatief vir die faktorontleding van items. Items kan ook in pakkies gegroepeer word wat meer waarskynlik aan die aannames van faktorontleding voldoen as individuele items. Die gebruik van die Rasch beoordelingskaalmodel en die faktorontleding van pakkies word aan die hand van data wat met die Lokus van Beheervraelys verkry is, gedemonstreer. Die resultate van hierdie ontledings word vergelyk met die resultate wat deur ‘n faktorontleding van die individuele items verkry is. Die resultate dui daarop dat die Rasch
Bentz, Amy Elizabeth
The process of formative assessment improves student understanding; however, the topic of formative assessment in preservice education has been severely neglected. Since a major goal of teacher education is to create reflective teaching professionals, preservice teachers should be provided an opportunity to critically reflect on the use of formative assessment in the classroom. Case method is an instructional methodology that allows learners to engage in and reflect on real-world situations. Case based pedagogy can play an important role in enhancing preservice teachers' ability to reflect on teaching and learning by encouraging alternative ways of thinking about assessment. Although the literature on formative assessment and case methodology are extensive, using case method to explore the formative assessment process is, at best, sparse. The purpose of this study is to answer the following research questions: To what extent does the implementation of formative assessment cases in methods instruction influence preservice elementary science teachers' knowledge of formative assessment? What descriptive characteristics change between the preservice teachers' pre-case and post-case written reflection that would demonstrate learning had occurred? To investigate these questions, preservice teachers in an elementary methods course were asked to reflect on and discuss five cases. Pre/post-case data was analyzed. Results indicate that the preservice teachers modified their ideas to reflect the themes that were represented within the cases and modified their reflections to include specific ideas or examples taken directly from the case discussions. Comparing pre- and post-case reflections, the data supports a noted change in how the preservice teachers interpreted the case content. The preservice teachers began to evaluate the case content, question the lack of formative assessment concepts and strategies within the case, and apply formative assessment concepts and
Hong, Ickpyo; Velozo, Craig A; Li, Chih-Ying; Romero, Sergio; Gruber-Baldini, Ann L; Shulman, Lisa M
The aim of this study is to investigate the psychometrics of the Patient-Reported Outcomes Measurement Information System self-efficacy for managing daily activities item bank. The item pool was field tested on a sample of 1087 participants via internet (n = 250) and in-clinic (n = 837) surveys. All participants reported having at least one chronic health condition. The 35 item pool was investigated for dimensionality (confirmatory factor analyses, CFA and exploratory factor analysis, EFA), item-total correlations, local independence, precision, and differential item functioning (DIF) across gender, race, ethnicity, age groups, data collection modes, and neurological chronic conditions (McFadden Pseudo R (2) less than 10 %). The item pool met two of the four CFA fit criteria (CFI = 0.952 and SRMR = 0.07). EFA analysis found a dominant first factor (eigenvalue = 24.34) and the ratio of first to second eigenvalue was 12.4. The item pool demonstrated good item-total correlations (0.59-0.85) and acceptable internal consistency (Cronbach's alpha = 0.97). The item pool maintained its precision (reliability over 0.90) across a wide range of theta (3.70), and there was no significant DIF. The findings indicated the item pool has sound psychometric properties and the test items are eligible for development of computerized adaptive testing and short forms.
With reference to a questionnaire that aimed to assess the quality of life for dysarthric speakers, we investigate the usefulness of a model-based procedure for reducing the number of items. We propose a mixed cumulative logit model, which is known in the psychometrics literature as the graded response model: responses to different items are modelled as a function of individual latent traits and as a function of item characteristics, such as their difficulty and their discrimination power. We jointly model the discrimination and the difficulty parameters by using a k-component mixture of normal distributions. Mixture components correspond to disjoint groups of items. Items that belong to the same groups can be considered equivalent in terms of both difficulty and discrimination power. According to decision criteria, we select a subset of items such that the reduced questionnaire is able to provide the same information that the complete questionnaire provides. The model is estimated by using a Bayesian approach, and the choice of the number of mixture components is justified according to information criteria. We illustrate the proposed approach on the basis of data that are collected for 104 dysarthric patients by local health authorities in Lecce and in Milan. Copyright © 2014 John Wiley & Sons, Ltd.
Akkermans, Wies; Muraki, Eiji
For trinary partial credit items the shape of the item information and the item discrimination function is examined in relation to the item parameters. In particular, it is shown that these functions are unimodal if δ2 – δ1 < 4 ln 2 and bimodal otherwise. The locations and values of the maxima are
Penfield, Randall D.; Myers, Nicholas D.; Wolfe, Edward W.
Measurement invariance in the partial credit model (PCM) can be conceptualized in several different but compatible ways. In this article the authors distinguish between three forms of measurement invariance in the PCM: step invariance, item invariance, and threshold invariance. Approaches for modeling these three forms of invariance are proposed,…
Boser, Judith A.; And Others
Different formats for four types of research items were studied for ease of computer data entry. The types were: (1) numeric response items; (2) individual multiple choice items; (3) multiple choice items with the same response items; and (4) card column indicator placement. Each of the 13 experienced staff members of a major university's Data…
Li, Xin; Xia, Bao-long; Li, Wei; Zhou, Qi
Pluripotent stem cells can be evaluated by pluripotent markers expression, embryoid body aggregation, teratoma formation, chimera contribution and even more, tetraploid complementation. Whether iPS cells in general are functionally equivalent to normal ESCs is difficult to establish. Here, we present the detailed procedure for chimera formation and tetraploid complementation, the most stringent criterion, to assessing pluripotency.
Martinková, Patrícia; Drabinová, Adéla; Liaw, Y.L.; Sanders, E.A.; McFarland, J.L.; Price, R.M.
Roč. 16, č. 2 (2017), č. článku rm2. ISSN 1931-7913 R&D Projects: GA ČR GJ15-15856Y Grant - others:NSF(US) DUE-1043443 Institutional support: RVO:67985807 Keywords : differential item functioning * fairness * conceptual assessments * concept inventory * undergraduate education * bias Subject RIV: AM - Education OBOR OECD: Education , special (to gifted persons, those with learning disabilities) Impact factor: 3.930, year: 2016
Hadley, Lindsay; Black, David; Welch, Jan; Reynolds, Peter; Penlington, Clare
Clinical leadership is considered essential for maintaining and improving patient care and safety in the UK, and is incorporated in the curriculum for all trainee doctors. Despite the growing focus on the importance of leadership, and the introduction of the Medical Leadership Competency Framework (MLCF) in the UK, leadership education for doctors in training is still in its infancy. Assessment is focused on clinical skills, and trainee doctors receive very little formal feedback on their leadership competencies. In this article we describe the approach taken by Health Education Kent, Sussex and Surrey (HEKSS) to raise the profile of leadership amongst doctors in training in the South Thames Foundation School (STFS). An annual structured formative assessment in leadership for each trainee has been introduced, supported by leadership education for both trainees and their supervisors in HEKSS trusts. We analysed over 500 of these assessments from the academic year 2012/13 for foundation doctors in HEKSS trusts, in order to assess the quality of the feedback. From the analysis, potential indicators of more effective formative assessments were identified. These may be helpful in improving the leadership education programme for future years. There is a wealth of evidence to highlight the importance and value of formative assessments; however, particularly for foundation doctors, these have typically been focused on assessing clinical capabilities. This HEKSS initiative encourages doctors to recognise leadership opportunities at the beginning of their careers, seeks to help them understand the importance of acquiring leadership skills and provides structured feedback to help them improve. Leadership education for doctors in training is still in its infancy. © 2015 John Wiley & Sons Ltd.
Chauncey, Penny Denyse
No Child Left Behind mandates utilizing summative assessment to measure schools' effectiveness. The problem is that summative assessment measures students' knowledge without depth of understanding. The goal of public education, however, is to prepare students to think critically at higher levels. The purpose of this study was to examine any difference between formative assessment incorporated in instruction as opposed to the usual, more summative methods in terms of attitudes and academic achievement of middle-school science students. Maslow's theory emphasizes that individuals must have basic needs met before they can advance to higher levels. Formative assessment enables students to master one level at a time. The research questions focused on whether statistically significant differences existed between classrooms using these two types of assessments on academic tests and an attitude survey. Using a quantitative quasi-experimental control-group design, data were obtained from a sample of 430 middle-school science students in 6 classes. One control and 2 experimental classes were assigned to each teacher. Results of the independent t tests revealed academic achievement was significantly greater for groups that utilized formative assessment. No significant difference in attitudes was noted. Recommendations include incorporating formative assessment results with the summative results. Findings from this study could contribute to positive social change by prompting educational stakeholders to examine local and state policies on curriculum as well as funding based on summative scores alone. Use of formative assessment can lead to improved academic success.
Toepoel, V.; Das, J.W.M.; van Soest, A.H.O.
This paper presents results from an experimental manipulation of one versus multiple-items per screen format in a Web survey.The purpose of the experiment was to find out if a questionnaire s format influences how respondents provide answers in online questionnaires and if this is depending on
Harris, C. J.; Penuel, W. R.; Haydel Debarger, A.; Blank, J. G.
An important purpose of formative assessment is to elicit student thinking to use in instruction to help all students learn and inform next steps in teaching. However, formative assessment practices are difficult to implement and thus present a formidable challenge for many science teachers. A critical need in geoscience education is a framework for providing teachers with real-time assessment tools as well as professional development to learn how to use formative assessment to improve instruction. Here, we describe a comprehensive support system, developed for our NSF-funded Contingent Pedagogies project, for addressing the challenge of helping teachers to use formative assessment to enhance student learning in middle school Earth Systems science. Our support system is designed to improve student understanding about the geosphere by integrating classroom network technology, interactive formative assessments, and contingent curricular activities to guide teachers from formative assessment to instructional decision-making and improved student learning. To accomplish this, we are using a new classroom network technology, Group Scribbles, in the context of an innovative middle-grades Earth Science curriculum called Investigating Earth Systems (IES). Group Scribbles, developed at SRI International, is a collaborative software tool that allows individual students to compose “scribbles” (i.e., drawings and notes), on “post-it” notes in a private workspace (a notebook computer) in response to a public task. They can post these notes anonymously to a shared, public workspace (a teacher-controlled large screen monitor) that becomes the centerpiece of group and class discussion. To help teachers implement formative assessment practices, we have introduced a key resource, called a teaching routine, to help teachers take advantage of Group Scribbles for more interactive assessments. Routine refers to a sequence of repeatable interactions that, over time, become
This paper reviews the literature about item response models for the subject level and aggregated level (group level). Group-level item response models (IRMs) are used in the United States in large-scale assessment programs such as the National Assessment of Educational Progress and the California Assessment Program. In the Netherlands, these…
Brodey, Benjamin B; Gonzalez, Nicole L; Elkin, Kathryn Ann; Sasiela, W Jordan; Brodey, Inger S
The computerized administration of self-report psychiatric diagnostic and outcomes assessments has risen in popularity. If results are similar enough across different administration modalities, then new administration technologies can be used interchangeably and the choice of technology can be based on other factors, such as convenience in the study design. An assessment based on item response theory (IRT), such as the Patient-Reported Outcomes Measurement Information System (PROMIS) depression item bank, offers new possibilities for assessing the effect of technology choice upon results. To create equivalent halves of the PROMIS depression item bank and to use these halves to compare survey responses and user satisfaction among administration modalities-paper, mobile phone, or tablet-with a community mental health care population. The 28 PROMIS depression items were divided into 2 halves based on content and simulations with an established PROMIS response data set. A total of 129 participants were recruited from an outpatient public sector mental health clinic based in Memphis. All participants took both nonoverlapping halves of the PROMIS IRT-based depression items (Part A and Part B): once using paper and pencil, and once using either a mobile phone or tablet. An 8-cell randomization was done on technology used, order of technologies used, and order of PROMIS Parts A and B. Both Parts A and B were administered as fixed-length assessments and both were scored using published PROMIS IRT parameters and algorithms. All 129 participants received either Part A or B via paper assessment. Participants were also administered the opposite assessment, 63 using a mobile phone and 66 using a tablet. There was no significant difference in item response scores for Part A versus B. All 3 of the technologies yielded essentially identical assessment results and equivalent satisfaction levels. Our findings show that the PROMIS depression assessment can be divided into 2 equivalent
Thamsborg, Lise Laurberg Holst; Petersen, Morten Aa; Aaronson, Neil K
to 12 lack of appetite items. CONCLUSIONS: Phases 1-3 resulted in 12 lack of appetite candidate items. Based on a field testing (phase 4), the psychometric characteristics of the items will be assessed and the final item bank will be generated. This CAT item bank is expected to provide precise...
Liu, Chen-Wei; Wang, Wen-Chung
Examinee-selected item (ESI) design, in which examinees are required to respond to a fixed number of items in a given set, always yields incomplete data (i.e., when only the selected items are answered, data are missing for the others) that are likely non-ignorable in likelihood inference. Standard item response theory (IRT) models become infeasible when ESI data are missing not at random (MNAR). To solve this problem, the authors propose a two-dimensional IRT model that posits one unidimensional IRT model for observed data and another for nominal selection patterns. The two latent variables are assumed to follow a bivariate normal distribution. In this study, the mirt freeware package was adopted to estimate parameters. The authors conduct an experiment to demonstrate that ESI data are often non-ignorable and to determine how to apply the new model to the data collected. Two follow-up simulation studies are conducted to assess the parameter recovery of the new model and the consequences for parameter estimation of ignoring MNAR data. The results of the two simulation studies indicate good parameter recovery of the new model and poor parameter recovery when non-ignorable missing data were mistakenly treated as ignorable. © 2017 The British Psychological Society.
Panoz-Brown, Danielle; Corbin, Hannah E; Dalecki, Stefan J; Gentry, Meredith; Brotheridge, Sydney; Sluka, Christina M; Wu, Jie-En; Crystal, Jonathon D
Vivid episodic memories in people have been characterized as the replay of unique events in sequential order [1-3]. Animal models of episodic memory have successfully documented episodic memory of a single event (e.g., [4-8]). However, a fundamental feature of episodic memory in people is that it involves multiple events, and notably, episodic memory impairments in human diseases are not limited to a single event. Critically, it is not known whether animals remember many unique events using episodic memory. Here, we show that rats remember many unique events and the contexts in which the events occurred using episodic memory. We used an olfactory memory assessment in which new (but not old) odors were rewarded using 32 items. Rats were presented with 16 odors in one context and the same odors in a second context. To attain high accuracy, the rats needed to remember item in context because each odor was rewarded as a new item in each context. The demands on item-in-context memory were varied by assessing memory with 2, 3, 5, or 15 unpredictable transitions between contexts, and item-in-context memory survived a 45 min retention interval challenge. When the memory of item in context was put in conflict with non-episodic familiarity cues, rats relied on item in context using episodic memory. Our findings suggest that rats remember multiple unique events and the contexts in which these events occurred using episodic memory and support the view that rats may be used to model fundamental aspects of human cognition. Copyright © 2016 Elsevier Ltd. All rights reserved.
Clarence D. Kreiter
Full Text Available Objectives: Insufficient attention has been given to how information from computer-based clinical case simulations is presented, collected, and scored. Research is needed on how best to design such simulations to acquire valid performance assessment data that can act as useful feedback for educational applications. This report describes a study of a new simulation format with design features aimed at improving both its formative assessment feedback and educational function. Methods: Case simulation software (LabCAPS was developed to target a highly focused and well-defined measurement goal with a response format that allowed objective scoring. Data from an eight-case computer-based performance assessment administered in a pilot study to 13 second-year medical students was analyzed using classical test theory and generalizability analysis. In addition, a similar analysis was conducted on an administration in a less controlled setting, but to a much large sample (n=143, within a clinical course that utilized two random case subsets from a library of 18 cases. Results: Classical test theory case-level item analysis of the pilot assessment yielded an average case discrimination of 0.37, and all eight cases were positively discriminating (range=0.11–0.56. Classical test theory coefficient alpha and the decision study showed the eight-case performance assessment to have an observed reliability of σ=G=0.70. The decision study further demonstrated that a G=0.80 could be attained with approximately 3 h and 15 min of testing. The less-controlled educational application within a large medical class produced a somewhat lower reliability for eight cases (G=0.53. Students gave high ratings to the logic of the simulation interface, its educational value, and to the fidelity of the tasks. Conclusions: LabCAPS software shows the potential to provide formative assessment of medical students’ skill at diagnostic test ordering and to provide valid feedback to
Saint-Maurice, Pedro F; Welk, Gregory J; Bartee, R Todd; Heelan, Kate
This study tests calibration models to re-scale context-specific physical activity (PA) items to accelerometer-derived PA. A total of 195 4th-12th grades children wore an Actigraph monitor and completed the Physical Activity Questionnaire (PAQ) one week later. The relative time spent in moderate-to-vigorous PA (MVPA % ) obtained from the Actigraph at recess, PE, lunch, after-school, evening and weekend was matched with a respective item score obtained from the PAQ's. Item scores from 145 participants were calibrated against objective MVPA % using multiple linear regression with age, and sex as additional predictors. Predicted minutes of MVPA for school, out-of-school and total week were tested in the remaining sample (n = 50) using equivalence testing. The results showed that PAQ β-weights ranged from 0.06 (lunch) to 4.94 (PE) MVPA % (P PAQ and accelerometer MVPA at school and out-of-school ranged from -15.6 to +3.8 min and the PAQ was within 10-15% of accelerometer measured activity. This study demonstrated that context-specific items can be calibrated to predict minutes of MVPA in groups of youth during in- and out-of-school periods.
Full Text Available Based on theories of assessment as well as on the pedagogical and administrative advantages Computer Assisted Assessment (CAA has to offer in foreign language learning, the study presented in this paper examines how computers can facilitate the formative assessment of EFL learners and enhance their feeling of responsibility towards monitoring their progress. The subjects of the study were twenty five 14-year-old students attending the third class of a State Gymnasium in Greece. The instruments utilized were questionnaires on motivation and learning styles, three quizzes designed with the software Hot Potatoes, a self–assessment questionnaire and an evaluation questionnaire showing the subjects’ attitudes towards the experience of using computers for assessing purposes. After reviewing formative assessment, CAA and how these two can be combined, the paper focuses on the description of the three class quizzes used in the study. Ιnformation from the questionnaires filled in by students combined with the results of the quizzes, shows how computers can be used to provide continuous ongoing measurement of students’ progress needed for formative assessment. The results are also used to show how students and teachers can benefit from formative CAA and the extent to which such kind of assessment could be applicable in the Greek state school reality.
Tsubakita, Takashi; Kawazoe, Nobuo; Kasano, Eri
Health literacy predicts health outcomes. Despite concerns surrounding the health of Japanese young adults, to date there has been no objective assessment of health literacy in this population. This study aimed to develop a Functional Health Literacy Scale for Young Adults (funHLS-YA) based on item response theory. Each item in the scale requires participants to choose the most relevant term from 3 choices in relation to a target item, thus assessing objective rather than perceived health literacy. The 20-item scale was administered to 1816 university students and 1751 responded. Cronbach's α coefficient was .73. Difficulty and discrimination parameters of each item were estimated, resulting in the exclusion of 1 item. Some items showed different difficulty parameters for male and female participants, reflecting that some aspects of health literacy may differ by gender. The current 19-item version of funHLS-YA can reliably assess the objective health literacy of Japanese young adults.
Draaijer, S.; Hartog, R.J.M.
A set of design patterns for digital item types has been developed in response to challenges identified in various projects by teachers in higher education. The goal of the projects in question was to design and develop formative and summative tests, and to develop interactive learning material in
Whitelock, Denise M.
e-Assessment is being advocated in the UK as our way of introducing a more personalised learning agenda throughout the Higher Education sector. This paper discusses the findings from two projects where formative e-assessment has contributed to students taking more control of their own learning. One study set out to provide further insights into the role of electronic formative assessment and to point the way forward to new assessment practices, capitalising on a range of open source tools. Th...
Individuals with knee impairments identify items in need of clarification in the Patient Reported Outcomes Measurement Information System (PROMIS®) pain interference and physical function item banks - a qualitative study.
Lynch, Andrew D; Dodds, Nathan E; Yu, Lan; Pilkonis, Paul A; Irrgang, James J
The content and wording of the Patient Reported Outcome Measurement Information System (PROMIS) Physical Function and Pain Interference item banks have not been qualitatively assessed by individuals with knee joint impairments. The purpose of this investigation was to identify items in the PROMIS Physical Function and Pain Interference Item Banks that are irrelevant, unclear, or otherwise difficult to respond to for individuals with impairment of the knee and to suggest modifications based on cognitive interviews. Twenty-nine individuals with knee joint impairments qualitatively assessed items in the Pain Interference and Physical Function Item Banks in a mixed-methods cognitive interview. Field notes were analyzed to identify themes and frequency counts were calculated to identify items not relevant to individuals with knee joint impairments. Issues with clarity were identified in 23 items in the Physical Function Item Bank, resulting in the creation of 43 new or modified items, typically changing words within the item to be clearer. Interpretation issues included whether or not the knee joint played a significant role in overall health and age/gender differences in items. One quarter of the original items (31 of 124) in the Physical Function Item Bank were identified as irrelevant to the knee joint. All 41 items in the Pain Interference Item Bank were identified as clear, although individuals without significant pain substituted other symptoms which interfered with their life. The Physical Function Item Bank would benefit from additional items that are relevant to individuals with knee joint impairments and, by extension, to other lower extremity impairments. Several issues in clarity were identified that are likely to be present in other patient cohorts as well.
Jonasen, Tanja Svarre; Lunn, Tine Bieber; Helle, Tina
Background: The aim of this paper is to provide the reader with an overall impression of the stepwise user-centred design approach including the specific methods used and lessons learned when transforming paper-based assessment forms into a prototype app, taking the Housing Enabler as an example....... Results: The design iterations resulted in the development of a Housing Enabler prototype app. The prototype app has several features and options that are new compared with the original paper-based Housing Enabler assessment form. These new features include a user friendly overview of the assessment form......; easy navigation by swiping back and forth between items; onsite data analysis; and ranking of the accessibility score, photo documentation and a data export facility. Conclusion: Based on the presented stepwise approach, a high-fidelity Housing Enabler prototype app was successfully developed...
Full Text Available This paper deals with the right to learn in school type education and considers the assessment as assurance of teaching and learning quality. It deals with the current evaluation processes and discriminatory misconceptions of merely summative assessments, which tend to qualify students. This text evaluates the punitive bias of meritocratic grading of learning and argues that only formative assessment can ensure the right to learn
Herman, Joan; Osmundson, Ellen; Dai, Yunyun; Ringstaff, Cathy; Timms, Michael
This exploratory study of elementary school science examines questions central to policy, practice and research on formative assessment: What is the quality of teachers' content-pedagogical and assessment knowledge? What is the relationship between teacher knowledge and assessment practice? What is the relationship between teacher knowledge,…
Rijmen, Frank; Jeon, Minjeong; von Davier, Matthias; Rabe-Hesketh, Sophia
Second-order item response theory models have been used for assessments consisting of several domains, such as content areas. We extend the second-order model to a third-order model for assessments that include subdomains nested in domains. Using a graphical model framework, it is shown how the model does not suffer from the curse of…
von Davier, Matthias
Utilizing technology for automated item generation is not a new idea. However, test items used in commercial testing programs or in research are still predominantly written by humans, in most cases by content experts or professional item writers. Human experts are a limited resource and testing agencies incur high costs in the process of continuous renewal of item banks to sustain testing programs. Using algorithms instead holds the promise of providing unlimited resources for this crucial part of assessment development. The approach presented here deviates in several ways from previous attempts to solve this problem. In the past, automatic item generation relied either on generating clones of narrowly defined item types such as those found in language free intelligence tests (e.g., Raven's progressive matrices) or on an extensive analysis of task components and derivation of schemata to produce items with pre-specified variability that are hoped to have predictable levels of difficulty. It is somewhat unlikely that researchers utilizing these previous approaches would look at the proposed approach with favor; however, recent applications of machine learning show success in solving tasks that seemed impossible for machines not too long ago. The proposed approach uses deep learning to implement probabilistic language models, not unlike what Google brain and Amazon Alexa use for language processing and generation.
Kopittke, Peter M.; Wehr, J. Bernhard; Menzies, Neal W.
Soil science students are required to apply knowledge from a range of disciplines to unfamiliar scenarios to solve complex problems. To encourage deep learning (with student performance an indicator of learning), a formative assessment exercise was introduced to a second-year soil science subject. For the formative assessment exercise, students…
Introduction. Clinical clerkships, typically situated in environments lacking educational structure, form the backbone of undergraduate medical training. The imperative to develop strategies that enhance learning in this context is apparent. This study explored the impact of longitudinal bedside formative assessment on ...
Falk, Carl F.; Cai, Li
We present a logistic function of a monotonic polynomial with a lower asymptote, allowing additional flexibility beyond the three-parameter logistic model. We develop a maximum marginal likelihood-based approach to estimate the item parameters. The new item response model is demonstrated on math assessment data from a state, and a computationally…
Fukuhara, Hirotaka; Kamata, Akihito
A differential item functioning (DIF) detection method for testlet-based data was proposed and evaluated in this study. The proposed DIF model is an extension of a bifactor multidimensional item response theory (MIRT) model for testlets. Unlike traditional item response theory (IRT) DIF models, the proposed model takes testlet effects into…
Prins, Frans; Sluijsmans, Dominique; Kirschner, Paul A.; Strijbos, Jan Willem
In this case study our aim was to gain more insight in the possibilities of qualitative formative peer assessment in a computer supported collaborative learning (CSCL) environment. An approach was chosen in which peer assessment was operationalized in assessment assignments and assessment tools that
Petersen, Morten Aa; Groenvold, Mogens; Bjorner, Jakob B.; Aaronson, Neil; Conroy, Thierry; Cull, Ann; Fayers, Peter; Hjermstad, Marianne; Sprangers, Mirjam; Sullivan, Marianne
In cross-national comparisons based on questionnaires, accurate translations are necessary to obtain valid results. Differential item functioning (DIF) analysis can be used to test whether translations of items in multi-item scales are equivalent to the original. In data from 10,815 respondents
Marfeo, Elizabeth E; Ni, Pengsheng; Haley, Stephen M; Jette, Alan M; Bogusz, Kara; Meterko, Mark; McDonough, Christine M; Chan, Leighton; Brandt, Diane E; Rasch, Elizabeth K
To develop a broad set of claimant-reported items to assess behavioral health functioning relevant to the Social Security disability determination processes, and to evaluate the underlying structure of behavioral health functioning for use in development of a new functional assessment instrument. Cross-sectional. Community. Item pools of behavioral health functioning were developed, refined, and field tested in a sample of persons applying for Social Security disability benefits (N=1015) who reported difficulties working because of mental or both mental and physical conditions. None. Social Security Administration Behavioral Health (SSA-BH) measurement instrument. Confirmatory factor analysis (CFA) specified that a 4-factor model (self-efficacy, mood and emotions, behavioral control, social interactions) had the optimal fit with the data and was also consistent with our hypothesized conceptual framework for characterizing behavioral health functioning. When the items within each of the 4 scales were tested in CFA, the fit statistics indicated adequate support for characterizing behavioral health as a unidimensional construct along these 4 distinct scales of function. This work represents a significant advance both conceptually and psychometrically in assessment methodologies for work-related behavioral health. The measurement of behavioral health functioning relevant to the context of work requires the assessment of multiple dimensions of behavioral health functioning. Specifically, we identified a 4-factor model solution that represented key domains of work-related behavioral health functioning. These results guided the development and scale formation of a new SSA-BH instrument. Copyright © 2013 American Congress of Rehabilitation Medicine. Published by Elsevier Inc. All rights reserved.
Pradana, O. R. Y.; Sujadi, I.; Pramudya, I.
Geometry is a science related to abstract thinking ability so that not many students are able to understand this material well. In this case, the learning model plays a crucial role in improving student achievement. This means that a less precise learning model will cause difficulties for students. Therefore, this study provides a quantitative explanation of the Think Pair Share learning model combined with the formative assessment. This study aims to test the Think Pair Share with the formative assessment on junior high school students. This research uses a quantitative approach of Pretest-Posttest in control group and experiment group. ANOVA test and Scheffe test used to analyse the effectiveness this learning. Findings in this study are student achievement on the material geometry with Think Pair Share using formative assessment has increased significantly. This happens probably because this learning makes students become more active during learning. Hope in the future, Think Pair Share with formative assessment be a useful learning for teachers and this learning applied by the teacher around the world especially on the material geometry.
Research purpose: This study assesses the Differential Item Functioning (DIF of the Utrecht Work Engagement Scale (UWES-17 for different South African cultural groups in a South African company. Motivation for the study: Organisations are using the UWES-17 more and more in South Africa to assess work engagement. Therefore, research evidence from psychologists or assessment practitioners on its DIF across different cultural groups is necessary. Research design, approach and method: The researchers conducted a Secondary Data Analysis (SDA on the UWES-17 sample (n = 2429 that they obtained from a cross-sectional survey undertaken in a South African Information and Communication Technology (ICT sector company (n = 24 134. Quantitative item data on the UWES-17 scale enabled the authors to address the research question. Main findings: The researchers found uniform and/or non-uniform DIF on five of the vigour items, four of the dedication items and two of the absorption items. This also showed possible Differential Test Functioning (DTF on the vigour and dedication dimensions. Practical/managerial implications: Based on the DIF, the researchers suggested that organisations should not use the UWES-17 comparatively for different cultural groups or employment decisions in South Africa. Contribution/value add: The study provides evidence on DIF and possible DTF for the UWES-17. However, it also raises questions about possible interaction effects that need further investigation.
Assessment forms an important part of instruction. Assessment that aims to support learning is known as formative assessment and it contributes student's learning gain and motivation. However, teachers rarely use assessment formatively to aid their students' learning. Thus reviewing the factors that limit or support teachers' practices of…
Full Text Available Assessment integrates the teaching and learning process and always has room for discussion in educational processes, requiring technical preparation and observation capacity from those involved. According to Perrenoud (2014, assessment for learning is a mediator in the process of curriculum construction and is closely related to the management of learning by the students. Assessment methods occupy a very important space in the pedagogical practices since assessment cannot be an act that expresses only a quantitative and formal concept. In Distance Education (DE, formative assessment also needs to be prioritized and avoid traditional evaluation which is performed through multiple-choice tests with self-correction. The use of diaries in Distance Education maintains the focus on the evaluation process and not only on the product, configuring itself as a permanent orientation of learning, both for the teacher and for the student, who jointly assume reciprocal commitments. This article presents an experiment conducted with diaries on an undergraduate course offered by Universidade Aberta do Brasil (UAB as a means of formative assessment in Distance Education.
.... IOD evaluates the impacts of nonavailability of secondary items on the life cycle supportability of AMCOM weapon systems and evaluates the producibility of secondary items for war reserve requirements...
Purcell, Bernice M.
Formative assessment is considered to be an evaluation technique that informs the instructor of the level of student learning, giving evidence when it may be necessary for the instructor to make a change in delivery based upon the results. Several theories of formative assessment exist, all which propound the importance of feedback to the student.…
Full Text Available This paper describes the steps taken to eliminate two of the items in a Test of Figural Analogies (TFA. The main guidelines of psychometric analysis concerning Classical Test Theory (CTT and Item Response Theory (IRT are explained. The item elimination process was based on both the study of the CTT difficulty and discrimination index, and the unidimensionality analysis. The a, b, and c parameters of the Three Parameter Logistic Model of IRT were also considered for this purpose, as well as the assessment of each item fitting this model. The unfavourable characteristics of a group of TFA items are detailed, and decisions leading to their possible elimination are discussed.
Khorramdel, Lale; Kubinger, Klaus D; Uitz, Alexander
An experiment was conducted to investigate the effects of item order and questionnaire content on faking good or intentional response distortion. It was hypothesized that intentional response distortion would either increase towards the end of a long questionnaire, as learning effects might make it easier to adjust responses to a faking good schema, or decrease because applicants' will to distort responses is reduced if the questionnaire lasts long enough. Furthermore, it was hypothesized that certain types of questionnaire content are especially vulnerable to response distortion. Eighty-four pre-selected pilot applicants filled out a questionnaire consisting of 516 items including items from the NEO five factor inventory (NEO FFI), NEO personality inventory revised (NEO PI-R) and business-focused inventory of personality (BIP). The positions of the items were varied within the applicant sample to test if responses are affected by item order, and applicants' response behaviour was additionally compared to that of volunteers. Applicants reported significantly higher mean scores than volunteers, and results provide some evidence of decreased faking tendencies towards the end of the questionnaire. Furthermore, it could be demonstrated that lower variances or standard deviations in combination with appropriate (often higher) mean scores can serve as an indicator for faking tendencies in group comparisons, even if effects are not significant. © 2013 International Union of Psychological Science.
Thomas A. Stewart
Full Text Available This study supports the work of Black and Wiliam (1998, who demonstrated that when teachers effectively utilize formative assessment strategies, student learning increases significantly. However, the researchers also found a “poverty of practice” among teachers, in that few fully understood how to implement classroom formative assessment. This qualitative case study examined a series of voluntary workshops offered at one middle school designed to address this poverty of practice. Data were gathered via semi-structured interviews. These research questions framed the study: (1 What role did a professional learning community structure play in shaping workshop participants’ perceived effectiveness of a voluntary formative assessment initiative? (2 How did this initiative affect workshop participants’ perceptions of their knowledge of formative assessment and differentiation strategies? (3 How did it affect workshop participants’ perceptions of their abilities to teach others about formative assessment and differentiated instruction? (4 How did it affect school-wide use of classroom-level strategies? Results indicated that teacher workshop participants experienced a growth in their capacity to use and teach others various formative assessment strategies, and even non-participating teachers reported greater use of formative assessment in their own instruction. Workshop participants and non-participating teachers perceived little growth in the area of differentiation of instruction, which contradicted some administrator perceptions.
...) The amount of the total bill assessed as a franchise fee and the identity of the franchising authority... fees and costs itemized pursuant to this section. (c) Local franchising authorities may adopt...
Ried, L Douglas
Collaboration and implementation of a minimum, standardized set of core global educational and professional competencies seems appropriate given the expanding international evolution of pharmacy practice. However, winnowing down hundreds of competencies from a plethora of local, national and international competency frameworks to select the most highly preferred to be included in the core set is a daunting task. The objective of this paper is to describe a combination of strategies used to ascertain the most highly preferred items among a large number of disparate items. In this case, the items were >100 educational and professional competencies that might be incorporated as the core components of new and existing competency frameworks. Panelists (n = 30) from the European Union (EU) and United States (USA) were chosen to reflect a variety of practice settings. Each panelist completed two electronic surveys. The first survey presented competencies in a Likert-type format and the second survey presented many of the same competencies in an ipsative/forced choice format. Item mean scores were calculated for each competency, the competencies were ranked, and non-parametric statistical tests were used to ascertain the consistency in the rankings achieved by the two strategies. This exploratory study presented over 100 competencies to the panelists in the beginning. The two methods provided similar results, as indicated by the significant correlation between the rankings (Spearman's rho = 0.30, P < 0.09). A two-step strategy using Likert-type and ipsative/forced choice formats in sequence, appears to be useful in a situation where a clear preference is required from among a large number of choices. The ipsative/forced choice format resulted in some differences in the competency preferences because the panelists could not rate them equally by design. While this strategy was used for the selection of professional educational competencies in this exploratory study, it is
Emons, Wilco H M; Meijer, Rob R; Denollet, Johan
Individuals with increased levels of both negative affectivity (NA) and social inhibition (SI)-referred to as type-D personality-are at increased risk of adverse cardiac events. We used item response theory (IRT) to evaluate NA, SI, and type-D personality as measured by the DS14. The objectives of this study were (a) to evaluate the relative contribution of individual items to the measurement precision at the cutoff to distinguish type-D from non-type-D personality and (b) to investigate the comparability of NA, SI, and type-D constructs across the general population and clinical populations. Data from representative samples including 1316 respondents from the general population, 427 respondents diagnosed with coronary heart disease, and 732 persons suffering from hypertension were analyzed using the graded response IRT model. In Study 1, the information functions obtained in the IRT analysis showed that (a) all items had highest measurement precision around the cutoff and (b) items are most informative at the higher end of the scale. In Study 2, the IRT analysis showed that measurements were fairly comparable across the general population and clinical populations. The DS14 adequately measures NA and SI, with highest reliability in the trait range around the cutoff. The DS14 is a valid instrument to assess and compare type-D personality across clinical groups.
Full Text Available Laura Kelly, Crispin Jenkinson, Sarah Dummett, Jill Dawson, Ray Fitzpatrick, David Morley Health Services Research Unit, Nuffield Department of Population Health, University of Oxford, Oxford, UK Purpose: The Oxford Participation and Activities Questionnaire is a patient-reported outcome measure in development that is grounded on the World Health Organization International Classification of Functioning, Disability, and Health (ICF. The study reported here aimed to inform and generate an item pool for the new measure, which is specifically designed for the assessment of participation and activity in patients experiencing a range of health conditions. Methods: Items were informed through in-depth interviews conducted with 37 participants spanning a range of conditions. Interviews aimed to identify how their condition impacted their ability to participate in meaningful activities. Conditions included arthritis, cancer, chronic back pain, diabetes, motor neuron disease, multiple sclerosis, Parkinson's disease, and spinal cord injury. Transcripts were analyzed using the framework method. Statements relating to ICF themes were recast as questionnaire items and shown for review to an expert panel. Cognitive debrief interviews (n=13 were used to assess items for face and content validity. Results: ICF themes relevant to activities and participation in everyday life were explored, and a total of 222 items formed the initial item pool. This item pool was refined by the research team and 28 generic items were mapped onto all nine chapters of the ICF construct, detailing activity and participation. Cognitive interviewing confirmed the questionnaire instructions, items, and response options were acceptable to participants. Conclusion: Using a clear conceptual basis to inform item generation, 28 items have been identified as suitable to undergo further psychometric testing. A large-scale postal survey will follow in order to refine the instrument further and
Offerdahl, Erika G; Montplaisir, Lisa
Formative assessment has long been identified as a critical element to teaching for conceptual development in science. It is therefore important for university instructors to have an arsenal of formative assessment tools at their disposal which enable them to effectively uncover and diagnose all students' thinking, not just the most vocal or assertive. We illustrate the utility of one type of formative assessment prompt (reading question assignment) in producing high-quality evidence of student thinking (student-generated reading questions). Specifically, we characterized student assessment data using three distinct analytic frames to exemplify their effectiveness in diagnosing student learning in relationship to three sample learning outcomes. Our data will be useful for university faculty, particularly those engaged in teaching upper-level biochemistry courses and their prerequisites, as they provide an alternative mechanism for uncovering and diagnosing student understanding. © 2013 by The International Union of Biochemistry and Molecular Biology.
DeLeon, Iser G.; Iwata, Brian A.
A study of seven adults with profound developmental disabilities compared methods for presenting stimuli during reinforcer-preference assessments. It found that a multiple-stimulus format in which selections were made without replacement may share the advantages of a paired-stimulus format and a multiple-stimulus format with replacement, while…
Yao, Lihua; Schwarz, Richard D.
Multidimensional item response theory (IRT) models have been proposed for better understanding the dimensional structure of data or to define diagnostic profiles of student learning. A compensatory multidimensional two-parameter partial credit model (M-2PPC) for constructed-response items is presented that is a generalization of those proposed to…
Salathé, Cornelia Rolli; Trippolini, Maurizio Alen; Terribilini, Livio Claudio; Oliveri, Michael; Elfering, Achim
Purpose To develop a multidimensional scale to asses psychosocial beliefs-the Yellow Flag Questionnaire (YFQ)-aimed at guiding interventions for workers with chronic musculoskeletal (MSK) pain. Methods Phase 1 consisted of item selection based on literature search, item development and expert consensus rounds. In phase 2, items were reduced with calculating a quality-score per item, using structure equation modeling and confirmatory factor analysis on data from 666 workers. In phase 3, Cronbach's α, and Pearson correlations coefficients were computed to compare YFQ with disability, anxiety, depression and self-efficacy and the YFQ score based on data from 253 injured workers. Regressions of YFQ total score on disability, anxiety, depression and self-efficacy were calculated. Results After phase 1, the YFQ included 116 items and 15 domains. Further reductions of items in phase 2 by applying the item quality criteria reduced the total to 48 items. Phase factor analysis with structural equation modeling confirmed 32 items in seven domains: activity, work, emotions, harm & blame, diagnosis beliefs, co-morbidity and control. Cronbach α was 0.91 for the total score, between 0.49 and 0.81 for the 7 distinct scores of each domain, respectively. Correlations between YFQ total score ranged with disability, anxiety, depression and self-efficacy was .58, .66, .73, -.51, respectively. After controlling for age and gender the YFQ total score explained between R2 27% and R2 53% variance of disability, anxiety, depression and self-efficacy. Conclusions The YFQ, a multidimensional screening scale is recommended for use to assess psychosocial beliefs of workers with chronic MSK pain. Further evaluation of the measurement properties such as the test-retest reliability, responsiveness and prognostic validity is warranted.
In this interview, Dr. Marcia Tate discusses her work and focuses on critical issues in brain based learning, and the need for both formative and summative assessment. Tangential issues such as grade retention, and response to intervention are also discussed. It is hope that this interview will assist teachers in the instructional and learning process and aid in both formative and summative assessment.
Full Text Available The term formative assessment is often used to describe a type of assessment. The purpose of this paper is to challenge the use of this phrase given that formative assessment as a noun phrase ignores the well-established understanding that it is a process more than an object. A model that combines content, context, and strategies is presented as one way to view the process nature of assessing formatively. The alternate phrase formative use of assessment information is suggested as a more appropriate way to describe how content, context, and strategies can be used together in order to close the gap between where a student is performing currently and the intended learning goal.
Based on geochemical characteristics of radioelements and the theory of facieology, the author describes the characteristics of the distribution of U, Th and K in sedimentary formation and the relationship between their combined parameters MA and MB and uranium mineralization in geological formation. The ranges of MA and MB in uraniferous geological formation used to assess four different levels of uranium mineralization in regional investigation are obtained from the comparision of combined parameters MA and MB in the geological formation with different levels of mineralization and the experience is provided for quantitatively assessing uranium prospects in geological by multi-parameter model of radioelements
This dissertation includes three studies that analyze a new set of assessment tasks developed by the Learning Progressions in Middle School Science (LPS) Project. These assessment tasks were designed to measure science content knowledge on the structure of matter domain and scientific argumentation, while following the goals from the Next Generation Science Standards (NGSS). The three studies focus on the evidence available for the success of this design and its implementation, generally labelled as "validity" evidence. I use explanatory item response models (EIRMs) as the overarching framework to investigate these assessment tasks. These models can be useful when gathering validity evidence for assessments as they can help explain student learning and group differences. In the first study, I explore the dimensionality of the LPS assessment by comparing the fit of unidimensional, between-item multidimensional, and Rasch testlet models to see which is most appropriate for this data. By applying multidimensional item response models, multiple relationships can be investigated, and in turn, allow for a more substantive look into the assessment tasks. The second study focuses on person predictors through latent regression and differential item functioning (DIF) models. Latent regression models show the influence of certain person characteristics on item responses, while DIF models test whether one group is differentially affected by specific assessment items, after conditioning on latent ability. Finally, the last study applies the linear logistic test model (LLTM) to investigate whether item features can help explain differences in item difficulties.
Wang, Bo; Sun, Bukuan
The current study examined whether the effect of post-encoding emotional arousal on item memory extends to reality-monitoring source memory and, if so, whether the effect depends on emotionality of learning stimuli and testing format. In Experiment 1, participants encoded neutral words and imagined or viewed their corresponding object pictures. Then they watched a neutral, positive, or negative video. The 24-hour delayed test showed that emotional arousal had little effect on both item memory and reality-monitoring source memory. Experiment 2 was similar except that participants encoded neutral, positive, and negative words and imagined or viewed their corresponding object pictures. The results showed that positive and negative emotional arousal induced after encoding enhanced consolidation of item memory, but not reality-monitoring source memory, regardless of emotionality of learning stimuli. Experiment 3, identical to Experiment 2 except that participants were tested only on source memory for all the encoded items, still showed that post-encoding emotional arousal had little effect on consolidation of reality-monitoring source memory. Taken together, regardless of emotionality of learning stimuli and regardless of testing format of source memory (conjunction test vs. independent test), the facilitatory effect of post-encoding emotional arousal on item memory does not generalize to reality-monitoring source memory.
Peterson, Alexander C; Sutherland, Jason M; Liu, Guiping; Crump, R Trafford; Karimuddin, Ahmer A
The Fecal Incontinence Quality of Life Scale (FIQL) is a commonly used patient-reported outcome measure for fecal incontinence, often used in clinical trials, yet has not been validated in English since its initial development. This study uses modern methods to thoroughly evaluate the psychometric characteristics of the FIQL and its potential for differential functioning by gender. This study analyzed prospectively collected patient-reported outcome data from a sample of patients prior to colorectal surgery. Patients were recruited from 14 general and colorectal surgeons in Vancouver Coastal Health hospitals in Vancouver, Canada. Confirmatory factor analysis was used to assess construct validity. Item response theory was used to evaluate test reliability, describe item-level characteristics, identify local item dependence, and test for differential functioning by gender. 236 patients were included for analysis, with mean age 58 and approximately half female. Factor analysis failed to identify the lifestyle, coping, depression, and embarrassment domains, suggesting lack of construct validity. Items demonstrated low difficulty, indicating that the test has the highest reliability among individuals who have low quality of life. Five items are suggested for removal or replacement. Differential test functioning was minimal. This study has identified specific improvements that can be made to each domain of the Fecal Incontinence Quality of Life Scale and to the instrument overall. Formatting, scoring, and instructions may be simplified, and items with higher difficulty developed. The lifestyle domain can be used as is. The embarrassment domain should be significantly revised before use.
M. L. Oliveira
Full Text Available Introduction and objectives: Apps can be designed to provide usage data, and most of them do. These data are usually used to map users interests and to deliver more effective ads that are more likely to result in clicks, and sales. We have applied some of these metrics to understand how can it be used to map students’ behavior and to promote a formative assessment using educational software. The purpose of a formative assessment is to monitor student learning to provide ongoing feedback that can be used by instructors and students to improve the teaching and learning process. Thus, this modality aims to help both students and instructors to identify strengths and weaknesses that need to be developed. This study aimed to describe the potential of educational apps in the formative assessment process. Material and Methods: We have implemented assessment tools embedded in three apps (ARMET, The Cell and 3D Class used to teach: 1 Metabolic Pathways; 2 Scale of the cellular structures, and 3 Concepts from techniques used in a Biochemistry Lab course. The implemented tools allow to verify on what issues there were recurring mistakes, the total number of mistakes presented, which questions they most achieved, how long they took to perform the activity and other relevant information. Results and conclusion: Educational apps can provide transparent and coherent evaluation metrics to enable instructors to systematize more consistent criteria and indicators, reducing the subjectivity of the formative assessment process and the time spent for preparation, tabulation and analysis of assessment data. This approach allows instructors to understand better where students struggle, giving to them a more effective feedback. It also helps instructor to plan interventions to help students to perform better and to achieve the learning objectives.
Alexander K. Volkov
Full Text Available The modern approaches to the aviation security screeners’ efficiency have been analyzedand, certain drawbacks have been considered. The main drawback is the complexity of ICAO recommendations implementation concerning taking into account of shadow x-ray image complexity factors during preparation and evaluation of prohibited items detection efficiency by aviation security screeners. Х-ray image based factors are the specific properties of the x-ray image that in- fluence the ability to detect prohibited items by aviation security screeners. The most important complexity factors are: geometric characteristics of a prohibited item; view difficulty of prohibited items; superposition of prohibited items byother objects in the bag; bag content complexity; the color similarity of prohibited and usual items in the luggage.The one-dimensional two-parameter IRT model and the related criterion of aviation security screeners’ qualification have been suggested. Within the suggested model the probabilistic detection characteristics of aviation security screeners are considered as functions of such parameters as the difference between level of qualification and level of x-ray images com- plexity, and also between the aviation security screeners’ responsibility and structure of their professional knowledge. On the basis of the given model it is possible to consider two characteristic functions: first of all, characteristic function of qualifica- tion level which describes multi-complexity level of x-ray image interpretation competency of the aviation security screener; secondly, characteristic function of the x-ray image complexity which describes the range of x-ray image interpretation com- petency of the aviation security screeners having various training levels to interpret the x-ray image of a certain level of com- plexity. The suggested complex criterion to assess the level of the aviation security screener qualification allows to evaluate his or
Vaccarino, Anthony L; Evans, Kenneth R; Sills, Terrence L; Kalali, Amir H
Although diagnostically dissociable, anxiety is strongly co-morbid with depression. To examine further the clinical symptoms of anxiety in major depressive disorder (MDD), a non-parametric item response analysis on "blinded" data from four pharmaceutical company clinical trials was performed on the Hamilton Anxiety Rating Scale (HAMA) across levels of depressive severity. The severity of depressive symptoms was assessed using the 17-item Hamilton Depression Rating Scale (HAMD). HAMA and HAMD measures were supplied for each patient on each of two post-screen visits (n=1,668 observations). Option characteristic curves were generated for all 14 HAMA items to determine the probability of scoring a particular option on the HAMA in relation to the total HAMD score. Additional analyses were conducted using Pearson's product-moment correlations. Results showed that anxiety-related symptomatology generally increased as a function of overall depressive severity, though there were clear differences between individual anxiety symptoms in their relationship with depressive severity. In particular, anxious mood, tension, insomnia, difficulties in concentration and memory, and depressed mood were found to discriminate over the full range of HAMD scores, increasing continuously with increases in depressive severity. By contrast, many somatic-related symptoms, including muscular, sensory, cardiovascular, respiratory, gastro-intestinal, and genito-urinary were manifested primarily at higher levels of depression and did not discriminate well at lower HAMD scores. These results demonstrate anxiety as a core feature of depression, and the relationship between anxiety-related symptoms and depression should be considered in the assessment of depression and evaluation of treatment strategies and outcome.
Full Text Available Distance education is a discipline that offers solutions to some important education problems. Distance education, contribute to the solution to the problems such as; inequality of opportunities, lifelong education, the implementation of a series of individual and social goals that can contribute to and benefit from educational technology and self-learning. In distance education, methods of measurement and assessment must be consistent with the objectives and contents of teaching. A major interest of formative assessment is determining the students’ learning level of each behavior in the interested unit. In summative assessment, performances of students on some units are measured broader than formative assessment. A computerized adaptive testing, CAT, is the test managed by computer in which each item is introduced and the decision to stop are dynamically imposed based on the students answers and his/her estimated knowledge level. In CAT applications, students do not take the same test. Despite item numbers and properties of items are different for the students; the precise of measures improves in positioning students on an ability or success continuum in CAT applications. In CAT applications, questions answered by a student depend on the student's ability or learning level. In item response theory, there are some models to estimate a student’s ability level, such as three-parameter logistic model. Cheating in exams or other academic assignments can be defined as use resources not allowed to use or having someone else to take exams or assignments. Some precautions must be taken about cheating such as a live proctoring, using web cams, and using a plagiarism detection program.
Brown, David W., Ed.; Sewell, Jeffrey J., Ed.
This document consists of test items which are applicable to biology courses throughout Australia (irrespective of course materials used); assess key concepts within course statement (for both core and optional studies); assess a wide range of cognitive processes; and are relevant to current biological concepts. These items are arranged under…
Brown, David W., Ed.; Sewell, Jeffrey J., Ed.
This document consists of test items which are applicable to biology courses throughout Australia (irrespective of course materials used); assess key concepts within course statement (for both core and optional studies); assess a wide range of cognitive processes; and are relevant to current biological concepts. These items are arranged under…
Shelton, Angi; Smith, Andrew; Wiebe, Eric; Behrle, Courtney; Sirkin, Ruth; Lester, James
Formative assessment strategies are used to direct instruction by establishing where learners' understanding is, how it is developing, informing teachers and students alike as to how they might get to their next set of goals of conceptual understanding. For the science classroom, one rich source of formative assessment data about scientific…
Cardamone, Caroline N.; Abbott, Jonathan E.; Rayyan, Saif; Seaton, Daniel T.; Pawl, Andrew; Pritchard, David E.
Item response theory is useful in both the development and evaluation of assessments and in computing standardized measures of student performance. In item response theory, individual parameters (difficulty, discrimination) for each item or question are fit by item response models. These parameters provide a means for evaluating a test and offer a better measure of student skill than a raw test score, because each skill calculation considers not only the number of questions answered correctly, but the individual properties of all questions answered. Here, we present the results from an analysis of the Mechanics Baseline Test given at MIT during 2005-2010. Using the item parameters, we identify questions on the Mechanics Baseline Test that are not effective in discriminating between MIT students of different abilities. We show that a limited subset of the highest quality questions on the Mechanics Baseline Test returns accurate measures of student skill. We compare student skills as determined by item response theory to the more traditional measurement of the raw score and show that a comparable measure of learning gain can be computed.
Using case study approach, the dissertation provides the notions and practices of formative assessment in Bhutanese Secondary Schools. It includes the teachers’ understanding of the practice of student-centered teaching and learning, which is regarded as a precondition for effective formative...... assessment. It also take account of those features of formative assessment which are much more favored by students and teachers in the case study schools....
Krasne, Sally; Wimmers, Paul F; Relan, Anju; Drake, Thomas A
Formative assessments are systematically designed instructional interventions to assess and provide feedback on students' strengths and weaknesses in the course of teaching and learning. Despite their known benefits to student attitudes and learning, medical school curricula have been slow to integrate such assessments into the curriculum. This study investigates how performance on two different modes of formative assessment relate to each other and to performance on summative assessments in an integrated, medical-school environment. Two types of formative assessment were administered to 146 first-year medical students each week over 8 weeks: a timed, closed-book component to assess factual recall and image recognition, and an un-timed, open-book component to assess higher order reasoning including the ability to identify and access appropriate resources and to integrate and apply knowledge. Analogous summative assessments were administered in the ninth week. Models relating formative and summative assessment performance were tested using Structural Equation Modeling. Two latent variables underlying achievement on formative and summative assessments could be identified; a "formative-assessment factor" and a "summative-assessment factor," with the former predicting the latter. A latent variable underlying achievement on open-book formative assessments was highly predictive of achievement on both open- and closed-book summative assessments, whereas a latent variable underlying closed-book assessments only predicted performance on the closed-book summative assessment. Formative assessments can be used as effective predictive tools of summative performance in medical school. Open-book, un-timed assessments of higher order processes appeared to be better predictors of overall summative performance than closed-book, timed assessments of factual recall and image recognition.
Babiar, Tasha Calvert
Traditionally, women and minorities have not been fully represented in science and engineering. Numerous studies have attributed these differences to gaps in science achievement as measured by various standardized tests. Rather than describe mean group differences in science achievement across multiple cultures, this study focused on an in-depth item-level analysis across two countries: Spain and the United States. This study investigated eighth-grade gender differences on science items across the two countries. A secondary purpose of the study was to explore the nature of gender differences using the many-faceted Rasch Model as a way to estimate gender DIF. A secondary analysis of data from the Third International Mathematics and Science Study (TIMSS) was used to address three questions: 1) Does gender DIF in science achievement exist? 2) Is there a relationship between gender DIF and characteristics of the science items? 3) Do the relationships between item characteristics and gender DIF in science items replicate across countries. Participants included 7,087 eight grade students from the United States and 3,855 students from Spain who participated in TIMSS. The Facets program (Linacre and Wright, 1992) was used to estimate gender DIF. The results of the analysis indicate that the content of the item seemed to be related to gender DIF. The analysis also suggests that there is a relationship between gender DIF and item format. No pattern of gender DIF related to cognitive demand was found. The general pattern of gender DIF was similar across the two countries used in the analysis. The strength of item-level analysis as opposed to group mean difference analysis is that gender differences can be detected at the item level, even when no mean differences can be detected at the group level.
Partington, Susan N; Menzies, Tim J; Colburn, Trina A; Saelens, Brian E; Glanz, Karen
The community food environment may contribute to obesity by influencing food choice. Store and restaurant audits are increasingly common methods for assessing food environments, but are time consuming and costly. A valid, reliable brief measurement tool is needed. The purpose of this study was to develop and validate reduced-item food environment audit tools for stores and restaurants. Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed in 820 stores and 1,795 restaurants in West Virginia, San Diego, and Seattle. Data mining techniques (correlation-based feature selection and linear regression) were used to identify survey items highly correlated to total survey scores and produce reduced-item audit tools that were subsequently validated against full NEMS surveys. Regression coefficients were used as weights that were applied to reduced-item tool items to generate comparable scores to full NEMS surveys. Data were collected and analyzed in 2008-2013. The reduced-item tools included eight items for grocery, ten for convenience, seven for variety, and five for other stores; and 16 items for sit-down, 14 for fast casual, 19 for fast food, and 13 for specialty restaurants-10% of the full NEMS-S and 25% of the full NEMS-R. There were no significant differences in median scores for varying types of retail food outlets when compared to the full survey scores. Median in-store audit time was reduced 25%-50%. Reduced-item audit tools can reduce the burden and complexity of large-scale or repeated assessments of the retail food environment without compromising measurement quality. Copyright © 2015 American Journal of Preventive Medicine. Published by Elsevier Inc. All rights reserved.
Alexander J Millner
Full Text Available Suicide is a leading cause of death worldwide. Although research has made strides in better defining suicidal behaviors, there has been less focus on accurate measurement. Currently, the widespread use of self-report, single-item questions to assess suicide ideation, plans and attempts may contribute to measurement problems and misclassification. We examined the validity of single-item measurement and the potential for statistical errors. Over 1,500 participants completed an online survey containing single-item questions regarding a history of suicidal behaviors, followed by questions with more precise language, multiple response options and narrative responses to examine the validity of single-item questions. We also conducted simulations to test whether common statistical tests are robust against the degree of misclassification produced by the use of single-items. We found that 11.3% of participants that endorsed a single-item suicide attempt measure engaged in behavior that would not meet the standard definition of a suicide attempt. Similarly, 8.8% of those who endorsed a single-item measure of suicide ideation endorsed thoughts that would not meet standard definitions of suicide ideation. Statistical simulations revealed that this level of misclassification substantially decreases statistical power and increases the likelihood of false conclusions from statistical tests. Providing a wider range of response options for each item reduced the misclassification rate by approximately half. Overall, the use of single-item, self-report questions to assess the presence of suicidal behaviors leads to misclassification, increasing the likelihood of statistical decision errors. Improving the measurement of suicidal behaviors is critical to increase understanding and prevention of suicide.
Methodological quality of diagnostic accuracy studies on non-invasive coronary CT angiography: influence of QUADAS (Quality Assessment of Diagnostic Accuracy Studies included in systematic reviews) items on sensitivity and specificity
Schueler, Sabine; Walther, Stefan; Schuetz, Georg M.; Schlattmann, Peter; Dewey, Marc
To evaluate the methodological quality of diagnostic accuracy studies on coronary computed tomography (CT) angiography using the QUADAS (Quality Assessment of Diagnostic Accuracy Studies included in systematic reviews) tool. Each QUADAS item was individually defined to adapt it to the special requirements of studies on coronary CT angiography. Two independent investigators analysed 118 studies using 12 QUADAS items. Meta-regression and pooled analyses were performed to identify possible effects of methodological quality items on estimates of diagnostic accuracy. The overall methodological quality of coronary CT studies was merely moderate. They fulfilled a median of 7.5 out of 12 items. Only 9 of the 118 studies fulfilled more than 75 % of possible QUADAS items. One QUADAS item (''Uninterpretable Results'') showed a significant influence (P = 0.02) on estimates of diagnostic accuracy with ''no fulfilment'' increasing specificity from 86 to 90 %. Furthermore, pooled analysis revealed that each QUADAS item that is not fulfilled has the potential to change estimates of diagnostic accuracy. The methodological quality of studies investigating the diagnostic accuracy of non-invasive coronary CT is only moderate and was found to affect the sensitivity and specificity. An improvement is highly desirable because good methodology is crucial for adequately assessing imaging technologies. (orig.)
Methodological quality of diagnostic accuracy studies on non-invasive coronary CT angiography: influence of QUADAS (Quality Assessment of Diagnostic Accuracy Studies included in systematic reviews) items on sensitivity and specificity
Schueler, Sabine; Walther, Stefan; Schuetz, Georg M. [Humboldt-Universitaet zu Berlin, Freie Universitaet Berlin, Charite Medical School, Department of Radiology, Berlin (Germany); Schlattmann, Peter [University Hospital of Friedrich Schiller University Jena, Department of Medical Statistics, Informatics, and Documentation, Jena (Germany); Dewey, Marc [Humboldt-Universitaet zu Berlin, Freie Universitaet Berlin, Charite Medical School, Department of Radiology, Berlin (Germany); Charite, Institut fuer Radiologie, Berlin (Germany)
To evaluate the methodological quality of diagnostic accuracy studies on coronary computed tomography (CT) angiography using the QUADAS (Quality Assessment of Diagnostic Accuracy Studies included in systematic reviews) tool. Each QUADAS item was individually defined to adapt it to the special requirements of studies on coronary CT angiography. Two independent investigators analysed 118 studies using 12 QUADAS items. Meta-regression and pooled analyses were performed to identify possible effects of methodological quality items on estimates of diagnostic accuracy. The overall methodological quality of coronary CT studies was merely moderate. They fulfilled a median of 7.5 out of 12 items. Only 9 of the 118 studies fulfilled more than 75 % of possible QUADAS items. One QUADAS item (''Uninterpretable Results'') showed a significant influence (P = 0.02) on estimates of diagnostic accuracy with ''no fulfilment'' increasing specificity from 86 to 90 %. Furthermore, pooled analysis revealed that each QUADAS item that is not fulfilled has the potential to change estimates of diagnostic accuracy. The methodological quality of studies investigating the diagnostic accuracy of non-invasive coronary CT is only moderate and was found to affect the sensitivity and specificity. An improvement is highly desirable because good methodology is crucial for adequately assessing imaging technologies. (orig.)
Wiener-Ogilvie, Sharon; Begg, Drummond
Clinical skill assessment (CSA) has been an integral part of the Royal College of General Practitioners' membership examination (MRCGP) since 2008. It is an expensive, high-stakes examination with first time pass rates ranging from 76.4 to 81.3. In this paper we describe the South East Scotland Deanery, NHS Education Scotland, pilot of a formative clinical skills assessment (fCSA) using the principles of formative assessment and OSCE. The purpose of the study was to assess the acceptability of the fCSA and to examine whether trainees, identified during the fCSA as 'at risk of failing the MRCGP CSA exam', are more likely to fail the MRCGP CSA exam later on in the year. Trainees were assessed in four clinical skills stations under exam conditions. After each station they were given verbal feedback and subsequently both trainee and their trainer received written feedback. We assessed the value of the exercise through written feedback from trainees and trainers. Each trainee's performance in fCSA was triangulated with trainer assessment to identify 'flagged trainees'. We compared flagged and non-flagged trainees' performance in MRCGP CSA. Both trainees and trainers highly rated the fCSA. Overall 97% of non-flagged trainees have passed the RCGP CSA exam by May of that year in comparison to 80% of flagged trainees who have passed the RCGP CSA (P = 0.005). Trainers and trainees rated the fCSA as excellent and useful. We were able to demonstrate that the fCSA can be used to identify those trainees likely to fail the RCGP CSA. Contrary to reservations about the potential to demoralise trainees, the fCSA was viewed as a useful and a positive experience by both trainees and trainers. In addition, we suggest that feedback from fCSA was useful in triggering appropriate educational interventions. Early intervention with trainees who are predicted to fail the CSA has the potential to reduce deaneries overall fail rate. Preventing one trainee failure could save over £30 000.
Full Text Available Learning assessments are subject of discussion both in their theoretical and practical approaches. The process of measuring learning in physics by high school students, either qualitatively or quantitatively, is one in which it should be possible to identify not only the concepts and contents students failed to achieve but also the reasons for the failure. We propose that students’ video production oﬀers a very eﬀective formative assessment tool to teachers: as a formative assessment, it produces information that allows the understanding of where and when the learning process succeeded or failed, of identifying, as a subject or as a group, the deﬁciencies or misunderstandings related to the theme under analysis and their interpretation by students, and it provides also a diﬀerent kind of assessment, related to some other life skills, such as ability to carry on a project till its conclusion and to work cooperatively. In this paper, we describe the use of videos produced by high school students as an assessment resource. The students were asked to prepare a short video, which was then presented to the whole group and discussed. The videos reveal aspects of students’ diﬃculties that usually do not appear in formal assessments such as tests and questionnaires. After the use of the videos as a component of classroom assessments and the use of the discussions to rethink learning activities in the group, the videos were analysed and classiﬁed in various categories. This analysis showed a strong correlation between the technical quality of the video and the content quality of the students’ argumentation. Also, it was shown that the students do not prepare their video based on quick and easy production; they usually choose forms of video production that require careful planning and implementation, and this reﬂects directly on the overall quality of the video and of the learning process.
Tylka, Tracy L; Wood-Barcalow, Nichole L
Considered a positive body image measure, the 13-item Body Appreciation Scale (BAS; Avalos, Tylka, & Wood-Barcalow, 2005) assesses individuals' acceptance of, favorable opinions toward, and respect for their bodies. While the BAS has accrued psychometric support, we improved it by rewording certain BAS items (to eliminate sex-specific versions and body dissatisfaction-based language) and developing additional items based on positive body image research. In three studies, we examined the reworded, newly developed, and retained items to determine their psychometric properties among college and online community (Amazon Mechanical Turk) samples of 820 women and 767 men. After exploratory factor analysis, we retained 10 items (five original BAS items). Confirmatory factor analysis upheld the BAS-2's unidimensionality and invariance across sex and sample type. Its internal consistency, test-retest reliability, and construct (convergent, incremental, and discriminant) validity were supported. The BAS-2 is a psychometrically sound positive body image measure applicable for research and clinical settings. Copyright © 2014 Elsevier Ltd. All rights reserved.
Pandito, R. H.; Haris, A.; Zainal, R. M.; Riyanto, A.
The assessment of Ngimbang formation at Rihen field of Northeast Java Basin has been conducted to identify the hydrocarbon potential by analyzing the response of passive seismic on the proven reservoir zone and proposing a tectonic evolution model. In the case of petroleum exploration in Northeast Java basin, the Ngimbang formation cannot be simply overemphasized. East Java Basin has been well known as one of the mature basins producing hydrocarbons in Indonesia. This basin was stratigraphically composed of several formations from the old to the young i.e., the basement, Ngimbang, Kujung, Tuban, Ngerayong, Wonocolo, Kawengan and Lidah formation. All of these formations have proven to become hydrocarbon producer. The Ngrayong formation, which is geologically dominated by channels, has become a production formation. The Kujung formation that has been known with the reef build up has produced more than 102 million barrel of oil. The Ngimbang formation so far has not been comprehensively assessed in term its role as a source rock and a reservoir. In 2013, one exploratory well has been drilled at Ngimbang formation and shown a gas discovery, which is indicated on Drill Stem Test (DST) reading for more than 22 MMSCFD of gas. This discovery opens new prospect in exploring the Ngimbang formation.
Sideridis, Georgios; Padeliadu, Susana
The purpose of the present studies was to provide the means to create brief versions of instruments that can aid the diagnosis and classification of students with learning disabilities and comorbid disorders (e.g., attention-deficit/hyperactivity disorder). A sample of 1,108 students with and without a diagnosis of learning disabilities took part in study 1. Using information from modern theory methods (i.e., the Rasch model), a scale was created that included fewer than one third of the original battery items designed to assess reading skills. This best item synthesis was then evaluated for its predictive and criterion validity with a valid external reading battery (study 2). Using a sample of 232 students with and without learning disabilities, results indicated that the brief version of the scale was equally effective as the original scale in predicting reading achievement. Analysis of the content of the brief scale indicated that the best item synthesis involved items from cognition, motivation, strategy use, and advanced reading skills. It is suggested that multiple psychometric criteria be employed in evaluating the psychometric adequacy of scales used for the assessment and identification of learning disabilities and comorbid disorders.
Full Text Available Background: Lack of assessment and feedback based on observation is one of the most serious deficiencies in the current medical education practice. Formative assessment strategies in postgraduate education can be affective when they are integral to the learning process. Seminars and journal club presentations are integral to the postgraduate education in all medical institutions. Methods: This study was done to assess a structured tool for evaluation of seminars and journal clubs by postgraduates in Community Medicine (as part of formative assessment based on rater reliability and efficacy of feedback. Results: The scale having five domains namely justification for the topic or the journal article, presentation skills, slide preparation, slide content and discussion, had high inter-rater reliability with intra class coefficient of 0.861 (95% CI 0.632 to 0.958, ‘p’ of 0.000. There was a significant improvement of the students over three journal club presentations in four out of five domains.Conclusions: This study has shown that use of rating scales during seminar and journal club presentations, when combined with feedback, can be an effective tool in formative assessment thereby supporting and enhancing the learning process.
Cisterna Alburquerque, Dante Igor
This study describes and analyzes the experiences of two high-school chemistry teachers who participated in a team-based professional development program to learn about and enact formative assessment in their classrooms. The overall purpose of this study is to explain how participation in this professional development influenced both teachers' classroom enactment of formative assessment practices. This study focuses on 1) teachers' participation in the professional development program, 2) teachers' enactment of formative assessment, and 3) factors that enabled or hindered enactment of formative assessment. Drawing on cultural-historical activity theory (CHAT) and using evidence from teacher lessons, teacher interviews, professional development meetings as data sources, this single embedded case study analyzes how these two teachers who participated in the same learning team and have similar characteristics (i.e., teaching in the same school, teaching the same courses and population of students, and using the same materials) differentially used the professional development learning about formative assessment as mediating tools to improve their classroom instruction. The learning team experience contributed to both teachers' development of a better understanding of formative assessment---especially in recognizing that their current grading and assessment practices were not appropriate to promote student learning---and the co-creation of artifacts to gather evidence of students' ideas. Although both teachers demonstrated understanding about how formative assessment may serve to promote student learning and had a set of tools available to utilize for formative assessment use, they did not enact these tools in the same way. One teacher appropriated formative assessment as mediating tool to verify if the students were following her explanations, and to check if the students were able to provide the correct response. The other teacher used the mediating tool to promote
Peterson, Dwight J; Naveh-Benjamin, Moshe
An important yet unresolved question regarding visual working memory (VWM) relates to whether or not binding processes within VWM require additional attentional resources compared with processing solely the individual components comprising these bindings. Previous findings indicate that binding of surface features (e.g., colored shapes) within VWM is not demanding of resources beyond what is required for single features. However, it is possible that other types of binding, such as the binding of complex, distinct items (e.g., faces and scenes), in VWM may require additional resources. In 3 experiments, we examined VWM item-item binding performance under no load, articulatory suppression, and backward counting using a modified change detection task. Binding performance declined to a greater extent than single-item performance under higher compared with lower levels of concurrent load. The findings from each of these experiments indicate that processing item-item bindings within VWM requires a greater amount of attentional resources compared with single items. These findings also highlight an important distinction between the role of attention in item-item binding within VWM and previous studies of long-term memory (LTM) where declines in single-item and binding test performance are similar under divided attention. The current findings provide novel evidence that the specific type of binding is an important determining factor regarding whether or not VWM binding processes require attention. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Bialleck, Katharina A.; Schaal, Hans-Peter; Kranz, Thorsten A.; Fell, Juergen; Elger, Christian E.; Axmacher, Nikolai
During reinforcement learning, dopamine release shifts from the moment of reward consumption to the time point when the reward can be predicted. Previous studies provide consistent evidence that reward-predicting cues enhance long-term memory (LTM) formation of these items via dopaminergic projections to the ventral striatum. However, it is less clear whether memory for items that do not precede a reward but are directly associated with reward consumption is also facilitated. Here, we investigated this question in an fMRI paradigm in which LTM for reward-predicting and neutral cues was compared to LTM for items presented during consumption of reliably predictable as compared to less predictable rewards. We observed activation of the ventral striatum and enhanced memory formation during reward anticipation. During processing of less predictable as compared to reliably predictable rewards, the ventral striatum was activated as well, but items associated with less predictable outcomes were remembered worse than items associated with reliably predictable outcomes. Processing of reliably predictable rewards activated the ventromedial prefrontal cortex (vmPFC), and vmPFC BOLD responses were associated with successful memory formation of these items. Taken together, these findings show that consumption of reliably predictable rewards facilitates LTM formation and is associated with activation of the vmPFC. PMID:21326612
Schmiemann, Philipp; Nehm, Ross H.; Tornabene, Robyn E.
Understanding how situational features of assessment tasks impact reasoning is important for many educational pursuits, notably the selection of curricular examples to illustrate phenomena, the design of formative and summative assessment items, and determination of whether instruction has fostered the development of abstract schemas divorced from particular instances. The goal of our study was to employ an experimental research design to quantify the degree to which situational features impact inferences about participants' understanding of Mendelian genetics. Two participant samples from different educational levels and cultural backgrounds (high school, n = 480; university, n = 444; Germany and USA) were used to test for context effects. A multi-matrix test design was employed, and item packets differing in situational features (e.g., plant, animal, human, fictitious) were randomly distributed to participants in the two samples. Rasch analyses of participant scores from both samples produced good item fit, person reliability, and item reliability and indicated that the university sample displayed stronger performance on the items compared to the high school sample. We found, surprisingly, that in both samples, no significant differences in performance occurred among the animal, plant, and human item contexts, or between the fictitious and "real" item contexts. In the university sample, we were also able to test for differences in performance between genders, among ethnic groups, and by prior biology coursework. None of these factors had a meaningful impact upon performance or context effects. Thus some, but not all, types of genetics problem solving or item formats are impacted by situational features.
Dolin, Jens; Evans, Robert Harry
This chapter suggests the use of formative assessment in inquiry lessons as a helpful source of positive personal capacity beliefs for both teachers and students. The challenge most commonly experienced when first using inquiry learning methods is that pupils and even teachers become uncertain...... of their abilities to use inquiry and ‘give-up’ on it. With the use of formative assessment combined with conscious efforts to increase self-efficacy among students, teachers can help provide students with the confidence and motivation to engage in inquiry methods. Such student engagement can in-turn affirm teachers......’ inquiry teaching efforts and raise the likelihood that they will continue to improve them. We see inquiry methods as the motor for changing teacher practice and formative assessment methods combined with capacity beliefs as the fuel that keeps the motor running. The central position of the chapter is how...
Full Text Available Purpose: This article is based on questions related to the formative assessment of preparatory trainee ship in the professional life of Physical Education teachers. In general, in the first training program, the traineeship represents an integral part of training. In this sense, the traineeship offers a vital opportunity for future teacher to gain practical experience in the real environment, given that formative evaluation is a process of collecting evidence from trainees by cooperative teachers to make decisions about their knowledge and skills, to guide their own instructional activities and to control their behavior. Accordingly, this study proposed to explore practices of Tunisians cooperative teachers in relation to the formative assessment. Material: To verify our proposed object, we conducted a research using a questionnaire distributed among 96 cooperative teachers in different educational institutions located in the region of the greater Tunis. During the school year 2015-2016, the questionnaire was the subject of a statistical analysis using frequencies and percentages. Results: The analysis of such data revealed a range of practices about formative estimation among cooperative teachers. In particular, each teacher acknowledged the value of guiding and encouraging student’s self-assessment. So that they could lead their students to assume a share of evaluative activity. Conclusion: Both theoretical and practical implications of these findings are discussed, and some recommendations are made for future practice.
Toker, Turker; Green, Kathy
The least squares distance method (LSDM) was used in a cognitive diagnostic analysis of TIMSS (Trends in International Mathematics and Science Study) items administered to 4,498 8th-grade students from seven geographical regions of Turkey, extending analysis of attributes from content to process and skill attributes. Logit item positions were…
Arapovic-Johansson, B; Wåhlin, C; Kwak, L; Björklund, C; Jensen, I
Given the prevalence of work stress-related ill-health in the Western world, it is important to find cost-effective, easy-to-use and valid measures which can be used both in research and in practice. To examine the validity and reliability of the single-item stress question (SISQ), distributed weekly by short message service (SMS) and used for measurement of work-related stress. The convergent validity was assessed through associations between the SISQ and subscales of the Job Demand-Control-Support model, the Effort-Reward Imbalance model and scales measuring depression, exhaustion and sleep. The predictive validity was assessed using SISQ data collected through SMS. The reliability was analysed by the test-retest procedure. Correlations between the SISQ and all the subscales except for job strain and esteem reward were significant, ranging from -0.186 to 0.627. The SISQ could also predict sick leave, depression and exhaustion at 12-month follow-up. The analysis on reliability revealed a satisfactory stability with a weighted kappa between 0.804 and 0.868. The SISQ, administered through SMS, can be used for the screening of stress levels in a working population. © The Author 2017. Published by Oxford University Press on behalf of the Society of Occupational Medicine. All rights reserved. For Permissions, please email: email@example.com
van Halem, Nicolette; Goei, Sui Lin; Akkerman, Sanne F.
Purpose: The purpose of this paper is to evaluate the extent of systematic examination of students’ educational (support) needs by teachers participating in lesson study (LS) meetings within a framework of formative assessment (FA). Design/methodology/approach: The study took place in the context of
Holman, Rebecca; Glas, Cornelis A.W.
A model-based procedure for assessing the extent to which missing data can be ignored and handling non-ignorable missing data is presented. The procedure is based on item response theory modelling. As an example, the approach is worked out in detail in conjunction with item response data modelled
Holman, Rebecca; Glas, Cees A. W.
A model-based procedure for assessing the extent to which missing data can be ignored and handling non-ignorable missing data is presented. The procedure is based on item response theory modelling. As an example, the approach is worked out in detail in conjunction with item response data modelled
Full Text Available The aim of this study was to find the impact of feedback in formative assessment in the learning process activity and students learning outcomes on learning chemistry. The method used on this study was quasi experiment research with non-equivalent control group design. The result showed that the application of feedback in formative assessment has a positive impact toward students learning process activity. Students become more enthusiastic, motivated, and more active on the learning process. Thus in this study can be conclude that feedback in formative assessment have a positive impact toward the learning process activity to form a habits of mind.
Cardoso, Monique Herrera; Capellini, Simone Aparecida
Perform a cross-cultural adaptation of the Detailed Assessment of Speed of Handwriting 17+ (DASH 17+) for Brazilians. Evaluation of (1) conceptual, item and (2) semantic equivalence, with assistance of four translators and application of a pilot study to 36 students. (1) The concepts and items are equivalent in the British and Brazilian cultures. (2) Adaptations were made concerning the English language pangram used in copying tasks and selection of the lower-case, cursive handwriting in the alphabet-writing task. Application of the pilot study verified acceptability and understanding of the proposed tasks by the students. The Brazilian Portuguese version of the DASH 17+ was presented after finalization of the conceptual, item and semantic equivalence of the instrument. Further studies on psychometric properties should be conducted with the purpose of measuring the speed of handwriting in youngsters and adults with greater reliability and validity to the procedure.
Hohensinn, Christine; Kubinger, Klaus D.
In aptitude and achievement tests, different response formats are usually used. A fundamental distinction must be made between the class of multiple-choice formats and the constructed response formats. Previous studies have examined the impact of different response formats applying traditional statistical approaches, but these influences can also…
Victor Manuel López Pastor
Full Text Available 800x600 Normal 0 21 false false false CA X-NONE X-NONE The aim of this article is three-fold: (a to present an example of best practices in formative assessment in university instruction, offering three different methods of learning and assessment to pass a subject; (b to analyze differences in academic performance depending on method of learning and assessment chosen; (c to consider professors´ and students´ evaluation of these assessment methods, as well as analyze the workload these methods suppose for both students and professors. The design is based on a single case study. The study analyzes the results obtained in a third- year course at the University of Valladolid (Spain that participated in an ECTS pilot program. Data was collected during academic year 2009-10. Total number of registered students was 77. This paper describes the procedure to develop a formative assessment system and collect data, as well as the main techniques to obtain and analyze data. Findings indicate that there are important differences in student academic performance depending on the learning and assessment method employed in an academic course. Courses are using formative and on going assessment result in significantly higher student academic performance than courses using other learning and assessment methods. Lastly, empirical data suggest that the workload is in line with the ECTS European Credit Transfer System, and is no excessive for the professor. However, students´ subjective perception is that this method involves a heavier workload. These findings may be important, given the current process of convergence towards the new Degrees and ECTS credit system, and the need to adapt these degrees and credits to students’ real workload.
Kim, Stella H; Strutt, Adriana M; Olabarrieta-Landa, Laiene; Lequerica, Anthony H; Rivera, Diego; De Los Reyes Aragon, Carlos Jose; Utria, Oscar; Arango-Lasprilla, Juan Carlos
The Boston Naming Test (BNT) is a widely used measure of confrontation naming ability that has been criticized for its questionable construct validity for non-English speakers. This study investigated item difficulty and construct validity of the Spanish version of the BNT to assess cultural and linguistic impact on performance. Subjects were 1298 healthy Spanish speaking adults from Colombia. They were administered the 60- and 15-item Spanish version of the BNT. A Rasch analysis was computed to assess dimensionality, item hierarchy, targeting, reliability, and item fit. Both versions of the BNT satisfied requirements for unidimensionality. Although internal consistency was excellent for the 60-item BNT, order of difficulty did not increase consistently with item number and there were a number of items that did not fit the Rasch model. For the 15-item BNT, a total of 5 items changed position on the item hierarchy with 7 poor fitting items. Internal consistency was acceptable. Construct validity of the BNT remains a concern when it is administered to non-English speaking populations. Similar to previous findings, the order of item presentation did not correspond with increasing item difficulty, and both versions were inadequate at assessing high naming ability.
Jorge Luíz Clemente Gomes
Full Text Available This work aims to organize pre-defined items that affect the students’ answers when assessing the Electrotechnical Technology Course / PROEJA. The research was carried out from October / 2011 to December / 2012 with questionnaires applied with 1st to 6th period students. At campus Campos Centro, “Technical Visits” and “Internship” presented high levels of importance and low satisfaction, while “Personal Realization” and “Professional Achievement” presented high levels of relevance and satisfaction. At campus Itaperuna, “Job opportunities” and “Professional Achievement” presented high levels of relevance and satisfaction. Items “Faculty” and “New Technologies”, presented high importance but low satisfaction. The research aims at improving the quality of the course.
Full Text Available There are several advantages to creating multimedia item types and applying computer-based adaptive testing in education. First is the capability to motivate learning by making the learners feel more engaged and in an interactive environment. Second is a better concept representation, which is not possible in conventional multiple-choice tests. Third is the advantage of individualized curriculum design, rather than a curriculum designed for an average student. Fourth is a good choice of the next question, associated with the appropriate difficulty level based on a student's response to the current question. However, many issues need to be addressed when achieving these goals, including: (a the large number of item types required to represent the current multiple-choice questions in multimedia formats, (b the criterion used to determine the difficulty level of a multimedia question item, and (c the methodology applied to the question selection process for individual students. In this paper, we propose a multimedia item shell design that not only reduces the number of item types required, but also computes difficulty level of an item automatically. The concept of question seed is introduced to make content creation more cost-effective. The proposed item shell framework facilitates efficient communication between user responses at the client, and the scoring agents integrated with a student ability assessor at the server. We also describe approaches for automatically estimating difficulty level of questions, and discuss preliminary evaluation of multimedia item types by students.
Solano-Flores, Guillermo; Wang, Chao; Shade, Chelsey
We examined multimodality (the representation of information in multiple semiotic modes) in the context of international test comparisons. Using Program of International Student Assessment (PISA)-2009 data, we examined the correlation of the difficulty of science items and the complexity of their illustrations. We observed statistically…
This article presents a guide to the development of formative assessments for school librarians participating in professional learning communities (PLC). It describes librarians' reading of assigned books, meeting with their PLCs, and incorporation of learned strategies in their daily instruction. Central library service readers' regular visits to…
Sadler, D. Royce
Discusses the nature and function of formative assessment in the development of students' expertise for evaluating the quality of their own work. Highlights include the transition from teacher-supplied feedback to learner self-monitoring; qualitative judgments; communicating standards to students; multicriterion judgments; and implications for the…
Cropley, David; Cropley, Arthur
Computer-assisted assessment (CAA) is problematic when it comes to fostering creativity, because in educational thinking the essence of creativity is not finding the correct answer but generating novelty. The idea of "functional" creativity provides rubrics that can serve as the basis for forms of CAA leading to either formative or…
Quercia, F.; D'Alessandro, M.; Saltelli, A.
The probabilistic code LISA (Long term Isolation Safety Assessment) has been used to assess the risk related to the disposal of alpha waste in a geological formation. The code has been modified to take into account waste form properties and leaching processes pertinent to alpha waste produced at fuel reprocessing plants. The exercise refers to a repository in a deep clay formation located at Harwell (U.K.) where some hydrogeological data were available. Radionuclide migration through repository and geological barriers has been simulated together with biosphere contamination. Results of the assessment are presented as dose rate (or risk) distributions; a sensitivity analysis on input parameters has been performed
Dikken, Jeroen; Hoogerduijn, Jita G; Kruitwagen, Cas; Schuurmans, Marieke J
To assess the content validity and psychometric characteristics of the Knowledge about Older Patients Quiz (KOP-Q), which measures nurses' knowledge regarding older hospitalized adults and their certainty regarding this knowledge. Cross-sectional. Content validity: general hospitals. Psychometric characteristics: nursing school and general hospitals in the Netherlands. Content validity: 12 nurse specialists in geriatrics. Psychometric characteristics: 107 first-year and 78 final-year bachelor of nursing students, 148 registered nurses, and 20 nurse specialists in geriatrics. Content validity: The nurse specialists rated each item of the initial KOP-Q (52 items) on relevance. Ratings were used to calculate Item-Content Validity Index and average Scale-Content Validity Index (S-CVI/ave) scores. Items with insufficient content validity were removed. Psychometric characteristics: Ratings of students, nurses, and nurse specialists were used to test for different item functioning (DIF) and unidimensionality before item characteristics (discrimination and difficulty) were examined using Item Response Theory. Finally, norm references were calculated and nomological validity was assessed. Content validity: Forty-three items remained after assessing content validity (S-CVI/ave = 0.90). Psychometric characteristics: Of the 43 items, two demonstrating ceiling effects and 11 distorting ability estimates (DIF) were subsequently excluded. Item characteristics were assessed for the remaining 30 items, all of which demonstrated good discrimination and difficulty parameters. Knowledge was positively correlated with certainty about this knowledge. The final 30-item KOP-Q is a valid, psychometrically sound, comprehensive instrument that can be used to assess the knowledge of nursing students, hospital nurses, and nurse specialists in geriatrics regarding older hospitalized adults. It can identify knowledge and certainty deficits for research purposes or serve as a tool in educational
McKay, J; Murphy, D J; Bowie, P; Schmuck, M-L; Lough, M; Eva, K W
To establish the content validity and specific aspects of reliability for an assessment instrument designed to provide formative feedback to general practitioners (GPs) on the quality of their written analysis of a significant event. Content validity was quantified by application of a content validity index. Reliability testing involved a nested design, with 5 cells, each containing 4 assessors, rating 20 unique significant event analysis (SEA) reports (10 each from experienced GPs and GPs in training) using the assessment instrument. The variance attributable to each identified variable in the study was established by analysis of variance. Generalisability theory was then used to investigate the instrument's ability to discriminate among SEA reports. Content validity was demonstrated with at least 8 of 10 experts endorsing all 10 items of the assessment instrument. The overall G coefficient for the instrument was moderate to good (G>0.70), indicating that the instrument can provide consistent information on the standard achieved by the SEA report. There was moderate inter-rater reliability (G>0.60) when four raters were used to judge the quality of the SEA. This study provides the first steps towards validating an instrument that can provide educational feedback to GPs on their analysis of significant events. The key area identified to improve instrument reliability is variation among peer assessors in their assessment of SEA reports. Further validity and reliability testing should be carried out to provide GPs, their appraisers and contractual bodies with a validated feedback instrument on this aspect of the general practice quality agenda.
Davies, Jenifer; Ecclestone, Kathryn
In contrast to theoretical and empirical insights from research into formative assessment in compulsory schooling, understanding the relationship between formative assessment, motivation and learning in vocational education has been a topic neglected by researchers. The Improving Formative Assessment project (IFA) addresses this gap, using a…
Linden, Michael; Dymke, Tina; Hüttner, Susanne; Schnaubelt, Sabine
The first item of any psychopathological assessment is "general impression". There is some research under the heading of "impression formation" which shows that the outer appearance of a person decides about how a person is perceived by others and how others react. Impression formation is an important factor in social interaction. This is of special importance in mental disorders, which may express themselves in a distorted impression formation. As impression formation is by and large an emotional process, measurement can be done by adjective lists. An example is the bipolar MED rating scale. Such lists can be used in therapy to help patients and therapists to understand the problem and initiate modifications. A special group intervention in occupational therapy is described. Results suggest that impression formation is quite objective, that self- and observer judgments coincide and that therapy can help to adopt a less irritating outer appearance. © Georg Thieme Verlag KG Stuttgart · New York.
Sirota, Miroslav; Juanchich, Marie
The Cognitive Reflection Test, measuring intuition inhibition and cognitive reflection, has become extremely popular because it reliably predicts reasoning performance, decision-making, and beliefs. Across studies, the response format of CRT items sometimes differs, based on the assumed construct equivalence of tests with open-ended versus multiple-choice items (the equivalence hypothesis). Evidence and theoretical reasons, however, suggest that the cognitive processes measured by these response formats and their associated performances might differ (the nonequivalence hypothesis). We tested the two hypotheses experimentally by assessing the performance in tests with different response formats and by comparing their predictive and construct validity. In a between-subjects experiment (n = 452), participants answered stem-equivalent CRT items in an open-ended, a two-option, or a four-option response format and then completed tasks on belief bias, denominator neglect, and paranormal beliefs (benchmark indicators of predictive validity), as well as on actively open-minded thinking and numeracy (benchmark indicators of construct validity). We found no significant differences between the three response formats in the numbers of correct responses, the numbers of intuitive responses (with the exception of the two-option version, which had a higher number than the other tests), and the correlational patterns of the indicators of predictive and construct validity. All three test versions were similarly reliable, but the multiple-choice formats were completed more quickly. We speculate that the specific nature of the CRT items helps build construct equivalence among the different response formats. We recommend using the validated multiple-choice version of the CRT presented here, particularly the four-option CRT, for practical and methodological reasons. Supplementary materials and data are available at https://osf.io/mzhyc/ .
In this commentary, I interpret Xinying Yin and Gayle Ann Buck's collaborative action research from a social-cultural perspective. Classroom implementation of formative assessment is viewed as interaction between this assessment method and the local learning culture. I first identify Yin and Buck's definition of the formative assessment, and then…
Smedema, Susan Miller; Ruiz, Derek; Mohr, Michael J.
Purpose: To evaluate the factorial and concurrent validity and internal consistency reliability of the World Health Organization Disability Assessment Schedule 2.0 (WHODAS 2.0) 12-item version in persons with spinal cord injuries. Method: Two hundred forty-seven adults with spinal cord injuries completed an online survey consisting of the WHODAS…
Hung, Pi-Hsia; Lin, Yu-Fen; Hwang, Gwo-Jen
Ubiquitous computing and mobile technologies provide a new perspective for designing innovative outdoor learning experiences. The purpose of this study is to propose a formative assessment design for integrating PDAs into ecology observations. Three learning activities were conducted in this study. An action research approach was applied to…
Tsang, Siny; Schmidt, Karen M.; Vincent, Gina M.; Salekin, Randall T.; Moretti, Marlene M.; Odgers, Candice L.
This study used an item response theory (IRT) model and a large adolescent sample of justice involved youth (N = 1,007, 38% female) to examine the item functioning of the Psychopathy Checklist – Youth Version (PCL: YV). Items that were most discriminating (or most sensitive to changes) of the latent trait (thought to be psychopathy) among adolescents included “Glibness/superficial charm”, “Lack of remorse”, and “Need for stimulation”, whereas items that were least discriminating included “Pathological lying”, “Failure to accept responsibility”, and “Lacks goals.” The items “Impulsivity” and “Irresponsibility” were the most likely to be rated high among adolescents, whereas “Parasitic lifestyle”, and “Glibness/superficial charm” were the most likely to be rated low. Evidence of differential item functioning (DIF) on four of the 13 items was found between boys and girls. “Failure to accept responsibility” and “Impulsivity” were endorsed more frequently to describe adolescent girls than boys at similar levels of the latent trait, and vice versa for “Grandiose sense of self-worth” and “Lacks goals.” The DIF findings suggest that four PCL: YV items function differently between boys and girls. PMID:25580672
Rosenbluth, Glenn; Burman, Natalie J; Ranji, Sumant R; Boscardin, Christy K
Improving the quality of health care and education has become a mandate at all levels within the medical profession. While several published quality improvement (QI) assessment tools exist, all have limitations in addressing the range of QI projects undertaken by learners in undergraduate medical education, graduate medical education, and continuing medical education. We developed and validated a tool to assess QI projects with learner engagement across the educational continuum. After reviewing existing tools, we interviewed local faculty who taught QI to understand how learners were engaged and what these faculty wanted in an ideal assessment tool. We then developed a list of competencies associated with QI, established items linked to these competencies, revised the items using an iterative process, and collected validity evidence for the tool. The resulting Multi-Domain Assessment of Quality Improvement Projects (MAQIP) rating tool contains 9 items, with criteria that may be completely fulfilled, partially fulfilled, or not fulfilled. Interrater reliability was 0.77. Untrained local faculty were able to use the tool with minimal guidance. The MAQIP is a 9-item, user-friendly tool that can be used to assess QI projects at various stages and to provide formative and summative feedback to learners at all levels.
van der Linden, Willem J.; Vos, Hendrik J.; Chang, Lei
In judgmental standard setting experiments, it may be difficult to specify subjective probabilities that adequately take the properties of the items into account. As a result, these probabilities are not consistent with each other in the sense that they do not refer to the same borderline level of
McLaughlin, T.; Yan, Z.
This article is a review of literature on online formative assessment (OFA). It includes a narrative summary that synthesizes the research on the diverse delivery methods of OFA, as well as the empirical literature regarding the strong psychological benefits and limitations. Online formative assessment can be delivered using many traditional…
Beltran, Alicia; Knight Sepulveda, Karina; Watson, Kathy; Baranowski, Tom; Baranowski, Janice; Islam, Noemi; Missaghian, Mariam
Objective: Assess how 8- to 13-year-old children categorized and labeled food items for possible use as part of a food search strategy in a computerized 24-hour dietary recall. Design: A set of 62 cards with pictures and names of food items from 18 professionally defined food groups was sorted by each child into piles of similar food items.…
Kim, Sara; Brock, Douglas M; Hess, Brian J; Holmboe, Eric S; Gallagher, Thomas H; Lipner, Rebecca S; Mazor, Kathleen M
Little is known about the best approaches and format for measuring physicians' communication skills in an online environment. This study examines the reliability and validity of scores from two Web-based communication skill assessment formats. We created two online communication skill assessment formats: (a) MCQ (multiple-choice questions) consisting of video-based multiple-choice questions; (b) multi-format including video-based multiple-choice questions with rationales, Likert-type scales, and free text responses of what physicians would say to a patient. We randomized 100 general internists to each test format. Peer and patient ratings collected via the American Board of Internal Medicine (ABIM) served as validity sources. Seventy-seven internists completed the tests (MCQ: 38; multi-format: 39). The adjusted reliability was 0.74 for both formats. Excellent communicators, as based on their peer and patient ratings, performed slightly better on both tests than adequate communicators, though this difference was not statistically significant. Physicians in both groups rated test format innovative (4.2 out of 5.0). The acceptable reliability and participants' overall positive experiences point to the value of ongoing research into rigorous Web-based communication skills assessment. With efficient and reliable scoring, the Web offers an important way to measure and potentially enhance physicians' communication skills. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.
Bai, Mei; Dixon, Jane K
The purpose of this study was to reexamine the factor pattern of the 12-item Functional Assessment of Chronic Illness Therapy-Spiritual Well-Being Scale (FACIT-Sp-12) using exploratory factor analysis in people newly diagnosed with advanced cancer. Principal components analysis (PCA) and 3 common factor analysis methods were used to explore the factor pattern of the FACIT-Sp-12. Factorial validity was assessed in association with quality of life (QOL). Principal factor analysis (PFA), iterative PFA, and maximum likelihood suggested retrieving 3 factors: Peace, Meaning, and Faith. Both Peace and Meaning positively related to QOL, whereas only Peace uniquely contributed to QOL. This study supported the 3-factor model of the FACIT-Sp-12. Suggestions for revision of items and further validation of the identified factor pattern were provided.
Monajemi, Alireza; Yaghmaei, Minoo
Most contemporary clinical reasoning tests typically assess non-automatic thinking. Therefore, a test is needed to measure automatic reasoning or pattern recognition, which has been largely neglected in clinical reasoning tests. The Puzzle Test (PT) is dedicated to assess automatic clinical reasoning in routine situations. This test has been introduced first in 2009 by Monajemi et al in the Olympiad for Medical Sciences Students.PT is an item format that has gained acceptance in medical education, but no detailed guidelines exist for this test's format, construction and scoring. In this article, a format is described and the steps to prepare and administer valid and reliable PTs are presented. PT examines a specific clinical reasoning task: Pattern recognition. PT does not replace other clinical reasoning assessment tools. However, it complements them in strategies for assessing comprehensive clinical reasoning.
Kleinman, Marjorie; Teresi, Jeanne A
Measures of magnitude and impact of differential item functioning (DIF) at the item and scale level, respectively are presented and reviewed in this paper. Most measures are based on item response theory models. Magnitude refers to item level effect sizes, whereas impact refers to differences between groups at the scale score level. Reviewed are magnitude measures based on group differences in the expected item scores and impact measures based on differences in the expected scale scores. The similarities among these indices are demonstrated. Various software packages are described that provide magnitude and impact measures, and new software presented that computes all of the available statistics conveniently in one program with explanations of their relationships to one another.
Wati, F.; Sinaga, P.; Priyandoko, D.
The Programme for International Students Assessment (PISA) does assess students’ science literacy in a real-life contexts and wide variety of situation. Therefore, the results do not provide adequate information for the teacher to excavate students’ science literacy because the range of materials taught at schools depends on the curriculum used. This study aims to investigate the way how junior high school students in Indonesia solve PISA test items. Data was collected by using PISA test items in greenhouse unit employed to 36 students of 9th grade. Students’ answer was analyzed qualitatively for each item based on competence tested in the problem. The way how students answer the problem exhibits their ability in particular competence which is influenced by a number of factors. Those are students’ unfamiliarity with test construction, low performance on reading, low in connecting available information and question, and limitation on expressing their ideas effectively and easy-read. As the effort, selected PISA test items can be used in accordance teaching topic taught to familiarize students with science literacy.
Adaptive screening for depression--recalibration of an item bank for the assessment of depression in persons with mental and somatic diseases and evaluation in a simulated computer-adaptive test environment.
Forkmann, Thomas; Kroehne, Ulf; Wirtz, Markus; Norra, Christine; Baumeister, Harald; Gauggel, Siegfried; Elhan, Atilla Halil; Tennant, Alan; Boecker, Maren
This study conducted a simulation study for computer-adaptive testing based on the Aachen Depression Item Bank (ADIB), which was developed for the assessment of depression in persons with somatic diseases. Prior to computer-adaptive test simulation, the ADIB was newly calibrated. Recalibration was performed in a sample of 161 patients treated for a depressive syndrome, 103 patients from cardiology, and 103 patients from otorhinolaryngology (mean age 44.1, SD=14.0; 44.7% female) and was cross-validated in a sample of 117 patients undergoing rehabilitation for cardiac diseases (mean age 58.4, SD=10.5; 24.8% women). Unidimensionality of the itembank was checked and a Rasch analysis was performed that evaluated local dependency (LD), differential item functioning (DIF), item fit and reliability. CAT-simulation was conducted with the total sample and additional simulated data. Recalibration resulted in a strictly unidimensional item bank with 36 items, showing good Rasch model fit (item fit residualsLD. CAT simulation revealed that 13 items on average were necessary to estimate depression in the range of -2 and +2 logits when terminating at SE≤0.32 and 4 items if using SE≤0.50. Receiver Operating Characteristics analysis showed that θ estimates based on the CAT algorithm have good criterion validity with regard to depression diagnoses (Area Under the Curve≥.78 for all cut-off criteria). The recalibration of the ADIB succeeded and the simulation studies conducted suggest that it has good screening performance in the samples investigated and that it may reasonably add to the improvement of depression assessment. © 2013.
Zuiker, Steven; Reid Whitaker, J.
This paper describes the 5E+I/A inquiry model and reports a case study of one curricular enactment by a US fifth-grade classroom. A literature review establishes the model's conceptual adequacy with respect to longstanding research related to both the 5E inquiry model and multiple, incremental innovations of it. As a collective line of research, the review highlights a common emphasis on formative assessment, at times coupled either with differentiated instruction strategies or with activities that target the generalization of learning. The 5E+I/A model contributes a multi-level assessment strategy that balances formative and summative functions of multiple forms of assessment in order to support classroom participation while still attending to individual achievement. The case report documents the enactment of a weeklong 5E+I/A curricular design as a preliminary account of the model's empirical adequacy. A descriptive and analytical narrative illustrates variable ways that multi-level assessment makes student thinking visible and pedagogical decision-making more powerful. In light of both, it also documents productive adaptations to a flexible curricular design and considers future research to advance this collective line of inquiry.
David M Condon
Full Text Available These data were collected to evaluate the structure of personality constructs in the temperament domain. In the context of modern personality theory, these constructs are typically construed in terms of the Big Five (Conscientiousness, Agreeableness, Neuroticism, Openness, and Extraversion though several additional constructs were included here. Approximately 24,000 individuals were administered random subsets of 696 items from 92 public-domain personality scales using the Synthetic Aperture Personality Assessment method between December 8, 2013 and July 26, 2014. The data are available in rdata format and are accompanied by documentation stored as a text file. Re-use potential include many types of structural and correlational analyses of personality.
Gifford, Katherine A; Liu, Dandan; Romano, Raymond; Jones, Richard N; Jefferson, Angela L
Subjective cognitive decline (SCD) may indicate unhealthy cognitive changes, but no standardized SCD measurement exists. This pilot study aims to identify reliable SCD questions. 112 cognitively normal (NC, 76±8 years, 63% female), 43 mild cognitive impairment (MCI; 77±7 years, 51% female), and 33 diagnostically ambiguous participants (79±9 years, 58% female) were recruited from a research registry and completed 57 self-report SCD questions. Psychometric methods were used for item-reduction. Factor analytic models assessed unidimensionality of the latent trait (SCD); 19 items were removed with extreme response distribution or trait-fit. Item response theory (IRT) provided information about question utility; 17 items with low information were dropped. Post-hoc simulation using computerized adaptive test (CAT) modeling selected the most commonly used items (n=9 of 21 items) that represented the latent trait well (r=0.94) and differentiated NC from MCI participants (F(1,146)=8.9, p=0.003). Item response theory and computerized adaptive test modeling identified nine reliable SCD items. This pilot study is a first step toward refining SCD assessment in older adults. Replication of these findings and validation with Alzheimer's disease biomarkers will be an important next step for the creation of a SCD screener.
Lesser, Lenard I; Wu, Leslie; Matthiessen, Timothy B; Luft, Harold S
To develop a technology-based method for evaluating the nutritional quality of chain-restaurant menus to increase the efficiency and lower the cost of large-scale data analysis of food items. Using a Modified Nutrient Profiling Index (MNPI), we assessed chain-restaurant items from the MenuStat database with a process involving three steps: (i) testing 'extreme' scores; (ii) crowdsourcing to analyse fruit, nut and vegetable (FNV) amounts; and (iii) analysis of the ambiguous items by a registered dietitian. In applying the approach to assess 22 422 foods, only 3566 could not be scored automatically based on MenuStat data and required further evaluation to determine healthiness. Items for which there was low agreement between trusted crowd workers, or where the FNV amount was estimated to be >40 %, were sent to a registered dietitian. Crowdsourcing was able to evaluate 3199, leaving only 367 to be reviewed by the registered dietitian. Overall, 7 % of items were categorized as healthy. The healthiest category was soups (26 % healthy), while desserts were the least healthy (2 % healthy). An algorithm incorporating crowdsourcing and a dietitian can quickly and efficiently analyse restaurant menus, allowing public health researchers to analyse the healthiness of menu items.
Anders Jönsson; David Rosenlund; Fredrik Alvén
The purpose of this study is to investigate the validity of using multiple-choice (MC) items as a complement to constructed-response (CR) items when making decisions about student performance on reasoning tasks. CR items from a national test in physics have been reformulated into MC items and students’ reasoning skills have been analyzed in two substudies. In the first study, 12 students answered the MC items and were asked to explain their answers orally. In the second study, 102 students fr...
Schmitt, Frederick A; Aarsland, Dag; Brønnick, Kolbjørn S; Meng, Xiangyi; Tekin, Sibel; Olin, Jason T
Rivastigmine has been shown to improve cognition in patients with Parkinson's disease dementia (PDD). To further explore the impact of anticholinesterase therapy on PDD, Alzheimer's Disease Assessment Scale-cognitive subscale (ADAS-cog) items were assessed in a retrospective analysis of a 24-week, double-blind, placebo-controlled trial of rivastigmine. Mean changes from baseline at week 24 were calculated for ADAS-cog item scores and for 3 cognitive domain scores. A total of 362 patients were randomized to 3 to 12 mg/d rivastigmine capsules and 179 to placebo. Patients with PDD receiving rivastigmine improved versus placebo on items: word recall, following commands, ideational praxis, remembering test instructions, and comprehension of spoken language (P ADAS-cog is sensitive to broad cognitive changes in PDD. Overall, rivastigmine was associated with improvements on individual cognitive items and general cognitive domains.
Van Geert, Eline; Orhon, Altan; Cioca, Iulia A; Mamede, Rui; Golušin, Slobodan; Hubená, Barbora; Morillo, Daniel
Self-report personality questionnaires, traditionally offered in a graded-scale format, are widely used in high-stakes contexts such as job selection. However, job applicants may intentionally distort their answers when filling in these questionnaires, undermining the validity of the test results. Forced-choice questionnaires are allegedly more resistant to intentional distortion compared to graded-scale questionnaires, but they generate ipsative data. Ipsativity violates the assumptions of classical test theory, distorting the reliability and construct validity of the scales, and producing interdependencies among the scores. This limitation is overcome in the current study by using the recently developed Thurstonian item response theory model. As online testing in job selection contexts is increasing, the focus will be on the impact of intentional distortion on personality questionnaire data collected online. The present study intends to examine the effect of three different variables on intentional distortion: (a) test format (graded-scale versus forced-choice); (b) culture, as data will be collected in three countries differing in their attitudes toward intentional distortion (the United Kingdom, Serbia, and Turkey); and (c) cognitive ability, as a possible predictor of the ability to choose the more desirable responses. Furthermore, we aim to integrate the findings using a comprehensive model of intentional distortion. In the Anticipated Results section, three main aspects are considered: (a) the limitations of the manipulation, theoretical approach, and analyses employed; (b) practical implications for job selection and for personality assessment in a broader sense; and (c) suggestions for further research.
Conway, Lauryn; Widjaja, Elysa; Smith, Mary Lou
The current study investigated the psychometric properties of a single-item quality of life (QOL) measure, the Global Quality of Life in Childhood Epilepsy question (G-QOLCE), in children with drug-resistant epilepsy. Data came from the Impact of Pediatric Epilepsy Surgery on Health-Related Quality of Life Study (PESQOL), a multicenter prospective cohort study (n = 118) with observations collected at baseline and at 6 months of follow-up on children aged 4-18 years. QOL was measured with the QOLCE-76 and KIDSCREEN-27. The G-QOLCE was an overall QOL question derived from the QOLCE-76. Construct validity and reliability were assessed with Spearman's correlation and intraclass correlation coefficient (ICC). Responsiveness was examined through distribution-based and anchor-based methods. The G-QOLCE showed moderate (r ≥ 0.30) to strong (r ≥ 0.50) correlations with composite scores, and most subscales of the QOLCE-76 and KIDSCREEN-27 at baseline and 6-month follow-up. The G-QOLCE had moderate test-retest reliability (ICC range: 0.49-0.72) and was able to detect clinically important change in patients' QOL (standardized response mean: 0.38; probability of change: 0.65; Guyatt's responsiveness statistics: 0.62 and 0.78). Caregiver anxiety and family functioning contributed most strongly to G-QOLCE scores over time. Results offer promising preliminary evidence regarding the validity, reliability, and responsiveness of the proposed single-item QOL measure. The G-QOLCE is a potentially useful tool that can be feasibly administered in a busy clinical setting to evaluate clinical status and impact of treatment outcomes in pediatric epilepsy.
Chen, Cheng-Te; Chen, Yu-Lan; Lin, Yu-Ching; Hsieh, Ching-Lin; Tzeng, Jeng-Yi
Objective The purpose of this study was to construct a computerized adaptive test (CAT) for measuring self-care performance (the CAT-SC) in children with developmental disabilities (DD) aged from 6 months to 12 years in a content-inclusive, precise, and efficient fashion. Methods The study was divided into 3 phases: (1) item bank development, (2) item testing, and (3) a simulation study to determine the stopping rules for the administration of the CAT-SC. A total of 215 caregivers of children with DD were interviewed with the 73-item CAT-SC item bank. An item response theory model was adopted for examining the construct validity to estimate item parameters after investigation of the unidimensionality, equality of slope parameters, item fitness, and differential item functioning (DIF). In the last phase, the reliability and concurrent validity of the CAT-SC were evaluated. Results The final CAT-SC item bank contained 56 items. The stopping rules suggested were (a) reliability coefficient greater than 0.9 or (b) 14 items administered. The results of simulation also showed that 85% of the estimated self-care performance scores would reach a reliability higher than 0.9 with a mean test length of 8.5 items, and the mean reliability for the rest was 0.86. Administering the CAT-SC could reduce the number of items administered by 75% to 84%. In addition, self-care performances estimated by the CAT-SC and the full item bank were very similar to each other (Pearson r = 0.98). Conclusion The newly developed CAT-SC can efficiently measure self-care performance in children with DD whose performances are comparable to those of TD children aged from 6 months to 12 years as precisely as the whole item bank. The item bank of the CAT-SC has good reliability and a unidimensional self-care construct, and the CAT can estimate self-care performance with less than 25% of the items in the item bank. Therefore, the CAT-SC could be useful for measuring self-care performance in children with
Haag, Nicole; Heppt, Birgit; Roppelt, Alexander; Stanat, Petra
In large-scale assessment studies, language minority students typically obtain lower test scores in mathematics than native speakers. Although this performance difference was related to the linguistic complexity of test items in some studies, other studies did not find linguistically demanding math items to be disproportionally more difficult for…
Elliott, Stephen N.; Kettler, Ryan J.; Beddow, Peter A.; Kurz, Alexander; Compton, Elizabeth; McGrath, Dawn; Bruen, Charles; Hinton, Kent; Palmer, Porter; Rodriguez, Michael C.; Bolt, Daniel; Roach, Andrew T.
This study investigated the effects of using modified items in achievement tests to enhance accessibility. An experiment determined whether tests composed of modified items would reduce the performance gap between students eligible for an alternate assessment based on modified achievement standards (AA-MAS) and students not eligible, and the…
Kisala, Pamela A; Tulsky, David S; Pace, Natalie; Victorson, David; Choi, Seung W; Heinemann, Allen W
To develop a calibrated item bank and computer adaptive test (CAT) to assess the effects of stigma on health-related quality of life in individuals with spinal cord injury (SCI). Grounded-theory based qualitative item development methods, large-scale item calibration field testing, confirmatory factor analysis, and item response theory (IRT)-based psychometric analyses. Five SCI Model System centers and one Department of Veterans Affairs medical center in the United States. Adults with traumatic SCI. SCI-QOL Stigma Item Bank A sample of 611 individuals with traumatic SCI completed 30 items assessing SCI-related stigma. After 7 items were iteratively removed, factor analyses confirmed a unidimensional pool of items. Graded Response Model IRT analyses were used to estimate slopes and thresholds for the final 23 items. The SCI-QOL Stigma item bank is unique not only in the assessment of SCI-related stigma but also in the inclusion of individuals with SCI in all phases of its development. Use of confirmatory factor analytic and IRT methods provide flexibility and precision of measurement. The item bank may be administered as a CAT or as a 10-item fixed-length short form and can be used for research and clinical applications.
Sharkey, Nancy S.; Murnane, Richard J.
A growing number of school districts in the United States are introducing formative assessment systems to measure student skills in core subjects throughout the year. The underlying logic is that providing teachers with timely information on student skills will enable them to improve instruction and better prepare students to excel on high-stakes,…
In this commentary, I interpret Xinying Yin and Gayle Ann Buck's collaborative action research from a social-cultural perspective. Classroom implementation of formative assessment is viewed as interaction between this assessment method and the local learning culture. I first identify Yin and Buck's definition of the formative assessment, and then analyze the role of formative assessment in the change of local learning culture. Based on the practice of Yin and Buck I emphasize the significance of their "bottom up" strategy to the teachers' epistemological change. I believe that this strategy may provide practicable solutions to current Chinese educational problems as well as a means for science educators to shift toward systematic professional development.
Berk, Eric J. Vanden; Lohman, David F.; Cassata, Jennifer Coyne
Assessing the construct relevance of mental test results continues to present many challenges, and it has proven to be particularly difficult to assess the construct relevance of verbal items. This study was conducted to gain a better understanding of the conceptual sources of verbal item difficulty using a unique approach that integrates…
Multani, Namita; Rudzicz, Frank; Wong, Wing Yiu Stephanie; Namasivayam, Aravind Kumar; van Lieshout, Pascal
Purpose: Random item generation (RIG) involves central executive functioning. Measuring aspects of random sequences can therefore provide a simple method to complement other tools for cognitive assessment. We examine the extent to which RIG relates to specific measures of cognitive function, and whether those measures can be estimated using RIG…
Jing Jing, Ma
One of the key aims of formative assessment in higher education is to enable students to become self-regulated learners (Nicol & Macfarlane-Dick, 2006). Based on Nicol and Macfarlane-Dick's (2006) framework, this exploratory study investigates which formative assessment practices proposed by them were used by one college EFL writing teacher to…
Crins, Martine H P; Terwee, Caroline B; Klausch, Thomas; Smits, Niels; de Vet, Henrica C W; Westhovens, Rene; Cella, David; Cook, Karon F; Revicki, Dennis A; van Leeuwen, Jaap; Boers, Maarten; Dekker, Joost; Roorda, Leo D
The objective of this study was to assess the psychometric properties of the Dutch-Flemish Patient-Reported Outcomes Measurement Information System (PROMIS) Physical Function item bank in Dutch patients with chronic pain. A bank of 121 items was administered to 1,247 Dutch patients with chronic pain. Unidimensionality was assessed by fitting a one-factor confirmatory factor analysis and evaluating resulting fit statistics. Items were calibrated with the graded response model and its fit was evaluated. Cross-cultural validity was assessed by testing items for differential item functioning (DIF) based on language (Dutch vs. English). Construct validity was evaluated by calculation correlations between scores on the Dutch-Flemish PROMIS Physical Function measure and scores on generic and disease-specific measures. Results supported the Dutch-Flemish PROMIS Physical Function item bank's unidimensionality (Comparative Fit Index = 0.976, Tucker Lewis Index = 0.976) and model fit. Item thresholds targeted a wide range of physical function construct (threshold-parameters range: -4.2 to 5.6). Cross-cultural validity was good as four items only showed DIF for language and their impact on item scores was minimal. Physical Function scores were strongly associated with scores on all other measures (all correlations ≤ -0.60 as expected). The Dutch-Flemish PROMIS Physical Function item bank exhibited good psychometric properties. Development of a computer adaptive test based on the large bank is warranted. Copyright © 2017 Elsevier Inc. All rights reserved.
Nikolic, Sasha; Stirling, David; Ros, Montserrat
Obtaining oral communication competency is an important skill for engineering students to prepare them for interacting and working in any professional setting. For engineers, it is also important to be able to present technical information to non-technical audiences. To ensure oral competency, a non-graded formative assessment approach using video with self- and peer assessment was introduced into a final-year engineering thesis course. A low workload approach was used due to growing student numbers and higher pressures on academic staff. A quasi-experimental design was used to investigate the differences between traditional delivery, self-assessment and combined self-assessment with peer feedback. The study found that the formative models were seen by students to help develop their presentation skills. However, the results showed no significant improvement compared to the traditional method. This could be due to previous presentation practice within the degree or more probable, the lack of incentive for weaker students to engage and improve due to the ungraded nature of the activity.
Takashima, A.; Jensen, O.; Oostenveld, R.; Maris, E.G.G.; Coevering, M. van de; Fernandez, G.S.E.
The aim of the present study was to investigate the spatio-temporal characteristics of the neural correlates of declarative memory formation as assessed by the subsequent memory effect, i.e. the difference in encoding activity between subsequently remembered and subsequently forgotten items.
Takashima, A.; Jensen, O.; Oostenveld, R.; Maris, E.G.G.; Coevering, M. van de; Fernandez, G.S.E.
The aim of the present study was to investigate the spatio-temporal characteristics of the neural correlates of declarative memory formation as assessed by the subsequent memory effect, i.e. the difference in encoding activity between subsequently remembered and subsequently forgotten items.
van den Berg, M.; Harskamp, E.G.; Suhre, C.J.M.
In the last two decades Dutch primary school students scored below expectation in international mathematics tests. An explanation for this may be that teachers fail to adequately assess their students’ understanding of learning goals and provide timely feedback. To improve the teachers’ formative
Muhammad Tufail Chandio
Full Text Available English is a second language (L2 in Sindh, Pakistan. Most of the public sector schools in Sindh teach English as a subject rather than a language. Besides, they do not distinguish between generic pedagogy and distinctive approaches used for teaching English as a first language (L1 and second language (L2. In addition, the erroneous traditional assessment focuses on only writing and reading skills and the listening and speaking skills of L2 remain excluded. There is a great emphasis on summative assessments, which contribute to a qualification; however, formative assessments, which provide timely and continuous appraisal and feedback, remain ignored. Summative assessment employs only paper-and- pencil based test, while the other current means of alternative assessments like self-assessment, peer-assessment, and portfolio assessment have not been incorporated, and explored yet. Teaching English as a subject not as a language, employing summative assessment not formative, depending on paper-and-pencil based test, and not using the alternative modes of assessment are some of the questions this study will deal with. The study under discussion suggests that current approaches employed for teaching English are misplaced as these take a subject teaching approach rather than a language teaching approach. It also argues for the paradigm shift from a product to process approach to assessment by administering modern alternative assessments.
Lea Bonner, C; Staton, April G; Naro, Patricia B; McCullough, Elizabeth; Lynn Stevenson, T; Williamson, Margaret; Sheffield, Melody C; Miller, Mindi; Fetterman, James W; Fan, Shirley; Momary, Kathryn M
Experiential pharmacy preceptors should provide formative and summative feedback during a learning experience. Preceptors are required to provide colleges and schools of pharmacy with assessments or evaluations of students' performance. Students and experiential programs value on-time completion of midpoint evaluations by preceptors. The objective of this study was to determine the number of on-time electronically documented formative midpoint evaluations completed by preceptors during advanced pharmacy practice experiences (APPEs). Compliance rates of on-time electronically documented formative midpoint evaluations were reviewed by the Office of Experiential Education of a five-member consortium during the two-year study period prior to the adoption of Standards 2016. Pearson chi-square test and generalized linear models were used to determine if statistically significant differences were present. Average midpoint compliance rates for the two-year research period were 40.7% and 41% respectively. No statistical significance was noted comparing compliance rates for year one versus year two. However, statistical significance was present when comparing compliance rates between schools during year two. Feedback from students and preceptors pointed to the need for brief formal midpoint evaluations that require minimal time to complete, user friendly experiential management software, and methods for documenting verbal feedback through student self-reflection. Additional education and training to both affiliate and faculty preceptors on the importance of written formative feedback at midpoint is critical to remaining in compliance with Standards 2016. Copyright © 2017 Elsevier Inc. All rights reserved.
Melki Hasan; S. Bouzid Mohamed; Haweni Aymen; Fadhloun Mourad; Mrayeh Meher; Souissi Nizar
Purpose: This article is based on questions related to the formative assessment of preparatory trainee ship in the professional life of Physical Education teachers. In general, in the first training program, the traineeship represents an integral part of training. In this sense, the traineeship offers a vital opportunity for future teacher to gain practical experience in the real environment, given that formative evaluation is a process of collecting evidence from trainees by cooperative teac...
Drechsel, Barbara; Carstensen, Claus; Prenzel, Manfred
This paper focuses interest in science as one of the attitudinal aspects of scientific literacy. Large-scale data from the Programme for International Student Assessment (PISA) 2006 are analysed in order to describe student interest more precisely. So far the analyses have provided a general indicator of interest, aggregated over all contexts and contents in the science test. With its innovative approach PISA embeds interest items within the cognitive test unit and its contents and contexts. The main difference from conventional interest measures is that in most questionnaires, a relatively small number of interest items cover broad fields of contents and contexts. The science units represent a number of systematically differentiated scientific contexts and contents. The units' stimulus texts allow for concrete descriptions of relevant content aspects, applications, and contexts. In the analyses, multidimensional item response models are applied in order to disentangle student interest. The results indicate that multidimensional models fit the data. A two-dimensional model separating interest into two different knowledge of science dimensions described in the PISA science framework is further analysed with respect to gender, performance differences, and country. The findings give a comprehensive description of students' interest in science. The paper deals with methodological problems and describes requirements of the test construction for further assessments. The results are discussed with regard to their significance for science education.
Sabel, Jaime L.; Forbes, Cory T.; Zangori, Laura
To support elementary students' learning of core, standards-based life science concepts highlighted in the Next Generation Science Standards, prospective elementary teachers should develop an understanding of life science concepts and learn to apply their content knowledge in instructional practice to craft elementary science learning environments grounded in students' thinking. To do so, teachers must learn to use high-leverage instructional practices, such as formative assessment, to engage students in scientific practices and connect instruction to students' ideas. However, teachers may not understand formative assessment or possess sufficient science content knowledge to effectively engage in related instructional practices. To address these needs, we developed and conducted research within an innovative course for preservice elementary teachers built upon two pillars—life science concepts and formative assessment. An embedded mixed methods study was used to evaluate the effect of the intervention on preservice teachers' (n = 49) content knowledge and ability to engage in formative assessment practices for science. Findings showed that increased life content knowledge over the semester helped preservice teachers engage more productively in anticipating and evaluating students' ideas, but not in identifying effective instructional strategies to respond to those ideas.
Byram, Jessica N; Seifert, Mark F; Brooks, William S; Fraser-Cotlin, Laura; Thorp, Laura E; Williams, James M; Wilson, Adam B
With integrated curricula and multidisciplinary assessments becoming more prevalent in medical education, there is a continued need for educational research to explore the advantages, consequences, and challenges of integration practices. This retrospective analysis investigated the number of items needed to reliably assess anatomical knowledge in the context of gross anatomy and histology. A generalizability analysis was conducted on gross anatomy and histology written and practical examination items that were administered in a discipline-based format at Indiana University School of Medicine and in an integrated fashion at the University of Alabama School of Medicine and Rush University Medical College. Examination items were analyzed using a partially nested design s×(i:o) in which items were nested within occasions (i:o) and crossed with students (s). A reliability standard of 0.80 was used to determine the minimum number of items needed across examinations (occasions) to make reliable and informed decisions about students' competence in anatomical knowledge. Decision study plots are presented to demonstrate how the number of items per examination influences the reliability of each administered assessment. Using the example of a curriculum that assesses gross anatomy knowledge over five summative written and practical examinations, the results of the decision study estimated that 30 and 25 items would be needed on each written and practical examination to reach a reliability of 0.80, respectively. This study is particularly relevant to educators who may question whether the amount of anatomy content assessed in multidisciplinary evaluations is sufficient for making judgments about the anatomical aptitude of students. Anat Sci Educ 10: 109-119. © 2016 American Association of Anatomists. © 2016 American Association of Anatomists.
Eutalia Aparecida Candido de Araujo
Full Text Available A preocupação com medidas de traços psicológicos é antiga, sendo que muitos estudos e propostas de métodos foram desenvolvidos no sentido de alcançar este objetivo. Entre os trabalhos propostos, destaca-se a Teoria da Resposta ao Item (TRI que, a princípio, veio completar limitações da Teoria Clássica de Medidas, empregada em larga escala até hoje na medida de traços psicológicos. O ponto principal da TRI é que ela leva em consideração o item particularmente, sem relevar os escores totais; portanto, as conclusões não dependem apenas do teste ou questionário, mas de cada item que o compõe. Este artigo propõe-se a apresentar esta Teoria que revolucionou a teoria de medidas.La preocupación con las medidas de los rasgos psicológicos es antigua y muchos estudios y propuestas de métodos fueron desarrollados para lograr este objetivo. Entre estas propuestas de trabajo se incluye la Teoría de la Respuesta al Ítem (TRI que, en principio, vino a completar las limitaciones de la Teoría Clásica de los Tests, ampliamente utilizada hasta hoy en la medida de los rasgos psicológicos. El punto principal de la TRI es que se tiene en cuenta el punto concreto, sin relevar las puntuaciones totales; por lo tanto, los resultados no sólo dependen de la prueba o cuestionario, sino que de cada ítem que lo compone. En este artículo se propone presentar la Teoría que revolucionó la teoría de medidas.The concern with measures of psychological traits is old and many studies and proposals of methods were developed to achieve this goal. Among these proposed methods highlights the Item Response Theory (IRT that, in principle, came to complete limitations of the Classical Test Theory, which is widely used until nowadays in the measurement of psychological traits. The main point of IRT is that it takes into account the item in particular, not relieving the total scores; therefore, the findings do not only depend on the test or questionnaire
Kang, Hyeon-Ah; Su, Ya-Hui; Chang, Hua-Hua
A monotone relationship between a true score (τ) and a latent trait level (θ) has been a key assumption for many psychometric applications. The monotonicity property in dichotomous response models is evident as a result of a transformation via a test characteristic curve. Monotonicity in polytomous models, in contrast, is not immediately obvious because item response functions are determined by a set of response category curves, which are conceivably non-monotonic in θ. The purpose of the present note is to demonstrate strict monotonicity in ordered polytomous item response models. Five models that are widely used in operational assessments are considered for proof: the generalized partial credit model (Muraki, 1992, Applied Psychological Measurement, 16, 159), the nominal model (Bock, 1972, Psychometrika, 37, 29), the partial credit model (Masters, 1982, Psychometrika, 47, 147), the rating scale model (Andrich, 1978, Psychometrika, 43, 561), and the graded response model (Samejima, 1972, A general model for free-response data (Psychometric Monograph no. 18). Psychometric Society, Richmond). The study asserts that the item response functions in these models strictly increase in θ and thus there exists strict monotonicity between τ and θ under certain specified conditions. This conclusion validates the practice of customarily using τ in place of θ in applied settings and provides theoretical grounds for one-to-one transformations between the two scales. © 2018 The British Psychological Society.
Kazman, Josh B; Scott, Jonathan M; Deuster, Patricia A
The limitations for self-reporting of dietary patterns are widely recognised as a major vulnerability of FFQ and the dietary screeners/scales derived from FFQ. Such instruments can yield inconsistent results to produce questionable interpretations. The present article discusses the value of psychometric approaches and standards in addressing these drawbacks for instruments used to estimate dietary habits and nutrient intake. We argue that a FFQ or screener that treats diet as a 'latent construct' can be optimised for both internal consistency and the value of the research results. Latent constructs, a foundation for item response theory (IRT)-based scales (e.g. Patient Reported Outcomes Measurement Information System) are typically introduced in the design stage of an instrument to elicit critical factors that cannot be observed or measured directly. We propose an iterative approach that uses such modelling to refine FFQ and similar instruments. To that end, we illustrate the benefits of psychometric modelling by using items and data from a sample of 12 370 Soldiers who completed the 2012 US Army Global Assessment Tool (GAT). We used factor analysis to build the scale incorporating five out of eleven survey items. An IRT-driven assessment of response category properties indicates likely problems in the ordering or wording of several response categories. Group comparisons, examined with differential item functioning (DIF), provided evidence of scale validity across each Army sub-population (sex, service component and officer status). Such an approach holds promise for future FFQ.
Deane, Richard P
Team Objective Structured Bedside Assessment (TOSBA) is a learning approach in which a team of medical students undertake a set of structured clinical tasks with real patients in order to reach a diagnosis and formulate a management plan and receive immediate feedback on their performance from a facilitator. TOSBA was introduced as formative assessment to an 8-week undergraduate teaching programme in Obstetrics and Gynaecology (O&G) in 2013\\/14. Each student completed 5 TOSBA sessions during the rotation. The aim of the study was to evaluate TOSBA as a teaching method to provide formative assessment for medical students during their clinical rotation. The research questions were: Does TOSBA improve clinical, communication and\\/or reasoning skills? Does TOSBA provide quality feedback?
Gierl, Mark J; Lai, Hollis
Computerised assessment raises formidable challenges because it requires large numbers of test items. Automatic item generation (AIG) can help address this test development problem because it yields large numbers of new items both quickly and efficiently. To date, however, the quality of the items produced using a generative approach has not been evaluated. The purpose of this study was to determine whether automatic processes yield items that meet standards of quality that are appropriate for medical testing. Quality was evaluated firstly by subjecting items created using both AIG and traditional processes to rating by a four-member expert medical panel using indicators of multiple-choice item quality, and secondly by asking the panellists to identify which items were developed using AIG in a blind review. Fifteen items from the domain of therapeutics were created in three different experimental test development conditions. The first 15 items were created by content specialists using traditional test development methods (Group 1 Traditional). The second 15 items were created by the same content specialists using AIG methods (Group 1 AIG). The third 15 items were created by a new group of content specialists using traditional methods (Group 2 Traditional). These 45 items were then evaluated for quality by a four-member panel of medical experts and were subsequently categorised as either Traditional or AIG items. Three outcomes were reported: (i) the items produced using traditional and AIG processes were comparable on seven of eight indicators of multiple-choice item quality; (ii) AIG items can be differentiated from Traditional items by the quality of their distractors, and (iii) the overall predictive accuracy of the four expert medical panellists was 42%. Items generated by AIG methods are, for the most part, equivalent to traditionally developed items from the perspective of expert medical reviewers. While the AIG method produced comparatively fewer plausible
Ingham, Gerard; Fry, Jennifer; Morgan, Simon; Ward, Bernadette
Workplace-based formative assessments using consultation observation are currently conducted during the Australian general practice training program. Assessment reliability is improved by using multiple assessment methods. The aim of this study was to explore experiences of general practice medical educator assessors and registrars (trainees) when adding random case analysis to direct observation (ARCADO) during formative workplace-based assessments. A sample of general practice medical educators and matched registrars were recruited. Following the ARCADO workplace assessment, semi-structured qualitative interviews were conducted. The data was analysed thematically. Ten registrars and eight medical educators participated. Four major themes emerged - formative versus summative assessment; strengths (acceptability, flexibility, time efficiency, complementarity and authenticity); weaknesses (reduced observation and integrity risks); and contextual factors (variation in assessment content, assessment timing, registrar-medical educator relationship, medical educator's approach and registrar ability). ARCADO is a well-accepted workplace-based formative assessment perceived by registrars and assessors to be valid and flexible. The use of ARCADO enabled complementary insights that would not have been achieved with direct observation alone. Whilst there are some contextual factors to be considered in its implementation, ARCADO appears to have utility as formative assessment and, subject to further evaluation, high-stakes assessment.
Hemert, Dianne A. van; Baerveldt, Chris; Vermande, Marjolijn
Amethod is presented for evaluating the presence and size of cross-cultural item biases. The examined items concern parental support and family cohesion in a Likert-type questionnaire for adolescents in The Netherlands. Each evaluated item has two versions, a collectivist and an individualistic one, that measure the same theoretical construct. The standardized difference between the score means of the item versions, called the ?e score, gives an indication of the cultural bias of the item. As...
Aybek, Eren Can; Demirtasli, R. Nukhet
This article aims to provide a theoretical framework for computerized adaptive tests (CAT) and item response theory models for polytomous items. Besides that, it aims to introduce the simulation and live CAT software to the related researchers. Computerized adaptive test algorithm, assumptions of item response theory models, nominal response…
Full Text Available Pengembangan Model Penilaian Formatif Berbasis Web untuk Meningkatkan Pemahaman Konsep Fisika Siswa Abstract: There are two approaches of learning assessment, called formative and summative. The formative assessment is applicable because it involves students directly during the process, may im-prove these students perceptive. The limited time in class makes this process difficult, then the de-velopment of both online and offline formative assessment, provide responsive feedback for teachers and students, is definitely needed. This research goal is to produce a model of web-based formative assessment for physics. This study used research design and development of the formative assess-ment-model. Questionnaire is used for product validation, consist of validation of textbook, instrument of pre and post-learning quizzes and web product.The result of quantitative analysis shows that the developed product is valid without any revision. Based on qualitative data, the product revision follows comments and suggestions from expert’s validation, teachers and students. The product testing shows that the formative assessment-model may improve students’ conceptual comprehension. Key Words: formatice assessment-model, students’ conceptual comprehension of physics, web-based Abstrak: Penilaian terbagi menjadi dua macam yaitu penilaian formatif dan penilaian sumatif. Penilaian formatif tepat digunakan karena prosesnya melibatkan siswa secara langsung di dalam proses pembelajaran dan mampu meningkatkan pemahaman konsep siswa. Keterbatasan waktu di kelas menyebabkan proses ini sulit dilakukan, maka perlu dikembangkan model penilaian formatif secara online dan off-line yang dapat memberikan umpan balik yang cepat bagi siswa dan guru. Tujuan dari penelitian adalah menghasilkan model web-based penilaian formatif untuk pembelajaran fisika. Penelitian menggunakan rancangan penelitian dan pengembangan model penilaian formatif. Instrumen yang digunakan
Mellenbergh, Gideon J.; van der Linden, Wim J.
Three item selection methods for criterion-referenced tests are examined: the classical theory of item difficulty and item-test correlation; the latent trait theory of item characteristic curves; and a decision-theoretic approach for optimal item selection. Item contribution to the standardized expected utility of mastery testing is discussed. (CM)
Cher Wong, Cheow
Building on previous works by Lord and Ogasawara for dichotomous items, this article proposes an approach to derive the asymptotic standard errors of item response theory true score equating involving polytomous items, for equivalent and nonequivalent groups of examinees. This analytical approach could be used in place of empirical methods like…
Meijer, R.R.; Egberink, I.J.L.; Emons, Wilco H.M.; Sijtsma, Klaas
We illustrate the usefulness of person-fit methodology for personality assessment. For this purpose, we use person-fit methods from item response theory. First, we give a nontechnical introduction to existing person-fit statistics. Second, we analyze data from Harter's (1985)Self-Perception Profile
Mark, Kristen P; Herbenick, Debby; Fortenberry, J Dennis; Sanders, Stephanie; Reece, Michael
This study was designed to systematically compare and contrast the psychometric properties of three scales developed to measure sexual satisfaction and a single-item measure of sexual satisfaction. The Index of Sexual Satisfaction (ISS), Global Measure of Sexual Satisfaction (GMSEX), and the New Sexual Satisfaction Scale-Short (NSSS-S) were compared to one another and to a single-item measure of sexual satisfaction. Conceptualization of the constructs, distribution of scores, internal consistency, convergent validity, test-retest reliability, and factor structure were compared between the measures. A total of 211 men and 214 women completed the scales and a measure of relationship satisfaction, with 33% (n = 139) of the sample reassessed two months later. All scales demonstrated appropriate distribution of scores and adequate internal consistency. The GMSEX, NSSS-S, and the single-item measure demonstrated convergent validity. Test-retest reliability was demonstrated by the ISS, GMSEX, and NSSS-S, but not the single-item measure. Taken together, the GMSEX received the strongest psychometric support in this sample for a unidimensional measure of sexual satisfaction and the NSSS-S received the strongest psychometric support in this sample for a bidimensional measure of sexual satisfaction.
Bichi, Ado Abdu; Hafiz, Hadiza; Bello, Samira Abdullahi
High-stakes testing is used for the purposes of providing results that have important consequences. Validity is the cornerstone upon which all measurement systems are built. This study applied the Item Response Theory principles to analyse Northwest University Kano Post-UTME Economics test items. The developed fifty (50) economics test items was…
Kujačić Momčilo D.
Full Text Available Delivery of postal items is the last phase in the postal conveyance process. This phase involved up to 57% in total costs of postal items conveyance. In order to reduce the costs of delivery phase, postal organizations apply different methods and techniques. Legal and technological regulations, various restrictions regarding the selection and deployment of employees influence the choice of appropriate methods. Also, the principle of availability of the universal postal service is an essential factor in defining the optimal model. In this paper, the model for assessing and planning of the number of employees in the delivery service observed postal operator has been proposed, with respect to the principles of productivity and accessibility constraints of the universal postal service. This paper will analyze the impact of daily fluctuations in the number of full-time employees and the possibility of hiring a part-time workers in the days with increased traffic volume in the delivery of items, when usually the items from large customers are delivered.
Scheuneman, Janice Dowd; Gerritz, Kalle
Differential item functioning (DIF) methodology for revealing sources of item difficulty and performance characteristics of different groups was explored. A total of 150 Scholastic Aptitude Test items and 132 Graduate Record Examination general test items were analyzed. DIF was evaluated for males and females and Blacks and Whites. (SLD)
Locating an item on an achievement continuum (item mapping) is well-established in technical work for educational/psychological assessment. Applications of item mapping may be found in criterion-referenced (CR) testing (or scale anchoring, Beaton and Allen, 1992; Huynh, 1994, 1998a, 2000a, 2000b, 2006), computer-assisted testing, test form assembly, and in standard setting methods based on ordered test booklets. These methods include the bookmark standard setting originally used for the CTB/TerraNova tests (Lewis, Mitzel, Green, and Patz, 1999), the item descriptor process (Ferrara, Perie, and Johnson, 2002) and a similar process described by Wang (2003) for multiple-choice licensure and certification examinations. While item response theory (IRT) models such as the Rasch and two-parameter logistic (2PL) models traditionally place a binary item at its location, Huynh has argued in the cited papers that such mapping may not be appropriate in selecting items for CR interpretation and scale anchoring.
MacCann, Robert G.; Stanley, Gordon
An item banking method that does not use Item Response Theory (IRT) is described. This method provides a comparable grading system across schools that would be suitable for low-stakes testing. It uses the Angoff standard-setting method to obtain item ratings that are stored with each item. An example of such a grading system is given, showing how…
The relevance of negative symptoms across the diagnostic spectrum of the psychoses remains uncertain. The purpose of this study was to report on prevalence of item and subscale level negative symptoms across the first episode psychosis (FEP) diagnostic spectrum in an epidemiological sample, and to ascertain whether items and subscales were more prevalent in a schizophrenia spectrum diagnoses group compared to an \\'all other psychotic diagnoses\\' group. We measured negative symptoms in 330 patients presenting with FEP using the Scale for Assessment of Negative Symptoms (SANS), and ascertained diagnosis using the Structured Clinical Interview for DSM IV. Prevalence of SANS items and subscales were tabulated across all psychotic diagnoses, and logistic regression analysis determined which items and subscales were predictive of schizophrenia spectrum diagnoses. SANS items were most prevalent in schizophrenia spectrum conditions but frequently presented in other FEP diagnoses, particularly substance induced psychotic disorder and Major Depressive Disorder. Brief psychotic disorder and bipolar disorders had low levels of negative symptoms. SANS items and subscales which significantly predicted schizophrenia spectrum diagnoses, were also frequently present in some of the other psychotic diagnoses. Conclusions: SANS items have high prevalence in FEP, and while commonest in schizophrenia spectrum conditions are not restricted to this diagnostic subgroup.
Lee, Woo-Yeol; Cho, Sun-Joo; McGugin, Rankin W; Van Gulick, Ana Beth; Gauthier, Isabel
The Vanderbilt Expertise Test for cars (VETcar) is a test of visual learning for contemporary car models. We used item response theory to assess the VETcar and in particular used differential item functioning (DIF) analysis to ask if the test functions the same way in laboratory versus online settings and for different groups based on age and gender. An exploratory factor analysis found evidence of multidimensionality in the VETcar, although a single dimension was deemed sufficient to capture the recognition ability measured by the test. We selected a unidimensional three-parameter logistic item response model to examine item characteristics and subject abilities. The VETcar had satisfactory internal consistency. A substantial number of items showed DIF at a medium effect size for test setting and for age group, whereas gender DIF was negligible. Because online subjects were on average older than those tested in the lab, we focused on the age groups to conduct a multigroup item response theory analysis. This revealed that most items on the test favored the younger group. DIF could be more the rule than the exception when measuring performance with familiar object categories, therefore posing a challenge for the measurement of either domain-general visual abilities or category-specific knowledge.
Full Text Available Abstract Background Patients receiving complementary and alternative medicine (CAM therapies often report shifts in well-being that go beyond resolution of the original presenting symptoms. We undertook a research program to develop and evaluate a patient-centered outcome measure to assess the multidimensional impacts of CAM therapies, utilizing a novel mixed methods approach that relied upon techniques from the fields of anthropology and psychometrics. This tool would have broad applicability, both for CAM practitioners to measure shifts in patients' states following treatments, and conventional clinical trial researchers needing validated outcome measures. The US Food and Drug Administration has highlighted the importance of valid and reliable measurement of patient-reported outcomes in the evaluation of conventional medical products. Here we describe Phase I of our research program, the iterative process of content identification, item development and refinement, and response format selection. Cognitive interviews and psychometric evaluation are reported separately. Methods From a database of patient interviews (n = 177 from six diverse CAM studies, 150 interviews were identified for secondary analysis in which individuals spontaneously discussed unexpected changes associated with CAM. Using ATLAS.ti, we identified common themes and language to inform questionnaire item content and wording. Respondents' language was often richly textured, but item development required a stripping down of language to extract essential meaning and minimize potential comprehension barriers across populations. Through an evocative card sort interview process, we identified those items most widely applicable and covering standard psychometric domains. We developed, pilot-tested, and refined the format, yielding a questionnaire for cognitive interviews and psychometric evaluation. Results The resulting questionnaire contained 18 items, in visual analog scale format
Full Text Available Background. The didactic approach to teaching physiology in our university has traditionally included the delivery of lectures to large groups, illustrating concepts and referencing recommended textbooks. Importantly, at undergraduate level, our assessments demand a level of application of physiological mechanisms to recognised pathophysiological conditions. Objective. To bridge the gap between lectured material and the application of physiological concepts to pathophysiological conditions, we developed a technological tool approach that augments traditional teaching. Methods. Our e-learning initiative, eQuip, is a custom-built e-learning platform specifically created to align question types included in the program to be similar to those used in current assessments. We describe our formative e-learning system and present preliminary results after the first year of introduction, reporting on the performances and perceptions of 2nd-year physiology students. Results. Students who made use of eQuip for at least three of the teaching blocks achieved significantly better results than those who did not use the program (p=0.0032. Questionnaire feedback was positive with regard to the administration processes and usefulness of eQuip. Students reported particularly liking the ease of access to information; however, <60% of them felt that eQuip motivated them to learn. Conclusion. These results are consistent with the literature, which shows that students who made use of an online formative assessment tool performed better in summative assessment tasks. Despite the improved performance of students, the questionnaire results showed that student motives for using online learning tools indicated that they lack self-directed learning skills and seek easy access to information.
Edelen, Maria O; Tucker, Joan S; Shadel, William G; Stucky, Brian D; Cai, Li
The aim of the PROMIS® Smoking Initiative is to develop, evaluate, and standardize item banks to assess cigarette smoking behavior and biopsychosocial constructs associated with smoking for both daily and non-daily smokers. We used qualitative methods to develop the item pool (following the PROMIS® approach: e.g., literature search, "binning and winnowing" of items, and focus groups and cognitive interviews to finalize wording and format), and quantitative methods (e.g., factor analysis) to develop the item banks. We considered a total of 1622 extant items, and 44 new items for inclusion in the smoking item banks. A final set of 277 items representing 11 conceptual domains was selected for field testing in a national sample of smokers. Using data from 3021 daily smokers in the field test, an iterative series of exploratory factor analyses and project team discussions resulted in six item banks: Positive Consequences of Smoking (40 items), Smoking Dependence/Craving (55 items), Health Consequences of Smoking (26 items), Psychosocial Consequences of Smoking (37 items), Coping Aspects of Smoking (30 items), and Social Factors of Smoking (23 items). Inclusion of a smoking domain in the PROMIS® framework will standardize measurement of key smoking constructs using state-of-the-art psychometric methods, and make them widely accessible to health care providers, smoking researchers and the large community of researchers using PROMIS® who might not otherwise include an assessment of smoking in their design. Next steps include reducing the number of items in each domain, conducting confirmatory analyses, and duplicating the process for non-daily smokers. Copyright © 2012 Elsevier Ltd. All rights reserved.
Full Text Available The subject of the investigation is the translation of neologism and culture-bound items based on the first chapter of the third book of The Witcher Saga, entitled Baptism of Fire. The analyzed fragment abounds in neologisms and nomenclature; therefore, the processes of word formation are briefly described. Furthermore, some of Hejwowski’s ( 2009, pp. 76–83 procedures are cited to present methods of dealing with the creativity resulting from word formation processes. It is shown that a translator, when translating culture-bound items, is not always able to find an equivalent in the target language and may try either to describe a certain phenomenon or to use a literal translation. The way in which neologisms are coined in a fictional novel may differ from the coinage of words in the standard language; nevertheless, the word formation processes are the same as in Standard English or Standard Polish. Moreover, there is still little evidence of what makes a borrowed word catch on in the standard language.
Full Text Available Previous studies have reported conflicting findings on whether item repetition has beneficial or detrimental effects on source memory. To reconcile such contradictions, we investigated whether the degree of pre-exposure of items can be a potential modulating factor. The experimental procedures spanned two consecutive days. On Day 1, participants were exposed to a set of unfamiliar faces. On Day 2, the same faces presented on the previous day were used again in half of the participants, whereas novel faces were used for the other half. Day 2 procedures consisted of three successive phases: item repetition, source association, and source memory test. In the item repetition phase, half of the face stimuli were repeatedly presented while participants were making male/female judgments. During the source association phase, both the repeated and the unrepeated faces appeared in one of the four locations on the screen. Finally, participants were tested on the location in which a given face was presented during the previous phase and reported the confidence of their memory. Source memory accuracy was measured as the percentage of correct non-guess trials. As results, we found a significant interaction between prior exposure and repetition. Repetition impaired source memory when the items had been pre-exposed on Day 1, while it led to greater accuracy in novel ones. These results show that pre-experimental exposure can modulate the effects of repetition on associative binding between an item and its contextual information, suggesting that pre-existing representation and novelty signal interact to form new episodic memory.
Spencer, Mercedes; Cho, Sun-Joo; Cutting, Laurie E
In the current study, we examined the dimensionality of the 16-item Card Sorting subtest of the Delis-Kaplan Executive Functioning System assessment in a sample of 264 native English-speaking children between the ages of 9 and 15 years. We also tested for measurement invariance for these items across age and gender groups using item response theory (IRT). Results of the exploratory factor analysis indicated that a two-factor model that distinguished between verbal and perceptual items provided the best fit to the data. Although the items demonstrated measurement invariance across age groups, measurement invariance was violated for gender groups, with two items demonstrating differential item functioning for males and females. Multigroup analysis using all 16 items indicated that the items were more effective for individuals whose IRT scale scores were relatively high. A single-group explanatory IRT model using 14 non-differential item functioning items showed that for perceptual ability, females scored higher than males and that scores increased with age for both males and females; for verbal ability, the observed increase in scores across age differed for males and females. The implications of these findings are discussed.
Forrest, Christopher B; Devine, Janine; Bevans, Katherine B; Becker, Brandon D; Carle, Adam C; Teneralli, Rachel E; Moon, JeanHee; Tucker, Carole A; Ravens-Sieberer, Ulrike
To describe the psychometric evaluation and item response theory calibration of the PROMIS Pediatric Life Satisfaction item banks, child-report, and parent-proxy editions. A pool of 55 life satisfaction items was administered to 1992 children 8-17 years old and 964 parents of children 5-17 years old. Analyses included descriptive statistics, reliability, factor analysis, differential item functioning, and assessment of construct validity. Thirteen items were deleted because of poor psychometric performance. An 8-item short form was administered to a national sample of 996 children 8-17 years old, and 1294 parents of children 5-17 years old. The combined sample (2988 children and 2258 parents) was used in item response theory (IRT) calibration analyses. The final item banks were unidimensional, the items were locally independent, and the items were free from impactful differential item functioning. The 8-item and 4-item short form scales showed excellent reliability, convergent validity, and discriminant validity. Life satisfaction decreased with declining socio-economic status, presence of a special health care need, and increasing age for girls, but not boys. After IRT calibration, we found that 4- and 8-item short forms had a high degree of precision (reliability) across a wide range (>4 SD units) of the latent variable. The PROMIS Pediatric Life Satisfaction item banks and their short forms provide efficient, precise, and valid assessments of life satisfaction in children and youth.
Polly, Drew; Wang, Chuang; Martin, Christie; Lambert, Richard G.; Pugalee, David K.; Middleton, Catharina Win
This study examined primary grades students' achievement on number sense tasks administered through an Internet-based formative assessment tool, Assessing Math Concepts Anywhere. Data were analyzed from 2,357 students in teachers' classrooms who had participated in a year-long professional development program on mathematics formative assessment,…
Peters, Michele; Potter, Caroline M; Kelly, Laura; Hunter, Cheryl; Gibbons, Elizabeth; Jenkinson, Crispin; Coulter, Angela; Forder, Julien; Towers, Ann-Marie; A'Court, Christine; Fitzpatrick, Ray
To identify the main issues of importance when living with long-term conditions to refine a conceptual framework for informing the item development of a patient-reported outcome measure for long-term conditions. Semi-structured qualitative interviews (n=48) were conducted with people living with at least one long-term condition. Participants were recruited through primary care. The interviews were transcribed verbatim and analyzed by thematic analysis. The analysis served to refine the conceptual framework, based on reviews of the literature and stakeholder consultations, for developing candidate items for a new measure for long-term conditions. Three main organizing concepts were identified: impact of long-term conditions, experience of services and support, and self-care. The findings helped to refine a conceptual framework, leading to the development of 23 items that represent issues of importance in long-term conditions. The 23 candidate items formed the first draft of the measure, currently named the Long-Term Conditions Questionnaire. The aim of this study was to refine the conceptual framework and develop items for a patient-reported outcome measure for long-term conditions, including single and multiple morbidities and physical and mental health conditions. Qualitative interviews identified the key themes for assessing outcomes in long-term conditions, and these underpinned the development of the initial draft of the measure. These initial items will undergo cognitive testing to refine the items prior to further validation in a survey.
Thyssen, Jacob P; Menné, Torkil; Johansen, Jeanne D
Nickel allergy is prevalent as assessed by epidemiological studies. In an attempt to further identify and characterize sources that may result in nickel allergy and dermatitis, we analysed items identified by nickel-allergic dermatitis patients as causative of nickel dermatitis by using the dimethylglyoxime (DMG) test. Dermatitis patients with nickel allergy of current relevance were identified over a 2-year period in a tertiary referral patch test centre. When possible, their work tools and personal items were examined with the DMG test. Among 95 nickel-allergic dermatitis patients, 70 (73.7%) had metallic items investigated for nickel release. A total of 151 items were investigated, and 66 (43.7%) gave positive DMG test reactions. Objects were nearly all purchased or acquired after the introduction of the EU Nickel Directive. Only one object had been inherited, and only two objects had been purchased outside of Denmark. DMG testing is valuable as a screening test for nickel release and should be used to identify relevant exposures in nickel-allergic patients. Mainly consumer items, but also work tools used in an occupational setting, released nickel in dermatitis patients. This study confirmed 'risk items' from previous studies, including mobile phones.
Robins' Single-item Self-esteem Inventory was compared with a single item from the Coopersmith Self-esteem. Although a new scoring format was used, there was good evidence of cross-validation in 83 current and former psychiatric patients who completed Harvey's adapted measure of stigma felt and experienced by users of mental health services. Scores on the two single-item self-esteem measures correlated .76 (p self-esteem in users of mental health services.
Yoon Soo ePark
Full Text Available This study investigates the impact of item parameter drift (IPD on parameter and ability estimation when the underlying measurement model fits a mixture distribution, thereby violating the item invariance property of unidimensional item response theory (IRT models. An empirical study was conducted to demonstrate the occurrence of both IPD and an underlying mixture distribution using real-world data. Twenty-one trended anchor items from the 1999, 2003, and 2007 administrations of Trends in International Mathematics and Science Study (TIMSS were analyzed using unidimensional and mixture IRT models. TIMSS treats trended anchor items as invariant over testing administrations and uses pre-calibrated item parameters based on unidimensional IRT. However, empirical results showed evidence of two latent subgroups with IPD. Results showed changes in the distribution of examinee ability between latent classes over the three administrations. A simulation study was conducted to examine the impact of IPD on the estimation of ability and item parameters, when data have underlying mixture distributions. Simulations used data generated from a mixture IRT model and estimated using unidimensional IRT. Results showed that data reflecting IPD using mixture IRT model led to IPD in the unidimensional IRT model. Changes in the distribution of examinee ability also affected item parameters. Moreover, drift with respect to item discrimination and distribution of examinee ability affected estimates of examinee ability. These findings demonstrate the need to caution and evaluate IPD using a mixture IRT framework to understand its effect on item parameters and examinee ability.
Park, Yoon Soo; Lee, Young-Sun; Xing, Kuan
This study investigates the impact of item parameter drift (IPD) on parameter and ability estimation when the underlying measurement model fits a mixture distribution, thereby violating the item invariance property of unidimensional item response theory (IRT) models. An empirical study was conducted to demonstrate the occurrence of both IPD and an underlying mixture distribution using real-world data. Twenty-one trended anchor items from the 1999, 2003, and 2007 administrations of Trends in International Mathematics and Science Study (TIMSS) were analyzed using unidimensional and mixture IRT models. TIMSS treats trended anchor items as invariant over testing administrations and uses pre-calibrated item parameters based on unidimensional IRT. However, empirical results showed evidence of two latent subgroups with IPD. Results also showed changes in the distribution of examinee ability between latent classes over the three administrations. A simulation study was conducted to examine the impact of IPD on the estimation of ability and item parameters, when data have underlying mixture distributions. Simulations used data generated from a mixture IRT model and estimated using unidimensional IRT. Results showed that data reflecting IPD using mixture IRT model led to IPD in the unidimensional IRT model. Changes in the distribution of examinee ability also affected item parameters. Moreover, drift with respect to item discrimination and distribution of examinee ability affected estimates of examinee ability. These findings demonstrate the need to caution and evaluate IPD using a mixture IRT framework to understand its effects on item parameters and examinee ability.
Choe, Edison M.; Kern, Justin L.; Chang, Hua-Hua
Despite common operationalization, measurement efficiency of computerized adaptive testing should not only be assessed in terms of the number of items administered but also the time it takes to complete the test. To this end, a recent study introduced a novel item selection criterion that maximizes Fisher information per unit of expected response…
Extended an Item Response Theory (IRT) method for detection of differential item functioning to the partial credit model and applied the method to simulated data using a stepwise procedure. Then applied the stepwise DIF analysis based on the multiple-group partial credit model to writing trend data from the National Assessment of Educational…
Falk, Carl F.; Cai, Li
We present a logistic function of a monotonic polynomial with a lower asymptote, allowing additional flexibility beyond the three-parameter logistic model. We develop a maximum marginal likelihood based approach to estimate the item parameters. The new item response model is demonstrated on math assessment data from a state, and a computationally…
Beretvas, S. Natasha; Cawthon, Stephanie W.; Lockhart, L. Leland; Kaye, Alyssa D.
This pedagogical article is intended to explain the similarities and differences between the parameterizations of two multilevel measurement model (MMM) frameworks. The conventional two-level MMM that includes item indicators and models item scores (Level 1) clustered within examinees (Level 2) and the two-level cross-classified MMM (in which item…
Bose, Jayakumar; Rengel, Zed
Adult learners are already involved in the process of self-regulation; hence, higher education institutions should focus on strengthening students' self-regulatory skills. Self-regulation can be facilitated through formative assessment. This paper proposes a model formative assessment strategy that would complement existing university teaching,…
Irene Fernández Monsalve
Full Text Available During language comprehension, semantic contextual information is used to generate expectations about upcoming items. This has been commonly studied through the N400 event-related potential (ERP, as a measure of facilitated lexical retrieval. However, the associative relationships in multi-word expressions (MWE may enable the generation of a categorical expectation, leading to lexical retrieval before target word onset. Processing of the target word would thus reflect a target-identification mechanism, possibly indexed by a P3 ERP component. However, given their time overlap (200-500 ms post-stimulus onset, differentiating between N400/P3 ERP responses (averaged over multiple linguistically variable trials is problematic. In the present study, we analyzed EEG data from a previous experiment, which compared ERP responses to highly expected words that were placed either in a MWE or a regular non-fixed compositional context, and to low predictability controls. We focused on oscillatory dynamics and regression analyses, in order to dissociate between the two contexts by modeling the electrophysiological response as a function of item-level parameters. A significant interaction between word position and condition was found in the regression model for power in a theta range (~7-9 Hz, providing evidence for the presence of qualitative differences between conditions. Power levels within this band were lower for MWE than compositional contexts then the target word appeared later on in the sentence, confirming that in the former lexical retrieval would have taken place before word onset. On the other hand, gamma-power (~50-70 Hz was also modulated by predictability of the item in all conditions, which is interpreted as an index of a similar `matching' sub-step for both types of contexts, binding an expected representation and the external input.
Mindyarto, B. N.; Nugroho, S. E.; Linuwih, S.
Computer-based testing has created the demand for large numbers of items. This paper discusses the production of cohesive physics testlets using an automatic item generation concepts and procedures. The testlets were composed by restructuring physics problems to reveal deeper understanding of the underlying physical concepts by inserting a qualitative question and its scientific reasoning question. A template-based testlet generator was used to generate the testlet variants. Using this methodology, 1248 testlet variants were effectively generated from 25 testlet templates. Some issues related to the effective application of the generated physics testlets in practical assessments were discussed.
Full Text Available Online Formative Assessment (OFA improves EFL students’ reading comprehension enabling them to have a better performance in reading comprehension tests. To lend support to the above mentioned claim, a quasi-experimental study was conducted in Mashhad, Iran. 48 female lower intermediate EFL students took part in this study. Participants were assigned to control and treatment groups. Participants in both groups received a formative assessment program lasting for 10 sessions. Formative assessment in treatment group was conducted by the site itself, and participants in control group were assessed by the teacher. It was found that participants in treatment group significantly outperformed those in control group. This finding indicated OFA as an effective learning tool in EFL reading comprehension classrooms.
Martine H P Crins
Full Text Available The Patient-Reported Outcomes Measurement Information System (PROMIS is a universally applicable set of instruments, including item banks, short forms and computer adaptive tests (CATs, measuring patient-reported health across different patient populations. PROMIS CATs are highly efficient and the use in practice is considered feasible with little administration time, offering standardized and routine patient monitoring. Before an item bank can be used as CAT, the psychometric properties of the item bank have to be examined. Therefore, the objective was to assess the psychometric properties of the Dutch-Flemish PROMIS Physical Function item bank (DF-PROMIS-PF in Dutch patients receiving physical therapy.Cross-sectional study.805 patients >18 years, who received any kind of physical therapy in primary care in the past year, completed the full DF-PROMIS-PF (121 items.Unidimensionality was examined by Confirmatory Factor Analysis and local dependence and monotonicity were evaluated. A Graded Response Model was fitted. Construct validity was examined with correlations between DF-PROMIS-PF T-scores and scores on two legacy instruments (SF-36 Health Survey Physical Functioning scale [SF36-PF10] and the Health Assessment Questionnaire Disability-Index [HAQ-DI]. Reliability (standard errors of theta was assessed.The results for unidimensionality were mixed (scaled CFI = 0.924, TLI = 0.923, RMSEA = 0.045, 1th factor explained 61.5% of variance. Some local dependence was found (8.2% of item pairs. The item bank showed a broad coverage of the physical function construct (threshold-parameters range: -4.28-2.33 and good construct validity (correlation with SF36-PF10 = 0.84 and HAQ-DI = -0.85. Furthermore, the DF-PROMIS-PF showed greater reliability over a broader score-range than the SF36-PF10 and HAQ-DI.The psychometric properties of the DF-PROMIS-PF item bank are sufficient. The DF-PROMIS-PF can now be used as short forms or CAT to measure the level of
Methods. Our e-learning initiative, eQuip, is a custom-built e-learning platform specifically created to align question types included in the program to be similar to those used in current assessments. We describe our formative e-learning system and present preliminary results after the first year of introduction, reporting on the ...
Chen, Qiuxian; Kettle, Margaret; Klenowski, Val; May, Lyn
Formative assessment is increasingly being implemented through policy initiatives in Chinese educational contexts. As an approach to assessment, formative assessment derives many of its key principles from Western contexts, notably through the work of scholars in the UK, the USA and Australia. The question for this paper is the ways that formative…
Research purpose: This article reports on the process of identifying items for, and provides a quantitative evaluation of, the South African Personality Inventory (SAPI items. Motivation for the study: The study intended to develop an indigenous and psychometrically sound personality instrument that adheres to the requirements of South African legislation and excludes cultural bias. Research design, approach and method: The authors used a cross-sectional design. They measured the nine SAPI clusters identified in the qualitative stage of the SAPI project in 11 separate quantitative studies. Convenience sampling yielded 6735 participants. Statistical analysis focused on the construct validity and reliability of items. The authors eliminated items that showed poor performance, based on common psychometric criteria, and selected the best performing items to form part of the final version of the SAPI. Main findings: The authors developed 2573 items from the nine SAPI clusters. Of these, 2268 items were valid and reliable representations of the SAPI facets. Practical/managerial implications: The authors developed a large item pool. It measures personality in South Africa. Researchers can refine it for the SAPI. Furthermore, the project illustrates an approach that researchers can use in projects that aim to develop culturally-informed psychological measures. Contribution/value-add: Personality assessment is important for recruiting, selecting and developing employees. This study contributes to the current knowledge about the early processes researchers follow when they develop a personality instrument that measures personality fairly in different cultural groups, as the SAPI does.
Black and Wiliam (1998a, 1998b) demonstrate that formative assessment is one of the most effective strategies for promoting student learning. Since the publication of their reviews, formative assessment has gained increasing international prominence in both policy and practice. However, despite this early innovation, the theory and practice of…
Full Text Available Background. Clinical skills training in the clinical skills laboratory (CSL environment forms an important part of the undergraduate medical curriculum. These skills are better demonstrated than described. A lack of direct observation and feedback given to medical students performing these skills has been reported. Without feedback, errors are uncorrected, good performance is not reinforced and clinical competence is minimally achieved. Objectives. To explore the perceptions of 3rd-year medical students and their clinical teachers about formative clinical assessment feedback in the CSL setting. Methods. Questionnaires with open- and closed-ended questions were administered to 3rd-year medical students and their clinical skills teachers. Quantitative data were statistically analysed while qualitative data were thematically analysed. Results. Five clinical teachers and 183 medical students participated. Average scores for the items varied between 1.87 and 5.00 (1: negative to 5:positive. The majority of students reported that feedback informed them of their competence level and learning needs, and motivated them to improve their skills and participation in patient-centred learning activities. Teachers believed that they provided sufficient and balanced feedback. Some students were concerned about the lack of standardised and structured assessment criteria and variation in teacher feedback. No statistical difference (p<0.05 was found between the mean item ratings based on demographic and academic background. Conclusion. Most teachers and students were satisfied with the feedback given and received, respectively. Structured and balanced criterion-referenced feedback processes, together with feedback training workshops for staff and students, are recommended to enhance feedback practice quality in the CSL. Limited clinical staff in the CSL was noted as a concern.
Foos, Paul W; Goolkasian, Paula
Format effects refer to lower recall of printed words from working memory when compared to spoken words or pictures. These effects have been attributed to an attenuation of attention to printed words. The present experiment compares younger and older adults' recall of three or six items presented as pictures, spoken words, printed words, and alternating case WoRdS. The latter stimuli have been shown to increase attention to printed words and, thus, reduce format effects. The question of interest was whether these stimuli would also reduce format effects for older adults whose working memory capacity has fewer attentional resources to allocate. Results showed that older adults performed as well as younger adults with three items but less well with six and that format effects were reduced for both age groups, but more for young, when alternating case words were used. Other findings regarding executive control of working memory are discussed. The obtained differences support models of reduced capacity in older adult working memory.
Chrisinger, Benjamin W
The Supplemental Nutrition Assistance Program (SNAP, formerly known as food stamps) is the federal government's largest form of food assistance, and a frequent focus of political and scholarly debate. Previous discourse in the public health community and recent proposals in state legislatures have suggested limiting the use of SNAP benefits on unhealthy food items, such as sugar-sweetened beverages (SSBs). This paper identifies two possible underlying motivations for item restriction, health and morals, and analyzes the level of empirical support for claims about the current state of the program, as well as expectations about how item restriction would change participant outcomes. It also assesses how item restriction would reduce individual agency of low-income individuals, and identifies mechanisms by which this may adversely affect program participants. Finally, this paper offers alternative policies to promote healthier purchasing and eating among SNAP participants that can be pursued without reducing individual agency. Health advocates and officials must more fully weigh the attendant risks of implementing SNAP item restrictions, including the reduction of individual agency of a vulnerable population. Copyright © 2017 Elsevier Inc. All rights reserved.
Rusman, Ellen; Martínez-Monés, Alejandra; Tasouris, Christodoulos; Economou, Anastasia
Workshop participants will learn to: Understand the reasons behind the shift from assessment of learning to assessment for learning; Make a difference between the objectives of formative and summative assessment; Distinguish between different formative eAassessment methods; Understand the benefits
Hobin, Erin; Lebenbaum, Michael; Rosella, Laura; Hammond, David
To assess the availability, location, and format of nutrition information in fast-food chain restaurants in Ontario. Nutrition information in restaurants was assessed using an adapted version of the Nutrition Environment Measures Study for Restaurants (NEMS-R). Two raters independently visited 50 restaurants, 5 outlets of each of the top-10 fast-food chain restaurants in Canada. The locations of the restaurants were randomly selected within the Waterloo, Wellington, and Peel regions in Ontario, Canada. Descriptive results are presented for the proportion of restaurants presenting nutrition information by location (e.g., brochure), format (e.g., use of symbols), and then by type of restaurant (e.g., quick take-away, full-service). Overall, 96.0% (n = 48) of the restaurants had at least some nutrition information available in the restaurant. However, no restaurant listed calorie information for all items on menu boards or menus, and only 14.0% (n = 7) of the restaurants posted calorie information and 26.0% (n = 13) of restaurants posted other nutrients (e.g., total fat) for at least some items on menus boards or menus. The majority of the fast-food chain restaurants included in our study provided at least some nutrition information in restaurants; however, very few restaurants made nutrition information readily available for consumers on menu boards and menus.
Gloria, R. Y.; Sudarmin, S.; Wiyanto; Indriyanti, D. R.
Habits of mind are intelligent thinking dispositions that every individual needs to have, and it needs an effort to form them as expected. A behavior can be formed by continuous practice; therefore the student's habits of mind can also be formed and trained. One effort that can be used to encourage the formation of habits of mind is a formative assessment strategy with the stages of UbD (Understanding by Design), and a study needs to be done to prove it. This study aims to determine the contribution of formative assessment to the value of habits of mind owned by prospective teachers. The method used is a quantitative method with a quasi-experimental design. To determine the effectiveness of formative assessment with Ubd stages on the formation of habits of mind, correlation test and regression analysis were conducted in the formative assessment questionnaire consisting of three components, i.e. feed back, peer assessment and self assessment, and habits of mind. The result of the research shows that from the three components of Formative Assessment, only Feedback component does not show correlation to students’ habits of mind (r = 0.323). While peer assessment component (r = 0. 732) and self assessment component (r = 0.625), both indicate correlation. From the regression test the overall component of the formative assessment contributed to the habits of mind at 57.1%. From the result of the research, it can be concluded that the formative assessment with Ubd stages is effective and contributes in forming the student's habits of mind; the formative assessment components that contributed the most are the peer assessment and self assessment. The greatest contribution goes to the Thinking interdependently category.
Ayotte, Brian J; Trivedi, Ranak; Bosworth, Hayden B
Health-related knowledge is an important component in the self-management of chronic illnesses. The objective of this study was to more accurately assess racial differences in hypertension knowledge by using a latent variable modeling approach that controlled for sociodemographic factors and accounted for measurement issues in the assessment of hypertension knowledge. Cross-sectional data from 1,177 participants (45% African American; 35% female) were analyzed using a multiple indicator multiple causes (MIMIC) modeling approach. Available sociodemographic data included race, education, sex, financial status, and age. All participants completed six items on a hypertension knowledge questionnaire. Overall, the final model suggested that females, Whites, and patients with at least a high school diploma had higher latent knowledge scores than males, African Americans, and patients with less than a high school diploma, respectively. The model also detected differential item functioning (DIF) based on race for two of the items. Specifically, the error rate for African Americans was lower than would be expected given the lower level of latent knowledge on the items, on the questions related to: (a) the association between high blood pressure and kidney disease, and (b) the increased risk African Americans have for developing hypertension. Not accounting for DIF resulted in the difference between Whites and African Americans to be underestimated. These results are discussed in the context of the need for careful measurement of health-related constructs, and how measurement-related issues can result in an inaccurate estimation of racial differences in hypertension knowledge.
Irwin, Debra E; Gross, Heather E; Stucky, Brian D; Thissen, David; DeWitt, Esi Morgan; Lai, Jin Shei; Amtmann, Dagmar; Khastou, Leyla; Varni, James W; DeWalt, Darren A
Pediatric self-report should be considered the standard for measuring patient reported outcomes (PRO) among children. However, circumstances exist when the child is too young, cognitively impaired, or too ill to complete a PRO instrument and a proxy-report is needed. This paper describes the development process including the proxy cognitive interviews and large-field-test survey methods and sample characteristics employed to produce item parameters for the Patient Reported Outcomes Measurement Information System (PROMIS) pediatric proxy-report item banks. The PROMIS pediatric self-report items were converted into proxy-report items before undergoing cognitive interviews. These items covered six domains (physical function, emotional distress, social peer relationships, fatigue, pain interference, and asthma impact). Caregivers (n = 25) of children ages of 5 and 17 years provided qualitative feedback on proxy-report items to assess any major issues with these items. From May 2008 to March 2009, the large-scale survey enrolled children ages 8-17 years to complete the self-report version and caregivers to complete the proxy-report version of the survey (n = 1548 dyads). Caregivers of children ages 5 to 7 years completed the proxy report survey (n = 432). In addition, caregivers completed other proxy instruments, PedsQL™ 4.0 Generic Core Scales Parent Proxy-Report version, PedsQL™ Asthma Module Parent Proxy-Report version, and KIDSCREEN Parent-Proxy-52. Item content was well understood by proxies and did not require item revisions but some proxies clearly noted that determining an answer on behalf of their child was difficult for some items. Dyads and caregivers of children ages 5-17 years old were enrolled in the large-scale testing. The majority were female (85%), married (70%), Caucasian (64%) and had at least a high school education (94%). Approximately 50% had children with a chronic health condition, primarily asthma, which was diagnosed or treated within 6
David Thissen, a professor in the Department of Psychology and Neuroscience, Quantitative Program at the University of North Carolina, has consulted and served on technical advisory committees for assessment programs that use item response theory (IRT) over the past couple decades. He has come to the conclusion that there are usually two purposes…
The present study used a cohort-sequential design to examine developmental changes in children's ability to bind items in memory during early and middle childhood. Three cohorts of children (aged 4, 6, or 8 years) were followed longitudinally for three years. Each year, children completed a source memory paradigm assessing memory for items and binding. Results suggest linear increases in memory for individual items (facts or sources) between 4 and 10 years of age, but that memory for correct ...
Peterson, Euguenia; Siadat, M. Vali
The purpose of this study is to examine the effects of the implementation of formative assessment on student achievement in elementary algebra classes at Richard J. Daley College in Chicago, IL. The formative assessment is defined in this case as frequent, cumulative, time-restricted, multiple-choice quizzes with immediate constructive feedback.…
Hemert, Dianne A. van; Baerveldt, Chris; Vermande, Marjolijn
Amethod is presented for evaluating the presence and size of cross-cultural item biases. The examined items concern parental support and family cohesion in a Likert-type questionnaire for adolescents in The Netherlands. Each evaluated item has two versions, a collectivist and an individualistic one,
Full Text Available Background: Web-based formative assessment tools have become widely recognized in medical education as valuable resources for self-directed learning. Objectives: To explore the educational value of formative assessment using online quizzes for kidney pathology learning in our renal pathophysiology course. Methods: Students were given unrestricted and optional access to quizzes. Performance on quizzed and non-quizzed materials of those who used (‘quizzers’ and did not use the tool (‘non-quizzers’ was compared. Frequency of tool usage was analyzed and satisfaction surveys were utilized at the end of the course. Results: In total, 82.6% of the students used quizzes. The greatest usage was observed on the day before the final exam. Students repeated interactive and more challenging quizzes more often. Average means between final exam scores for quizzed and unrelated materials were almost equal for ‘quizzers’ and ‘non-quizzers’, but ‘quizzers’ performed statistically better than ‘non-quizzers’ on both, quizzed (p=0.001 and non-quizzed (p=0.024 topics. In total, 89% of surveyed students thought quizzes improved their learning experience in this course. Conclusions: Our new computer-assisted learning tool is popular, and although its use can predict the final exam outcome, it does not provide strong evidence for direct improvement in academic performance. Students who chose to use quizzes did well on all aspects of the final exam and most commonly used quizzes to practice for final exam. Our efforts to revitalize the course material and promote learning by adding interactive online formative assessments improved students’ learning experience overall.
Mehmet Aydeniz; Aybuke Pabuccu
This study investigated the effects of formative assessment strategies on students’ conceptual understanding in a freshmen college chemistry course in Turkey. Our sample consists of 96 students; 27 males, 69 females. The formative assessment strategies such as reflection on exams, and collective problem solving sessions were used throughout the course. Data were collected through pre and post-test methodology. The findings reveal that the formative assessment strategies used in this study led...
Walker, Timothy J; Tullar, Jessica M; Diamond, Pamela M; Kohl, Harold W; Amick, Benjamin C
Purpose To evaluate factorial validity, scale reliability, test-retest reliability, convergent validity, and discriminant validity of the 8-item Work Limitations Questionnaire (WLQ) among employees from a public university system. Methods A secondary analysis using de-identified data from employees who completed an annual Health Assessment between the years 2009-2015 tested research aims. Confirmatory factor analysis (CFA) (n = 10,165) tested the latent structure of the 8-item WLQ. Scale reliability was determined using a CFA-based approach while test-retest reliability was determined using the intraclass correlation coefficient. Convergent/discriminant validity was tested by evaluating relations between the 8-item WLQ with health/performance variables for convergent validity (health-related work performance, number of chronic conditions, and general health) and demographic variables for discriminant validity (gender and institution type). Results A 1-factor model with three correlated residuals demonstrated excellent model fit (CFI = 0.99, TLI = 0.99, RMSEA = 0.03, and SRMR = 0.01). The scale reliability was acceptable (0.69, 95% CI 0.68-0.70) and the test-retest reliability was very good (ICC = 0.78). Low-to-moderate associations were observed between the 8-item WLQ and the health/performance variables while weak associations were observed between the demographic variables. Conclusions The 8-item WLQ demonstrated sufficient reliability and validity among employees from a public university system. Results suggest the 8-item WLQ is a usable alternative for studies when the more comprehensive 25-item WLQ is not available.
Hamodi, Carolina; López-Pastor, Víctor Manuel; López-Pastor, Ana Teresa
The aim of this article is to analyse whether having experience of formative assessment during their initial teacher education courses (ITE) influences graduates' subsequent practice as teachers. That is, if the assessment methods that university students are subject to during their learning process are then actually employed by them during their…
Full Text Available Abstract Background The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC is a widely used patient reported outcome in osteoarthritis. An important, but frequently overlooked, aspect of validating health outcome measures is to establish if items exhibit differential item functioning (DIF. That is, if respondents have the same underlying level of an attribute, does the item give the same score in different subgroups or is it biased towards one subgroup or another. The aim of the study was to explore DIF in the Likert format WOMAC for the first time in a UK osteoarthritis population with respect to demographic, social, clinical and psychological factors. Methods The sample comprised a community sample of 763 people with osteoarthritis who participated in the Somerset and Avon Survey of Health. The WOMAC was explored for DIF by gender, age, social deprivation, social class, employment status, distress, body mass index and clinical factors. Ordinal regression models were used to identify DIF items. Results After adjusting for age, two items were identified for the physical functioning subscale as having DIF with age identified as the DIF factor for 2 items, gender for 1 item and body mass index for 1 item. For the WOMAC pain subscale, for people with hip osteoarthritis one item was identified with age-related DIF. The impact of the DIF items rarely had a significant effect on the conclusions of group comparisons. Conclusions Overall, the WOMAC performed well with only a small number of DIF items identified. However, as DIF items were identified in for the WOMAC physical functioning subscale it would be advisable to analyse data taking into account the possible impact of the DIF items when weight, gender or especially age effects, are the focus of interest in UK-based osteoarthritis studies. Similarly for the WOMAC pain subscale in people with hip osteoarthritis it would be worthwhile to analyse data taking into account the
Rose, Matthias; Bjorner, Jakob B; Gandek, Barbara; Bruce, Bonnie; Fries, James F; Ware, John E
To document the development and psychometric evaluation of the Patient-Reported Outcomes Measurement Information System (PROMIS) Physical Function (PF) item bank and static instruments. The items were evaluated using qualitative and quantitative methods. A total of 16,065 adults answered item subsets (n>2,200/item) on the Internet, with oversampling of the chronically ill. Classical test and item response theory methods were used to evaluate 149 PROMIS PF items plus 10 Short Form-36 and 20 Health Assessment Questionnaire-Disability Index items. A graded response model was used to estimate item parameters, which were normed to a mean of 50 (standard deviation [SD]=10) in a US general population sample. The final bank consists of 124 PROMIS items covering upper, central, and lower extremity functions and instrumental activities of daily living. In simulations, a 10-item computerized adaptive test (CAT) eliminated floor and decreased ceiling effects, achieving higher measurement precision than any comparable length static tool across four SDs of the measurement range. Improved psychometric properties were transferred to the CAT's superior ability to identify differences between age and disease groups. The item bank provides a common metric and can improve the measurement of PF by facilitating the standardization of patient-reported outcome measures and implementation of CATs for more efficient PF assessments over a larger range. Copyright © 2014. Published by Elsevier Inc.
Full Text Available Abstract Background A potential problem of low-stakes large-scale assessments such as the Programme for the International Assessment of Adult Competencies (PIAAC is low test-taking engagement. The present study pursued two goals in order to better understand conditioning factors of test-taking disengagement: First, a model-based approach was used to investigate whether item indicators of disengagement constitute a continuous latent person variable by domain. Second, the effects of person and item characteristics were jointly tested using explanatory item response models. Methods Analyses were based on the Canadian sample of Round 1 of the PIAAC, with N = 26,683 participants completing test items in the domains of literacy, numeracy, and problem solving. Binary item disengagement indicators were created by means of item response time thresholds. Results The results showed that disengagement indicators define a latent dimension by domain. Disengagement increased with lower educational attainment, lower cognitive skills, and when the test language was not the participant’s native language. Gender did not exert any effect on disengagement, while age had a positive effect for problem solving only. An item’s location in the second of two assessment modules was positively related to disengagement, as was item difficulty. The latter effect was negatively moderated by cognitive skill, suggesting that poor test-takers are especially likely to disengage with more difficult items. Conclusions The negative effect of cognitive skill, the positive effect of item difficulty, and their negative interaction effect support the assumption that disengagement is the outcome of individual expectations about success (informed disengagement.
Bevans, Katherine B; Meltzer, Lisa J; De La Motte, Anna; Kratchman, Amy; Viél, Dominique; Forrest, Christopher B
To develop the Patient Reported Outcome Measurement Information System (PROMIS) Pediatric Sleep Health item pool and evaluate its content validity. Participants included 8 expert sleep clinician-researchers, 64 children ages 8-17 years, and 54 parents of children ages 5-17 years. We started with item concepts and expressions from the PROMIS Sleep Disturbance and Sleep Related Impairment adult measures. Additional pediatric sleep health concepts were generated by expert (n = 8), child (n = 28), and parent (n = 33) concept elicitation interviews and a systematic review of existing pediatric sleep health questionnaires. Content validity of the item pool was evaluated with item translatability review, readability analysis, and child (n = 36) and parent (n = 21) cognitive interviews. The final pediatric Sleep Health item pool includes 43 items that assess sleep disturbance (children's capacity to fall and stay asleep, sleep quality, dreams, and parasomnias) and sleep-related impairments (daytime sleepiness, low energy, difficulty waking up, and the impact of sleep and sleepiness on cognition, affect, behavior, and daily activities). Items are translatable and relevant and well understood by children ages 8-17 and parents of children ages 5-17. Rigorous qualitative procedures were used to develop and evaluate the content validity of the PROMIS Pediatric Sleep Health item pool. Once the item pool's psychometric properties are established, the scales will be useful for measuring children's subjective experiences of sleep.
Shou, Yiyun; Sellbom, Martin; Xu, Jing
There is cumulative evidence for the cross-cultural validity of the Triarchic Psychopathy Measure (TriPM; Patrick, 2010) among non-Western populations. Recent studies using correlational and regression analyses show promising construct validity of the TriPM in Chinese samples. However, little is known about the efficiency of items in TriPM in assessing the proposed latent traits. The current study evaluated the psychometric properties of the Chinese TriPM at the item level using item response theory analyses. It also examined the measurement invariance of the TriPM between the Chinese and the U.S. student samples by applying differential item functioning analyses under the item response theory framework. The results supported the unidimensional nature of the Disinhibition and Meanness scales. Both scales had a greater level of precision in the respective underlying constructs at the positive ends. The two scales, however, had several items that were weakly associated with their respective latent traits in the Chinese student sample. Boldness, on the other hand, was found to be multidimensional, and reflected a more normally distributed range of variation. The examination of measurement bias via differential item functioning analyses revealed that a number of items of the TriPM were not equivalent across the Chinese and the U.S. Some modification and adaptation of items might be considered for improving the precision of the TriPM for Chinese participants. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Svicher, Andrea; Cosci, Fiammetta; Giannini, Marco; Pistelli, Francesco; Fagerström, Karl
The Fagerström Test for Cigarette Dependence (FTCD) and the Heaviness of Smoking Index (HSI) are the gold standard measures to assess cigarette dependence. However, FTCD reliability and factor structure have been questioned and HSI psychometric properties are in need of further investigations. The present study examined the psychometrics properties of the FTCD and the HSI via the Item Response Theory. The study was a secondary analysis of data collected in 862 Italian daily smokers. Confirmatory factor analysis was run to evaluate the dimensionality of FTCD. A Grade Response Model was applied to FTCD and HSI to verify the fit to the data. Both item and test functioning were analyzed and item statistics, Test Information Function, and scale reliabilities were calculated. Mokken Scale Analysis was applied to estimate homogeneity and Loevinger's coefficients were calculated. The FTCD showed unidimensionality and homogeneity for most of the items and for the total score. It also showed high sensitivity and good reliability from medium to high levels of cigarette dependence, although problems related to some items (i.e., items 3 and 5) were evident. HSI had good homogeneity, adequate item functioning, and high reliability from medium to high levels of cigarette dependence. Significant Differential Item Functioning was found for items 1, 4, 5 of the FTCD and for both items of HSI. HSI seems highly recommended in clinical settings addressed to heavy smokers while FTCD would be better used in smokers with a level of cigarette dependence ranging between low and high. Copyright © 2017 Elsevier Ltd. All rights reserved.
Meusen-Beekman, Kelly; Joosten-ten Brinke, Desirée; Boshuizen, Els
This article presents the results of a formative assessment intervention in writing assignments in sixth grade. We examined whether formative assessments would improve self-regulation, motivation and self-efficacy among sixth graders, and whether differential effects exist between formative
Lin, Chung-Ying; Griffiths, Mark D; Pakpour, Amir H
Background and aims Research examining problematic mobile phone use has increased markedly over the past 5 years and has been related to "no mobile phone phobia" (so-called nomophobia). The 20-item Nomophobia Questionnaire (NMP-Q) is the only instrument that assesses nomophobia with an underlying theoretical structure and robust psychometric testing. This study aimed to confirm the construct validity of the Persian NMP-Q using Rasch and confirmatory factor analysis (CFA) models. Methods After ensuring the linguistic validity, Rasch models were used to examine the unidimensionality of each Persian NMP-Q factor among 3,216 Iranian adolescents and CFAs were used to confirm its four-factor structure. Differential item functioning (DIF) and multigroup CFA were used to examine whether males and females interpreted the NMP-Q similarly, including item content and NMP-Q structure. Results Each factor was unidimensional according to the Rach findings, and the four-factor structure was supported by CFA. Two items did not quite fit the Rasch models (Item 14: "I would be nervous because I could not know if someone had tried to get a hold of me;" Item 9: "If I could not check my smartphone for a while, I would feel a desire to check it"). No DIF items were found across gender and measurement invariance was supported in multigroup CFA across gender. Conclusions Due to the satisfactory psychometric properties, it is concluded that the Persian NMP-Q can be used to assess nomophobia among adolescents. Moreover, NMP-Q users may compare its scores between genders in the knowledge that there are no score differences contributed by different understandings of NMP-Q items.
This study developed a game-based formative assessment, called tic-tac-toe quiz for single-player version (TRIS-Q-SP), in an energy education e-learning system. This assessment game combined tic-tac-toe with online assessment, and revised the rule of tic-tac-toe for stimulating students to use online formative assessment actively. Additionally, to…
Hougaard, Jens Leth; Moulin, Hervé
We ask how to share the cost of finitely many public goods (items) among users with different needs: some smaller subsets of items are enough to serve the needs of each user, yet the cost of all items must be covered, even if this entails inefficiently paying for redundant items. Typical examples...... are network connectivity problems when an existing (possibly inefficient) network must be maintained. We axiomatize a family cost ratios based on simple liability indices, one for each agent and for each item, measuring the relative worth of this item across agents, and generating cost allocation rules...... additive in costs....
Lewis, Crystal G; Herman, Keith C; Huang, Francis L; Stormont, Melissa; Grossman, Caroline; Eddy, Colleen; Reinke, Wendy M
This study examined the benefit of utilizing one-item academic and one-item behavior readiness teacher-rated screeners at the beginning of the school year to predict end-of-school year outcomes for middle school students. The Middle School Academic and Behavior Readiness (M-ABR) screeners were developed to provide an efficient and effective way to assess readiness in students. Participants included 889 students in 62 middle school classrooms in an urban Missouri school district. Concurrent validity with the M-ABR items and other indicators of readiness in the fall were evaluated using Pearson product-moment correlation coefficients, with the academic readiness item having medium to strong correlations with other baseline academic indicators (r=±0.56 to 0.91) and the behavior readiness item having low to strong correlations with baseline behavior items (r=±0.20 to 0.79). Next, the predictive validity of the M-ABR items was analyzed with hierarchical linear regressions using end-of-year outcomes as the dependent variable. The academic and behavior readiness items demonstrated adequate validity for all outcomes with moderate effects (β=±0.31 to 0.73 for academic outcomes and β=±0.24 to 0.59 for behavioral outcomes) after controlling for baseline demographics. Even after controlling for baseline scores, the M-ABR items predicted unique variance in almost all outcome variables. Four conditional probability indices were calculated to obtain an optimal cut score, to determine ready vs. not ready, for both single-item M-ABR scales. The cut point of "fair" yielded the most acceptable values for the indices. The odd ratios (OR) of experiencing negative outcomes given a "fair" or lower readiness rating (2 or below on the M-ABR screeners) at the beginning of the year were significant and strong for all outcomes (OR=2.29 to OR=14.46), except for internalizing problems. These findings suggest promise for using single readiness items to screen for varying negative end
Armour, Cherie; Shevlin, Mark
The factor structure of posttraumatic stress disorder (PTSD) currently used by the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV), has received limited support. A four-factor dysphoria model is widely supported. However, the dysphoria factor of this model has been hailed as a nonspecific factor of PTSD. The present study investigated the specificity of the dysphoria factor within the dysphoria model by conducting a confirmatory factor analysis while statistically controlling for the variance attributable to depression. The sample consisted of 429 individuals who met the diagnostic criteria for PTSD in the National Comorbidity Survey. The results concluded that there was no significant attenuation in any of the PTSD items. This finding is pertinent given several proposals for the removal of dysphoric items from the diagnostic criteria set of PTSD in the upcoming DSM-5.
Woods, D.D.; Roth, E.M.; Pople, H. Jr.
This paper describes a dynamic simulation capability for modeling how people form intentions to act in nuclear power plant emergency situations. This modeling tool, Cognitive Environment Simulation or CES, was developed based on techniques from artificial intelligence. It simulates the cognitive processes that determine situation assessment and intention formation. It can be used to investigate analytically what situations and factors lead to intention failures, what actions follow from intention failures (e.g. errors of omission, errors of commission, common mode errors), the ability to recover from errors or additional machine failures, and the effects of changes in the NPP person machine system. One application of the CES modeling environment is to enhance the measurement of the human contribution to risk in probabilistic risk assessment studies. (author)
Full Text Available BACKGROUND: Integration of information streams into a unitary representation is an important task of our cognitive system. Within working memory, the medial temporal lobe (MTL has been conceptually linked to the maintenance of bound representations. In a previous fMRI study, we have shown that the MTL is indeed more active during working-memory maintenance of spatial associations as compared to non-spatial associations or single items. There are two explanations for this result, the mere presence of the spatial component activates the MTL, or the MTL is recruited to bind associations between neurally non-overlapping representations. METHODOLOGY/PRINCIPAL FINDINGS: The current fMRI study investigates this issue further by directly comparing intrinsic intra-item binding (object/colour, extrinsic intra-item binding (object/location, and inter-item binding (object/object. The three binding conditions resulted in differential activation of brain regions. Specifically, we show that the MTL is important for establishing extrinsic intra-item associations and inter-item associations, in line with the notion that binding of information processed in different brain regions depends on the MTL. CONCLUSIONS/SIGNIFICANCE: Our findings indicate that different forms of working-memory binding rely on specific neural structures. In addition, these results extend previous reports indicating that the MTL is implicated in working-memory maintenance, challenging the classic distinction between short-term and long-term memory systems.
Full Text Available BACKGROUND Pharmacology is the toughest subject in the II MBBS syllabus. Students have to memorise a lot about the drugs’ name and classification. We are conducting internal assessment exams after completion of each system. Number of failures will be more than 60% in the internal assessments conducted during first six months of II MBBS course. AIM To assess the formative assessment pattern followed in our institution with the students’ feedback and modify the pattern according to the students’ feedback. SETTINGS & DESIGN Prospective Observational Study conducted at Department of Pharmacology, Government Sivagangai Medical College, Sivagangai, Tamil Nadu. MATERIALS AND METHODS Questionnaire was prepared and distributed to the 300 students of Government Sivagangai Medical College and feedback was collected. Data collected was analysed in Microsoft Excel 2007 version. RESULTS Received feedback from 274 students. Most (80% of the students wanted to attend the tests in all systems. Monthly assessment test was preferred by 47% of the students. Students who preferred to finish tests before holidays was 57%. Most (56% of the students preferred tests for 1 hour. Multiple choice question (MCQ type was preferred by 43%, which is not a routine question pattern. Only 7% preferred viva. Recall type of questions was preferred by 41% of the students. CONCLUSION In our institution, internal assessment is conducted as per the students’ mind setup. As the feedback has been the generally followed one, we will add MCQs in the forthcoming tests. Application type questions will be asked for more marks than Recall type of questions.
Pilkonis, Paul A; Kim, Yookyung; Yu, Lan; Morse, Jennifer Q
The Adult Attachment Ratings (AAR) include 3 scales for anxious, ambivalent attachment (excessive dependency, interpersonal ambivalence, and compulsive care-giving), 3 for avoidant attachment (rigid self-control, defensive separation, and emotional detachment), and 1 for secure attachment. The scales include items (ranging from 6-16 in their original form) scored by raters using a 3-point format (0 = absent, 1 = present, and 2 = strongly present) and summed to produce a total score. Item response theory (IRT) analyses were conducted with data from 414 participants recruited from psychiatric outpatient, medical, and community settings to identify the most informative items from each scale. The IRT results allowed us to shorten the scales to 5-item versions that are more precise and easier to rate because of their brevity. In general, the effective range of measurement for the scales was 0 to +2 SDs for each of the attachment constructs; that is, from average to high levels of attachment problems. Evidence for convergent and discriminant validity of the scales was investigated by comparing them with the Experiences of Close Relationships-Revised (ECR-R) scale and the Kobak Attachment Q-sort. The best consensus among self-reports on the ECR-R, informant ratings on the ECR-R, and expert judgments on the Q-sort and the AAR emerged for anxious, ambivalent attachment. Given the good psychometric characteristics of the scale for secure attachment, however, this measure alone might provide a simple alternative to more elaborate procedures for some measurement purposes. Conversion tables are provided for the 7 scales to facilitate transformation from raw scores to IRT-calibrated (theta) scores.
Asim, Alice E.; Ekuri, Emmanuel E.; Eni, Eni I.
Large class size is an issue in testing at all levels of Education. As a panacea to this, multiple choice test formats has become very popular. This case study was designed to diagnose pre-service teachers' competency in constructing questions (IQT); direct questions (DQT); and best answer (BAT) varieties of multiple choice items. Subjects were 88…
As the largest international study ever taken in history, the Trend in Mathematics and Science Study (TIMSS) has been held as a benchmark to measure U.S. student performance in the global context. In-depth analyses of the TIMSS project are conducted in this study to examine key issues of the comparative investigation: (1) item flaws in mathematics…
Abdin, Edimansyah; Sagayadevan, Vathsala; Vaingankar, Janhavi Ajit; Picco, Louisa; Chong, Siow Ann; Subramaniam, Mythily
The validity of the CAGE using item response theory (IRT) has not yet been examined in older adult population. This study aims to investigate the psychometric properties of the CAGE using both non-parametric and parametric IRT models, assess whether there is any differential item functioning (DIF) by age, gender and ethnicity and examine the measurement precision at the cut-off scores. We used data from the Well-being of the Singapore Elderly study to conduct Mokken scaling analysis (MSA), dichotomous Rasch and 2-parameter logistic IRT models. The measurement precision at the cut-off scores were evaluated using classification accuracy (CA) and classification consistency (CC). The MSA showed the overall scalability H index was 0.459, indicating a medium performing instrument. All items were found to be homogenous, measuring the same construct and able to discriminate well between respondents with high levels of the construct and the ones with lower levels. The item discrimination ranged from 1.07 to 6.73 while the item difficulty ranged from 0.33 to 2.80. Significant DIF was found for 2-item across ethnic group. More than 90% (CC and CA ranged from 92.5% to 94.3%) of the respondents were consistently and accurately classified by the CAGE cut-off scores of 2 and 3. The current study provides new evidence on the validity of the CAGE from the IRT perspective. This study provides valuable information of each item in the assessment of the overall severity of alcohol problem and the precision of the cut-off scores in older adult population.
Full Text Available Classic studies of visual short-term memory (VSTM found that presenting memory items either sequentially or simultaneously does not affect recognition accuracy of the remembered items. Other studies also suggest that capacity of VSTM benefits from formation of bound object-based representations leading to no cost of remembering multi-feature items. According to these ideas, we aimed to examine the role of temporal and featural separation of memory items in VSTM change detection, (1 if sample items are separated across different temporal moments and (2 if across different feature dimensions. In a series of change detection experiments, we asked participants to report a change between a sample and a test display with a brief delay in between. In experiment 1, the sample items were split into two sets with a different onset time. In experiment 2, the sample items were split across two different feature dimensions (e.g., half color and half orientation. The change detection accuracy in Experiment 1 showed no substantial drop when the memory items were separated into two onset groups compared to simultaneous onset. The accuracy did not drop either when the features of sample items were split across two different feature groups compared to when were not split. The results indicate that temporal and featural separation of VWM items does not play a significant role for VSTM-based change detection.
For a randomly renewed item the probability distributions of the time to failure and of the duration of down time and the expectations of these random variables are determined. Moreover, it is shown that the same theory applies to randomly checked items with exponential probability distribution of life such as electronic items. The case of periodic renewals is treated as an example. (orig.) [de
Rosales, Roberto S; Martin-Hidalgo, Yolanda; Reboso-Morales, Luis; Atroshi, Isam
The purpose of this study was to assess the reliability and construct validity of the Spanish version of the 6-item carpal tunnel syndrome (CTS) symptoms scale (CTS-6). In this cross-sectional study 40 patients diagnosed with CTS based on clinical and neurophysiologic criteria, completed the standard Spanish versions of the CTS-6 and the disabilities of the arm, shoulder and hand (QuickDASH) scales on two occasions with a 1-week interval. Internal-consistency reliability was assessed with the Cronbach alpha coefficient and test-retest reliability with the intraclass correlation coefficient, two way random effect model and absolute agreement definition (ICC2,1). Cross-sectional precision was analyzed with the Standard Error of the Measurement (SEM). Longitudinal precision for test-retest reliability coefficient was assessed with the Standard Error of the Measurement difference (SEMdiff) and the Minimal Detectable Change at 95 % confidence level (MDC95). For assessing construct validity it was hypothesized that the CTS-6 would have a strong positive correlation with the QuickDASH, analyzed with the Pearson correlation coefficient (r). The standard Spanish version of the CTS-6 presented a Cronbach alpha of 0.81 with a SEM of 0.3. Test-retest reliability showed an ICC of 0.85 with a SRMdiff of 0.36 and a MDC95 of 0.7. The correlation between CTS-6 and the QuickDASH was concordant with the a priori formulated construct hypothesis (r 0.69) CONCLUSIONS: The standard Spanish version of the 6-item CTS symptoms scale showed good internal consistency, test-retest reliability and construct validity for outcomes assessment in CTS. The CTS-6 will be useful to clinicians and researchers in Spanish speaking parts of the world. The use of standardized outcome measures across countries also will facilitate comparison of research results in carpal tunnel syndrome.
Gaskin, Cadeyrn J; Lambert, Sylvie D; Bowe, Steven J; Orellana, Liliana
Sample selection can substantially affect the solutions generated using exploratory factor analysis. Validation studies of the 12-item World Health Organization (WHO) Disability Assessment Schedule 2.0 (WHODAS 2.0) have generally involved samples in which substantial proportions of people had no, or minimal, disability. With the WHODAS 2.0 oriented towards measuring disability across six life domains (cognition, mobility, self-care, getting along, life activities, and participation in society), performing factor analysis with samples of people with disability may be more appropriate. We determined the influence of the sampling strategy on (a) the number of factors extracted and (b) the factor structure of the WHODAS 2.0. Using data from adults aged 50+ from the six countries in Wave 1 of the WHO's longitudinal Study on global AGEing and adult health (SAGE), we repeatedly selected samples (n = 750) using two strategies: (1) simple random sampling that reproduced nationally representative distributions of WHODAS 2.0 summary scores for each country (i.e., positively skewed distributions with many zero scores indicating the absence of disability), and (2) stratified random sampling with weights designed to obtain approximately symmetric distributions of summary scores for each country (i.e. predominantly including people with varying degrees of disability). Samples with skewed distributions typically produced one-factor solutions, except for the two countries with the lowest percentages of zero scores, in which the majority of samples produced two factors. Samples with approximately symmetric distributions, generally produced two- or three-factor solutions. In the two-factor solutions, the getting along domain items loaded on one factor (commonly with a cognition domain item), with remaining items loading on a second factor. In the three-factor solutions, the getting along and self-care domain items loaded separately on two factors and three other domains
Janson, David C.
This descriptive study is addressed to policy-makers, textbook publisher, teachers, principals, and curriculum directors. It compares the assessment practices of ten elementary teachers over a period of 11 weeks with Ohio's fourth and sixth grade science Proficiency Tests. Results show that the teachers' assessment practices were not aligned with Ohio's Proficiency Test. The tests used in the participants' classroom contained a disproportionate number of items characterized as low-level in terms of their cognitive function. Classroom test items generally fell into three categories---true/false, completion, and matching. The remaining items were predominantly low-level multiple-choice items requiring simple recall of information. The teachers in this study showed a heavy reliance on the packaged assessments that accompanied their adopted textbook series with little use of teacher-designed instruments. This differs from the findings of previous researchers who reported that most teacher assessments were done with teacher-made tests. The lack of alignment between classroom tests and Ohio's Proficiency Test is a concern because previous researchers and the teachers in this study believe that aligning classroom tests with high-stakes assessment improves student performance. Other research shows teachers teach what they test suggesting that the curriculum would be better aligned with State expectations if classroom tests were more in line with the proficiency tests. This study found that textbooks and their assessment packages are not aligned to most state standards and that teachers need help developing better assessments. The results of this study suggest directions school administrators might take to facilitate inservice training for current teachers and could be helpful to textbook publishers as well as educators serving on adoption committees. Since high-stakes testing of students in the nation's public schools and school accountability seem destined to remain a
Levine, Stephen Z; Rabinowitz, Jonathan; Rizopoulos, Dimitris
The adequacy of the Positive and Negative Syndrome Scale (PANSS) items in measuring symptom severity in schizophrenia was examined using Item Response Theory (IRT). Baseline PANSS assessments were analyzed from two multi-center clinical trials of antipsychotic medication in chronic schizophrenia (n=1872). Generally, the results showed that the PANSS (a) item ratings discriminated symptom severity best for the negative symptoms; (b) has an excess of "Severe" and "Extremely severe" rating options; and (c) assessments are more reliable at medium than very low or high levels of symptom severity. Analysis also showed that the detection of statistically and non-statistically significant differences in treatment were highly similar for the original and IRT-modified PANSS. In clinical trials of chronic schizophrenia, the PANSS appears to require the following modifications: fewer rating options, adjustment of 'Lack of judgment and insight', and improved severe symptom assessment. 2011 Elsevier Ltd. All rights reserved.
Holden, Libby; Lee, Christina; Hockey, Richard; Ware, Robert S; Dobson, Annette J
This study aimed to validate a 6-item 1-factor global measure of social support developed from the Medical Outcomes Study Social Support Survey (MOS-SSS) for use in large epidemiological studies. Data were obtained from two large population-based samples of participants in the Australian Longitudinal Study on Women's Health. The two cohorts were aged 53-58 and 28-33 years at data collection (N = 10,616 and 8,977, respectively). Items selected for the 6-item 1-factor measure were derived from the factor structure obtained from unpublished work using an earlier wave of data from one of these cohorts. Descriptive statistics, including polychoric correlations, were used to describe the abbreviated scale. Cronbach's alpha was used to assess internal consistency and confirmatory factor analysis to assess scale validity. Concurrent validity was assessed using correlations between the new 6-item version and established 19-item version, and other concurrent variables. In both cohorts, the new 6-item 1-factor measure showed strong internal consistency and scale reliability. It had excellent goodness-of-fit indices, similar to those of the established 19-item measure. Both versions correlated similarly with concurrent measures. The 6-item 1-factor MOS-SSS measures global functional social support with fewer items than the established 19-item measure.
The present study used a cohort-sequential design to examine developmental changes in children's ability to bind items in memory during early and middle childhood. Three cohorts of children (aged 4, 6, or 8 years) were followed longitudinally for 3 years. Each year, children completed a source memory paradigm assessing memory for items and…
T. O. Tolstykh
Full Text Available In modern conditions of digitalization of the economy, its integration with the policy society questions of formation and development of corporate culture of the learning organisation are of particular relevance. Digital transformation of business dictates the need for the emergence and development of learning organizations, creating and preserving knowledge. In this situation, the openness of issues of assessment of efficiency of processes of formation and development defines the importance of the proposed research. Corporate culture is regarded by most scholars as the most important internal resource of the organization, able to provide her with stability in a crisis and give impetus to the development and transition to qualitatively different levels of the life cycle. This position assumes that a strong corporate culture should be aimed at building a learning organization, able to quickly adapt to changes in the external and internal environment. This article examines the issue of assessment of efficiency of corporate culture; it is shown that in addition to the empirical, sociological methods and qualitative approach to evaluation, is acceptable investment approach. This option appears when you use the aggregate target-oriented and project management methods, which allows in a systematic manner to carry out the formation and development of corporate culture. The assessment should be subject to software development activities and (or development of the corporate culture of a learning organization. In evidence to draw conclusions on the example of agricultural companies, a calculation of the economic efficiency of the program of formation of corporate culture of a learning organization. Calculation of net discounted income, the net present value of the project, profitability index, project profitability, payback period. This confirms the social and economic effects of the proposed program on the formation of corporate culture of independent
Irwin Debra E
Full Text Available Abstract Background Pediatric self-report should be considered the standard for measuring patient reported outcomes (PRO among children. However, circumstances exist when the child is too young, cognitively impaired, or too ill to complete a PRO instrument and a proxy-report is needed. This paper describes the development process including the proxy cognitive interviews and large-field-test survey methods and sample characteristics employed to produce item parameters for the Patient Reported Outcomes Measurement Information System (PROMIS pediatric proxy-report item banks. Methods The PROMIS pediatric self-report items were converted into proxy-report items before undergoing cognitive interviews. These items covered six domains (physical function, emotional distress, social peer relationships, fatigue, pain interference, and asthma impact. Caregivers (n = 25 of children ages of 5 and 17 years provided qualitative feedback on proxy-report items to assess any major issues with these items. From May 2008 to March 2009, the large-scale survey enrolled children ages 8-17 years to complete the self-report version and caregivers to complete the proxy-report version of the survey (n = 1548 dyads. Caregivers of children ages 5 to 7 years completed the proxy report survey (n = 432. In addition, caregivers completed other proxy instruments, PedsQL™ 4.0 Generic Core Scales Parent Proxy-Report version, PedsQL™ Asthma Module Parent Proxy-Report version, and KIDSCREEN Parent-Proxy-52. Results Item content was well understood by proxies and did not require item revisions but some proxies clearly noted that determining an answer on behalf of their child was difficult for some items. Dyads and caregivers of children ages 5-17 years old were enrolled in the large-scale testing. The majority were female (85%, married (70%, Caucasian (64% and had at least a high school education (94%. Approximately 50% had children with a chronic health condition, primarily
Kelly D. Meusen-Beekman
Full Text Available Fostering self-regulated learning (SRL has become increasingly important at various educational levels. Most studies on SRL have been conducted in higher education. The present literature study aims toward understanding self-regulation processes of students in primary and secondary education. We explored the development of young students’ self-regulation from a theoretical perspective. In addition, effective characteristics for an intervention to develop young students’ self-regulation were examined, as well as the possibilities of implementing formative assessments in primary education to develop self-regulation. The results show that SRL can be supported in both primary and secondary education. However, at both school levels, differences were found, regarding the theoretical background of the training and the type of instructed strategy. Studies so far suggest avenues toward formative assessment, which seems to be a unifying theory of instruction that improves the learning process by developing self-regulation among students. But gaps in knowledge about the impact of formative assessments on the development of SRL strategies among primary school students require further exploration.
Papadomichelaki, Xenia; Mentzas, Gregoris
A critical element in the evolution of e-governmental services is the development of sites that better serve the citizens’ needs. To deliver superior service quality, we must first understand how citizens perceive and evaluate online citizen service. This involves defining what e-government service quality is, identifying its underlying dimensions, and determining how it can be conceptualized and measured. In this article we conceptualise an e-government service quality model (e-GovQual) and then we develop, refine, validate, confirm and test a multiple-item scale for measuring e-government service quality for public administration sites where citizens seek either information or services.
Covitt, Beth A.; Gunckel, Kristin L.; Caplan, Bess; Syswerda, Sara
While learning progressions (LPs) hold promise as instructional tools, researchers are still in the early stages of understanding how teachers use LPs in formative assessment practices. We report on a study that assessed teachers' proficiency in using a LP for student ideas about hydrologic systems. Research questions were: (a) what were teachers'…
Alphs, Larry; Morlock, Robert; Coon, Cheryl; Cazorla, Pilar; Szegedi, Armin; Panagides, John
The 16-item Negative Symptom Assessment (NSA-16) scale is a validated tool for evaluating negative symptoms of schizophrenia. The psychometric properties and predictive power of a four-item version (NSA-4) were compared with the NSA-16. Baseline data from 561 patients with predominant negative symptoms of schizophrenia who participated in two identically designed clinical trials were evaluated. Ordered logistic regression analysis of ratings using NSA-4 and NSA-16 were compared with ratings using several other standard tools to determine predictive validity and construct validity. Internal consistency and test--retest reliability were also analyzed. NSA-16 and NSA-4 scores were both predictive of scores on the NSA global rating (odds ratio = 0.83-0.86) and the Clinical Global Impressions--Severity scale (odds ratio = 0.91-0.93). NSA-16 and NSA-4 showed high correlation with each other (Pearson r = 0.85), similar high correlation with other measures of negative symptoms (demonstrating convergent validity), and lesser correlations with measures of other forms of psychopathology (demonstrating divergent validity). NSA-16 and NSA-4 both showed acceptable internal consistency (Cronbach α, 0.85 and 0.64, respectively) and test--retest reliability (intraclass correlation coefficient, 0.87 and 0.82). This study demonstrates that NSA-4 offers accuracy comparable to the NSA-16 in rating negative symptoms in patients with schizophrenia. Copyright © 2011 John Wiley & Sons, Ltd.
Romero-Martín, M. Rosario; Castejón-Oliva, Francisco-Javier; López-Pastor, Víctor-Manuel; Fraile-Aranda, Antonio
The purpose of this study is to analyze the perception of students, graduates, and lecturers in relation to systems of formative and shared assessment and to the acquisition of teaching competences regarding communication and the use of Information and Communications Technology (ICT) in initial teacher education (ITE) on degrees in Primary…
... 17 Commodity and Securities Exchanges 3 2010-04-01 2010-04-01 false Inclusion of items, differentiation between items and answers, omission of instructions. 260.7a-16 Section 260.7a-16 Commodity and... INDENTURE ACT OF 1939 Formal Requirements § 260.7a-16 Inclusion of items, differentiation between items and...
Wilt, Joshua; Revelle, William
Personality psychology is concerned with affect (A), behavior (B), cognition (C) and desire (D), and personality traits have been defined conceptually as abstractions used to either explain or summarize coherent ABC (and sometimes D) patterns over time and space. However, this conceptual definition of traits has not been reflected in their operationalization, possibly resulting in theoretical and practical limitations to current trait inventories. Thus, the goal of this project was to determine the affective, behavioral, cognitive and desire (ABCD) components of Big-Five personality traits. The first study assessed the ABCD content of items measuring Big-Five traits in order to determine the ABCD composition of traits and identify items measuring relatively high amounts of only one ABCD content. The second study examined the correlational structure of scales constructed from items assessing ABCD content via a large, web-based study. An assessment of Big-Five traits that delineates ABCD components of each trait is presented, and the discussion focuses on how this assessment builds upon current approaches of assessing personality. PMID:26279606
Sekely, Angela; Taylor, Graeme J; Bagby, R Michael
The Toronto Structured Interview for Alexithymia (TSIA) was developed to provide a structured interview method for assessing alexithymia. One drawback of this instrument is the amount of time it takes to administer and score. The current study used item response theory (IRT) methods to analyze data from a large heterogeneous multi-language sample (N = 842) to investigate whether a subset of items could be selected to create a short version of the instrument. Samejima's (1969) graded response model was used to fit the item responses. Items providing maximum information were retained in the short model, resulting in the elimination of 12-items from the original 24-items. Despite the 50% reduction in the number of items, 65.22% of the information was retained. Further studies are needed to validate the short version. A short version of the TSIA is potentially of practical value to clinicians and researchers with time constraints. Copyright © 2018. Published by Elsevier B.V.
Garcia-Campayo, Javier; Navarro-Gil, Mayte; Andrés, Eva; Montero-Marin, Jesús; López-Artal, Lorena; Demarzo, Marcelo Marcos Piva
Self-compassion is a key psychological construct for assessing clinical outcomes in mindfulness-based interventions. The aim of this study was to validate the Spanish versions of the long (26 item) and short (12 item) forms of the Self-Compassion Scale (SCS). The translated Spanish versions of both subscales were administered to two independent samples: Sample 1 was comprised of university students (n = 268) who were recruited to validate the long form, and Sample 2 was comprised of Aragon Health Service workers (n = 271) who were recruited to validate the short form. In addition to SCS, the Mindful Attention Awareness Scale (MAAS), the State-Trait Anxiety Inventory-Trait (STAI-T), the Beck Depression Inventory (BDI) and the Perceived Stress Questionnaire (PSQ) were administered. Construct validity, internal consistency, test-retest reliability and convergent validity were tested. The Confirmatory Factor Analysis (CFA) of the long and short forms of the SCS confirmed the original six-factor model in both scales, showing goodness of fit. Cronbach's α for the 26 item SCS was 0.87 (95% CI = 0.85-0.90) and ranged between 0.72 and 0.79 for the 6 subscales. Cronbach's α for the 12-item SCS was 0.85 (95% CI = 0.81-0.88) and ranged between 0.71 and 0.77 for the 6 subscales. The long (26-item) form of the SCS showed a test-retest coefficient of 0.92 (95% CI = 0.89-0.94). The Intraclass Correlation (ICC) for the 6 subscales ranged from 0.84 to 0.93. The short (12-item) form of the SCS showed a test-retest coefficient of 0.89 (95% CI: 0.87-0.93). The ICC for the 6 subscales ranged from 0.79 to 0.91. The long and short forms of the SCS exhibited a significant negative correlation with the BDI, the STAI and the PSQ, and a significant positive correlation with the MAAS. The correlation between the total score of the long and short SCS form was r = 0.92. The Spanish versions of the long (26-item) and short (12-item) forms of the SCS are valid and
Chan, An-Wen; Tetzlaff, Jennifer M; Altman, Douglas G; Laupacis, Andreas; Gøtzsche, Peter C; Krle A-Jerić, Karmela; Hrobjartsson, Asbjørn; Mann, Howard; Dickersin, Kay; Berlin, Jesse A; Dore, Caroline J; Parulekar, Wendy R; Summerskill, William S M; Groves, Trish; Schulz, Kenneth F; Sox, Harold C; Rockhold, Frank W; Rennie, Drummond; Moher, David
The protocol of a clinical trial serves as the foundation for study planning, conduct, reporting, and appraisal. However, trial protocols and existing protocol guidelines vary greatly in content and quality. This article describes the systematic development and scope of SPIRIT (Standard Protocol Items: Recommendations for Interventional Trials) 2013, a guideline for the minimum content of a clinical trial protocol. The 33-item SPIRIT checklist applies to protocols for all clinical trials and focuses on content rather than format. The checklist recommends a full description of what is planned; it does not prescribe how to design or conduct a trial. By providing guidance for key content, the SPIRIT recommendations aim to facilitate the drafting of high-quality protocols. Adherence to SPIRIT would also enhance the transparency and completeness of trial protocols for the benefit of investigators, trial participants, patients, sponsors, funders, research ethics committees or institutional review boards, peer reviewers, journals, trial registries, policymakers, regulators, and other key stakeholders.
van Rooij, Antonius J; Van Looy, Jan; Billieux, Joël
Some people have serious problems controlling their Internet and video game use. The DSM-5 now includes a proposal for 'Internet Gaming Disorder' (IGD) as a condition in need of further study. Various studies aim to validate the proposed diagnostic criteria for IGD and multiple new scales have been introduced that cover the suggested criteria. Using a structured approach, we demonstrate that IGD might be better interpreted as a formative construct, as opposed to the current practice of conceptualizing it as a reflective construct. Incorrectly approaching a formative construct as a reflective one causes serious problems in scale development, including: (i) incorrect reliance on item-to-total scale correlation to exclude items and incorrectly relying on indices of inter-item reliability that do not fit the measurement model (e.g., Cronbach's α); (ii) incorrect interpretation of composite or mean scores that assume all items are equal in contributing value to a sum score; and (iii) biased estimation of model parameters in statistical models. We show that these issues are impacting current validation efforts through two recent examples. A reinterpretation of IGD as a formative construct has broad consequences for current validation efforts and provides opportunities to reanalyze existing data. We discuss three broad implications for current research: (i) composite latent constructs should be defined and used in models; (ii) item exclusion and selection should not rely on item-to-total scale correlations; and (iii) existing definitions of IGD should be enriched further. © 2016 The Authors. Psychiatry and Clinical Neurosciences © 2016 Japanese Society of Psychiatry and Neurology.
Sahin, Alper; Anil, Duygu
This study investigates the effects of sample size and test length on item-parameter estimation in test development utilizing three unidimensional dichotomous models of item response theory (IRT). For this purpose, a real language test comprised of 50 items was administered to 6,288 students. Data from this test was used to obtain data sets of…
Abdallah, Basem; Ditzel, Nicholas; Kassem, Moustapha
In vivo assessment of bone formation (osteogenesis) potential by isolated cells is an important method for analysis of cells and factors control ling bone formation. Currently, cell implantation mixed with hydroxyapa-tite/tricalcium phosphate in an open system (subcutaneous implantation) in immun...
Takano, Ken-ichi; Hasegawa, Naoko; Hirose, Ayako; Hayase, Ken-ichi
For past five years, CRIEPI has been continuing efforts to develop and make applications of a 'safety assessment system' which enable to measure the safety level of organization. This report describe about frame of the system, assessment results and its reliability, and relation between labor accident rate in the site and total safety index (TSI), which can be obtained by the principal factors analysis. The safety assessment in this report is based on questionnaire survey of employee. The format and concrete questionnaires were developed using existing literatures including organizational assessment tools. The tailored questionnaire format involved 124 questionnaire items. The assessment results could be considered as a well indicator of the safety level of organization, safety management, and safety awareness of employee. (author)
Amber D. Dumford
Full Text Available As society’s needs for quantitative skills become more prevalent, college graduates require quantitative skills regardless of their career choices. Therefore, it is important that institutions assess students’ engagement in quantitative activities during college. This study chronicles the process taken by the National Survey of Student Engagement (NSSE to develop items that measure students’ participation in quantitative reasoning (QR activities. On the whole, findings across the quantitative and qualitative analyses suggest good overall properties for the developed QR items. The items show great promise to explore and evaluate the frequency with which college students participate in QR-related activities. Each year, hundreds of institutions across the United States and Canada participate in NSSE, and, with the addition of these new items on the core survey, every participating institution will have information on this topic. Our hope is that these items will spur conversations on campuses about students’ use of quantitative reasoning activities.
Williams, Stacy A. S.; Stenglein, Katherine
In order for school psychologists to effectively work with teachers, it is important to understand not only the context in which they work, but to understand how educators consider and subsequently use data. Therefore, the purpose of this article is to examine how formative assessments are conceptualized in teacher training and pedagogical…
Hamane, Ryoso; Itoh, Toshiya; Tomita, Kouhei
When a store sells items to customers, the store wishes to determine the prices of the items to maximize its profit. Intuitively, if the store sells the items with low (resp. high) prices, the customers buy more (resp. less) items, which provides less profit to the store. So it would be hard for the store to decide the prices of items. Assume that the store has a set V of n items and there is a set E of m customers who wish to buy those items, and also assume that each item i ∈ V has the production cost di and each customer ej ∈ E has the valuation vj on the bundle ej ⊆ V of items. When the store sells an item i ∈ V at the price ri, the profit for the item i is pi = ri - di. The goal of the store is to decide the price of each item to maximize its total profit. We refer to this maximization problem as the item pricing problem. In most of the previous works, the item pricing problem was considered under the assumption that pi ≥ 0 for each i ∈ V, however, Balcan, et al. [In Proc. of WINE, LNCS 4858, 2007] introduced the notion of “loss-leader, ” and showed that the seller can get more total profit in the case that pi < 0 is allowed than in the case that pi < 0 is not allowed. In this paper, we derive approximation preserving reductions among several item pricing problems and show that all of them have algorithms with good approximation ratio.
Welch, Karen P.
Formative assessment has been identified as an effective pedagogical practice in the field of education, where teachers and students engage daily in an interactive process to gather evidence of the students' proficiency of a specific learning goal. The evidence collected by the teacher and a student during the formative assessment process allows…
Park, Mihwa; Liu, Xiufeng; Smith, Erica; Waight, Noemi
This study reports the effect of computer models as formative assessment on high school students' understanding of the nature of models. Nine high school teachers integrated computer models and associated formative assessments into their yearlong high school chemistry course. A pre-test and post-test of students' understanding of the nature of…
Petersen, Morten Aa.; Gamper, Eva-Maria; Costantini, Anna
of the widely used EORTC Quality of Life questionnaire (QLQ-C30). STUDY DESIGN AND SETTING: On the basis of literature search and evaluations by international samples of experts and cancer patients, 38 candidate items were developed. The psychometric properties of the items were evaluated in a large...... international sample of cancer patients. This included evaluations of dimensionality, item response theory (IRT) model fit, differential item functioning (DIF), and of measurement precision/statistical power. RESULTS: Responses were obtained from 1,023 cancer patients from four countries. The evaluations showed...... that 24 items could be included in a unidimensional IRT model. DIF did not seem to have any significant impact on the estimation of EF. Evaluations indicated that the CAT measure may reduce sample size requirements by up to 50% compared to the QLQ-C30 EF scale without reducing power. CONCLUSION...
Describes a method for assessing the quality of translations based on item response theory (IRT). Results from the IRT technique with French and Chinese versions of a scale measuring individualism-collectivism for samples of 250 U.S., 357 French, and 290 Chinese undergraduates show how several biased items are detected. (SLD)
Sheldon, Signy; Levine, Brian
During autobiographical memory retrieval, the medial temporal lobes (MTL) relate together multiple event elements, including object (within-item relations) and context (item-context relations) information, to create a cohesive memory. There is consistent support for a functional specialization within the MTL according to these relational processes, much of which comes from recognition memory experiments. In this study, we compared brain activation patterns associated with retrieving within-item relations (i.e., associating conceptual and sensory-perceptual object features) and item-context relations (i.e., spatial relations among objects) with respect to naturalistic autobiographical retrieval. We developed a novel paradigm that cued participants to retrieve information about past autobiographical events, non-episodic within-item relations, and non-episodic item-context relations with the perceptuomotor aspects of retrieval equated across these conditions. We used multivariate analysis techniques to extract common and distinct patterns of activity among these conditions within the MTL and across the whole brain, both in terms of spatial and temporal patterns of activity. The anterior MTL (perirhinal cortex and anterior hippocampus) was preferentially recruited for generating within-item relations later in retrieval whereas the posterior MTL (posterior parahippocampal cortex and posterior hippocampus) was preferentially recruited for generating item-context relations across the retrieval phase. These findings provide novel evidence for functional specialization within the MTL with respect to naturalistic memory retrieval. © 2015 Wiley Periodicals, Inc.
Paz, Sylvia H; Spritzer, Karen L; Reise, Steven P; Hays, Ron D
About 70% of Latinos, 5 years old or older, in the United States speak Spanish at home. Measurement equivalence of the PROMIS ® pain interference (PI) item bank by language of administration (English versus Spanish) has not been evaluated. A sample of 527 adult Spanish-speaking Latinos completed the Spanish version of the 41-item PROMIS ® pain interference item bank. We evaluate dimensionality, monotonicity and local independence of the Spanish-language items. Then we evaluate differential item functioning (DIF) using ordinal logistic regression with item response theory scores estimated from DIF-free "anchor" items. One of the 41 items in the Spanish version of the PROMIS ® PI item bank was identified as having significant uniform DIF. English- and Spanish-speaking subjects with the same level of pain interference responded differently to 1 of the 41 items in the PROMIS ® PI item bank. This item was not retained due to proprietary issues. The original English language item parameters can be used when estimating PROMIS ® PI scores.
Stevenson, Claire E.; Heiser, Willem J.; Resing, Wilma C. M.
Multiple-choice (MC) analogy items are often used in cognitive assessment. However, in dynamic testing, where the aim is to provide insight into potential for learning and the learning process, constructed-response (CR) items may be of benefit. This study investigated whether training with CR or MC items leads to differences in the strategy…
Ariel, A.; van der Linden, Willem J.; Veldkamp, Bernard P.
Item-pool management requires a balancing act between the input of new items into the pool and the output of tests assembled from it. A strategy for optimizing item-pool management is presented that is based on the idea of a periodic update of an optimal blueprint for the item pool to tune item
Areiza Restrepo Hugo Nelson
Full Text Available This article presents a partial report of a small qualitative research study that explored the students’ views of their learning during and after the implementation of formative procedures such as self-assessment, feedback, and conferences. The article also includes their perceptions about this implementation. The research was carried out with a group of students of English enrolled in an extension program of a Colombian public university. The results showed that formative assessment helped these learners to be aware of their communicative competence and to perceive the situations in which they developed this awareness; it also enabled them to experience success in their learning. Also, learners identified the purposes of this kind of assessment and perceived formative assessment as a transparent procedure.Este artículo presenta el reporte parcial de un pequeño estudio de investigación de tipo cualitativo que exploró las percepciones de los estudiantes sobre su aprendizaje durante y después de la implementación de una evaluación formativa sistemática y sus visiones sobre este tipo de intervención. El estudio se llevó a cabo en un grupo de estudiantes de inglés pertenecientes a un programa de extensión de enseñanza de lenguas extranjeras en una universidad pública colombiana. Los resultados mostraron que la evaluación formativa ayudó a estos estudiantes a ser conscientes de su competencia comunicativa y a reconocer las situaciones en las que se generó tal conciencia; además, también les permitió experimentar éxito en su aprendizaje. Asimismo, los estudiantes identificaron los propósitos de este tipo de evaluación, la cual percibieron como un proceso transparente.
Uhrskov Sørensen, Lisbeth; Foldspang, Anders; Gulmann, Nils Christian
psychiatrist. The two assessments were mutually blinded. Multiple conditional forward logistic regression was used to select the items that most strongly predicted organic disorder as assessed by the psychiatrist. The weighted score had significantly better validity parameters, performed better on a receiver...
Fermino Fernandes Sisto
Full Text Available In this research evidences of construct validity were searched analyzing the differential functioning items related to aggressiveness. The participants were 445 college students of both genders, attending the courses of Engineering, Computing and Psychology. The scale of aggressiveness composed by 81 items was collectively applied, in the classroom, to the students who consented to participate in the study. The items of the instrument were studied by means of the Rasch model. Twenty-eight items presented differential functioning item, 15 were characterized as typical for females and 13 for males. The reliability coefficients were 0.99 to the items and 0.86 to the persons. It was concluded that the aggressiveness can be measured separately on the basis of gender.
Granziol, Umberto; Spoto, Andrea; Vidotto, Giulio
The nonverbal behavior (NVB) of people diagnosed with schizophrenia consistently interacts with their symptoms during the assessment. Previous studies frequently observed such an interaction when a prevalence of negative symptoms occurred. Nonetheless, a list of NVBs linked to negative symptoms needs to be defined. Furthermore, a list of items that can exhaustively assess such NVBs is still needed. The present study aims to introduce both lists by using the Formal Psychological Assessment. A deep analysis was performed on both the scientific literature and the DSM-5 for constructing the set of nonverbal behaviors; similarly, an initial list of 138 items investigating the behaviors was obtained from instruments used to assess schizophrenia. The Formal Psychological Assessment was then applied to reduce the preliminary list. A final list of 23 items necessary and sufficient to investigate the NVBs emerged. The list also allowed us to analyze specific relations among items. The present study shows how it is possible to deepen a patient's negative symptomatology, starting with the relations between items and the NVBs they investigate. Finally, this study examines the advantages and clinical implications of defining an assessment tool based on the found list of items. Copyright © 2017 John Wiley & Sons, Ltd.
Scott, Neil W; Fayers, Peter M; Aaronson, Neil K
Differential item functioning (DIF) analyses are commonly used to evaluate health-related quality of life (HRQoL) instruments. There is, however, a lack of consensus as to how to assess the practical impact of statistically significant DIF results.......Differential item functioning (DIF) analyses are commonly used to evaluate health-related quality of life (HRQoL) instruments. There is, however, a lack of consensus as to how to assess the practical impact of statistically significant DIF results....
Radiological assessments of the disposal of radioactive waste in evaporite formations, principally halite, have been reviewed. These assessments were carried out in the USA, the Netherlands, Denmark and West Germany. The general nature of evaporite formations in the UK is discussed and comments are given on the broad relevance of the assessments to the potential disposal of radioactive waste in UK evaporite formations. (author)
Solheim, Elisabeth; Plathe, Hilde Syvertsen; Eide, Hilde
Clinical skills training is an important part of nurses' education programmes. Clinical skills are complex. A common understanding of what characterizes clinical skills and learning outcomes needs to be established. The aim of the study was to develop and evaluate a new reflection and feedback tool for formative assessment. The study has a descriptive quantitative design. 129 students participated who were at the end of the first year of a Bachelor degree in nursing. After highfidelity simulation, data were collected using a questionnaire with 19 closed-ended and 2 open-ended questions. The tool stimulated peer assessment, and enabled students to be more thorough in what to assess as an observer in clinical skills. The tool provided a structure for selfassessment and made visible items that are important to be aware of in clinical skills. This article adds to simulation literature and provides a tool that is useful in enhancing peer learning, which is essential for nurses in practice. The tool has potential for enabling students to learn about reflection and developing skills for guiding others in practice after they have graduated. Copyright © 2017 Elsevier Ltd. All rights reserved.
Wolfson, Julia A; Moran, Alyssa J; Jarlenski, Marian P; Bleich, Sara N
Consuming too much sodium is associated with increased risk for cardiovascular disease, and restaurant foods are a primary source of sodium. This study assessed recent trends in sodium content of menu items in U.S. chain restaurants. Data from 21,557 menu items in 66 top-earning chain restaurants available from 2012 to 2016 were obtained from the MenuStat project and analyzed in 2017. Generalized linear models were used to examine changes in calorie-adjusted, per-item sodium content of menu items offered in all years (2012-2016) and items offered in 2012 only compared with items newly introduced in 2013, 2014, 2015, and 2016. Overall, calorie-adjusted sodium content in newly introduced menu items declined by 104 mg from 2012 to 2016 (prestaurant type; sodium content, particularly for main course items, was high. Sodium declined by 83 mg in fast food restaurants, 19 mg in fast casual restaurants, and 163 mg in full service restaurants. Sodium in appetizer and side items newly introduced in 2016 increased by 266 mg compared with items on the menu in 2012 only (prestaurants. However, sodium content of core and new menu items remain high, and reductions are inconsistent across menu categories and restaurant types. Copyright © 2018 American Journal of Preventive Medicine. Published by Elsevier Inc. All rights reserved.
Chu, Man-Wai; Fung, Karen
Canadian students experience many different assessments throughout their schooling (O'Connor 2011). There are many benefits to using a variety of assessment types, item formats, and science-based performance tasks in the classroom to measure the many dimensions of science education. Although using a variety of assessments is beneficial, it is unclear exactly what types, format, and tasks are used in Canadian science classrooms. Additionally, since assessments are often administered to help improve student learning, this study identified assessments that may improve student learning as measured using achievement scores on a standardized test. Secondary analyses of the students' and teachers' responses to the questionnaire items asked in the Pan-Canadian Assessment Program were performed. The results of the hierarchical linear modeling analyses indicated that both students and teachers identified teacher-developed classroom tests or quizzes as the most common types of assessments used. Although this ranking was similar across the country, statistically significant differences in terms of the assessments that are used in science classrooms among the provinces were also identified. The investigation of which assessment best predicted student achievement scores indicated that minds-on science performance-based tasks significantly explained 4.21% of the variance in student scores. However, mixed results were observed between the student and teacher responses towards tasks that required students to choose their own investigation and design their own experience or investigation. Additionally, teachers that indicated that they conducted more demonstrations of an experiment or investigation resulted in students with lower scores.
Edward K. Chang
Discussion: Performance on weekly formative assessments was predictive of final exam scores. Struggling medical students will benefit from extra cumulative practice exams while students who are excelling do not need extra practice.
Prem Senthil, Mallika; Khadka, Jyoti; De Roach, John; Lamey, Tina; McLaren, Terri; Campbell, Isabella; Fenwick, Eva K; Lamoureux, Ecosse L; Pesudovs, Konrad
Our understanding of the coping strategies used by people with visual impairment to manage stress related to visual loss is limited. This study aims to develop a sophisticated coping instrument in the form of an item bank implemented via Computerised adaptive testing (CAT) for hereditary retinal diseases. Items on coping were extracted from qualitative interviews with patients which were supplemented by items from a literature review. A systematic multi-stage process of item refinement was carried out followed by expert panel discussion and cognitive interviews. The final coping item bank had 30 items. Rasch analysis was used to assess the psychometric properties. A CAT simulation was carried out to estimate an average number of items required to gain precise measurement of hereditary retinal disease-related coping. One hundred eighty-nine participants answered the coping item bank (median age = 58 years). The coping scale demonstrated good precision and targeting. The standardised residual loadings for items revealed six items grouped together. Removal of the six items reduced the precision of the main coping scale and worsened the variance explained by the measure. Therefore, the six items were retained within the main scale. Our CAT simulation indicated that, on average, less than 10 items are required to gain a precise measurement of coping. This is the first study to develop a psychometrically robust coping instrument for hereditary retinal diseases. CAT simulation indicated that on an average, only four and nine items were required to gain measurement at moderate and high precision, respectively.
Stanton, Kenneth C.
The purposes of this study were to conduct an exploratory study of the status quo of engineering faculty motivation for and engagement in formative assessment, and to conduct a preliminary validation of a motivational model, based in self-determination theory, that explains relationships between these variables. To do so, a survey instrument was first developed and validated, in accordance with a process prescribed in the literature, that measured individual engineering faculty membersâ mo...
M K Joshi
Full Text Available Background: Despite an increasing emphasis on workplace-based assessment (WPBA during medical training, the existing assessment system largely relies on summative assessment while formative assessment is less valued. Various tools have been described for WPBA, mini-clinical evaluation exercise (mini-CEX being one of them. Mini-CEX is well accepted in Western countries, however, reports of its use in India are scarce. We conducted this study to assess acceptability and feasibility of mini-CEX as a formative assessment tool for WPBA of surgical postgraduate students in an Indian setting. Methods: Faculty members and 2nd year surgical residents were sensitized toward mini-CEX and requisite numbers of exercises were conducted. The difficulties during conduction of these exercises were identified, recorded, and appropriate measures were taken to address them. At the conclusion, the opinion of residents and faculty members regarding their experience with mini-CEX was taken using a questionnaire. The results were analyzed using simple statistical tools. Results: Nine faculty members out of 11 approached participated in the study (81.8%. All 16 2nd year postgraduate surgical residents participated (100%. Sixty mini-CEX were conducted over 7 months. Each resident underwent 3–5 encounters. The mean time taken by the assessor for observation was 12.3 min (8–30 min while the mean feedback time was 4.2 min (3–10 min. The faculty reported good overall satisfaction with mini-CEX and found it acceptable as a formative assessment tool. Three faculty members (33.3% reported mini-CEX as more time-consuming while 2 (22.2% found it difficult to carry the exercises often. All residents accepted mini-CEX and most of them reported good to high satisfaction with the exercises conducted. Conclusions: Mini-CEX is well accepted by residents and faculty as a formative assessment tool. It is feasible to utilize mini-CEX for WPBA of postgraduate students of surgery.
Arce-Ferrer, Alvaro J.; Bulut, Okan
This study examines separate and concurrent approaches to combine the detection of item parameter drift (IPD) and the estimation of scale transformation coefficients in the context of the common item nonequivalent groups design with the three-parameter item response theory equating. The study uses real and synthetic data sets to compare the two…
Cho, Sun-Joo; Wilmer, Jeremy; Herzmann, Grit; McGugin, Rankin; Fiset, Daniel; Van Gulick, Ana E.; Ryan, Katie; Gauthier, Isabel
We evaluated the psychometric properties of the Cambridge face memory test (CFMT; Duchaine & Nakayama, 2006). First, we assessed the dimensionality of the test with a bi-factor exploratory factor analysis (EFA). This EFA analysis revealed a general factor and three specific factors clustered by targets of CFMT. However, the three specific factors appeared to be minor factors that can be ignored. Second, we fit a unidimensional item response model. This item response model showed that the CFMT items could discriminate individuals at different ability levels and covered a wide range of the ability continuum. We found the CFMT to be particularly precise for a wide range of ability levels. Third, we implemented item response theory (IRT) differential item functioning (DIF) analyses for each gender group and two age groups (Age ≤ 20 versus Age > 21). This DIF analysis suggested little evidence of consequential differential functioning on the CFMT for these groups, supporting the use of the test to compare older to younger, or male to female, individuals. Fourth, we tested for a gender difference on the latent facial recognition ability with an explanatory item response model. We found a significant but small gender difference on the latent ability for face recognition, which was higher for women than men by 0.184, at age mean 23.2, controlling for linear and quadratic age effects. Finally, we discuss the practical considerations of the use of total scores versus IRT scale scores in applications of the CFMT. PMID:25642930
Tassé, Marc J.; Schalock, Robert L.; Thissen, David; Balboni, Giulia; Bersani, Henry, Jr.; Borthwick-Duffy, Sharon A.; Spreat, Scott; Widaman, Keith F.; Zhang, Dalun; Navas, Patricia
The Diagnostic Adaptive Behavior Scale (DABS) was developed using item response theory (IRT) methods and was constructed to provide the most precise and valid adaptive behavior information at or near the cutoff point of making a decision regarding a diagnosis of intellectual disability. The DABS initial item pool consisted of 260 items. Using IRT…
JOSEPH P. EIMICKE
Full Text Available The aims of this paper are to present findings related to differential item functioning (DIF in the Patient Reported Outcome Measurement Information System (PROMIS depression item bank, and to discuss potential threats to the validity of results from studies of DIF. The 32 depression items studied were modified from several widely used instruments. DIF analyses of gender, age and education were performed using a sample of 735 individuals recruited by a survey polling firm. DIF hypotheses were generated by asking content experts to indicate whether or not they expected DIF to be present, and the direction of the DIF with respect to the studied comparison groups. Primary analyses were conducted using the graded item response model (for polytomous, ordered response category data with likelihood ratio tests of DIF, accompanied by magnitude measures. Sensitivity analyses were performed using other item response models and approaches to DIF detection. Despite some caveats, the items that are recommended for exclusion or for separate calibration were "I felt like crying" and "I had trouble enjoying things that I used to enjoy." The item, "I felt I had no energy," was also flagged as evidencing DIF, and recommended for additional review. On the one hand, false DIF detection (Type 1 error was controlled to the extent possible by ensuring model fit and purification. On the other hand, power for DIF detection might have been compromised by several factors, including sparse data and small sample sizes. Nonetheless, practical and not just statistical significance should be considered. In this case the overall magnitude and impact of DIF was small for the groups studied, although impact was relatively large for some individuals.
Full Text Available Objective. We examined the diagnostic value of subjective memory complaints (SMCs assessed with a single item in a large cross-sectional cohort consisting of families with autosomal dominant Alzheimer’s disease (ADAD participating in the Dominantly Inherited Alzheimer Network (DIAN. Methods. The baseline sample of 183 mutation carriers (MCs and 117 noncarriers (NCs was divided according to Clinical Dementia Rating (CDR scale into preclinical (CDR 0; MCs: n=107; NCs: n=109, early symptomatic (CDR 0.5; MCs: n=48; NCs: n=8, and dementia stage (CDR ≥ 1; MCs: n=28; NCs: n=0. These groups were subdivided by the presence or absence of SMCs. Results. At CDR 0, SMCs were present in 12.1% of MCs and 9.2% of NCs (P=0.6. At CDR 0.5, SMCs were present in 66.7% of MCs and 62.5% of NCs (P=1.0. At CDR ≥ 1, SMCs were present in 96.4% of MCs. SMCs in MCs were significantly associated with CDR, logical memory scores, Geriatric Depression Scale, education, and estimated years to onset. Conclusions. The present study shows that SMCs assessed by a single-item scale have no diagnostic value to identify preclinical ADAD in asymptomatic individuals. These results demonstrate the need of further improvement of SMC measures that should be examined in large clinical trials.
Sevigny, Jeffrey J; Peng, Yahong; Liu, Lian; Lines, Christopher R
We explored the association of Alzheimer's disease (AD) Assessment Scale (ADAS-Cog) item scores with AD severity using cross-sectional and longitudinal data from the same study. Post hoc analyses were performed using placebo data from a 12-month trial of patients with mild-to-moderate AD (N =281 randomized, N =209 completed). Baseline distributions of ADAS-Cog item scores by Mini-Mental State Examination (MMSE) score and Clinical Dementia Rating (CDR) sum of boxes score (measures of dementia severity) were estimated using local and nonparametric regressions. Mixed-effect models were used to characterize ADAS-Cog item score changes over time by dementia severity (MMSE: mild =21-26, moderate =14-20; global CDR: mild =0.5-1, moderate =2). In the cross-sectional analysis of baseline ADAS-Cog item scores, orientation was the most sensitive item to differentiate patients across levels of cognitive impairment. Several items showed a ceiling effect, particularly in milder AD. In the longitudinal analysis of change scores over 12 months, orientation was the only item with noticeable decline (8%-10%) in mild AD. Most items showed modest declines (5%-20%) in moderate AD.
... DEPARTMENT OF DEFENSE Defense Acquisition Regulations System Commercial Item Handbook AGENCY.... SUMMARY: DoD has updated its Commercial Item Handbook. The purpose of the Handbook is to help acquisition personnel develop sound business strategies for procuring commercial items. DoD is seeking industry input on...
Mazefsky, Carla A; Yu, Lan; White, Susan W; Siegel, Matthew; Pilkonis, Paul A
Individuals with autism spectrum disorder (ASD) often present with prominent emotion dysregulation that requires treatment but can be difficult to measure. The Emotion Dysregulation Inventory (EDI) was created using methods developed by the Patient-Reported Outcomes Measurement Information System (PROMIS ® ) to capture observable indicators of poor emotion regulation. Caregivers of 1,755 youth with ASD completed 66 candidate EDI items, and the final 30 items were selected based on classical test theory and item response theory (IRT) analyses. The analyses identified two factors: (a) Reactivity, characterized by intense, rapidly escalating, sustained, and poorly regulated negative emotional reactions, and (b) Dysphoria, characterized by anhedonia, sadness, and nervousness. The final items did not show differential item functioning (DIF) based on gender, age, intellectual ability, or verbal ability. Because the final items were calibrated using IRT, even a small number of items offers high precision, minimizing respondent burden. IRT co-calibration of the EDI with related measures demonstrated its superiority in assessing the severity of emotion dysregulation with as few as seven items. Validity of the EDI was supported by expert review, its association with related constructs (e.g., anxiety and depression symptoms, aggression), higher scores in psychiatric inpatients with ASD compared to a community ASD sample, and demonstration of test-retest stability and sensitivity to change. In sum, the EDI provides an efficient and sensitive method to measure emotion dysregulation for clinical assessment, monitoring, and research in youth with ASD of any level of cognitive or verbal ability. Autism Res 2018. © 2018 International Society for Autism Research, Wiley Periodicals, Inc. This paper describes a new measure of poor emotional control called the Emotion Dysregulation Inventory (EDI). Caregivers of 1,755 youth with ASD completed candidate items, and advanced statistical
Martin, Christie S.; Polly, Drew; Wang, Chuang; Lambert, Richard G.; Pugalee, David K.
This study examined the influence of professional development on elementary school teachers' perceptions of and use of an internet-based formative assessment tool focused on students' number sense skills. Data sources include teacher-participants' pre and post survey, open ended response on post survey, use of the assessment tool and their written…
Daniels, Brian; Volpe, Robert J.; Briesch, Amy M.; Gadow, Kenneth D.
Direct behavior rating (DBR) represents a feasible method for monitoring student behavior in the classroom; however, limited work to date has focused on the use of multi-item scales. The purposes of the study were to examine the (a) dependability of data obtained from a multi-item DBR designed to assess peer conflict and (b) treatment sensitivity…
Fernandez Carratala, L.
There is an increasing difficulty for purchasing safety related spare items, with certifications by manufacturers for maintaining the original qualifications of the equipment of destination. The main reasons are, on the top of the logical evolution of technology, applied to the new manufactured components, the quitting of nuclear specific production lines and the evolution of manufacturers quality systems, originally based on nuclear codes and standards, to conventional industry standards. To face this problem, for many years different Dedication processes have been implemented to verify whether a commercial grade element is acceptable to be used in safety related applications. In the same way, due to our particular position regarding the spare part supplies, mainly from markets others than the american, C.N. Trillo has developed a methodology called Spare Items Validation. This methodology, which is originally based on dedication processes, is not a single process but a group of coordinated processes involving engineering, quality and management activities. These are to be performed on the spare item itself, its design control, its fabrication and its supply for allowing its use in destinations with specific requirements. The scope of application is not only focussed on safety related items, but also to complex design, high cost or plant reliability related components. The implementation in C.N. Trillo has been mainly curried out by merging, modifying and making the most of processes and activities which were already being performed in the company. (Author)
This research examined the influence formative self-assessment had on first/second year community college student self-regulatory practices. Previous research has shown that the ability to regulate one's learning activities can improve performance in college classes, and it has long been known that the use of formative assessment improves…
Morton, David A; Colbert-Getz, Jorie M
The flipped classroom (FC) model has emerged as an innovative solution to improve student-centered learning. However, studies measuring student performance of material in the FC relative to the lecture classroom (LC) have shown mixed results. An aim of this study was to determine if the disparity in results of prior research is due to level of cognition (low or high) needed to perform well on the outcome, or course assessment. This study tested the hypothesis that (1) students in a FC would perform better than students in a LC on an assessment requiring higher cognition and (2) there would be no difference in performance for an assessment requiring lower cognition. To test this hypothesis the performance of 28 multiple choice anatomy items that were part of a final examination were compared between two classes of first year medical students at the University of Utah School of Medicine. Items were categorized as requiring knowledge (low cognition), application, or analysis (high cognition). Thirty hours of anatomy content was delivered in LC format to 101 students in 2013 and in FC format to 104 students in 2014. Mann Whitney tests indicated FC students performed better than LC students on analysis items, U = 4243.00, P = 0.030, r = 0.19, but there were no differences in performance between FC and LC students for knowledge, U = 5002.00, P = 0.720 or application, U = 4990.00, P = 0.700, items. The FC may benefit retention when students are expected to analyze material. Anat Sci Educ 10: 170-175. © 2016 American Association of Anatomists. © 2016 American Association of Anatomists.
Nunes, Andreia; Limpo, Teresa; Lima, César F.; Castro, São Luís
The importance of quickly assessing personality traits in many studies prompted the development of brief scales such as the Ten-Item Personality Inventory (TIPI), a measure of five personality traits (extraversion, agreeableness, conscientiousness, emotional stability, and openness). In the current study, we present the Portuguese version of TIPI and examine its psychometric properties, based on a sample of 333 Portuguese adults aged 18 to 65 years. The results revealed reliability coefficients similar to the original version (α = 0.39–0.72), very good 4-week test–retest reliability (n = 81, rs > 0.71), expected factorial structure, high convergent validity with the Big-Five Inventory (rs > 0.60), and correlations with self-esteem, affect, and aggressiveness similar to those found with standard measures of personality traits. Overall, our findings suggest that the Portuguese TIPI is a reliable and valid alternative to longer measures: it offers a promising tool for research contexts in which the available time for personality assessment is highly limited. PMID:29674989