multiple-choice test items: Topics by WorldWideScience.org

Sample records for multiple-choice test items

Using automatic item generation to create multiple-choice test items.

Science.gov (United States)

Gierl, Mark J; Lai, Hollis; Turner, Simon R

2012-08-01

Many tests of medical knowledge, from the undergraduate level to the level of certification and licensure, contain multiple-choice items. Although these are efficient in measuring examinees' knowledge and skills across diverse content areas, multiple-choice items are time-consuming and expensive to create. Changes in student assessment brought about by new forms of computer-based testing have created the demand for large numbers of multiple-choice items. Our current approaches to item development cannot meet this demand. We present a methodology for developing multiple-choice items based on automatic item generation (AIG) concepts and procedures. We describe a three-stage approach to AIG and we illustrate this approach by generating multiple-choice items for a medical licensure test in the content area of surgery. To generate multiple-choice items, our method requires a three-stage process. Firstly, a cognitive model is created by content specialists. Secondly, item models are developed using the content from the cognitive model. Thirdly, items are generated from the item models using computer software. Using this methodology, we generated 1248 multiple-choice items from one item model. Automatic item generation is a process that involves using models to generate items using computer technology. With our method, content specialists identify and structure the content for the test items, and computer technology systematically combines the content to generate new test items. By combining these outcomes, items can be generated automatically. © Blackwell Publishing Ltd 2012.
Item difficulty of multiple choice tests dependant on different item response formats – An experiment in fundamental research on psychological assessment

Directory of Open Access Journals (Sweden)

KLAUS D. KUBINGER

2007-12-01

Full Text Available Multiple choice response formats are problematical as an item is often scored as solved simply because the test-taker is a lucky guesser. Instead of applying pertinent IRT models which take guessing effects into account, a pragmatic approach of re-conceptualizing multiple choice response formats to reduce the chance of lucky guessing is considered. This paper compares the free response format with two different multiple choice formats. A common multiple choice format with a single correct response option and five distractors (“1 of 6” is used, as well as a multiple choice format with five response options, of which any number of the five is correct and the item is only scored as mastered if all the correct response options and none of the wrong ones are marked (“x of 5”. An experiment was designed, using pairs of items with exactly the same content but different response formats. 173 test-takers were randomly assigned to two test booklets of 150 items altogether. Rasch model analyses adduced a fitting item pool, after the deletion of 39 items. The resulting item difficulty parameters were used for the comparison of the different formats. The multiple choice format “1 of 6” differs significantly from “x of 5”, with a relative effect of 1.63, while the multiple choice format “x of 5” does not significantly differ from the free response format. Therefore, the lower degree of difficulty of items with the “1 of 6” multiple choice format is an indicator of relevant guessing effects. In contrast the “x of 5” multiple choice format can be seen as an appropriate substitute for free response format.
Dynamic Testing of Analogical Reasoning in 5- to 6-Year-Olds: Multiple-Choice versus Constructed-Response Training Items

Science.gov (United States)

Stevenson, Claire E.; Heiser, Willem J.; Resing, Wilma C. M.

2016-01-01

Multiple-choice (MC) analogy items are often used in cognitive assessment. However, in dynamic testing, where the aim is to provide insight into potential for learning and the learning process, constructed-response (CR) items may be of benefit. This study investigated whether training with CR or MC items leads to differences in the strategy…
Dynamic Testing of Analogical Reasoning in 5- to 6-Year-Olds : Multiple-Choice Versus Constructed-Response Training Items

NARCIS (Netherlands)

Stevenson, C.E.; Heiser, W.J.; Resing, W.C.M.

2016-01-01

Multiple-choice (MC) analogy items are often used in cognitive assessment. However, in dynamic testing, where the aim is to provide insight into potential for learning and the learning process, constructed-response (CR) items may be of benefit. This study investigated whether training with CR or MC
A Diagnostic Study of Pre-Service Teachers' Competency in Multiple-Choice Item Development

Science.gov (United States)

Asim, Alice E.; Ekuri, Emmanuel E.; Eni, Eni I.

2013-01-01

Large class size is an issue in testing at all levels of Education. As a panacea to this, multiple choice test formats has become very popular. This case study was designed to diagnose pre-service teachers' competency in constructing questions (IQT); direct questions (DQT); and best answer (BAT) varieties of multiple choice items. Subjects were 88…
Evaluating the quality of medical multiple-choice items created with automated processes.

Science.gov (United States)

Gierl, Mark J; Lai, Hollis

2013-07-01

Computerised assessment raises formidable challenges because it requires large numbers of test items. Automatic item generation (AIG) can help address this test development problem because it yields large numbers of new items both quickly and efficiently. To date, however, the quality of the items produced using a generative approach has not been evaluated. The purpose of this study was to determine whether automatic processes yield items that meet standards of quality that are appropriate for medical testing. Quality was evaluated firstly by subjecting items created using both AIG and traditional processes to rating by a four-member expert medical panel using indicators of multiple-choice item quality, and secondly by asking the panellists to identify which items were developed using AIG in a blind review. Fifteen items from the domain of therapeutics were created in three different experimental test development conditions. The first 15 items were created by content specialists using traditional test development methods (Group 1 Traditional). The second 15 items were created by the same content specialists using AIG methods (Group 1 AIG). The third 15 items were created by a new group of content specialists using traditional methods (Group 2 Traditional). These 45 items were then evaluated for quality by a four-member panel of medical experts and were subsequently categorised as either Traditional or AIG items. Three outcomes were reported: (i) the items produced using traditional and AIG processes were comparable on seven of eight indicators of multiple-choice item quality; (ii) AIG items can be differentiated from Traditional items by the quality of their distractors, and (iii) the overall predictive accuracy of the four expert medical panellists was 42%. Items generated by AIG methods are, for the most part, equivalent to traditionally developed items from the perspective of expert medical reviewers. While the AIG method produced comparatively fewer plausible
Impact of Answer-Switching Behavior on Multiple-Choice Test Scores in Higher Education

Directory of Open Access Journals (Sweden)

Ramazan BAŞTÜRK

2011-06-01

Full Text Available The multiple- choice format is one of the most popular selected-response item formats used in educational testing. Researchers have shown that Multiple-choice type test is a useful vehicle for student assessment in core university subjects that usually have large student numbers. Even though the educators, test experts and different test recourses maintain the idea that the first answer should be retained, many researchers argued that this argument is not dependent with empirical findings. The main question of this study is to examine how the answer switching behavior affects the multiple-choice test score. Additionally, gender differences and relationship between number of answer switching behavior and item parameters (item difficulty and item discrimination were investigated. The participants in this study consisted of 207 upper-level College of Education students from mid-sized universities. A Midterm exam consisted of 20 multiple-choice questions was used. According to the result of this study, answer switching behavior statistically increase test scores. On the other hand, there is no significant gender difference in answer-switching behavior. Additionally, there is a significant negative relationship between answer switching behavior and item difficulties.
On the Equivalence of Constructed-Response and Multiple-Choice Tests.

Science.gov (United States)

Traub, Ross E.; Fisher, Charles W.

Two sets of mathematical reasoning and two sets of verbal comprehension items were cast into each of three formats--constructed response, standard multiple-choice, and Coombs multiple-choice--in order to assess whether tests with indentical content but different formats measure the same attribute, except for possible differences in error variance…
Evaluation of five guidelines for option development in multiple-choice item-writing.

Science.gov (United States)

Martínez, Rafael J; Moreno, Rafael; Martín, Irene; Trigo, M Eva

2009-05-01

This paper evaluates certain guidelines for writing multiple-choice test items. The analysis of the responses of 5013 subjects to 630 items from 21 university classroom achievement tests suggests that an option should not differ in terms of heterogeneous content because such error has a slight but harmful effect on item discrimination. This also occurs with the "None of the above" option when it is the correct one. In contrast, results do not show the supposedly negative effects of a different-length option, the use of specific determiners, or the use of the "All of the above" option, which not only decreases difficulty but also improves discrimination when it is the correct option.
Multiple-choice test of energy and momentum concepts

OpenAIRE

Singh, Chandralekha; Rosengrant, David

2016-01-01

We investigate student understanding of energy and momentum concepts at the level of introductory physics by designing and administering a 25-item multiple choice test and conducting individual interviews. We find that most students have difficulty in qualitatively interpreting basic principles related to energy and momentum and in applying them in physical situations.
Evaluating the Psychometric Characteristics of Generated Multiple-Choice Test Items

Science.gov (United States)

Gierl, Mark J.; Lai, Hollis; Pugh, Debra; Touchie, Claire; Boulais, André-Philippe; De Champlain, André

2016-01-01

Item development is a time- and resource-intensive process. Automatic item generation integrates cognitive modeling with computer technology to systematically generate test items. To date, however, items generated using cognitive modeling procedures have received limited use in operational testing situations. As a result, the psychometric…
Complement or Contamination: A Study of the Validity of Multiple-Choice Items when Assessing Reasoning Skills in Physics

OpenAIRE

Anders Jönsson; David Rosenlund; Fredrik Alvén

2017-01-01

The purpose of this study is to investigate the validity of using multiple-choice (MC) items as a complement to constructed-response (CR) items when making decisions about student performance on reasoning tasks. CR items from a national test in physics have been reformulated into MC items and students’ reasoning skills have been analyzed in two substudies. In the first study, 12 students answered the MC items and were asked to explain their answers orally. In the second study, 102 students fr...
Item Analysis in Introductory Economics Testing.

Science.gov (United States)

Tinari, Frank D.

1979-01-01

Computerized analysis of multiple choice test items is explained. Examples of item analysis applications in the introductory economics course are discussed with respect to three objectives: to evaluate learning; to improve test items; and to help improve classroom instruction. Problems, costs and benefits of the procedures are identified. (JMD)
Exploring problem solving strategies on multiple-choice science items: Comparing native Spanish-speaking English Language Learners and mainstream monolinguals

Science.gov (United States)

Kachchaf, Rachel Rae

The purpose of this study was to compare how English language learners (ELLs) and monolingual English speakers solved multiple-choice items administered with and without a new form of testing accommodation---vignette illustration (VI). By incorporating theories from second language acquisition, bilingualism, and sociolinguistics, this study was able to gain more accurate and comprehensive input into the ways students interacted with items. This mixed methods study used verbal protocols to elicit the thinking processes of thirty-six native Spanish-speaking English language learners (ELLs), and 36 native-English speaking non-ELLs when solving multiple-choice science items. Results from both qualitative and quantitative analyses show that ELLs used a wider variety of actions oriented to making sense of the items than non-ELLs. In contrast, non-ELLs used more problem solving strategies than ELLs. There were no statistically significant differences in student performance based on the interaction of presence of illustration and linguistic status or the main effect of presence of illustration. However, there were significant differences based on the main effect of linguistic status. An interaction between the characteristics of the students, the items, and the illustrations indicates considerable heterogeneity in the ways in which students from both linguistic groups think about and respond to science test items. The results of this study speak to the need for more research involving ELLs in the process of test development to create test items that do not require ELLs to carry out significantly more actions to make sense of the item than monolingual students.
Science Library of Test Items. Volume Eighteen. A Collection of Multiple Choice Test Items Relating Mainly to Chemistry.

Science.gov (United States)

New South Wales Dept. of Education, Sydney (Australia).

As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…
Science Library of Test Items. Volume Seventeen. A Collection of Multiple Choice Test Items Relating Mainly to Biology.

Science.gov (United States)

New South Wales Dept. of Education, Sydney (Australia).

As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…
Science Library of Test Items. Volume Nineteen. A Collection of Multiple Choice Test Items Relating Mainly to Geology.

Science.gov (United States)

New South Wales Dept. of Education, Sydney (Australia).

As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…
The positive and negative consequences of multiple-choice testing.

Science.gov (United States)

Roediger, Henry L; Marsh, Elizabeth J

2005-09-01

Multiple-choice tests are commonly used in educational settings but with unknown effects on students' knowledge. The authors examined the consequences of taking a multiple-choice test on a later general knowledge test in which students were warned not to guess. A large positive testing effect was obtained: Prior testing of facts aided final cued-recall performance. However, prior testing also had negative consequences. Prior reading of a greater number of multiple-choice lures decreased the positive testing effect and increased production of multiple-choice lures as incorrect answers on the final test. Multiple-choice testing may inadvertently lead to the creation of false knowledge.
Multiple-Choice versus Constructed-Response Tests in the Assessment of Mathematics Computation Skills.

Science.gov (United States)

Gadalla, Tahany M.

The equivalence of multiple-choice (MC) and constructed response (discrete) (CR-D) response formats as applied to mathematics computation at grade levels two to six was tested. The difference between total scores from the two response formats was tested for statistical significance, and the factor structure of items in both response formats was…
Electronics. Criterion-Referenced Test (CRT) Item Bank.

Science.gov (United States)

Davis, Diane, Ed.

This document contains 519 criterion-referenced multiple choice and true or false test items for a course in electronics. The test item bank is designed to work with both the Vocational Instructional Management System (VIMS) and the Vocational Administrative Management System (VAMS) in Missouri. The items are grouped into 15 units covering the…

Science Library of Test Items. Volume Twenty-Two. A Collection of Multiple Choice Test Items Relating Mainly to Skills.

Science.gov (United States)

New South Wales Dept. of Education, Sydney (Australia).

As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…
Science Library of Test Items. Volume Twenty. A Collection of Multiple Choice Test Items Relating Mainly to Physics, 1.

Science.gov (United States)

New South Wales Dept. of Education, Sydney (Australia).

As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…
Assessing difference between classical test theory and item ...

African Journals Online (AJOL)

Assessing difference between classical test theory and item response theory methods in scoring primary four multiple choice objective test items. ... All research participants were ranked on the CTT number correct scores and the corresponding IRT item pattern scores from their performance on the PRISMADAT. Wilcoxon ...
Analyzing Multiple-Choice Questions by Model Analysis and Item Response Curves

Science.gov (United States)

Wattanakasiwich, P.; Ananta, S.

2010-07-01

In physics education research, the main goal is to improve physics teaching so that most students understand physics conceptually and be able to apply concepts in solving problems. Therefore many multiple-choice instruments were developed to probe students' conceptual understanding in various topics. Two techniques including model analysis and item response curves were used to analyze students' responses from Force and Motion Conceptual Evaluation (FMCE). For this study FMCE data from more than 1000 students at Chiang Mai University were collected over the past three years. With model analysis, we can obtain students' alternative knowledge and the probabilities for students to use such knowledge in a range of equivalent contexts. The model analysis consists of two algorithms—concentration factor and model estimation. This paper only presents results from using the model estimation algorithm to obtain a model plot. The plot helps to identify a class model state whether it is in the misconception region or not. Item response curve (IRC) derived from item response theory is a plot between percentages of students selecting a particular choice versus their total score. Pros and cons of both techniques are compared and discussed.
Are Faculty Predictions or Item Taxonomies Useful for Estimating the Outcome of Multiple-Choice Examinations?

Science.gov (United States)

Kibble, Jonathan D.; Johnson, Teresa

2011-01-01

The purpose of this study was to evaluate whether multiple-choice item difficulty could be predicted either by a subjective judgment by the question author or by applying a learning taxonomy to the items. Eight physiology faculty members teaching an upper-level undergraduate human physiology course consented to participate in the study. The…
ACER Chemistry Test Item Collection. ACER Chemtic Year 12.

Science.gov (United States)

Australian Council for Educational Research, Hawthorn.

The chemistry test item banks contains 225 multiple-choice questions suitable for diagnostic and achievement testing; a three-page teacher's guide; answer key with item facilities; an answer sheet; and a 45-item sample achievement test. Although written for the new grade 12 chemistry course in Victoria, Australia, the items are widely applicable.…
Mixed-Format Test Score Equating: Effect of Item-Type Multidimensionality, Length and Composition of Common-Item Set, and Group Ability Difference

Science.gov (United States)

Wang, Wei

2013-01-01

Mixed-format tests containing both multiple-choice (MC) items and constructed-response (CR) items are now widely used in many testing programs. Mixed-format tests often are considered to be superior to tests containing only MC items although the use of multiple item formats leads to measurement challenges in the context of equating conducted under…
Feedback-related brain activity predicts learning from feedback in multiple-choice testing.

Science.gov (United States)

Ernst, Benjamin; Steinhauser, Marco

2012-06-01

Different event-related potentials (ERPs) have been shown to correlate with learning from feedback in decision-making tasks and with learning in explicit memory tasks. In the present study, we investigated which ERPs predict learning from corrective feedback in a multiple-choice test, which combines elements from both paradigms. Participants worked through sets of multiple-choice items of a Swahili-German vocabulary task. Whereas the initial presentation of an item required the participants to guess the answer, corrective feedback could be used to learn the correct response. Initial analyses revealed that corrective feedback elicited components related to reinforcement learning (FRN), as well as to explicit memory processing (P300) and attention (early frontal positivity). However, only the P300 and early frontal positivity were positively correlated with successful learning from corrective feedback, whereas the FRN was even larger when learning failed. These results suggest that learning from corrective feedback crucially relies on explicit memory processing and attentional orienting to corrective feedback, rather than on reinforcement learning.
Optimizing multiple-choice tests as tools for learning.

Science.gov (United States)

Little, Jeri L; Bjork, Elizabeth Ligon

2015-01-01

Answering multiple-choice questions with competitive alternatives can enhance performance on a later test, not only on questions about the information previously tested, but also on questions about related information not previously tested-in particular, on questions about information pertaining to the previously incorrect alternatives. In the present research, we assessed a possible explanation for this pattern: When multiple-choice questions contain competitive incorrect alternatives, test-takers are led to retrieve previously studied information pertaining to all of the alternatives in order to discriminate among them and select an answer, with such processing strengthening later access to information associated with both the correct and incorrect alternatives. Supporting this hypothesis, we found enhanced performance on a later cued-recall test for previously nontested questions when their answers had previously appeared as competitive incorrect alternatives in the initial multiple-choice test, but not when they had previously appeared as noncompetitive alternatives. Importantly, however, competitive alternatives were not more likely than noncompetitive alternatives to be intruded as incorrect responses, indicating that a general increased accessibility for previously presented incorrect alternatives could not be the explanation for these results. The present findings, replicated across two experiments (one in which corrective feedback was provided during the initial multiple-choice testing, and one in which it was not), thus strongly suggest that competitive multiple-choice questions can trigger beneficial retrieval processes for both tested and related information, and the results have implications for the effective use of multiple-choice tests as tools for learning.
Science Library of Test Items. Volume Twenty-One. A Collection of Multiple Choice Test Items Relating Mainly to Physics, 2.

Science.gov (United States)

New South Wales Dept. of Education, Sydney (Australia).

As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…
Use of flawed multiple-choice items by the New England Journal of Medicine for continuing medical education.

Science.gov (United States)

Stagnaro-Green, Alex S; Downing, Steven M

2006-09-01

Physicians in the United States are required to complete a minimum number of continuing medical education (CME) credits annually. The goal of CME is to ensure that physicians maintain their knowledge and skills throughout their medical career. The New England Journal of Medicine (NEJM) provides its readers with the opportunity to obtain weekly CME credits. Deviation from established item-writing principles may result in a decrease in validity evidence for tests. This study evaluated the quality of 40 NEJM MCQs using the standard evidence-based principles of effective item writing. Each multiple-choice item reviewed had at least three item flaws, with a mean of 5.1 and a range of 3 to 7. The results of this study demonstrate that the NEJM uses flawed MCQs in its weekly CME program.
The "None of the Above" Option in Multiple-Choice Testing: An Experimental Study

Science.gov (United States)

DiBattista, David; Sinnige-Egger, Jo-Anne; Fortuna, Glenda

2014-01-01

The authors assessed the effects of using "none of the above" as an option in a 40-item, general-knowledge multiple-choice test administered to undergraduate students. Examinees who selected "none of the above" were given an incentive to write the correct answer to the question posed. Using "none of the above" as the…
Multiple Choice Testing and the Retrieval Hypothesis of the Testing Effect

Science.gov (United States)

Sensenig, Amanda E.

2010-01-01

Taking a test often leads to enhanced later memory for the tested information, a phenomenon known as the "testing effect". This memory advantage has been reliably demonstrated with recall tests but not multiple choice tests. One potential explanation for this finding is that multiple choice tests do not rely on retrieval processes to the same…
Testing for Nonuniform Differential Item Functioning with Multiple Indicator Multiple Cause Models

Science.gov (United States)

Woods, Carol M.; Grimm, Kevin J.

2011-01-01

In extant literature, multiple indicator multiple cause (MIMIC) models have been presented for identifying items that display uniform differential item functioning (DIF) only, not nonuniform DIF. This article addresses, for apparently the first time, the use of MIMIC models for testing both uniform and nonuniform DIF with categorical indicators. A…
Item and test analysis to identify quality multiple choice questions (MCQS from an assessment of medical students of Ahmedabad, Gujarat

Directory of Open Access Journals (Sweden)

Sanju Gajjar

2014-01-01

Full Text Available Background: Multiple choice questions (MCQs are frequently used to assess students in different educational streams for their objectivity and wide reach of coverage in less time. However, the MCQs to be used must be of quality which depends upon its difficulty index (DIF I, discrimination index (DI and distracter efficiency (DE. Objective: To evaluate MCQs or items and develop a pool of valid items by assessing with DIF I, DI and DE and also to revise/ store or discard items based on obtained results. Settings: Study was conducted in a medical school of Ahmedabad. Materials and Methods: An internal examination in Community Medicine was conducted after 40 hours teaching during 1 st MBBS which was attended by 148 out of 150 students. Total 50 MCQs or items and 150 distractors were analyzed. Statistical Analysis: Data was entered and analyzed in MS Excel 2007 and simple proportions, mean, standard deviations, coefficient of variation were calculated and unpaired t test was applied. Results: Out of 50 items, 24 had "good to excellent" DIF I (31 - 60% and 15 had "good to excellent" DI (> 0.25. Mean DE was 88.6% considered as ideal/ acceptable and non functional distractors (NFD were only 11.4%. Mean DI was 0.14. Poor DI (< 0.15 with negative DI in 10 items indicates poor preparedness of students and some issues with framing of at least some of the MCQs. Increased proportion of NFDs (incorrect alternatives selected by < 5% students in an item decrease DE and makes it easier. There were 15 items with 17 NFDs, while rest items did not have any NFD with mean DE of 100%. Conclusion: Study emphasizes the selection of quality MCQs which truly assess the knowledge and are able to differentiate the students of different abilities in correct manner.
Australian Chemistry Test Item Bank: Years 11 & 12. Volume 1.

Science.gov (United States)

Commons, C., Ed.; Martin, P., Ed.

Volume 1 of the Australian Chemistry Test Item Bank, consisting of two volumes, contains nearly 2000 multiple-choice items related to the chemistry taught in Year 11 and Year 12 courses in Australia. Items which were written during 1979 and 1980 were initially published in the "ACER Chemistry Test Item Collection" and in the "ACER…
ACER Chemistry Test Item Collection (ACER CHEMTIC Year 12 Supplement).

Science.gov (United States)

Australian Council for Educational Research, Hawthorn.

This publication contains 317 multiple-choice chemistry test items related to topics covered in the Victorian (Australia) Year 12 chemistry course. It allows teachers access to a range of items suitable for diagnostic and achievement purposes, supplementing the ACER Chemistry Test Item Collection--Year 12 (CHEMTIC). The topics covered are: organic…
Australian Chemistry Test Item Bank: Years 11 and 12. Volume 2.

Science.gov (United States)

Commons, C., Ed.; Martin, P., Ed.

The second volume of the Australian Chemistry Test Item Bank, consisting of two volumes, contains nearly 2000 multiple-choice items related to the chemistry taught in Year 11 and Year 12 courses in Australia. Items which were written during 1979 and 1980 were initially published in the "ACER Chemistry Test Item Collection" and in the…
Psychometrics of Multiple Choice Questions with Non-Functioning Distracters: Implications to Medical Education.

Science.gov (United States)

Deepak, Kishore K; Al-Umran, Khalid Umran; AI-Sheikh, Mona H; Dkoli, B V; Al-Rubaish, Abdullah

2015-01-01

The functionality of distracters in a multiple choice question plays a very important role. We examined the frequency and impact of functioning and non-functioning distracters on psychometric properties of 5-option items in clinical disciplines. We analyzed item statistics of 1115 multiple choice questions from 15 summative assessments of undergraduate medical students and classified the items into five groups by their number of non-functioning distracters. We analyzed the effect of varying degree of non-functionality ranging from 0 to 4, on test reliability, difficulty index, discrimination index and point biserial correlation. The non-functionality of distracters inversely affected the test reliability and quality of items in a predictable manner. The non-functioning distracters made the items easier and lowered the discrimination index significantly. Three non-functional distracters in a 5-option MCQ significantly affected all psychometric properties (p psychometrically as effective as 5-option items. Our study reveals that a multiple choice question with 3 functional options provides lower most limit of item format that has adequate psychometric property. The test containing items with less number of functioning options have significantly lower reliability. The distracter function analysis and revision of nonfunctioning distracters can serve as important methods to improve the psychometrics and reliability of assessment.
Approaches to Data Analysis of Multiple-Choice Questions

Science.gov (United States)

Ding, Lin; Beichner, Robert

2009-01-01

This paper introduces five commonly used approaches to analyzing multiple-choice test data. They are classical test theory, factor analysis, cluster analysis, item response theory, and model analysis. Brief descriptions of the goals and algorithms of these approaches are provided, together with examples illustrating their applications in physics…

Examining the Psychometric Quality of Multiple-Choice Assessment Items using Mokken Scale Analysis.

Science.gov (United States)

Wind, Stefanie A

The concept of invariant measurement is typically associated with Rasch measurement theory (Engelhard, 2013). Concerned with the appropriateness of the parametric transformation upon which the Rasch model is based, Mokken (1971) proposed a nonparametric procedure for evaluating the quality of social science measurement that is theoretically and empirically related to the Rasch model. Mokken's nonparametric procedure can be used to evaluate the quality of dichotomous and polytomous items in terms of the requirements for invariant measurement. Despite these potential benefits, the use of Mokken scaling to examine the properties of multiple-choice (MC) items in education has not yet been fully explored. A nonparametric approach to evaluating MC items is promising in that this approach facilitates the evaluation of assessments in terms of invariant measurement without imposing potentially inappropriate transformations. Using Rasch-based indices of measurement quality as a frame of reference, data from an eighth-grade physical science assessment are used to illustrate and explore Mokken-based techniques for evaluating the quality of MC items. Implications for research and practice are discussed.
Optimizing Multiple-Choice Tests as Learning Events

Science.gov (United States)

Little, Jeri Lynn

2011-01-01

Although generally used for assessment, tests can also serve as tools for learning--but different test formats may not be equally beneficial. Specifically, research has shown multiple-choice tests to be less effective than cued-recall tests in improving the later retention of the tested information (e.g., see meta-analysis by Hamaker, 1986),…
Test of understanding of vectors: A reliable multiple-choice vector concept test

Science.gov (United States)

Barniol, Pablo; Zavala, Genaro

2014-06-01

In this article we discuss the findings of our research on students' understanding of vector concepts in problems without physical context. First, we develop a complete taxonomy of the most frequent errors made by university students when learning vector concepts. This study is based on the results of several test administrations of open-ended problems in which a total of 2067 students participated. Using this taxonomy, we then designed a 20-item multiple-choice test [Test of understanding of vectors (TUV)] and administered it in English to 423 students who were completing the required sequence of introductory physics courses at a large private Mexican university. We evaluated the test's content validity, reliability, and discriminatory power. The results indicate that the TUV is a reliable assessment tool. We also conducted a detailed analysis of the students' understanding of the vector concepts evaluated in the test. The TUV is included in the Supplemental Material as a resource for other researchers studying vector learning, as well as instructors teaching the material.
Feedback enhances the positive effects and reduces the negative effects of multiple-choice testing.

Science.gov (United States)

Butler, Andrew C; Roediger, Henry L

2008-04-01

Multiple-choice tests are used frequently in higher education without much consideration of the impact this form of assessment has on learning. Multiple-choice testing enhances retention of the material tested (the testing effect); however, unlike other tests, multiple-choice can also be detrimental because it exposes students to misinformation in the form of lures. The selection of lures can lead students to acquire false knowledge (Roediger & Marsh, 2005). The present research investigated whether feedback could be used to boost the positive effects and reduce the negative effects of multiple-choice testing. Subjects studied passages and then received a multiple-choice test with immediate feedback, delayed feedback, or no feedback. In comparison with the no-feedback condition, both immediate and delayed feedback increased the proportion of correct responses and reduced the proportion of intrusions (i.e., lure responses from the initial multiple-choice test) on a delayed cued recall test. Educators should provide feedback when using multiple-choice tests.
Effect of response format on cognitive reflection: Validating a two- and four-option multiple choice question version of the Cognitive Reflection Test.

Science.gov (United States)

Sirota, Miroslav; Juanchich, Marie

2018-03-27

The Cognitive Reflection Test, measuring intuition inhibition and cognitive reflection, has become extremely popular because it reliably predicts reasoning performance, decision-making, and beliefs. Across studies, the response format of CRT items sometimes differs, based on the assumed construct equivalence of tests with open-ended versus multiple-choice items (the equivalence hypothesis). Evidence and theoretical reasons, however, suggest that the cognitive processes measured by these response formats and their associated performances might differ (the nonequivalence hypothesis). We tested the two hypotheses experimentally by assessing the performance in tests with different response formats and by comparing their predictive and construct validity. In a between-subjects experiment (n = 452), participants answered stem-equivalent CRT items in an open-ended, a two-option, or a four-option response format and then completed tasks on belief bias, denominator neglect, and paranormal beliefs (benchmark indicators of predictive validity), as well as on actively open-minded thinking and numeracy (benchmark indicators of construct validity). We found no significant differences between the three response formats in the numbers of correct responses, the numbers of intuitive responses (with the exception of the two-option version, which had a higher number than the other tests), and the correlational patterns of the indicators of predictive and construct validity. All three test versions were similarly reliable, but the multiple-choice formats were completed more quickly. We speculate that the specific nature of the CRT items helps build construct equivalence among the different response formats. We recommend using the validated multiple-choice version of the CRT presented here, particularly the four-option CRT, for practical and methodological reasons. Supplementary materials and data are available at https://osf.io/mzhyc/ .
To Show or Not to Show: The Effects of Item Stems and Answer Options on Performance on a Multiple-Choice Listening Comprehension Test

Science.gov (United States)

Yanagawa, Kozo; Green, Anthony

2008-01-01

The purpose of this study is to examine whether the choice between three multiple-choice listening comprehension test formats results in any difference in listening comprehension test performance. The three formats entail (a) allowing test takers to preview both the question stem and answer options prior to listening; (b) allowing test takers to…
Force Concept Inventory-based multiple-choice test for investigating students’ representational consistency

Directory of Open Access Journals (Sweden)

Pasi Nieminen

2010-08-01

Full Text Available This study investigates students’ ability to interpret multiple representations consistently (i.e., representational consistency in the context of the force concept. For this purpose we developed the Representational Variant of the Force Concept Inventory (R-FCI, which makes use of nine items from the 1995 version of the Force Concept Inventory (FCI. These original FCI items were redesigned using various representations (such as motion map, vectorial and graphical, yielding 27 multiple-choice items concerning four central concepts underpinning the force concept: Newton’s first, second, and third laws, and gravitation. We provide some evidence for the validity and reliability of the R-FCI; this analysis is limited to the student population of one Finnish high school. The students took the R-FCI at the beginning and at the end of their first high school physics course. We found that students’ (n=168 representational consistency (whether scientifically correct or not varied considerably depending on the concept. On average, representational consistency and scientifically correct understanding increased during the instruction, although in the post-test only a few students performed consistently both in terms of representations and scientifically correct understanding. We also compared students’ (n=87 results of the R-FCI and the FCI, and found that they correlated quite well.
Student certainty answering misconception question: study of Three-Tier Multiple-Choice Diagnostic Test in Acid-Base and Solubility Equilibrium

Science.gov (United States)

Ardiansah; Masykuri, M.; Rahardjo, S. B.

2018-04-01

Students’ concept comprehension in three-tier multiple-choice diagnostic test related to student confidence level. The confidence level related to certainty and student’s self-efficacy. The purpose of this research was to find out students’ certainty in misconception test. This research was quantitative-qualitative research method counting students’ confidence level. The research participants were 484 students that were studying acid-base and equilibrium solubility subject. Data was collected using three-tier multiple-choice (3TMC) with thirty questions and students’ questionnaire. The findings showed that #6 item gives the highest misconception percentage and high student confidence about the counting of ultra-dilute solution’s pH. Other findings were that 1) the student tendency chosen the misconception answer is to increase over item number, 2) student certainty decreased in terms of answering the 3TMC, and 3) student self-efficacy and achievement were related each other in the research. The findings suggest some implications and limitations for further research.
Approaches to data analysis of multiple-choice questions

OpenAIRE

Lin Ding; Robert Beichner

2009-01-01

This paper introduces five commonly used approaches to analyzing multiple-choice test data. They are classical test theory, factor analysis, cluster analysis, item response theory, and model analysis. Brief descriptions of the goals and algorithms of these approaches are provided, together with examples illustrating their applications in physics education research. We minimize mathematics, instead placing emphasis on data interpretation using these approaches.
An Explanatory Item Response Theory Approach for a Computer-Based Case Simulation Test

Science.gov (United States)

Kahraman, Nilüfer

2014-01-01

Problem: Practitioners working with multiple-choice tests have long utilized Item Response Theory (IRT) models to evaluate the performance of test items for quality assurance. The use of similar applications for performance tests, however, is often encumbered due to the challenges encountered in working with complicated data sets in which local…
Development and Preliminary Testing of the Food Choice Priorities Survey (FCPS): Assessing the Importance of Multiple Factors on College Students' Food Choices.

Science.gov (United States)

Vilaro, Melissa J; Zhou, Wenjun; Colby, Sarah E; Byrd-Bredbenner, Carol; Riggsbee, Kristin; Olfert, Melissa D; Barnett, Tracey E; Mathews, Anne E

2017-12-01

Understanding factors that influence food choice may help improve diet quality. Factors that commonly affect adults' food choices have been described, but measures that identify and assess food choice factors specific to college students are lacking. This study developed and tested the Food Choice Priorities Survey (FCPS) among college students. Thirty-seven undergraduates participated in two focus groups ( n = 19; 11 in the male-only group, 8 in the female-only group) and interviews ( n = 18) regarding typical influences on food choice. Qualitative data informed the development of survey items with a 5-point Likert-type scale (1 = not important, 5 = extremely important). An expert panel rated FCPS items for clarity, relevance, representativeness, and coverage using a content validity form. To establish test-retest reliability, 109 first-year college students completed the 14-item FCPS at two time points, 0-48 days apart ( M = 13.99, SD = 7.44). Using Cohen's weighted κ for responses within 20 days, 11 items demonstrated moderate agreement and 3 items had substantial agreement. Factor analysis revealed a three-factor structure (9 items). The FCPS is designed for college students and provides a way to determine the factors of greatest importance regarding food choices among this population. From a public health perspective, practical applications include using the FCPS to tailor health communications and behavior change interventions to factors most salient for food choices of college students.
Multiple-Choice and Short-Answer Exam Performance in a College Classroom

Science.gov (United States)

Funk, Steven C.; Dickson, K. Laurie

2011-01-01

The authors experimentally investigated the effects of multiple-choice and short-answer format exam items on exam performance in a college classroom. They randomly assigned 50 students to take a 10-item short-answer pretest or posttest on two 50-item multiple-choice exams in an introduction to personality course. Students performed significantly…
Decision making under internal uncertainty: the case of multiple-choice tests with different scoring rules.

Science.gov (United States)

Bereby-Meyer, Yoella; Meyer, Joachim; Budescu, David V

2003-02-01

This paper assesses framing effects on decision making with internal uncertainty, i.e., partial knowledge, by focusing on examinees' behavior in multiple-choice (MC) tests with different scoring rules. In two experiments participants answered a general-knowledge MC test that consisted of 34 solvable and 6 unsolvable items. Experiment 1 studied two scoring rules involving Positive (only gains) and Negative (only losses) scores. Although answering all items was the dominating strategy for both rules, the results revealed a greater tendency to answer under the Negative scoring rule. These results are in line with the predictions derived from Prospect Theory (PT) [Econometrica 47 (1979) 263]. The second experiment studied two scoring rules, which allowed respondents to exhibit partial knowledge. Under the Inclusion-scoring rule the respondents mark all answers that could be correct, and under the Exclusion-scoring rule they exclude all answers that might be incorrect. As predicted by PT, respondents took more risks under the Inclusion rule than under the Exclusion rule. The results illustrate that the basic process that underlies choice behavior under internal uncertainty and especially the effect of framing is similar to the process of choice under external uncertainty and can be described quite accurately by PT. Copyright 2002 Elsevier Science B.V.
Approaches to data analysis of multiple-choice questions

Directory of Open Access Journals (Sweden)

Lin Ding

2009-09-01

Full Text Available This paper introduces five commonly used approaches to analyzing multiple-choice test data. They are classical test theory, factor analysis, cluster analysis, item response theory, and model analysis. Brief descriptions of the goals and algorithms of these approaches are provided, together with examples illustrating their applications in physics education research. We minimize mathematics, instead placing emphasis on data interpretation using these approaches.
Comparison between three option, four option and five option multiple choice question tests for quality parameters: A randomized study.

Science.gov (United States)

Vegada, Bhavisha; Shukla, Apexa; Khilnani, Ajeetkumar; Charan, Jaykaran; Desai, Chetna

2016-01-01

Most of the academic teachers use four or five options per item of multiple choice question (MCQ) test as formative and summative assessment. Optimal number of options in MCQ item is a matter of considerable debate among academic teachers of various educational fields. There is a scarcity of the published literature regarding the optimum number of option in each item of MCQ in the field of medical education. To compare three options, four options, and five options MCQs test for the quality parameters - reliability, validity, item analysis, distracter analysis, and time analysis. Participants were 3 rd semester M.B.B.S. students. Students were divided randomly into three groups. Each group was given one set of MCQ test out of three options, four options, and five option randomly. Following the marking of the multiple choice tests, the participants' option selections were analyzed and comparisons were conducted of the mean marks, mean time, validity, reliability and facility value, discrimination index, point biserial value, distracter analysis of three different option formats. Students score more ( P = 0.000) and took less time ( P = 0.009) for the completion of three options as compared to four options and five options groups. Facility value was more ( P = 0.004) in three options group as compared to four and five options groups. There was no significant difference between three groups for the validity, reliability, and item discrimination. Nonfunctioning distracters were more in the four and five options group as compared to three option group. Assessment based on three option MCQs is can be preferred over four option and five option MCQs.
Comedy workshop: an enjoyable way to develop multiple-choice questions.

Science.gov (United States)

Droegemueller, William; Gant, Norman; Brekken, Alvin; Webb, Lynn

2005-01-01

To describe an innovative method of developing multiple-choice items for a board certification examination. The development of appropriate multiple-choice items is definitely more of an art, rather than a science. The comedy workshop format for developing questions for a certification examination is similar to the process used by comedy writers composing scripts for television shows. This group format dramatically diminishes the frustrations faced by an individual question writer attempting to create items. The vast majority of our comedy workshop participants enjoy and prefer the comedy workshop format. It provides an ideal environment in which to teach and blend the talents of inexperienced and experienced question writers. This is a descriptive article, in which we suggest an innovative process in the art of creating multiple-choice items for a high-stakes examination.
Guide to good practices for the development of test items

Energy Technology Data Exchange (ETDEWEB)

NONE

1997-01-01

While the methodology used in developing test items can vary significantly, to ensure quality examinations, test items should be developed systematically. Test design and development is discussed in the DOE Guide to Good Practices for Design, Development, and Implementation of Examinations. This guide is intended to be a supplement by providing more detailed guidance on the development of specific test items. This guide addresses the development of written examination test items primarily. However, many of the concepts also apply to oral examinations, both in the classroom and on the job. This guide is intended to be used as guidance for the classroom and laboratory instructor or curriculum developer responsible for the construction of individual test items. This document focuses on written test items, but includes information relative to open-reference (open book) examination test items, as well. These test items have been categorized as short-answer, multiple-choice, or essay. Each test item format is described, examples are provided, and a procedure for development is included. The appendices provide examples for writing test items, a test item development form, and examples of various test item formats.
Analysis Test of Understanding of Vectors with the Three-Parameter Logistic Model of Item Response Theory and Item Response Curves Technique

Science.gov (United States)

Rakkapao, Suttida; Prasitpong, Singha; Arayathanitkul, Kwan

2016-01-01

This study investigated the multiple-choice test of understanding of vectors (TUV), by applying item response theory (IRT). The difficulty, discriminatory, and guessing parameters of the TUV items were fit with the three-parameter logistic model of IRT, using the parscale program. The TUV ability is an ability parameter, here estimated assuming…
Advanced Marketing Core Curriculum. Test Items and Assessment Techniques.

Science.gov (United States)

Smith, Clifton L.; And Others

This document contains duties and tasks, multiple-choice test items, and other assessment techniques for Missouri's advanced marketing core curriculum. The core curriculum begins with a list of 13 suggested textbook resources. Next, nine duties with their associated tasks are given. Under each task appears one or more citations to appropriate…
Will a Short Training Session Improve Multiple-Choice Item-Writing Quality by Dental School Faculty? A Pilot Study.

Science.gov (United States)

Dellinges, Mark A; Curtis, Donald A

2017-08-01

Faculty members are expected to write high-quality multiple-choice questions (MCQs) in order to accurately assess dental students' achievement. However, most dental school faculty members are not trained to write MCQs. Extensive faculty development programs have been used to help educators write better test items. The aim of this pilot study was to determine if a short workshop would result in improved MCQ item-writing by dental school faculty at one U.S. dental school. A total of 24 dental school faculty members who had previously written MCQs were randomized into a no-intervention group and an intervention group in 2015. Six previously written MCQs were randomly selected from each of the faculty members and given an item quality score. The intervention group participated in a training session of one-hour duration that focused on reviewing standard item-writing guidelines to improve in-house MCQs. The no-intervention group did not receive any training but did receive encouragement and an explanation of why good MCQ writing was important. The faculty members were then asked to revise their previously written questions, and these were given an item quality score. The item quality scores for each faculty member were averaged, and the difference from pre-training to post-training scores was evaluated. The results showed a significant difference between pre-training and post-training MCQ difference scores for the intervention group (p=0.04). This pilot study provides evidence that the training session of short duration was effective in improving the quality of in-house MCQs.

Assessing choice making among children with multiple disabilities.

OpenAIRE

Sigafoos, J; Dempsey, R

1992-01-01

Some learners with multiple disabilities display idiosyncratic gestures that are interpreted as a means of making choices. In the present study, we assessed the validity of idiosyncratic choice-making behaviors of 3 children with multiple disabilities. Opportunities for each child to choose between food and drink were provided under two conditions. In one condition, the children were given the food or drink item corresponding to their prior choice. In the other condition, the teacher delivere...
Differential Weighting of Items to Improve University Admission Test Validity

Directory of Open Access Journals (Sweden)

Eduardo Backhoff Escudero

2001-05-01

Full Text Available This paper gives an evaluation of different ways to increase university admission test criterion-related validity, by differentially weighting test items. We compared four methods of weighting multiple-choice items of the Basic Skills and Knowledge Examination (EXHCOBA: (1 punishing incorrect responses by a constant factor, (2 weighting incorrect responses, considering the levels of error, (3 weighting correct responses, considering the item’s difficulty, based on the Classic Measurement Theory, and (4 weighting correct responses, considering the item’s difficulty, based on the Item Response Theory. Results show that none of these methods increased the instrument’s predictive validity, although they did improve its concurrent validity. It was concluded that it is appropriate to score the test by simply adding up correct responses.
"None of the above" as a correct and incorrect alternative on a multiple-choice test: implications for the testing effect.

Science.gov (United States)

Odegard, Timothy N; Koen, Joshua D

2007-11-01

Both positive and negative testing effects have been demonstrated with a variety of materials and paradigms (Roediger & Karpicke, 2006b). The present series of experiments replicate and extend the research of Roediger and Marsh (2005) with the addition of a "none-of-the-above" response option. Participants (n=32 in both experiments) read a set of passages, took an initial multiple-choice test, completed a filler task, and then completed a final cued-recall test (Experiment 1) or multiple-choice test (Experiment 2). Questions were manipulated on the initial multiple-choice test by adding a "none-of-the-above" response alternative (choice "E") that was incorrect ("E" Incorrect) or correct ("E" Correct). The results from both experiments demonstrated that the positive testing effect was negated when the "none-of-the-above" alternative was the correct response on the initial multiple-choice test, but was still present when the "none-of-the-above" alternative was an incorrect response.
An Investigation of Item Type in a Standards-Based Assessment.

Directory of Open Access Journals (Sweden)

Liz Hollingworth

2007-12-01

Full Text Available Large-scale state assessment programs use both multiple-choice and open-ended items on tests for accountability purposes. Certainly, there is an intuitive belief among some educators and policy makers that open-ended items measure something different than multiple-choice items. This study examined two item formats in custom-built, standards-based tests of achievement in Reading and Mathematics at grades 3-8. In this paper, we raise questions about the value of including open-ended items, given scoring costs, time constraints, and the higher probability of missing data from test-takers.
Analysis test of understanding of vectors with the three-parameter logistic model of item response theory and item response curves technique

Science.gov (United States)

Rakkapao, Suttida; Prasitpong, Singha; Arayathanitkul, Kwan

2016-12-01

This study investigated the multiple-choice test of understanding of vectors (TUV), by applying item response theory (IRT). The difficulty, discriminatory, and guessing parameters of the TUV items were fit with the three-parameter logistic model of IRT, using the parscale program. The TUV ability is an ability parameter, here estimated assuming unidimensionality and local independence. Moreover, all distractors of the TUV were analyzed from item response curves (IRC) that represent simplified IRT. Data were gathered on 2392 science and engineering freshmen, from three universities in Thailand. The results revealed IRT analysis to be useful in assessing the test since its item parameters are independent of the ability parameters. The IRT framework reveals item-level information, and indicates appropriate ability ranges for the test. Moreover, the IRC analysis can be used to assess the effectiveness of the test's distractors. Both IRT and IRC approaches reveal test characteristics beyond those revealed by the classical analysis methods of tests. Test developers can apply these methods to diagnose and evaluate the features of items at various ability levels of test takers.
Item Analysis of Multiple Choice Questions at the Department of Paediatrics, Arabian Gulf University, Manama, Bahrain

Directory of Open Access Journals (Sweden)

Deena Kheyami

2018-04-01

Full Text Available Objectives: The current study aimed to carry out a post-validation item analysis of multiple choice questions (MCQs in medical examinations in order to evaluate correlations between item difficulty, item discrimination and distraction effectiveness so as to determine whether questions should be included, modified or discarded. In addition, the optimal number of options per MCQ was analysed. Methods: This cross-sectional study was performed in the Department of Paediatrics, Arabian Gulf University, Manama, Bahrain. A total of 800 MCQs and 4,000 distractors were analysed between November 2013 and June 2016. Results: The mean difficulty index ranged from 36.70–73.14%. The mean discrimination index ranged from 0.20–0.34. The mean distractor efficiency ranged from 66.50–90.00%. Of the items, 48.4%, 35.3%, 11.4%, 3.9% and 1.1% had zero, one, two, three and four nonfunctional distractors (NFDs, respectively. Using three or four rather than five options in each MCQ resulted in 95% or 83.6% of items having zero NFDs, respectively. The distractor efficiency was 91.87%, 85.83% and 64.13% for difficult, acceptable and easy items, respectively (P <0.005. Distractor efficiency was 83.33%, 83.24% and 77.56% for items with excellent, acceptable and poor discrimination, respectively (P <0.005. The average Kuder-Richardson formula 20 reliability coefficient was 0.76. Conclusion: A considerable number of the MCQ items were within acceptable ranges. However, some items needed to be discarded or revised. Using three or four rather than five options in MCQs is recommended to reduce the number of NFDs and improve the overall quality of the examination.
Analysis test of understanding of vectors with the three-parameter logistic model of item response theory and item response curves technique

Directory of Open Access Journals (Sweden)

Suttida Rakkapao

2016-10-01

Full Text Available This study investigated the multiple-choice test of understanding of vectors (TUV, by applying item response theory (IRT. The difficulty, discriminatory, and guessing parameters of the TUV items were fit with the three-parameter logistic model of IRT, using the parscale program. The TUV ability is an ability parameter, here estimated assuming unidimensionality and local independence. Moreover, all distractors of the TUV were analyzed from item response curves (IRC that represent simplified IRT. Data were gathered on 2392 science and engineering freshmen, from three universities in Thailand. The results revealed IRT analysis to be useful in assessing the test since its item parameters are independent of the ability parameters. The IRT framework reveals item-level information, and indicates appropriate ability ranges for the test. Moreover, the IRC analysis can be used to assess the effectiveness of the test’s distractors. Both IRT and IRC approaches reveal test characteristics beyond those revealed by the classical analysis methods of tests. Test developers can apply these methods to diagnose and evaluate the features of items at various ability levels of test takers.
Test of Achievement in Quantitative Economics for Secondary Schools: Construction and Validation Using Item Response Theory

Science.gov (United States)

Eleje, Lydia I.; Esomonu, Nkechi P. M.

2018-01-01

A Test to measure achievement in quantitative economics among secondary school students was developed and validated in this study. The test is made up 20 multiple choice test items constructed based on quantitative economics sub-skills. Six research questions guided the study. Preliminary validation was done by two experienced teachers in…
Effects of Reducing the Cognitive Load of Mathematics Test Items on Student Performance

Directory of Open Access Journals (Sweden)

Susan C. Gillmor

2015-01-01

Full Text Available This study explores a new item-writing framework for improving the validity of math assessment items. The authors transfer insights from Cognitive Load Theory (CLT, traditionally used in instructional design, to educational measurement. Fifteen, multiple-choice math assessment items were modified using research-based strategies for reducing extraneous cognitive load. An experimental design with 222 middle-school students tested the effects of the reduced cognitive load items on student performance and anxiety. Significant findings confirm the main research hypothesis that reducing the cognitive load of math assessment items improves student performance. Three load-reducing item modifications are identified as particularly effective for reducing item difficulty: signalling important information, aesthetic item organization, and removing extraneous content. Load reduction was not shown to impact student anxiety. Implications for classroom assessment and future research are discussed.
An assessment of functioning and non-functioning distractors in multiple-choice questions: a descriptive analysis.

Science.gov (United States)

Tarrant, Marie; Ware, James; Mohammed, Ahmed M

2009-07-07

Four- or five-option multiple choice questions (MCQs) are the standard in health-science disciplines, both on certification-level examinations and on in-house developed tests. Previous research has shown, however, that few MCQs have three or four functioning distractors. The purpose of this study was to investigate non-functioning distractors in teacher-developed tests in one nursing program in an English-language university in Hong Kong. Using item-analysis data, we assessed the proportion of non-functioning distractors on a sample of seven test papers administered to undergraduate nursing students. A total of 514 items were reviewed, including 2056 options (1542 distractors and 514 correct responses). Non-functioning options were defined as ones that were chosen by fewer than 5% of examinees and those with a positive option discrimination statistic. The proportion of items containing 0, 1, 2, and 3 functioning distractors was 12.3%, 34.8%, 39.1%, and 13.8% respectively. Overall, items contained an average of 1.54 (SD = 0.88) functioning distractors. Only 52.2% (n = 805) of all distractors were functioning effectively and 10.2% (n = 158) had a choice frequency of 0. Items with more functioning distractors were more difficult and more discriminating. The low frequency of items with three functioning distractors in the four-option items in this study suggests that teachers have difficulty developing plausible distractors for most MCQs. Test items should consist of as many options as is feasible given the item content and the number of plausible distractors; in most cases this would be three. Item analysis results can be used to identify and remove non-functioning distractors from MCQs that have been used in previous tests.
A singular choice for multiple choice

DEFF Research Database (Denmark)

Frandsen, Gudmund Skovbjerg; Schwartzbach, Michael Ignatieff

2006-01-01

How should multiple choice tests be scored and graded, in particular when students are allowed to check several boxes to convey partial knowledge? Many strategies may seem reasonable, but we demonstrate that five self-evident axioms are sufficient to determine completely the correct strategy. We ...
Quantitative Analysis of Complex Multiple-Choice Items in Science Technology and Society: Item Scaling

Directory of Open Access Journals (Sweden)

Ángel Vázquez Alonso

2005-05-01

Full Text Available The scarce attention to assessment and evaluation in science education research has been especially harmful for Science-Technology-Society (STS education, due to the dialectic, tentative, value-laden, and controversial nature of most STS topics. To overcome the methodological pitfalls of the STS assessment instruments used in the past, an empirically developed instrument (VOSTS, Views on Science-Technology-Society have been suggested. Some methodological proposals, namely the multiple response models and the computing of a global attitudinal index, were suggested to improve the item implementation. The final step of these methodological proposals requires the categorization of STS statements. This paper describes the process of categorization through a scaling procedure ruled by a panel of experts, acting as judges, according to the body of knowledge from history, epistemology, and sociology of science. The statement categorization allows for the sound foundation of STS items, which is useful in educational assessment and science education research, and may also increase teachers’ self-confidence in the development of the STS curriculum for science classrooms.
An assessment of functioning and non-functioning distractors in multiple-choice questions: a descriptive analysis

Directory of Open Access Journals (Sweden)

Mohammed Ahmed M

2009-07-01

Full Text Available Abstract Background Four- or five-option multiple choice questions (MCQs are the standard in health-science disciplines, both on certification-level examinations and on in-house developed tests. Previous research has shown, however, that few MCQs have three or four functioning distractors. The purpose of this study was to investigate non-functioning distractors in teacher-developed tests in one nursing program in an English-language university in Hong Kong. Methods Using item-analysis data, we assessed the proportion of non-functioning distractors on a sample of seven test papers administered to undergraduate nursing students. A total of 514 items were reviewed, including 2056 options (1542 distractors and 514 correct responses. Non-functioning options were defined as ones that were chosen by fewer than 5% of examinees and those with a positive option discrimination statistic. Results The proportion of items containing 0, 1, 2, and 3 functioning distractors was 12.3%, 34.8%, 39.1%, and 13.8% respectively. Overall, items contained an average of 1.54 (SD = 0.88 functioning distractors. Only 52.2% (n = 805 of all distractors were functioning effectively and 10.2% (n = 158 had a choice frequency of 0. Items with more functioning distractors were more difficult and more discriminating. Conclusion The low frequency of items with three functioning distractors in the four-option items in this study suggests that teachers have difficulty developing plausible distractors for most MCQs. Test items should consist of as many options as is feasible given the item content and the number of plausible distractors; in most cases this would be three. Item analysis results can be used to identify and remove non-functioning distractors from MCQs that have been used in previous tests.
[Continuing medical education: how to write multiple choice questions].

Science.gov (United States)

Soler Fernández, R; Méndez Díaz, C; Rodríguez García, E

2013-06-01

Evaluating professional competence in medicine is a difficult but indispensable task because it makes it possible to evaluate, at different times and from different perspectives, the extent to which the knowledge, skills, and values required for exercising the profession have been acquired. Tests based on multiple choice questions have been and continue to be among the most useful tools for objectively evaluating learning in medicine. When these tests are well designed and correctly used, they can stimulate learning and even measure higher cognitive skills. Designing a multiple choice test is a difficult task that requires knowledge of the material to be tested and of the methodology of test preparation as well as time to prepare the test. The aim of this article is to review what can be evaluated through multiple choice tests, the rules and guidelines that should be taken into account when writing multiple choice questions, the different formats that can be used, the most common errors in elaborating multiple choice tests, and how to analyze the results of the test to verify its quality. Copyright © 2012 SERAM. Published by Elsevier Espana. All rights reserved.
Multiple-choice pretesting potentiates learning of related information.

Science.gov (United States)

Little, Jeri L; Bjork, Elizabeth Ligon

2016-10-01

Although the testing effect has received a substantial amount of empirical attention, such research has largely focused on the effects of tests given after study. The present research examines the effect of using tests prior to study (i.e., as pretests), focusing particularly on how pretesting influences the subsequent learning of information that is not itself pretested but that is related to the pretested information. In Experiment 1, we found that multiple-choice pretesting was better for the learning of such related information than was cued-recall pretesting or a pre-fact-study control condition. In Experiment 2, we found that the increased learning of non-pretested related information following multiple-choice testing could not be attributed to increased time allocated to that information during subsequent study. Last, in Experiment 3, we showed that the benefits of multiple-choice pretesting over cued-recall pretesting for the learning of related information persist over 48 hours, thus demonstrating the promise of multiple-choice pretesting to potentiate learning in educational contexts. A possible explanation for the observed benefits of multiple-choice pretesting for enhancing the effectiveness with which related nontested information is learned during subsequent study is discussed.
Making the Most of Multiple Choice

Science.gov (United States)

Brookhart, Susan M.

2015-01-01

Multiple-choice questions draw criticism because many people perceive they test only recall or atomistic, surface-level objectives and do not require students to think. Although this can be the case, it does not have to be that way. Susan M. Brookhart suggests that multiple-choice questions are a useful part of any teacher's questioning repertoire…
Comparing Item Performance on Three- Versus Four-Option Multiple Choice Questions in a Veterinary Toxicology Course.

Science.gov (United States)

Royal, Kenneth; Dorman, David

2018-06-09

The number of answer options is an important element of multiple-choice questions (MCQs). Many MCQs contain four or more options despite the limited literature suggesting that there is little to no benefit beyond three options. The purpose of this study was to evaluate item performance on 3-option versus 4-option MCQs used in a core curriculum course in veterinary toxicology at a large veterinary medical school in the United States. A quasi-experimental, crossover design was used in which students in each class were randomly assigned to take one of two versions (A or B) of two major exams. Both the 3-option and 4-option MCQs resulted in similar psychometric properties. The findings of our study support earlier research in other medical disciplines and settings that likewise concluded there was no significant change in the psychometric properties of three option MCQs when compared to the traditional MCQs with four or more options.
Retrieval practice with short-answer, multiple-choice, and hybrid tests.

Science.gov (United States)

Smith, Megan A; Karpicke, Jeffrey D

2014-01-01

Retrieval practice improves meaningful learning, and the most frequent way of implementing retrieval practice in classrooms is to have students answer questions. In four experiments (N=372) we investigated the effects of different question formats on learning. Students read educational texts and practised retrieval by answering short-answer, multiple-choice, or hybrid questions. In hybrid conditions students first attempted to recall answers in short-answer format, then identified answers in multiple-choice format. We measured learning 1 week later using a final assessment with two types of questions: those that could be answered by recalling information verbatim from the texts and those that required inferences. Practising retrieval in all format conditions enhanced retention, relative to a study-only control condition, on both verbatim and inference questions. However, there were little or no advantages of answering short-answer or hybrid format questions over multiple-choice questions in three experiments. In Experiment 4, when retrieval success was improved under initial short-answer conditions, there was an advantage of answering short-answer or hybrid questions over multiple-choice questions. The results challenge the simple conclusion that short-answer questions always produce the best learning, due to increased retrieval effort or difficulty, and demonstrate the importance of retrieval success for retrieval-based learning activities.
The Relationship of Deep and Surface Study Approaches on Factual and Applied Test-Bank Multiple-Choice Question Performance

Science.gov (United States)

Yonker, Julie E.

2011-01-01

With the advent of online test banks and large introductory classes, instructors have often turned to textbook publisher-generated multiple-choice question (MCQ) exams in their courses. Multiple-choice questions are often divided into categories of factual or applied, thereby implicating levels of cognitive processing. This investigation examined…
The development of a single-item Food Choice Questionnaire

NARCIS (Netherlands)

Onwezen, M.C.; Reinders, M.J.; Verain, M.C.D.; Snoek, H.M.

2019-01-01

Based on the multi-item Food Choice Questionnaire (FCQ) originally developed by Steptoe and colleagues (1995), the current study developed a single-item FCQ that provides an acceptable balance between practical needs and psychometric concerns. Studies 1 (N = 1851) and 2 (2a (N = 3290), 2b (N =

Post-Graduate Student Performance in "Supervised In-Class" vs. "Unsupervised Online" Multiple Choice Tests: Implications for Cheating and Test Security

Science.gov (United States)

Ladyshewsky, Richard K.

2015-01-01

This research explores differences in multiple choice test (MCT) scores in a cohort of post-graduate students enrolled in a management and leadership course. A total of 250 students completed the MCT in either a supervised in-class paper and pencil test or an unsupervised online test. The only statistically significant difference between the nine…
Using Likert-type and ipsative/forced choice items in sequence to generate a preference.

Science.gov (United States)

Ried, L Douglas

2014-01-01

Collaboration and implementation of a minimum, standardized set of core global educational and professional competencies seems appropriate given the expanding international evolution of pharmacy practice. However, winnowing down hundreds of competencies from a plethora of local, national and international competency frameworks to select the most highly preferred to be included in the core set is a daunting task. The objective of this paper is to describe a combination of strategies used to ascertain the most highly preferred items among a large number of disparate items. In this case, the items were >100 educational and professional competencies that might be incorporated as the core components of new and existing competency frameworks. Panelists (n = 30) from the European Union (EU) and United States (USA) were chosen to reflect a variety of practice settings. Each panelist completed two electronic surveys. The first survey presented competencies in a Likert-type format and the second survey presented many of the same competencies in an ipsative/forced choice format. Item mean scores were calculated for each competency, the competencies were ranked, and non-parametric statistical tests were used to ascertain the consistency in the rankings achieved by the two strategies. This exploratory study presented over 100 competencies to the panelists in the beginning. The two methods provided similar results, as indicated by the significant correlation between the rankings (Spearman's rho = 0.30, P < 0.09). A two-step strategy using Likert-type and ipsative/forced choice formats in sequence, appears to be useful in a situation where a clear preference is required from among a large number of choices. The ipsative/forced choice format resulted in some differences in the competency preferences because the panelists could not rate them equally by design. While this strategy was used for the selection of professional educational competencies in this exploratory study, it is
Analysis of Multiple Choice Tests Designed by Faculty Members of Kermanshah University of Medical Sciences

Directory of Open Access Journals (Sweden)

Reza Pourmirza Kalhori

2013-12-01

Full Text Available Dear Editor Multiple choice tests are the most common objective tests in medical education which are used to assess the ind-ividual knowledge, recall, recognition and problem solving abilities. One of the testing components is the post-test analysis. This component includes; first, qualitative analysis of the taxonomy of questions based on the Bloom’s educational objectives and percentage of the questions with no structural problems; and second, the quantitative analysis of the reliability (KR-20 and indices of difficulty and differentiation (1. This descriptive-analytical study was aimed to qualitatively and quan-titatively investigate the multiple-choice tests of the faculty members at Kermanshah University of Medical Sciences in 2009-2010. The sample size comprised of 156 tests. Data were analyzed by SPSS-16 software using t-test, chi-squared test, ANOVA and Tukey multiple comparison tests. The mean of reliability (KR-20, difficulty index, and discrimination index were 0.68 (± 0.31, 0.56 (± 0.15 and 0.21 (± 0.15, respectively, which were acceptable. The analysis of the tests at Mashad University of Medical Sciences indicated that the mean for the reliability of the tests was 0.72, and 52.2% of the tests had inappropriate difficulty index and 49.2% of the tests did not have acceptable differentiation index (2. Comparison of the tests at Kermanshah University of Medical Sciences for the fields of anatomy, physiology, biochemistry, genetics, statistics and behavioral sciences courses at Malaysia Faculty of Medicine (3 and tests at Argentina Faculty of Medicine (4 showed that while difficulty index was acceptable in all three universities, but differentiation indices in Malaysia and Argentina Medical Faculties were higher than that in Kermanshah University of Medical Sciences. The mean for the questions with no structural flaws in all tests, taxonomy I, taxonomy II, and taxonomy III were 73.88% (± 14.88, 34.65% (± 15.78, 41.34% (± 13
An Effective Multimedia Item Shell Design for Individualized Education: The Crome Project

Directory of Open Access Journals (Sweden)

Irene Cheng

2008-01-01

Full Text Available There are several advantages to creating multimedia item types and applying computer-based adaptive testing in education. First is the capability to motivate learning by making the learners feel more engaged and in an interactive environment. Second is a better concept representation, which is not possible in conventional multiple-choice tests. Third is the advantage of individualized curriculum design, rather than a curriculum designed for an average student. Fourth is a good choice of the next question, associated with the appropriate difficulty level based on a student's response to the current question. However, many issues need to be addressed when achieving these goals, including: (a the large number of item types required to represent the current multiple-choice questions in multimedia formats, (b the criterion used to determine the difficulty level of a multimedia question item, and (c the methodology applied to the question selection process for individual students. In this paper, we propose a multimedia item shell design that not only reduces the number of item types required, but also computes difficulty level of an item automatically. The concept of question seed is introduced to make content creation more cost-effective. The proposed item shell framework facilitates efficient communication between user responses at the client, and the scoring agents integrated with a student ability assessor at the server. We also describe approaches for automatically estimating difficulty level of questions, and discuss preliminary evaluation of multimedia item types by students.
Do Students Behave Rationally in Multiple Choice Tests? Evidence from a Field Experiment

OpenAIRE

María Paz Espinosa; Javier Gardeazabal

2013-01-01

A disadvantage of multiple choice tests is that students have incentives to guess. To discourage guessing, it is common to use scoring rules that either penalize wrong answers or reward omissions. In psychometrics, penalty and reward scoring rules are considered equivalent. However, experimental evidence indicates that students behave differently under penalty or reward scoring rules. These differences have been attributed to the different framing (penalty versus reward). In this paper, we mo...
Constructing a multiple choice test to measure elementary school teachers' Pedagogical Content Knowledge of technology education.

NARCIS (Netherlands)

Rohaan, E.J.; Taconis, R.; Jochems, W.M.G.

2009-01-01

This paper describes the construction and validation of a multiple choice test to measure elementary school teachers' Pedagogical Content Knowledge of technology education. Pedagogical Content Knowledge is generally accepted to be a crucial domain of teacher knowledge and is, therefore, an important
Generalizability theory and item response theory

NARCIS (Netherlands)

Glas, Cornelis A.W.; Eggen, T.J.H.M.; Veldkamp, B.P.

2012-01-01

Item response theory is usually applied to items with a selected-response format, such as multiple choice items, whereas generalizability theory is usually applied to constructed-response tasks assessed by raters. However, in many situations, raters may use rating scales consisting of items with a
Assessing Differential Item Functioning on the Test of Relational Reasoning

Directory of Open Access Journals (Sweden)

Denis Dumas

2018-03-01

Full Text Available The test of relational reasoning (TORR is designed to assess the ability to identify complex patterns within visuospatial stimuli. The TORR is designed for use in school and university settings, and therefore, its measurement invariance across diverse groups is critical. In this investigation, a large sample, representative of a major university on key demographic variables, was collected, and the resulting data were analyzed using a multi-group, multidimensional item-response theory model-comparison procedure. No significant differential item functioning was found on any of the TORR items across any of the demographic groups of interest. This finding is interpreted as evidence of the cultural fairness of the TORR, and potential test-development choices that may have contributed to that cultural fairness are discussed.
Measuring University students' understanding of the greenhouse effect - a comparison of multiple-choice, short answer and concept sketch assessment tools with respect to students' mental models

Science.gov (United States)

Gold, A. U.; Harris, S. E.

2013-12-01

The greenhouse effect comes up in most discussions about climate and is a key concept related to climate change. Existing studies have shown that students and adults alike lack a detailed understanding of this important concept or might hold misconceptions. We studied the effectiveness of different interventions on University-level students' understanding of the greenhouse effect. Introductory level science students were tested for their pre-knowledge of the greenhouse effect using validated multiple-choice questions, short answers and concept sketches. All students participated in a common lesson about the greenhouse effect and were then randomly assigned to one of two lab groups. One group explored an existing simulation about the greenhouse effect (PhET-lesson) and the other group worked with absorption spectra of different greenhouse gases (Data-lesson) to deepen the understanding of the greenhouse effect. All students completed the same assessment including multiple choice, short answers and concept sketches after participation in their lab lesson. 164 students completed all the assessments, 76 completed the PhET lesson and 77 completed the data lesson. 11 students missed the contrasting lesson. In this presentation we show the comparison between the multiple-choice questions, short answer questions and the concept sketches of students. We explore how well each of these assessment types represents student's knowledge. We also identify items that are indicators of the level of understanding of the greenhouse effect as measured in correspondence of student answers to an expert mental model and expert responses. Preliminary data analysis shows that student who produce concept sketch drawings that come close to expert drawings also choose correct multiple-choice answers. However, correct multiple-choice answers are not necessarily an indicator that a student produces an expert-like correlating concept sketch items. Multiple-choice questions that require detailed
Evaluating multiple-choice exams in large introductory physics courses

Directory of Open Access Journals (Sweden)

Gary Gladding

2006-07-01

Full Text Available The reliability and validity of professionally written multiple-choice exams have been extensively studied for exams such as the SAT, graduate record examination, and the force concept inventory. Much of the success of these multiple-choice exams is attributed to the careful construction of each question, as well as each response. In this study, the reliability and validity of scores from multiple-choice exams written for and administered in the large introductory physics courses at the University of Illinois, Urbana-Champaign were investigated. The reliability of exam scores over the course of a semester results in approximately a 3% uncertainty in students’ total semester exam score. This semester test score uncertainty yields an uncertainty in the students’ assigned letter grade that is less than 1 / 3 of a letter grade. To study the validity of exam scores, a subset of students were ranked independently based on their multiple-choice score, graded explanations, and student interviews. The ranking of these students based on their multiple-choice score was found to be consistent with the ranking assigned by physics instructors based on the students’ written explanations ( r>0.94 at the 95% confidence level and oral interviews (r=0.94−0.09+0.06 .
Using Multiple-Choice Questions to Evaluate In-Depth Learning of Economics

Science.gov (United States)

Buckles, Stephen; Siegfried, John J.

2006-01-01

Multiple-choice questions are the basis of a significant portion of assessment in introductory economics courses. However, these questions, as found in course assessments, test banks, and textbooks, often fail to evaluate students' abilities to use and apply economic analysis. The authors conclude that multiple-choice questions can be used to…
Stochastic order in dichotomous item response models for fixed tests, research adaptive tests, or multiple abilities

NARCIS (Netherlands)

van der Linden, Willem J.

1995-01-01

Dichotomous item response theory (IRT) models can be viewed as families of stochastically ordered distributions of responses to test items. This paper explores several properties of such distributiom. The focus is on the conditions under which stochastic order in families of conditional
Evolution of a Test Item

Science.gov (United States)

Spaan, Mary

2007-01-01

This article follows the development of test items (see "Language Assessment Quarterly", Volume 3 Issue 1, pp. 71-79 for the article "Test and Item Specifications Development"), beginning with a review of test and item specifications, then proceeding to writing and editing of items, pretesting and analysis, and finally selection of an item for a…
A novel multi-item joint replenishment problem considering multiple type discounts.

Directory of Open Access Journals (Sweden)

Ligang Cui

Full Text Available In business replenishment, discount offers of multi-item may either provide different discount schedules with a single discount type, or provide schedules with multiple discount types. The paper investigates the joint effects of multiple discount schemes on the decisions of multi-item joint replenishment. In this paper, a joint replenishment problem (JRP model, considering three discount (all-unit discount, incremental discount, total volume discount offers simultaneously, is constructed to determine the basic cycle time and joint replenishment frequencies of multi-item. To solve the proposed problem, a heuristic algorithm is proposed to find the optimal solutions and the corresponding total cost of the JRP model. Numerical experiment is performed to test the algorithm and the computational results of JRPs under different discount combinations show different significance in the replenishment cost reduction.
Looking Closer at the Effects of Framing on Risky Choice: An Item Response Theory Analysis.

Science.gov (United States)

Sickar; Highhouse

1998-07-01

Item response theory (IRT) methodology allowed an in-depth examination of several issues that would be difficult to explore using traditional methodology. IRT models were estimated for 4 risky-choice items, answered by students under either a gain or loss frame. Results supported the typical framing finding of risk-aversion for gains and risk-seeking for losses but also suggested that a latent construct we label preference for risk was influential in predicting risky choice. Also, the Asian Disease item, most often used in framing research, was found to have anomalous statistical properties when compared to other framing items. Copyright 1998 Academic Press.
Measuring the Consistency in Change in Hepatitis B Knowledge among Three Different Types of Tests: True/False, Multiple Choice, and Fill in the Blanks Tests.

Science.gov (United States)

Sahai, Vic; Demeyere, Petra; Poirier, Sheila; Piro, Felice

1998-01-01

The recall of information about Hepatitis B demonstrated by 180 seventh graders was tested with three test types: (1) short-answer; (2) true/false; and (3) multiple-choice. Short answer testing was the most reliable. Suggestions are made for the use of short-answer tests in evaluating student knowledge. (SLD)
Effects of Repeated Testing on Short- and Long-Term Memory Performance across Different Test Formats

Science.gov (United States)

Stenlund, Tova; Sundström, Anna; Jonsson, Bert

2016-01-01

This study examined whether practice testing with short-answer (SA) items benefits learning over time compared to practice testing with multiple-choice (MC) items, and rereading the material. More specifically, the aim was to test the hypotheses of "retrieval effort" and "transfer appropriate processing" by comparing retention…
Gender and Ethnicity Differences in Multiple-Choice Testing. Effects of Self-Assessment and Risk-Taking Propensity

Science.gov (United States)

1993-05-01

correctness of the response provides I some advantages. They are: i 1. Increased reliability of the test; 2. Examinees pay more attention to the multiple...their choice 3 of test date. Each sign up sheet was divided into four cells: Non-Hispanic males and females and Hispanic males and females. 3 I I I...certain prestige and financial rewards; or entering a conservatory of music for advanced training with a well-known pianist . Mr. H realizes that even
Can multiple-choice questions simulate free-response questions?

OpenAIRE

Lin, Shih-Yin; Singh, Chandralekha

2016-01-01

We discuss a study to evaluate the extent to which free-response questions could be approximated by multiple-choice equivalents. Two carefully designed research-based multiple-choice questions were transformed into a free-response format and administered on the final exam in a calculus-based introductory physics course. The original multiple-choice questions were administered in another similar introductory physics course on final exam. Findings suggest that carefully designed multiple-choice...
Validation and Structural Analysis of the Kinematics Concept Test

Science.gov (United States)

Lichtenberger, A.; Wagner, C.; Hofer, S. I.; Stem, E.; Vaterlaus, A.

2017-01-01

The kinematics concept test (KCT) is a multiple-choice test designed to evaluate students' conceptual understanding of kinematics at the high school level. The test comprises 49 multiple-choice items about velocity and acceleration, which are based on seven kinematic concepts and which make use of three different representations. In the first part…

Selecting Items for Criterion-Referenced Tests.

Science.gov (United States)

Mellenbergh, Gideon J.; van der Linden, Wim J.

1982-01-01

Three item selection methods for criterion-referenced tests are examined: the classical theory of item difficulty and item-test correlation; the latent trait theory of item characteristic curves; and a decision-theoretic approach for optimal item selection. Item contribution to the standardized expected utility of mastery testing is discussed. (CM)
Generalizability theory and item response theory

OpenAIRE

Glas, Cornelis A.W.; Eggen, T.J.H.M.; Veldkamp, B.P.

2012-01-01

Item response theory is usually applied to items with a selected-response format, such as multiple choice items, whereas generalizability theory is usually applied to constructed-response tasks assessed by raters. However, in many situations, raters may use rating scales consisting of items with a selected-response format. This chapter presents a short overview of how item response theory and generalizability theory were integrated to model such assessments. Further, the precision of the esti...
Examining the Prediction of Reading Comprehension on Different Multiple-Choice Tests

Science.gov (United States)

Andreassen, Rune; Braten, Ivar

2010-01-01

In this study, 180 Norwegian fifth-grade students with a mean age of 10.5 years were administered measures of word recognition skills, strategic text processing, reading motivation and working memory. Six months later, the same students were given three different multiple-choice reading comprehension measures. Based on three forced-order…
Using Tests as Learning Opportunities.

Science.gov (United States)

Foos, Paul W.; Fisher, Ronald P.

1988-01-01

A study involving 105 undergraduates assessed the value of testing as a means of increasing, rather than simply monitoring, learning. Results indicate that fill-in-the-blank and items requiring student inferences were more effective, respectively, than multiple-choice tests and verbatim items in furthering student learning. (TJH)
Multiple-Choice Testing Using Immediate Feedback--Assessment Technique (IF AT®) Forms: Second-Chance Guessing vs. Second-Chance Learning?

Science.gov (United States)

Merrel, Jeremy D.; Cirillo, Pier F.; Schwartz, Pauline M.; Webb, Jeffrey A.

2015-01-01

Multiple choice testing is a common but often ineffective method for evaluating learning. A newer approach, however, using Immediate Feedback Assessment Technique (IF AT®, Epstein Educational Enterprise, Inc.) forms, offers several advantages. In particular, a student learns immediately if his or her answer is correct and, in the case of an…
Science Library of Test Items. Volume Three. Mastery Testing Programme. Introduction and Manual.

Science.gov (United States)

New South Wales Dept. of Education, Sydney (Australia).

A set of short tests aimed at measuring student mastery of specific skills in the natural sciences are presented with a description of the mastery program's purposes, development, and methods. Mastery learning, criterion-referenced testing, and the scope of skills to be tested are defined. Each of the multiple choice tests for grades 7 through 10…
Force Concept Inventory-Based Multiple-Choice Test for Investigating Students' Representational Consistency

Science.gov (United States)

Nieminen, Pasi; Savinainen, Antti; Viiri, Jouni

2010-01-01

This study investigates students' ability to interpret multiple representations consistently (i.e., representational consistency) in the context of the force concept. For this purpose we developed the Representational Variant of the Force Concept Inventory (R-FCI), which makes use of nine items from the 1995 version of the Force Concept Inventory…
Psychometric aspects of item mapping for criterion-referenced interpretation and bookmark standard setting.

Science.gov (United States)

Huynh, Huynh

2010-01-01

Locating an item on an achievement continuum (item mapping) is well-established in technical work for educational/psychological assessment. Applications of item mapping may be found in criterion-referenced (CR) testing (or scale anchoring, Beaton and Allen, 1992; Huynh, 1994, 1998a, 2000a, 2000b, 2006), computer-assisted testing, test form assembly, and in standard setting methods based on ordered test booklets. These methods include the bookmark standard setting originally used for the CTB/TerraNova tests (Lewis, Mitzel, Green, and Patz, 1999), the item descriptor process (Ferrara, Perie, and Johnson, 2002) and a similar process described by Wang (2003) for multiple-choice licensure and certification examinations. While item response theory (IRT) models such as the Rasch and two-parameter logistic (2PL) models traditionally place a binary item at its location, Huynh has argued in the cited papers that such mapping may not be appropriate in selecting items for CR interpretation and scale anchoring.
The Impact Analysis of Psychological Reliability of Population Pilot Study For Selection of Particular Reliable Multi-Choice Item Test in Foreign Language Research Work

Directory of Open Access Journals (Sweden)

Seyed Hossein Fazeli

2010-10-01

Full Text Available The purpose of research described in the current study is the psychological reliability, its’ importance, application, and more to investigate on the impact analysis of psychological reliability of population pilot study for selection of particular reliable multi-choice item test in foreign language research work. The population for subject recruitment was all under graduated students from second semester at large university in Iran (both male and female that study English as a compulsory paper. In Iran, English is taught as a foreign language.
A Comparison of Multidimensional Item Selection Methods in Simple and Complex Test Designs

Directory of Open Access Journals (Sweden)

Eren Halil ÖZBERK

2017-03-01

Full Text Available In contrast with the previous studies, this study employed various test designs (simple and complex which allow the evaluation of the overall ability score estimations across multiple real test conditions. In this study, four factors were manipulated, namely the test design, number of items per dimension, correlation between dimensions and item selection methods. Using the generated item and ability parameters, dichotomous item responses were generated in by using M3PL compensatory multidimensional IRT model with specified correlations. MCAT composite ability score accuracy was evaluated using absolute bias (ABSBIAS, correlation and the root mean square error (RMSE between true and estimated ability scores. The results suggest that the multidimensional test structure, number of item per dimension and correlation between dimensions had significant effect on item selection methods for the overall score estimations. For simple structure test design it was found that V1 item selection has the lowest absolute bias estimations for both long and short tests while estimating overall scores. As the model gets complex KL item selection method performed better than other two item selection method.
Effectiveness of Guided Multiple Choice Objective Questions Test on Students' Academic Achievement in Senior School Mathematics by School Location

Science.gov (United States)

Igbojinwaekwu, Patrick Chukwuemeka

2015-01-01

This study investigated, using pretest-posttest quasi-experimental research design, the effectiveness of guided multiple choice objective questions test on students' academic achievement in Senior School Mathematics, by school location, in Delta State Capital Territory, Nigeria. The sample comprised 640 Students from four coeducation secondary…
[A factor analysis method for contingency table data with unlimited multiple choice questions].

Science.gov (United States)

Toyoda, Hideki; Haiden, Reina; Kubo, Saori; Ikehara, Kazuya; Isobe, Yurie

2016-02-01

The purpose of this study is to propose a method of factor analysis for analyzing contingency tables developed from the data of unlimited multiple-choice questions. This method assumes that the element of each cell of the contingency table has a binominal distribution and a factor analysis model is applied to the logit of the selection probability. Scree plot and WAIC are used to decide the number of factors, and the standardized residual, the standardized difference between the sample, and the proportion ratio, is used to select items. The proposed method was applied to real product impression research data on advertised chips and energy drinks. Since the results of the analysis showed that this method could be used in conjunction with conventional factor analysis model, and extracted factors were fully interpretable, and suggests the usefulness of the proposed method in the study of psychology using unlimited multiple-choice questions.
Performance of Men and Women on Multiple-Choice and Constructed-Response Tests for Beginning Teachers. Research Report. ETS RR-04-48

Science.gov (United States)

Livingston, Samuel A.; Rupp, Stacie L.

2004-01-01

Some previous research results imply that women tend to perform better, relative to men, on constructed-response (CR) tests than on multiple-choice (MC) tests in the same subjects. An analysis of data from several tests used in the licensing of beginning teachers supported this hypothesis, to varying degrees, in most of the tests investigated. The…
How do STEM-interested students pursue multiple interests in their higher educational choice?

Science.gov (United States)

Vulperhorst, Jonne Pieter; Wessels, Koen Rens; Bakker, Arthur; Akkerman, Sanne Floor

2018-05-01

Interest in science, technology, engineering and mathematics (STEM) has lately received attention in research due to a gap between the number of STEM students and the needs of the labour market. As interest seems to be one of the most important factors in deciding what to study, we focus in the present study on how STEM-interested students weigh multiple interests in making educational choices. A questionnaire with both open-ended and closed-ended items was administered to 91 STEM-interested students enrolled in a STEM programme of a Dutch University for secondary school students. Results indicate that students find it important that a study programme allows them to pursue multiple interests. Some students pursued multiple interests by choosing to enrol in two programmes at the same time. Most students chose one programme that enabled them to combine multiple interests. Combinations of pursued interests were dependent on the disciplinary range of interests of students. Students who were interested in diverse domains combined interests in an educational programme across academic and non-academic domains, whilst students who were mainly interested in STEM combined only STEM-focused interests. Together these findings stress the importance of taking a multiple interest perspective on interest development and educational choice.
Three Modeling Applications to Promote Automatic Item Generation for Examinations in Dentistry.

Science.gov (United States)

Lai, Hollis; Gierl, Mark J; Byrne, B Ellen; Spielman, Andrew I; Waldschmidt, David M

2016-03-01

Test items created for dentistry examinations are often individually written by content experts. This approach to item development is expensive because it requires the time and effort of many content experts but yields relatively few items. The aim of this study was to describe and illustrate how items can be generated using a systematic approach. Automatic item generation (AIG) is an alternative method that allows a small number of content experts to produce large numbers of items by integrating their domain expertise with computer technology. This article describes and illustrates how three modeling approaches to item content-item cloning, cognitive modeling, and image-anchored modeling-can be used to generate large numbers of multiple-choice test items for examinations in dentistry. Test items can be generated by combining the expertise of two content specialists with technology supported by AIG. A total of 5,467 new items were created during this study. From substitution of item content, to modeling appropriate responses based upon a cognitive model of correct responses, to generating items linked to specific graphical findings, AIG has the potential for meeting increasing demands for test items. Further, the methods described in this study can be generalized and applied to many other item types. Future research applications for AIG in dental education are discussed.
Free choice in and out of context: semantics and distribution of French, Greek and English Free Choice Items

NARCIS (Netherlands)

Vlachou, E.

2007-01-01

Free Choice Items (FCIs), such as French n’importe qui, Greek opjosdhipote and English anyone, are well known for their limited distributional properties. Most former analyses have been influenced by the polarity sensitivity tradition, accounting for the distribution of FCIs in terms of the
A more general model for testing measurement invariance and differential item functioning.

Science.gov (United States)

Bauer, Daniel J

2017-09-01

The evaluation of measurement invariance is an important step in establishing the validity and comparability of measurements across individuals. Most commonly, measurement invariance has been examined using 1 of 2 primary latent variable modeling approaches: the multiple groups model or the multiple-indicator multiple-cause (MIMIC) model. Both approaches offer opportunities to detect differential item functioning within multi-item scales, and thereby to test measurement invariance, but both approaches also have significant limitations. The multiple groups model allows 1 to examine the invariance of all model parameters but only across levels of a single categorical individual difference variable (e.g., ethnicity). In contrast, the MIMIC model permits both categorical and continuous individual difference variables (e.g., sex and age) but permits only a subset of the model parameters to vary as a function of these characteristics. The current article argues that moderated nonlinear factor analysis (MNLFA) constitutes an alternative, more flexible model for evaluating measurement invariance and differential item functioning. We show that the MNLFA subsumes and combines the strengths of the multiple group and MIMIC models, allowing for a full and simultaneous assessment of measurement invariance and differential item functioning across multiple categorical and/or continuous individual difference variables. The relationships between the MNLFA model and the multiple groups and MIMIC models are shown mathematically and via an empirical demonstration. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
The Predominance Of Integrative Tests Over Discrete Point Tests In Evaluating The Medical Students' General English Knowledge

Directory of Open Access Journals (Sweden)

maryam Heydarpour Meymeh

2009-03-01

Full Text Available Background and purpose: Multiple choice tests are the most common type of tests used in evaluating the general English knowledge of the students in most medical universities, however the efficacy of these tests are not examined precisely. Wecompare and examine the integrative tests and discrete point tests as measures of the English language knowledge of medical students.Methods: Three tests were given to 60 undergraduate physiotherapy and Audiology students in their second year of study (after passing their general English course. They were divided into 2 groups.The first test for both groups was an integrative test, writing. The second test was a multiple - choice test 0.(prepositions for group one and a multiple - choice test of tensesfor group two. The same items which were mostfi-equently used wrongly in thefirst test were used in the items of the second test. A third test, a TOEFL, was given to the subjects in order to estimate the correlation between this test and tests one and two.Results: The students performed better in the second test, discrete point test rather than the first which was an integrative test. The same grammatical mistakes in the composition were used correctly in the multiple choice tests by the students.Conclusion:Our findings show that student perform better in non-productive rather than productive test. Since being competent English language user is an expected outcome of university language courses it seems warranted to switch to integrative tests as a measure of English language competency.Keywords: INTEGRATIVE TESTS, ENGLISH LANGUAGE FOR MEDICINE, ACADEMIC ENGLISH
A simple test of choice stepping reaction time for assessing fall risk in people with multiple sclerosis.

Science.gov (United States)

Tijsma, Mylou; Vister, Eva; Hoang, Phu; Lord, Stephen R

2017-03-01

Purpose To determine (a) the discriminant validity for established fall risk factors and (b) the predictive validity for falls of a simple test of choice stepping reaction time (CSRT) in people with multiple sclerosis (MS). Method People with MS (n = 210, 21-74y) performed the CSRT, sensorimotor, balance and neuropsychological tests in a single session. They were then followed up for falls using monthly fall diaries for 6 months. Results The CSRT test had excellent discriminant validity with respect to established fall risk factors. Frequent fallers (≥3 falls) performed significantly worse in the CSRT test than non-frequent fallers (0-2 falls). With the odds of suffering frequent falls increasing 69% with each SD increase in CSRT (OR = 1.69, 95% CI: 1.27-2.26, p = falls in people with MS. This test may prove useful in documenting longitudinal changes in fall risk in relation to MS disease progression and effects of interventions. Implications for rehabilitation Good choice stepping reaction time (CSRT) is required for maintaining balance. A simple low-tech CSRT test has excellent discriminative and predictive validity in relation to falls in people with MS. This test may prove useful documenting longitudinal changes in fall risk in relation to MS disease progression and effects of interventions.
A multiple objective test assembly approach for exposure control problems in Computerized Adaptive Testing

Directory of Open Access Journals (Sweden)

Theo J.H.M. Eggen

2010-01-01

Full Text Available Overexposure and underexposure of items in the bank are serious problems in operational computerized adaptive testing (CAT systems. These exposure problems might result in item compromise, or point at a waste of investments. The exposure control problem can be viewed as a test assembly problem with multiple objectives. Information in the test has to be maximized, item compromise has to be minimized, and pool usage has to be optimized. In this paper, a multiple objectives method is developed to deal with both types of exposure problems. In this method, exposure control parameters based on observed exposure rates are implemented as weights for the information in the item selection procedure. The method does not need time consuming simulation studies, and it can be implemented conditional on ability level. The method is compared with Sympson Hetter method for exposure control, with the Progressive method and with alphastratified testing. The results show that the method is successful in dealing with both kinds of exposure problems.

Forced-Choice Assessment of Work-Related Maladaptive Personality Traits: Preliminary Evidence From an Application of Thurstonian Item Response Modeling.

Science.gov (United States)

Guenole, Nigel; Brown, Anna A; Cooper, Andrew J

2018-06-01

This article describes an investigation of whether Thurstonian item response modeling is a viable method for assessment of maladaptive traits. Forced-choice responses from 420 working adults to a broad-range personality inventory assessing six maladaptive traits were considered. The Thurstonian item response model's fit to the forced-choice data was adequate, while the fit of a counterpart item response model to responses to the same items but arranged in a single-stimulus design was poor. Monotrait heteromethod correlations indicated corresponding traits in the two formats overlapped substantially, although they did not measure equivalent constructs. A better goodness of fit and higher factor loadings for the Thurstonian item response model, coupled with a clearer conceptual alignment to the theoretical trait definitions, suggested that the single-stimulus item responses were influenced by biases that the independent clusters measurement model did not account for. Researchers may wish to consider forced-choice designs and appropriate item response modeling techniques such as Thurstonian item response modeling for personality questionnaire applications in industrial psychology, especially when assessing maladaptive traits. We recommend further investigation of this approach in actual selection situations and with different assessment instruments.
Taking the Test Taker's Perspective: Response Process and Test Motivation in Multidimensional Forced-Choice Versus Rating Scale Instruments.

Science.gov (United States)

Sass, Rachelle; Frick, Susanne; Reips, Ulf-Dietrich; Wetzel, Eunike

2018-03-01

The multidimensional forced-choice (MFC) format has been proposed as an alternative to the rating scale (RS) response format. However, it is unclear how changing the response format may affect the response process and test motivation of participants. In Study 1, we investigated the MFC response process using the think-aloud technique. In Study 2, we compared test motivation between the RS format and different versions of the MFC format (presenting 2, 3, 4, and 5 items simultaneously). The response process to MFC item blocks was similar to the RS response process but involved an additional step of weighing the items within a block against each other. The RS and MFC response format groups did not differ in their test motivation. Thus, from the test taker's perspective, the MFC format is somewhat more demanding to respond to, but this does not appear to decrease test motivation.
All of the above: When multiple correct response options enhance the testing effect.

Science.gov (United States)

Bishara, Anthony J; Lanzo, Lauren A

2015-01-01

Previous research has shown that multiple choice tests often improve memory retention. However, the presence of incorrect lures often attenuates this memory benefit. The current research examined the effects of "all of the above" (AOTA) options. When such options are correct, no incorrect lures are present. In the first three experiments, a correct AOTA option on an initial test led to a larger memory benefit than no test and standard multiple choice test conditions. The benefits of a correct AOTA option occurred even without feedback on the initial test; for both 5-minute and 48-hour retention delays; and for both cued recall and multiple choice final test formats. In the final experiment, an AOTA question led to better memory retention than did a control condition that had identical timing and exposure to response options. However, the benefits relative to this control condition were similar regardless of the type of multiple choice test (AOTA or not). Results suggest that retrieval contributes to multiple choice testing effects. However, the extra testing effect from a correct AOTA option, rather than being due to more retrieval, might be due simply to more exposure to correct information.
International Semiotics: Item Difficulty and the Complexity of Science Item Illustrations in the PISA-2009 International Test Comparison

Science.gov (United States)

Solano-Flores, Guillermo; Wang, Chao; Shade, Chelsey

2016-01-01

We examined multimodality (the representation of information in multiple semiotic modes) in the context of international test comparisons. Using Program of International Student Assessment (PISA)-2009 data, we examined the correlation of the difficulty of science items and the complexity of their illustrations. We observed statistically…
Evaluating multiple-choice exams in large introductory physics courses

OpenAIRE

Gary Gladding; Tim Stelzer; Michael Scott

2006-01-01

The reliability and validity of professionally written multiple-choice exams have been extensively studied for exams such as the SAT, graduate record examination, and the force concept inventory. Much of the success of these multiple-choice exams is attributed to the careful construction of each question, as well as each response. In this study, the reliability and validity of scores from multiple-choice exams written for and administered in the large introductory physics courses at the Unive...
The e-MSWS-12: improving the multiple sclerosis walking scale using item response theory.

Science.gov (United States)

Engelhard, Matthew M; Schmidt, Karen M; Engel, Casey E; Brenton, J Nicholas; Patek, Stephen D; Goldman, Myla D

2016-12-01

The Multiple Sclerosis Walking Scale (MSWS-12) is the predominant patient-reported measure of multiple sclerosis (MS) -elated walking ability, yet it had not been analyzed using item response theory (IRT), the emerging standard for patient-reported outcome (PRO) validation. This study aims to reduce MSWS-12 measurement error and facilitate computerized adaptive testing by creating an IRT model of the MSWS-12 and distributing it online. MSWS-12 responses from 284 subjects with MS were collected by mail and used to fit and compare several IRT models. Following model selection and assessment, subpopulations based on age and sex were tested for differential item functioning (DIF). Model comparison favored a one-dimensional graded response model (GRM). This model met fit criteria and explained 87 % of response variance. The performance of each MSWS-12 item was characterized using category response curves (CRCs) and item information. IRT-based MSWS-12 scores correlated with traditional MSWS-12 scores (r = 0.99) and timed 25-foot walk (T25FW) speed (r = -0.70). Item 2 showed DIF based on age (χ 2 = 19.02, df = 5, p Item 11 showed DIF based on sex (χ 2 = 13.76, df = 5, p = 0.02). MSWS-12 measurement error depends on walking ability, but could be lowered by improving or replacing items with low information or DIF. The e-MSWS-12 includes IRT-based scoring, error checking, and an estimated T25FW derived from MSWS-12 responses. It is available at https://ms-irt.shinyapps.io/e-MSWS-12 .
Can Free-Response Questions Be Approximated by Multiple-Choice Equivalents?

OpenAIRE

Lin, Shih-Yin; Singh, Chandralekha

2016-01-01

We discuss a study to evaluate the extent to which free-response questions can be approximated by multiple-choice equivalents. Two carefully designed research-based multiple-choice questions were transformed into a free-response format and administered on the final exam in a calculus-based introductory physics course. The original multiple-choice questions were administered in another, similar introductory physics course on the final exam. Our findings suggest that carefully designed multiple...
Delayed, but not immediate, feedback after multiple-choice questions increases performance on a subsequent short-answer, but not multiple-choice, exam: evidence for the dual-process theory of memory.

Science.gov (United States)

Sinha, Neha; Glass, Arnold Lewis

2015-01-01

Three experiments, two performed in the laboratory and one embedded in a college psychology lecture course, investigated the effects of immediate versus delayed feedback following a multiple-choice exam on subsequent short answer and multiple-choice exams. Performance on the subsequent multiple-choice exam was not affected by the timing of the feedback on the prior exam; however, performance on the subsequent short answer exam was better following delayed than following immediate feedback. This was true regardless of the order in which immediate versus delayed feedback was given. Furthermore, delayed feedback only had a greater effect than immediate feedback on subsequent short answer performance following correct, confident responses on the prior exam. These results indicate that delayed feedback cues a student's prior response and increases subsequent recollection of that response. The practical implication is that delayed feedback is better than immediate feedback during academic testing.
Evaluation of Northwest University, Kano Post-UTME Test Items Using Item Response Theory

Science.gov (United States)

Bichi, Ado Abdu; Hafiz, Hadiza; Bello, Samira Abdullahi

2016-01-01

High-stakes testing is used for the purposes of providing results that have important consequences. Validity is the cornerstone upon which all measurement systems are built. This study applied the Item Response Theory principles to analyse Northwest University Kano Post-UTME Economics test items. The developed fifty (50) economics test items was…
Bees Algorithm for Construction of Multiple Test Forms in E-Testing

Science.gov (United States)

Songmuang, Pokpong; Ueno, Maomi

2011-01-01

The purpose of this research is to automatically construct multiple equivalent test forms that have equivalent qualities indicated by test information functions based on item response theory. There has been a trade-off in previous studies between the computational costs and the equivalent qualities of test forms. To alleviate this problem, we…
13 CFR 121.407 - What are the size procedures for multiple item procurements?

Science.gov (United States)

2010-01-01

... Requirements for Government Procurement § 121.407 What are the size procedures for multiple item procurements? If a procurement calls for two or more specific end items or types of services with different size... multiple item procurements? 121.407 Section 121.407 Business Credit and Assistance SMALL BUSINESS...
Development of a Mechanical Engineering Test Item Bank to promote learning outcomes-based education in Japanese and Indonesian higher education institutions

Directory of Open Access Journals (Sweden)

Jeffrey S. Cross

2017-11-01

Full Text Available Following on the 2008-2012 OECD Assessment of Higher Education Learning Outcomes (AHELO feasibility study of civil engineering, in Japan a mechanical engineering learning outcomes assessment working group was established within the National Institute of Education Research (NIER, which became the Tuning National Center for Japan. The purpose of the project is to develop among engineering faculty members, common understandings of engineering learning outcomes, through the collaborative process of test item development, scoring, and sharing of results. By substantiating abstract level learning outcomes into concrete level learning outcomes that are attainable and assessable, and through measuring and comparing the students’ achievement of learning outcomes, it is anticipated that faculty members will be able to draw practical implications for educational improvement at the program and course levels. The development of a mechanical engineering test item bank began with test item development workshops, which led to a series of trial tests, and then to a large scale test implementation in 2016 of 348 first semester master’s students in 9 institutions in Japan, using both multiple choice questions designed to measure the mastery of basic and engineering sciences, and a constructive response task designed to measure “how well students can think like an engineer.” The same set of test items were translated from Japanese into to English and Indonesian, and used to measure achievement of learning outcomes at Indonesia’s Institut Teknologi Bandung (ITB on 37 rising fourth year undergraduate students. This paper highlights how learning outcomes assessment can effectively facilitate learning outcomes-based education, by documenting the experience of Japanese and Indonesian mechanical engineering faculty members engaged in the NIER Test Item Bank project.First published online: 30 November 2017
Evaluation of the Multiple Sclerosis Walking Scale-12 (MSWS-12) in a Dutch sample: Application of item response theory.

Science.gov (United States)

Mokkink, Lidwine Brigitta; Galindo-Garre, Francisca; Uitdehaag, Bernard Mj

2016-12-01

The Multiple Sclerosis Walking Scale-12 (MSWS-12) measures walking ability from the patients' perspective. We examined the quality of the MSWS-12 using an item response theory model, the graded response model (GRM). A total of 625 unique Dutch multiple sclerosis (MS) patients were included. After testing for unidimensionality, monotonicity, and absence of local dependence, a GRM was fit and item characteristics were assessed. Differential item functioning (DIF) for the variables gender, age, duration of MS, type of MS and severity of MS, reliability, total test information, and standard error of the trait level (θ) were investigated. Confirmatory factor analysis showed a unidimensional structure of the 12 items of the scale, explaining 88% of the variance. Item 2 did not fit into the GRM model. Reliability was 0.93. Items 8 and 9 (of the 11 and 12 item version respectively) showed DIF on the variable severity, based on the Expanded Disability Status Scale (EDSS). However, the EDSS is strongly related to the content of both items. Our results confirm the good quality of the MSWS-12. The trait level (θ) scores and item parameters of both the 12- and 11-item versions were highly comparable, although we do not suggest to change the content of the MSWS-12. © The Author(s), 2016.
An Alternative to the 3PL: Using Asymmetric Item Characteristic Curves to Address Guessing Effects

Science.gov (United States)

Lee, Sora; Bolt, Daniel M.

2018-01-01

Both the statistical and interpretational shortcomings of the three-parameter logistic (3PL) model in accommodating guessing effects on multiple-choice items are well documented. We consider the use of a residual heteroscedasticity (RH) model as an alternative, and compare its performance to the 3PL with real test data sets and through simulation…
Comparison between Two Assessment Methods; Modified Essay Questions and Multiple Choice Questions

Directory of Open Access Journals (Sweden)

Assadi S.N.* MD

2015-09-01

Full Text Available Aims Using the best assessment methods is an important factor in educational development of health students. Modified essay questions and multiple choice questions are two prevalent methods of assessing the students. The aim of this study was to compare two methods of modified essay questions and multiple choice questions in occupational health engineering and work laws courses. Materials & Methods This semi-experimental study was performed during 2013 to 2014 on occupational health students of Mashhad University of Medical Sciences. The class of occupational health and work laws course in 2013 was considered as group A and the class of 2014 as group B. Each group had 50 students.The group A students were assessed by modified essay questions method and the group B by multiple choice questions method.Data were analyzed in SPSS 16 software by paired T test and odd’s ratio. Findings The mean grade of occupational health and work laws course was 18.68±0.91 in group A (modified essay questions and was 18.78±0.86 in group B (multiple choice questions which was not significantly different (t=-0.41; p=0.684. The mean grade of chemical chapter (p<0.001 in occupational health engineering and harmful work law (p<0.001 and other (p=0.015 chapters in work laws were significantly different between two groups. Conclusion Modified essay questions and multiple choice questions methods have nearly the same student assessing value for the occupational health engineering and work laws course.
THE MULTIPLE CHOICE PROBLEM WITH INTERACTIONS BETWEEN CRITERIA

Directory of Open Access Journals (Sweden)

Luiz Flavio Autran Monteiro Gomes

2015-12-01

Full Text Available ABSTRACT An important problem in Multi-Criteria Decision Analysis arises when one must select at least two alternatives at the same time. This can be denoted as a multiple choice problem. In other words, instead of evaluating each of the alternatives separately, they must be combined into groups of n alternatives, where n = 2. When the multiple choice problem must be solved under multiple criteria, the result is a multi-criteria, multiple choice problem. In this paper, it is shown through examples how this problemcan be tackled on a bipolar scale. The Choquet integral is used in this paper to take care of interactions between criteria. A numerical application example is conducted using data from SEBRAE-RJ, a non-profit private organization that has the mission of promoting competitiveness, sustainable developmentand entrepreneurship in the state of Rio de Janeiro, Brazil. The paper closes with suggestions for future research.
Multiple-Choice Exams and Guessing: Results from a One-Year Study of General Chemistry Tests Designed to Discourage Guessing

Science.gov (United States)

Campbell, Mark L.

2015-01-01

Multiple-choice exams, while widely used, are necessarily imprecise due to the contribution of the final student score due to guessing. This past year at the United States Naval Academy the construction and grading scheme for the department-wide general chemistry multiple-choice exams were revised with the goal of decreasing the contribution of…
Project Physics Tests 1, Concepts of Motion.

Science.gov (United States)

Harvard Univ., Cambridge, MA. Harvard Project Physics.

Test items relating to Project Physics Unit 1 are presented in this booklet, consisting of 70 multiple-choice and 20 problem-and-essay questions. Concepts of motion are examined with respect to velocities, acceleration, forces, vectors, Newton's laws, and circular motion. Suggestions are made for time consumption in answering some items. Besides…
Computerized Adaptive Test (CAT) Applications and Item Response Theory Models for Polytomous Items

Science.gov (United States)

Aybek, Eren Can; Demirtasli, R. Nukhet

2017-01-01

This article aims to provide a theoretical framework for computerized adaptive tests (CAT) and item response theory models for polytomous items. Besides that, it aims to introduce the simulation and live CAT software to the related researchers. Computerized adaptive test algorithm, assumptions of item response theory models, nominal response…
Item Response Theory with Covariates (IRT-C): Assessing Item Recovery and Differential Item Functioning for the Three-Parameter Logistic Model

Science.gov (United States)

Tay, Louis; Huang, Qiming; Vermunt, Jeroen K.

2016-01-01

In large-scale testing, the use of multigroup approaches is limited for assessing differential item functioning (DIF) across multiple variables as DIF is examined for each variable separately. In contrast, the item response theory with covariate (IRT-C) procedure can be used to examine DIF across multiple variables (covariates) simultaneously. To…

Instructional Topics in Educational Measurement (ITEMS) Module: Using Automated Processes to Generate Test Items

Science.gov (United States)

Gierl, Mark J.; Lai, Hollis

2013-01-01

Changes to the design and development of our educational assessments are resulting in the unprecedented demand for a large and continuous supply of content-specific test items. One way to address this growing demand is with automatic item generation (AIG). AIG is the process of using item models to generate test items with the aid of computer…
Grade 9 Pilot Test. Mathematics. June 1988 = 9e Annee Test Pilote. Mathematiques. Juin 1988.

Science.gov (United States)

Alberta Dept. of Education, Edmonton.

This pilot test for ninth grade mathematics is written in both French and English. The test consists of 75 multiple-choice items. Students are given 90 minutes to complete the examination and the use of a calculator is highly recommended. The test content covers a wide range of mathematical topics including: decimals; exponents; arithmetic word…
Evaluating Multiple-Choice Exams in Large Introductory Physics Courses

Science.gov (United States)

Scott, Michael; Stelzer, Tim; Gladding, Gary

2006-01-01

The reliability and validity of professionally written multiple-choice exams have been extensively studied for exams such as the SAT, graduate record examination, and the force concept inventory. Much of the success of these multiple-choice exams is attributed to the careful construction of each question, as well as each response. In this study,…
The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory

Science.gov (United States)

Sahin, Alper; Anil, Duygu

2017-01-01

This study investigates the effects of sample size and test length on item-parameter estimation in test development utilizing three unidimensional dichotomous models of item response theory (IRT). For this purpose, a real language test comprised of 50 items was administered to 6,288 students. Data from this test was used to obtain data sets of…
Identification of Misconceptions through Multiple Choice Tasks at Municipal Chemistry Competition Test

Directory of Open Access Journals (Sweden)

Dušica D Milenković

2016-01-01

Full Text Available In this paper, the level of conceptual understanding of chemical contents among seventh grade students who participated in the municipal Chemistry competition in Novi Sad, Serbia, in 2013 have been examined. Tests for the municipal chemistry competition were used as a measuring instrument, wherein only multiple choice tasks were considered and analyzed. Determination of the level of conceptual understanding of the tested chemical contents was based on the calculation of the frequency of choosing the correct answers. Thereby, identification of areas of satisfactory conceptual understanding, areas of roughly adequate performance, areas of inadequate performance, and areas of quite inadequate performance have been conducted. On the other hand, the analysis of misconceptions was based on the analysis of distractors. The results showed that satisfactory level of conceptual understanding and roughly adequate performance characterize majority of contents, which was expected since only the best students who took part in the contest were surveyed. However, this analysis identified a large number of misunderstandings, as well. In most of the cases, these misconceptions were related to the inability to distinguish elements, compounds, homogeneous and heterogeneous mixtures. Besides, it is shown that students are not familiar with crystal structure of the diamond, and with metric prefixes. The obtained results indicate insufficient visualization of the submicroscopic level in school textbooks, the imprecise use of chemical language by teachers and imprecise use of language in chemistry textbooks.
Relative Merits of Four Methods for Scoring Cloze Tests.

Science.gov (United States)

Brown, James Dean

1980-01-01

Describes study comparing merits of exact answer, acceptable answer, clozentropy and multiple choice methods for scoring tests. Results show differences among reliability, mean item facility, discrimination and usability, but not validity. (BK)
Feasibility of a multiple-choice mini mental state examination for chronically critically ill patients.

Science.gov (United States)

Miguélez, Marta; Merlani, Paolo; Gigon, Fabienne; Verdon, Mélanie; Annoni, Jean-Marie; Ricou, Bara

2014-08-01

Following treatment in an ICU, up to 70% of chronically critically ill patients present neurocognitive impairment that can have negative effects on their quality of life, daily activities, and return to work. The Mini Mental State Examination is a simple, widely used tool for neurocognitive assessment. Although of interest when evaluating ICU patients, the current version is restricted to patients who are able to speak. This study aimed to evaluate the feasibility of a visual, multiple-choice Mini Mental State Examination for ICU patients who are unable to speak. The multiple-choice Mini Mental State Examination and the standard Mini Mental State Examination were compared across three different speaking populations. The interrater and intrarater reliabilities of the multiple-choice Mini Mental State Examination were tested on both intubated and tracheostomized ICU patients. Mixed 36-bed ICU and neuropsychology department in a university hospital. Twenty-six healthy volunteers, 20 neurological patients, 46 ICU patients able to speak, and 30 intubated or tracheostomized ICU patients. None. Multiple-choice Mini Mental State Examination results correlated satisfactorily with standard Mini Mental State Examination results in all three speaking groups: healthy volunteers: intraclass correlation coefficient = 0.43 (95% CI, -0.18 to 0.62); neurology patients: 0.90 (95% CI, 0.82-0.95); and ICU patients able to speak: 0.86 (95% CI, 0.70-0.92). The interrater and intrarater reliabilities were good (0.95 [0.87-0.98] and 0.94 [0.31-0.99], respectively). In all populations, a Bland-Altman analysis showed systematically higher scores using the multiple-choice Mini Mental State Examination. Administration of the multiple-choice Mini Mental State Examination to ICU patients was straightforward and produced exploitable results comparable to those of the standard Mini Mental State Examination. It should be of interest for the assessment and monitoring of the neurocognitive
A Two-Tier Multiple Choice Questions to Diagnose Thermodynamic Misconception of Thai and Laos Students

Science.gov (United States)

Kamcharean, Chanwit; Wattanakasiwich, Pornrat

The objective of this study was to diagnose misconceptions of Thai and Lao students in thermodynamics by using a two-tier multiple-choice test. Two-tier multiple choice questions consist of the first tier, a content-based question and the second tier, a reasoning-based question. Data of student understanding was collected by using 10 two-tier multiple-choice questions. Thai participants were the first-year students (N = 57) taking a fundamental physics course at Chiang Mai University in 2012. Lao participants were high school students in Grade 11 (N = 57) and Grade 12 (N = 83) at Muengnern high school in Xayaboury province, Lao PDR. As results, most students answered content-tier questions correctly but chose incorrect answers for reason-tier questions. When further investigating their incorrect reasons, we found similar misconceptions as reported in previous studies such as incorrectly relating pressure with temperature when presenting with multiple variables.
Quantum partial search for uneven distribution of multiple target items

Science.gov (United States)

Zhang, Kun; Korepin, Vladimir

2018-06-01

Quantum partial search algorithm is an approximate search. It aims to find a target block (which has the target items). It runs a little faster than full Grover search. In this paper, we consider quantum partial search algorithm for multiple target items unevenly distributed in a database (target blocks have different number of target items). The algorithm we describe can locate one of the target blocks. Efficiency of the algorithm is measured by number of queries to the oracle. We optimize the algorithm in order to improve efficiency. By perturbation method, we find that the algorithm runs the fastest when target items are evenly distributed in database.
Overcoming the effects of differential skewness of test items in scale construction

Directory of Open Access Journals (Sweden)

Johann M. Schepers

2004-10-01

Full Text Available The principal objective of the study was to develop a procedure for overcoming the effects of differential skewness of test items in scale construction. It was shown that the degree of skewness of test items places an upper limit on the correlations between the items, regardless of the contents of the items. If the items are ordered in terms of skewness the resulting inter correlation matrix forms a simplex or a pseudo simplex. Factoring such a matrix results in a multiplicity of factors, most of which are artifacts. A procedure for overcoming this problem was demonstrated with items from the Locus of Control Inventory (Schepers, 1995. The analysis was based on a sample of 1662 first year university students. Opsomming Die hoofdoel van die studie was om ’n prosedure te ontwikkel om die gevolge van differensiële skeefheid van toetsitems, in skaalkonstruksie, teen te werk. Daar is getoon dat die graad van skeefheid van toetsitems ’n boonste grens plaas op die korrelasies tussen die items ongeag die inhoud daarvan. Indien die items gerangskik word volgens graad van skeefheid, sal die interkorrelasiematriks van die items ’n simpleks of pseudosimpleks vorm. Indien so ’n matriks aan faktorontleding onderwerp word, lei dit tot ’n veelheid van faktore waarvan die meerderheid artefakte is. ’n Prosedure om hierdie probleem te bowe te kom, is gedemonstreer met behulp van die items van die Lokus van Beheer-vraelys (Schepers, 1995. Die ontledings is op ’n steekproef van 1662 eerstejaaruniversiteitstudente gebaseer.
Understanding Test-Takers' Perceptions of Difficulty in EAP Vocabulary Tests: The Role of Experiential Factors

Science.gov (United States)

Oruç Ertürk, Nesrin; Mumford, Simon E.

2017-01-01

This study, conducted by two researchers who were also multiple-choice question (MCQ) test item writers at a private English-medium university in an English as a foreign language (EFL) context, was designed to shed light on the factors that influence test-takers' perceptions of difficulty in English for academic purposes (EAP) vocabulary, with the…
Practical Usage of Multiple-Choice Questions as Part of Learning and Self-Evaluation

Directory of Open Access Journals (Sweden)

Paula Kangasniemi

2016-12-01

Full Text Available The poster describes how the multiple-choice questions could be a part of learning, not only assessing. We often think of the role of questions only in order to test the student's skills. We have tested how questions could be a part of learning in our web-based course of information retrieval in Lapland University. In web-based learning there is a need for high-quality mediators. Mediators are learning promoters which trigger, support, and amplify learning. Mediators can be human mediators or tool mediators. The tool mediators are for example; tests, tutorials, guides and diaries. The multiple-choice questions can also be learning promoters which select, interpret and amplify objects for learning. What do you have to take into account when you are preparing multiple-choice questions as mediators? First you have to prioritize teaching objectives: what must be known and what should be known. According to our experience with contact learning, you can assess what the things are that students have problems with and need more guidance on. The most important addition to the questions is feedback during practice. The questions’ answers (wrong or right are not important. The feedback on the answers are important to guide students on how to search. The questions promote students’ self-regulation and self-evaluation. Feedback can be verbal, a screenshot or a video. We have added a verbal feedback for every question and also some screenshots and eight videos in our web-based course.
Pursuing the Qualities of a "Good" Test

Science.gov (United States)

Coniam, David

2014-01-01

This article examines the issue of the quality of teacher-produced tests, limiting itself in the current context to objective, multiple-choice tests. The article investigates a short, two-part 20-item English language test. After a brief overview of the key test qualities of reliability and validity, the article examines the two subtests in terms…
Applying modern psychometric techniques to melodic discrimination testing: Item response theory, computerised adaptive testing, and automatic item generation.

Science.gov (United States)

Harrison, Peter M C; Collins, Tom; Müllensiefen, Daniel

2017-06-15

Modern psychometric theory provides many useful tools for ability testing, such as item response theory, computerised adaptive testing, and automatic item generation. However, these techniques have yet to be integrated into mainstream psychological practice. This is unfortunate, because modern psychometric techniques can bring many benefits, including sophisticated reliability measures, improved construct validity, avoidance of exposure effects, and improved efficiency. In the present research we therefore use these techniques to develop a new test of a well-studied psychological capacity: melodic discrimination, the ability to detect differences between melodies. We calibrate and validate this test in a series of studies. Studies 1 and 2 respectively calibrate and validate an initial test version, while Studies 3 and 4 calibrate and validate an updated test version incorporating additional easy items. The results support the new test's viability, with evidence for strong reliability and construct validity. We discuss how these modern psychometric techniques may also be profitably applied to other areas of music psychology and psychological science in general.
Teaching Critical Thinking without (Much) Writing: Multiple-Choice and Metacognition

Science.gov (United States)

Bassett, Molly H.

2016-01-01

In this essay, I explore an exam format that pairs multiple-choice questions with required rationales. In a space adjacent to each multiple-choice question, students explain why or how they arrived at the answer they selected. This exercise builds the critical thinking skill known as metacognition, thinking about thinking, into an exam that also…
A Multiple-Item Scale for Assessing E-Government Service Quality

Science.gov (United States)

Papadomichelaki, Xenia; Mentzas, Gregoris

A critical element in the evolution of e-governmental services is the development of sites that better serve the citizens’ needs. To deliver superior service quality, we must first understand how citizens perceive and evaluate online citizen service. This involves defining what e-government service quality is, identifying its underlying dimensions, and determining how it can be conceptualized and measured. In this article we conceptualise an e-government service quality model (e-GovQual) and then we develop, refine, validate, confirm and test a multiple-item scale for measuring e-government service quality for public administration sites where citizens seek either information or services.
Performance on large-scale science tests: Item attributes that may impact achievement scores

Science.gov (United States)

Gordon, Janet Victoria

Significant differences in achievement among ethnic groups persist on the eighth-grade science Washington Assessment of Student Learning (WASL). The WASL measures academic performance in science using both scenario and stand-alone question types. Previous research suggests that presenting target items connected to an authentic context, like scenario question types, can increase science achievement scores especially in underrepresented groups and thus help to close the achievement gap. The purpose of this study was to identify significant differences in performance between gender and ethnic subgroups by question type on the 2005 eighth-grade science WASL. MANOVA and ANOVA were used to examine relationships between gender and ethnic subgroups as independent variables with achievement scores on scenario and stand-alone question types as dependent variables. MANOVA revealed no significant effects for gender, suggesting that the 2005 eighth-grade science WASL was gender neutral. However, there were significant effects for ethnicity. ANOVA revealed significant effects for ethnicity and ethnicity by gender interaction in both question types. Effect sizes were negligible for the ethnicity by gender interaction. Large effect sizes between ethnicities on scenario question types became moderate to small effect sizes on stand-alone question types. This indicates the score advantage the higher performing subgroups had over the lower performing subgroups was not as large on stand-alone question types compared to scenario question types. A further comparison examined performance on multiple-choice items only within both question types. Similar achievement patterns between ethnicities emerged; however, achievement patterns between genders changed in boys' favor. Scenario question types appeared to register differences between ethnic groups to a greater degree than stand-alone question types. These differences may be attributable to individual differences in cognition
Location Indices for Ordinal Polytomous Items Based on Item Response Theory. Research Report. ETS RR-15-20

Science.gov (United States)

Ali, Usama S.; Chang, Hua-Hua; Anderson, Carolyn J.

2015-01-01

Polytomous items are typically described by multiple category-related parameters; situations, however, arise in which a single index is needed to describe an item's location along a latent trait continuum. Situations in which a single index would be needed include item selection in computerized adaptive testing or test assembly. Therefore single…
Dual processing theory and experts' reasoning: exploring thinking on national multiple-choice questions.

Science.gov (United States)

Durning, Steven J; Dong, Ting; Artino, Anthony R; van der Vleuten, Cees; Holmboe, Eric; Schuwirth, Lambert

2015-08-01

An ongoing debate exists in the medical education literature regarding the potential benefits of pattern recognition (non-analytic reasoning), actively comparing and contrasting diagnostic options (analytic reasoning) or using a combination approach. Studies have not, however, explicitly explored faculty's thought processes while tackling clinical problems through the lens of dual process theory to inform this debate. Further, these thought processes have not been studied in relation to the difficulty of the task or other potential mediating influences such as personal factors and fatigue, which could also be influenced by personal factors such as sleep deprivation. We therefore sought to determine which reasoning process(es) were used with answering clinically oriented multiple-choice questions (MCQs) and if these processes differed based on the dual process theory characteristics: accuracy, reading time and answering time as well as psychometrically determined item difficulty and sleep deprivation. We performed a think-aloud procedure to explore faculty's thought processes while taking these MCQs, coding think-aloud data based on reasoning process (analytic, nonanalytic, guessing or combination of processes) as well as word count, number of stated concepts, reading time, answering time, and accuracy. We also included questions regarding amount of work in the recent past. We then conducted statistical analyses to examine the associations between these measures such as correlations between frequencies of reasoning processes and item accuracy and difficulty. We also observed the total frequencies of different reasoning processes in the situations of getting answers correctly and incorrectly. Regardless of whether the questions were classified as 'hard' or 'easy', non-analytical reasoning led to the correct answer more often than to an incorrect answer. Significant correlations were found between self-reported recent number of hours worked with think-aloud word count
Binomial test models and item difficulty

NARCIS (Netherlands)

van der Linden, Willem J.

1979-01-01

In choosing a binomial test model, it is important to know exactly what conditions are imposed on item difficulty. In this paper these conditions are examined for both a deterministic and a stochastic conception of item responses. It appears that they are more restrictive than is generally

On school choice and test-based accountability.

Directory of Open Access Journals (Sweden)

Damian W. Betebenner

2005-10-01

Full Text Available Among the two most prominent school reform measures currently being implemented in The United States are school choice and test-based accountability. Until recently, the two policy initiatives remained relatively distinct from one another. With the passage of the No Child Left Behind Act of 2001 (NCLB, a mutualism between choice and accountability emerged whereby school choice complements test-based accountability. In the first portion of this study we present a conceptual overview of school choice and test-based accountability and explicate connections between the two that are explicit in reform implementations like NCLB or implicit within the market-based reform literature in which school choice and test-based accountability reside. In the second portion we scrutinize the connections, in particular, between school choice and test-based accountability using a large western school district with a popular choice system in place. Data from three sources are combined to explore the ways in which school choice and test-based accountability draw on each other: state assessment data of children in the district, school choice data for every participating student in the district choice program, and a parental survey of both participants and non-participants of choice asking their attitudes concerning the use of school report cards in the district. Results suggest that choice is of benefit academically to only the lowest achieving students, choice participation is not uniform across different ethnic groups in the district, and parents' primary motivations as reported on a survey for participation in choice are not due to test scores, though this is not consistent with choice preferences among parents in the district. As such, our results generally confirm the hypotheses of choice critics more so than advocates. Keywords: school choice; accountability; student testing.
Prediction of true test scores from observed item scores and ancillary data.

Science.gov (United States)

Haberman, Shelby J; Yao, Lili; Sinharay, Sandip

2015-05-01

In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008 in the case of GRE(®) General Analytical Writing and until 2009 in the case of TOEFL(®) iBT Writing. With use of natural language processing, it is possible to obtain additional information concerning item responses from computer programs such as e-rater(®). In addition, available information relevant to examinee performance may include scores on related tests. We suggest application of standard results from classical test theory to the available data to obtain best linear predictors of true traditional test scores. In performing such analysis, we require estimation of variances and covariances of measurement errors, a task which can be quite difficult in the case of tests with limited numbers of items and with multiple measurements per item. As a consequence, a new estimation method is suggested based on samples of examinees who have taken an assessment more than once. Such samples are typically not random samples of the general population of examinees, so that we apply statistical adjustment methods to obtain the needed estimated variances and covariances of measurement errors. To examine practical implications of the suggested methods of analysis, applications are made to GRE General Analytical Writing and TOEFL iBT Writing. Results obtained indicate that substantial improvements are possible both in terms of reliability of scoring and in terms of assessment reliability. © 2015 The British Psychological Society.
Multiple choice exams of medical knowledge with open books and web access? A validity study

DEFF Research Database (Denmark)

O'Neill, Lotte; Simonsen, Eivind Ortind; Knudsen, Ulla Breth

2015-01-01

Background: Open book tests have been suggested to lower test anxiety and promote deeper learning strategies. In the Aarhus University medical program, ¼ of the curriculum assess students’ medical knowledge with ‘open book, open web’ (OBOW) multiple choice examinations. We found little existing...
Modeling Local Item Dependence in Cloze and Reading Comprehension Test Items Using Testlet Response Theory

Science.gov (United States)

Baghaei, Purya; Ravand, Hamdollah

2016-01-01

In this study the magnitudes of local dependence generated by cloze test items and reading comprehension items were compared and their impact on parameter estimates and test precision was investigated. An advanced English as a foreign language reading comprehension test containing three reading passages and a cloze test was analyzed with a…
Reducing the number of options on multiple-choice questions: response time, psychometrics and standard setting.

Science.gov (United States)

Schneid, Stephen D; Armour, Chris; Park, Yoon Soo; Yudkowsky, Rachel; Bordage, Georges

2014-10-01

Despite significant evidence supporting the use of three-option multiple-choice questions (MCQs), these are rarely used in written examinations for health professions students. The purpose of this study was to examine the effects of reducing four- and five-option MCQs to three-option MCQs on response times, psychometric characteristics, and absolute standard setting judgements in a pharmacology examination administered to health professions students. We administered two versions of a computerised examination containing 98 MCQs to 38 Year 2 medical students and 39 Year 3 pharmacy students. Four- and five-option MCQs were converted into three-option MCQs to create two versions of the examination. Differences in response time, item difficulty and discrimination, and reliability were evaluated. Medical and pharmacy faculty judges provided three-level Angoff (TLA) ratings for all MCQs for both versions of the examination to allow the assessment of differences in cut scores. Students answered three-option MCQs an average of 5 seconds faster than they answered four- and five-option MCQs (36 seconds versus 41 seconds; p = 0.008). There were no significant differences in item difficulty and discrimination, or test reliability. Overall, the cut scores generated for three-option MCQs using the TLA ratings were 8 percentage points higher (p = 0.04). The use of three-option MCQs in a health professions examination resulted in a time saving equivalent to the completion of 16% more MCQs per 1-hour testing period, which may increase content validity and test score reliability, and minimise construct under-representation. The higher cut scores may result in higher failure rates if an absolute standard setting method, such as the TLA method, is used. The results from this study provide a cautious indication to health professions educators that using three-option MCQs does not threaten validity and may strengthen it by allowing additional MCQs to be tested in a fixed amount
Item Response Theory Models for Performance Decline during Testing

Science.gov (United States)

Jin, Kuan-Yu; Wang, Wen-Chung

2014-01-01

Sometimes, test-takers may not be able to attempt all items to the best of their ability (with full effort) due to personal factors (e.g., low motivation) or testing conditions (e.g., time limit), resulting in poor performances on certain items, especially those located toward the end of a test. Standard item response theory (IRT) models fail to…
Learning Physics Teaching through Collaborative Design of Conceptual Multiple-Choice Questions

Science.gov (United States)

Milner-Bolotin, Marina

2015-01-01

Increasing student engagement through Electronic Response Systems (clickers) has been widely researched. Its success largely depends on the quality of multiple-choice questions used by instructors. This paper describes a pilot project that focused on the implementation of online collaborative multiple-choice question repository, PeerWise, in a…
The Prevalence of Multiple-Choice Testing in Registered Nurse Licensure-Qualifying Nursing Education Programs in New York State.

Science.gov (United States)

Birkhead, Susan; Kelman, Glenda; Zittel, Barbara; Jatulis, Linnea

The aim of this study was to describe nurse educators' use of multiple-choice questions (MCQs) in testing in registered nurse licensure-qualifying nursing education programs in New York State. This study was a descriptive correlational analysis of data obtained from surveying 1,559 nurse educators; 297 educators from 61 institutions responded (response rate [RR] = 19 percent), yielding a final cohort of 200. MCQs were reported to comprise a mean of 81 percent of questions on a typical test. Baccalaureate program respondents were equally likely to use MCQs as associate degree program respondents (p > .05) but were more likely to report using other methods of assessing student achievement to construct course grades (p < .01). Both groups reported little use of alternate format-type questions. Respondent educators reported substantial reliance upon the use of MCQs, corroborating the limited data quantifying the prevalence of use of MCQ tests in licensure-qualifying nursing education programs.
Integration of the Forced-Choice Questionnaire and the Likert Scale: A Simulation Study

Directory of Open Access Journals (Sweden)

Yue Xiao

2017-05-01

Full Text Available The Thurstonian item response theory (IRT model allows estimating the latent trait scores of respondents directly through their responses in forced-choice questionnaires. It solves a part of problems brought by the traditional scoring methods of this kind of questionnaires. However, the forced-choice designs may still have their own limitations: The model may encounter underidentification and non-convergence and the test may show low test reliability in simple test designs (e.g., test designs with only a small number of traits measured or short length. To overcome these weaknesses, the present study applied the Thurstonian IRT model and the Graded Response Model to a different test format that comprises both forced-choice blocks and Likert-type items. And the Likert items should have low social desirability. A Monte Carlo simulation study is used to investigate how the mixed response format performs under various conditions. Four factors are considered: the number of traits, test length, the percentage of Likert items, and the proportion of pairs composed of items keyed in opposite directions. Results reveal that the mixed response format can be superior to the forced-choice format, especially in simple designs where the latter performs poorly. Besides the number of Likert items needed is small. One point to note is that researchers need to choose Likert items cautiously as Likert items may bring other response biases to the test. Discussion and suggestions are given to construct personality tests that can resist faking as much as possible and have acceptable reliability.
Item response theory analysis of the mechanics baseline test

Science.gov (United States)

Cardamone, Caroline N.; Abbott, Jonathan E.; Rayyan, Saif; Seaton, Daniel T.; Pawl, Andrew; Pritchard, David E.

2012-02-01

Item response theory is useful in both the development and evaluation of assessments and in computing standardized measures of student performance. In item response theory, individual parameters (difficulty, discrimination) for each item or question are fit by item response models. These parameters provide a means for evaluating a test and offer a better measure of student skill than a raw test score, because each skill calculation considers not only the number of questions answered correctly, but the individual properties of all questions answered. Here, we present the results from an analysis of the Mechanics Baseline Test given at MIT during 2005-2010. Using the item parameters, we identify questions on the Mechanics Baseline Test that are not effective in discriminating between MIT students of different abilities. We show that a limited subset of the highest quality questions on the Mechanics Baseline Test returns accurate measures of student skill. We compare student skills as determined by item response theory to the more traditional measurement of the raw score and show that a comparable measure of learning gain can be computed.
The Effect of English Language on Multiple Choice Question Scores of Thai Medical Students.

Science.gov (United States)

Phisalprapa, Pochamana; Muangkaew, Wayuda; Assanasen, Jintana; Kunavisarut, Tada; Thongngarm, Torpong; Ruchutrakool, Theera; Kobwanthanakun, Surapon; Dejsomritrutai, Wanchai

2016-04-01

Universities in Thailand are preparing for Thailand's integration into the ASEAN Economic Community (AEC) by increasing the number of tests in English language. English language is not the native language of Thailand Differences in English language proficiency may affect scores among test-takers, even when subject knowledge among test-takers is comparable and may falsely represent the knowledge level of the test-taker. To study the impact of English language multiple choice test questions on test scores of medical students. The final examination of fourth-year medical students completing internal medicine rotation contains 120 multiple choice questions (MCQ). The languages used on the test are Thai and English at a ratio of 3:1. Individual scores of tests taken in both languages were collected and the effect of English language on MCQ was analyzed Individual MCQ scores were then compared with individual student English language proficiency and student grade point average (GPA). Two hundred ninety five fourth-year medical students were enrolled. The mean percentage of MCQ scores in Thai and English were significantly different (65.0 ± 8.4 and 56.5 ± 12.4, respectively, p English was fair (Spearman's correlation coefficient = 0.41, p English than in Thai language. Students were classified into six grade categories (A, B+, B, C+, C, and D+), which cumulatively measured total internal medicine rotation performance score plus final examination score. MCQ scores from Thai language examination were more closely correlated with total course grades than were the scores from English language examination (Spearman's correlation coefficient = 0.73 (p English proficiency score was very high, at 3.71 ± 0.35 from a total of 4.00. Mean student GPA was 3.40 ± 0.33 from a possible 4.00. English language MCQ examination scores were more highly associated with GPA than with English language proficiency. The use of English language multiple choice question test may decrease scores
Validation of a Standardized Multiple-Choice Multicultural Competence Test: Implications for Training, Assessment, and Practice

Science.gov (United States)

Gillem, Angela R.; Bartoli, Eleonora; Bertsch, Kristin N.; McCarthy, Maureen A.; Constant, Kerra; Marrero-Meisky, Sheila; Robbins, Steven J.; Bellamy, Scarlett

2016-01-01

The Multicultural Counseling and Psychotherapy Test (MCPT), a measure of multicultural counseling competence (MCC), was validated in 2 phases. In Phase 1, the authors administered 451 test items derived from multicultural guidelines in counseling and psychology to 32 multicultural experts and 30 nonexperts. In Phase 2, the authors administered the…
The Effects of Item Format and Cognitive Domain on Students' Science Performance in TIMSS 2011

Science.gov (United States)

Liou, Pey-Yan; Bulut, Okan

2017-12-01

The purpose of this study was to examine eighth-grade students' science performance in terms of two test design components, item format, and cognitive domain. The portion of Taiwanese data came from the 2011 administration of the Trends in International Mathematics and Science Study (TIMSS), one of the major international large-scale assessments in science. The item difficulty analysis was initially applied to show the proportion of correct items. A regression-based cumulative link mixed modeling (CLMM) approach was further utilized to estimate the impact of item format, cognitive domain, and their interaction on the students' science scores. The results of the proportion-correct statistics showed that constructed-response items were more difficult than multiple-choice items, and that the reasoning cognitive domain items were more difficult compared to the items in the applying and knowing domains. In terms of the CLMM results, students tended to obtain higher scores when answering constructed-response items as well as items in the applying cognitive domain. When the two predictors and the interaction term were included together, the directions and magnitudes of the predictors on student science performance changed substantially. Plausible explanations for the complex nature of the effects of the two test-design predictors on student science performance are discussed. The results provide practical, empirical-based evidence for test developers, teachers, and stakeholders to be aware of the differential function of item format, cognitive domain, and their interaction in students' science performance.
Procedures for Selecting Items for Computerized Adaptive Tests.

Science.gov (United States)

Kingsbury, G. Gage; Zara, Anthony R.

1989-01-01

Several classical approaches and alternative approaches to item selection for computerized adaptive testing (CAT) are reviewed and compared. The study also describes procedures for constrained CAT that may be added to classical item selection approaches to allow them to be used for applied testing. (TJH)
Calibrating the Medical Council of Canada's Qualifying Examination Part I using an integrated item response theory framework: a comparison of models and designs.

Science.gov (United States)

De Champlain, Andre F; Boulais, Andre-Philippe; Dallas, Andrew

2016-01-01

The aim of this research was to compare different methods of calibrating multiple choice question (MCQ) and clinical decision making (CDM) components for the Medical Council of Canada's Qualifying Examination Part I (MCCQEI) based on item response theory. Our data consisted of test results from 8,213 first time applicants to MCCQEI in spring and fall 2010 and 2011 test administrations. The data set contained several thousand multiple choice items and several hundred CDM cases. Four dichotomous calibrations were run using BILOG-MG 3.0. All 3 mixed item format (dichotomous MCQ responses and polytomous CDM case scores) calibrations were conducted using PARSCALE 4. The 2-PL model had identical numbers of items with chi-square values at or below a Type I error rate of 0.01 (83/3,499 or 0.02). In all 3 polytomous models, whether the MCQs were either anchored or concurrently run with the CDM cases, results suggest very poor fit. All IRT abilities estimated from dichotomous calibration designs correlated very highly with each other. IRT-based pass-fail rates were extremely similar, not only across calibration designs and methods, but also with regard to the actual reported decision to candidates. The largest difference noted in pass rates was 4.78%, which occurred between the mixed format concurrent 2-PL graded response model (pass rate= 80.43%) and the dichotomous anchored 1-PL calibrations (pass rate= 85.21%). Simpler calibration designs with dichotomized items should be implemented. The dichotomous calibrations provided better fit of the item response matrix than more complex, polytomous calibrations.
Selection of multiple cued items is possible during visual short-term memory maintenance.

Science.gov (United States)

Matsukura, Michi; Vecera, Shaun P

2015-07-01

Recent neuroimaging studies suggest that maintenance of a selected object feature held in visual short-term/working memory (VSTM/VWM) is supported by the same neural mechanisms that encode the sensory information. If VSTM operates by retaining "reasonable copies" of scenes constructed during sensory processing (Serences, Ester, Vogel, & Awh, 2009, p. 207, the sensory recruitment hypothesis), then attention should be able to select multiple items represented in VSTM as long as the number of these attended items does not exceed the typical VSTM capacity. It is well known that attention can select at least two noncontiguous locations at the same time during sensory processing. However, empirical reports from the studies that examined this possibility are inconsistent. In the present study, we demonstrate that (1) attention can indeed select more than a single item during VSTM maintenance when observers are asked to recognize a set of items in the manner that these items were originally attended, and (2) attention can select multiple cued items regardless of whether these items are perceptually organized into a single group (contiguous locations) or not (noncontiguous locations). The results also replicate and extend the recent finding that selective attention that operates during VSTM maintenance is sensitive to the observers' goal and motivation to use the cueing information.
Memory-based attention capture when multiple items are maintained in visual working memory.

Science.gov (United States)

Hollingworth, Andrew; Beck, Valerie M

2016-07-01

Efficient visual search requires that attention is guided strategically to relevant objects, and most theories of visual search implement this function by means of a target template maintained in visual working memory (VWM). However, there is currently debate over the architecture of VWM-based attentional guidance. We contrasted a single-item-template hypothesis with a multiple-item-template hypothesis, which differ in their claims about structural limits on the interaction between VWM representations and perceptual selection. Recent evidence from van Moorselaar, Theeuwes, and Olivers (2014) indicated that memory-based capture during search, an index of VWM guidance, is not observed when memory set size is increased beyond a single item, suggesting that multiple items in VWM do not guide attention. In the present study, we maximized the overlap between multiple colors held in VWM and the colors of distractors in a search array. Reliable capture was observed when 2 colors were held in VWM and both colors were present as distractors, using both the original van Moorselaar et al. singleton-shape search task and a search task that required focal attention to array elements (gap location in outline square stimuli). In the latter task, memory-based capture was consistent with the simultaneous guidance of attention by multiple VWM representations. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Construction of Valid and Reliable Test for Assessment of Students

Science.gov (United States)

Osadebe, P. U.

2015-01-01

The study was carried out to construct a valid and reliable test in Economics for secondary school students. Two research questions were drawn to guide the establishment of validity and reliability for the Economics Achievement Test (EAT). It is a multiple choice objective test of five options with 100 items. A sample of 1000 students was randomly…
Effect of Differential Item Functioning on Test Equating

Science.gov (United States)

Kabasakal, Kübra Atalay; Kelecioglu, Hülya

2015-01-01

This study examines the effect of differential item functioning (DIF) items on test equating through multilevel item response models (MIRMs) and traditional IRMs. The performances of three different equating models were investigated under 24 different simulation conditions, and the variables whose effects were examined included sample size, test…
Validity of the ISUOG basic training test

DEFF Research Database (Denmark)

Hillerup, Niels Emil; Tabor, Ann; Konge, Lars

2018-01-01

A certain level of theoretical knowledge is required when performing basic obstetrical and gynecological ultrasound. To assess the adequacy of trainees' basic theoretical knowledge, the International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) has developed a theoretical test of 49...... Multiple Choice Questionnaire (MCQ) items for their basic training courses....

Project Physics Tests 4, Light and Electromagnetism.

Science.gov (United States)

Harvard Univ., Cambridge, MA. Harvard Project Physics.

Test items relating to Project Physics Unit 4 are presented in this booklet. Included are 70 multiple-choice and 22 problem-and-essay questions. Concepts of light and electromagnetism are examined on charges, reflection, electrostatic forces, electric potential, speed of light, electromagnetic waves and radiations, Oersted's and Faraday's work,…
The Role of Item Feedback in Self-Adapted Testing.

Science.gov (United States)

Roos, Linda L.; And Others

1997-01-01

The importance of item feedback in self-adapted testing was studied by comparing feedback and no feedback conditions for computerized adaptive tests and self-adapted tests taken by 363 college students. Results indicate that item feedback is not necessary to realize score differences between self-adapted and computerized adaptive testing. (SLD)
Measurement of ethical food choice motives.

Science.gov (United States)

Lindeman, M; Väänänen, M

2000-02-01

The two studies describe the development of three complementary scales to the Food Choice Questionnaire developed by Steptoe, Pollard & Wardle (1995). The new items address various ethical food choice motives and were derived from previous studies on vegetarianism and ethical food choice. The items were factor analysed in Study 1 (N=281) and the factor solution was confirmed in Study 2 (N=125), in which simple validity criteria were also included. Furthermore, test-retest reliability was assessed with a separate sample of subjects (N=36). The results indicated that the three new scales, Ecological Welfare (including subscales for Animal Welfare and Environment Protection), Political Values and Religion, are reliable and valid instruments for a brief screening of ethical food choice reasons. Copyright 2000 Academic Press.
Construction of a valid and reliable test to determine knowledge on ...

African Journals Online (AJOL)

Objective: The construction of a questionnaire, in the format of a test, to determine knowledge on dietary fat of higher-educated young adults. Design: The topics on dietary fat included were in accordance with those tested in other studies. Forty multiple-choice items were drafted as questions and incomplete statements ...
Algorithms for computerized test construction using classical item parameters

NARCIS (Netherlands)

Adema, Jos J.; van der Linden, Willem J.

1989-01-01

Recently, linear programming models for test construction were developed. These models were based on the information function from item response theory. In this paper another approach is followed. Two 0-1 linear programming models for the construction of tests using classical item and test
Development and validation of a theoretical test in basic laparoscopy

DEFF Research Database (Denmark)

Strandbygaard, Jeanett; Maagaard, Mathilde; Larsen, Christian Rifbjerg

2013-01-01

for first-year residents in obstetrics and gynecology. This study therefore aimed to develop and validate a framework for a theoretical knowledge test, a multiple-choice test, in basic theory related to laparoscopy. METHODS: The content of the multiple-choice test was determined by conducting informal...... conversational interviews with experts in laparoscopy. The subsequent relevance of the test questions was evaluated using the Delphi method involving regional chief physicians. Construct validity was tested by comparing test results from three groups with expected different clinical competence and knowledge.......001). Internal consistency (Cronbach's alpha) was 0.82. There was no evidence of differential item functioning between the three groups tested. CONCLUSIONS: A newly developed knowledge test in basic laparoscopy proved to have content and construct validity. The formula for the development and validation...
Does Correct Answer Distribution Influence Student Choices When Writing Multiple Choice Examinations?

Science.gov (United States)

Carnegie, Jacqueline A.

2017-01-01

Summative evaluation for large classes of first- and second-year undergraduate courses often involves the use of multiple choice question (MCQ) exams in order to provide timely feedback. Several versions of those exams are often prepared via computer-based question scrambling in an effort to deter cheating. An important parameter to consider when…
Project Physics Tests 3, The Triumph of Mechanics.

Science.gov (United States)

Harvard Univ., Cambridge, MA. Harvard Project Physics.

Test items relating to Project Physics Unit 3 are presented in this booklet. Included are 70 multiple-choice and 20 problem-and-essay questions. Concepts of mechanics are examined on energy, momentum, kinetic theory of gases, pulse analyses, "heat death," water waves, power, conservation laws, normal distribution, thermodynamic laws, and…
Detection of differential item functioning using Lagrange multiplier tests

NARCIS (Netherlands)

Glas, Cornelis A.W.

1996-01-01

In this paper it is shown that differential item functioning can be evaluated using the Lagrange multiplier test or C. R. Rao's efficient score test. The test is presented in the framework of a number of item response theory (IRT) models such as the Rasch model, the one-parameter logistic model, the
Genetic Algorithms for Multiple-Choice Problems

Science.gov (United States)

Aickelin, Uwe

2010-04-01

This thesis investigates the use of problem-specific knowledge to enhance a genetic algorithm approach to multiple-choice optimisation problems.It shows that such information can significantly enhance performance, but that the choice of information and the way it is included are important factors for success.Two multiple-choice problems are considered.The first is constructing a feasible nurse roster that considers as many requests as possible.In the second problem, shops are allocated to locations in a mall subject to constraints and maximising the overall income.Genetic algorithms are chosen for their well-known robustness and ability to solve large and complex discrete optimisation problems.However, a survey of the literature reveals room for further research into generic ways to include constraints into a genetic algorithm framework.Hence, the main theme of this work is to balance feasibility and cost of solutions.In particular, co-operative co-evolution with hierarchical sub-populations, problem structure exploiting repair schemes and indirect genetic algorithms with self-adjusting decoder functions are identified as promising approaches.The research starts by applying standard genetic algorithms to the problems and explaining the failure of such approaches due to epistasis.To overcome this, problem-specific information is added in a variety of ways, some of which are designed to increase the number of feasible solutions found whilst others are intended to improve the quality of such solutions.As well as a theoretical discussion as to the underlying reasons for using each operator,extensive computational experiments are carried out on a variety of data.These show that the indirect approach relies less on problem structure and hence is easier to implement and superior in solution quality.
Stepwise Analysis of Differential Item Functioning Based on Multiple-Group Partial Credit Model.

Science.gov (United States)

Muraki, Eiji

1999-01-01

Extended an Item Response Theory (IRT) method for detection of differential item functioning to the partial credit model and applied the method to simulated data using a stepwise procedure. Then applied the stepwise DIF analysis based on the multiple-group partial credit model to writing trend data from the National Assessment of Educational…
Project Physics Tests 2, Motion in the Heavens.

Science.gov (United States)

Harvard Univ., Cambridge, MA. Harvard Project Physics.

Test items relating to Project Physics Unit 2 are presented in this booklet. Included are 70 multiple-choice and 22 problem-and-essay questions. Concepts of motion in the heavens are examined for planetary motions, heliocentric theory, forces exerted on the planets, Kepler's laws, gravitational force, Galileo's work, satellite orbits, Jupiter's…
Calibrating the Medical Council of Canada’s Qualifying Examination Part I using an integrated item response theory framework: a comparison of models and designs

Directory of Open Access Journals (Sweden)

Andre F. De Champlain

2016-01-01

Full Text Available Purpose: The aim of this research was to compare different methods of calibrating multiple choice question (MCQ and clinical decision making (CDM components for the Medical Council of Canada’s Qualifying Examination Part I (MCCQEI based on item response theory. Methods: Our data consisted of test results from 8,213 first time applicants to MCCQEI in spring and fall 2010 and 2011 test administrations. The data set contained several thousand multiple choice items and several hundred CDM cases. Four dichotomous calibrations were run using BILOG-MG 3.0. All 3 mixed item format (dichotomous MCQ responses and polytomous CDM case scores calibrations were conducted using PARSCALE 4. Results: The 2-PL model had identical numbers of items with chi-square values at or below a Type I error rate of 0.01 (83/3,499 or 0.02. In all 3 polytomous models, whether the MCQs were either anchored or concurrently run with the CDM cases, results suggest very poor fit. All IRT abilities estimated from dichotomous calibration designs correlated very highly with each other. IRT-based pass-fail rates were extremely similar, not only across calibration designs and methods, but also with regard to the actual reported decision to candidates. The largest difference noted in pass rates was 4.78%, which occurred between the mixed format concurrent 2-PL graded response model (pass rate= 80.43% and the dichotomous anchored 1-PL calibrations (pass rate= 85.21%. Conclusion: Simpler calibration designs with dichotomized items should be implemented. The dichotomous calibrations provided better fit of the item response matrix than more complex, polytomous calibrations.
Differential item functioning analysis of the Vanderbilt Expertise Test for cars.

Science.gov (United States)

Lee, Woo-Yeol; Cho, Sun-Joo; McGugin, Rankin W; Van Gulick, Ana Beth; Gauthier, Isabel

2015-01-01

The Vanderbilt Expertise Test for cars (VETcar) is a test of visual learning for contemporary car models. We used item response theory to assess the VETcar and in particular used differential item functioning (DIF) analysis to ask if the test functions the same way in laboratory versus online settings and for different groups based on age and gender. An exploratory factor analysis found evidence of multidimensionality in the VETcar, although a single dimension was deemed sufficient to capture the recognition ability measured by the test. We selected a unidimensional three-parameter logistic item response model to examine item characteristics and subject abilities. The VETcar had satisfactory internal consistency. A substantial number of items showed DIF at a medium effect size for test setting and for age group, whereas gender DIF was negligible. Because online subjects were on average older than those tested in the lab, we focused on the age groups to conduct a multigroup item response theory analysis. This revealed that most items on the test favored the younger group. DIF could be more the rule than the exception when measuring performance with familiar object categories, therefore posing a challenge for the measurement of either domain-general visual abilities or category-specific knowledge.
Bayes Factor Covariance Testing in Item Response Models.

Science.gov (United States)

Fox, Jean-Paul; Mulder, Joris; Sinharay, Sandip

2017-12-01

Two marginal one-parameter item response theory models are introduced, by integrating out the latent variable or random item parameter. It is shown that both marginal response models are multivariate (probit) models with a compound symmetry covariance structure. Several common hypotheses concerning the underlying covariance structure are evaluated using (fractional) Bayes factor tests. The support for a unidimensional factor (i.e., assumption of local independence) and differential item functioning are evaluated by testing the covariance components. The posterior distribution of common covariance components is obtained in closed form by transforming latent responses with an orthogonal (Helmert) matrix. This posterior distribution is defined as a shifted-inverse-gamma, thereby introducing a default prior and a balanced prior distribution. Based on that, an MCMC algorithm is described to estimate all model parameters and to compute (fractional) Bayes factor tests. Simulation studies are used to show that the (fractional) Bayes factor tests have good properties for testing the underlying covariance structure of binary response data. The method is illustrated with two real data studies.
Item calibration in incomplete testing designs

Directory of Open Access Journals (Sweden)

Norman D. Verhelst

2011-01-01

Full Text Available This study discusses the justifiability of item parameter estimation in incomplete testing designs in item response theory. Marginal maximum likelihood (MML as well as conditional maximum likelihood (CML procedures are considered in three commonly used incomplete designs: random incomplete, multistage testing and targeted testing designs. Mislevy and Sheenan (1989 have shown that in incomplete designs the justifiability of MML can be deduced from Rubin's (1976 general theory on inference in the presence of missing data. Their results are recapitulated and extended for more situations. In this study it is shown that for CML estimation the justification must be established in an alternative way, by considering the neglected part of the complete likelihood. The problems with incomplete designs are not generally recognized in practical situations. This is due to the stochastic nature of the incomplete designs which is not taken into account in standard computer algorithms. For that reason, incorrect uses of standard MML- and CML-algorithms are discussed.
Mathematical-programming approaches to test item pool design

NARCIS (Netherlands)

Veldkamp, Bernard P.; van der Linden, Willem J.; Ariel, A.

2002-01-01

This paper presents an approach to item pool design that has the potential to improve on the quality of current item pools in educational and psychological testing andhence to increase both measurement precision and validity. The approach consists of the application of mathematical programming
Uncertainties in the Item Parameter Estimates and Robust Automated Test Assembly

Science.gov (United States)

Veldkamp, Bernard P.; Matteucci, Mariagiulia; de Jong, Martijn G.

2013-01-01

Item response theory parameters have to be estimated, and because of the estimation process, they do have uncertainty in them. In most large-scale testing programs, the parameters are stored in item banks, and automated test assembly algorithms are applied to assemble operational test forms. These algorithms treat item parameters as fixed values,…
Action and valence modulate choice and choice-induced preference change.

Directory of Open Access Journals (Sweden)

Raphael Koster

Full Text Available Choices are not only communicated via explicit actions but also passively through inaction. In this study we investigated how active or passive choice impacts upon the choice process itself as well as a preference change induced by choice. Subjects were tasked to select a preference for unfamiliar photographs by action or inaction, before and after they gave valuation ratings for all photographs. We replicate a finding that valuation increases for chosen items and decreases for unchosen items compared to a control condition in which the choice was made post re-evaluation. Whether choice was expressed actively or passively affected the dynamics of revaluation differently for positive and negatively valenced items. Additionally, the choice itself was biased towards action such that subjects tended to choose a photograph obtained by action more often than a photographed obtained through inaction. These results highlight intrinsic biases consistent with a tight coupling of action and reward and add to an emerging understanding of how the mode of action itself, and not just an associated outcome, modulates the decision making process.
The "Sniffin' Kids" test--a 14-item odor identification test for children.

Directory of Open Access Journals (Sweden)

Valentin A Schriever

Full Text Available Tools for measuring olfactory function in adults have been well established. Although studies have shown that olfactory impairment in children may occur as a consequence of a number of diseases or head trauma, until today no consensus on how to evaluate the sense of smell in children exists in Europe. Aim of the study was to develop a modified "Sniffin' Sticks" odor identification test, the "Sniffin' Kids" test for the use in children. In this study 537 children between 6-17 years of age were included. Fourteen odors, which were identified at a high rate by children, were selected from the "Sniffin' Sticks" 16-item odor identification test. Normative date for the 14-item "Sniffin' Kids" odor identification test was obtained. The test was validated by including a group of congenital anosmic children. Results show that the "Sniffin' Kids" test is able to discriminate between normosmia and anosmia with a cutoff value of >7 points on the odor identification test. In addition the test-retest reliability was investigated in a group of 31 healthy children and shown to be ρ = 0.44. With the 14-item odor identification "Sniffin' Kids" test we present a valid and reliable test for measuring olfactory function in children between ages 6-17 years.

A person fit test for IRT models for polytomous items

NARCIS (Netherlands)

Glas, Cornelis A.W.; Dagohoy, A.V.

2007-01-01

A person fit test based on the Lagrange multiplier test is presented for three item response theory models for polytomous items: the generalized partial credit model, the sequential model, and the graded response model. The test can also be used in the framework of multidimensional ability
The impact of two multiple-choice question formats on the problem-solving strategies used by novices and experts.

Science.gov (United States)

Coderre, Sylvain P; Harasym, Peter; Mandin, Henry; Fick, Gordon

2004-11-05

Pencil-and-paper examination formats, and specifically the standard, five-option multiple-choice question, have often been questioned as a means for assessing higher-order clinical reasoning or problem solving. This study firstly investigated whether two paper formats with differing number of alternatives (standard five-option and extended-matching questions) can test problem-solving abilities. Secondly, the impact of the alternatives number on psychometrics and problem-solving strategies was examined. Think-aloud protocols were collected to determine the problem-solving strategy used by experts and non-experts in answering Gastroenterology questions, across the two pencil-and-paper formats. The two formats demonstrated equal ability in testing problem-solving abilities, while the number of alternatives did not significantly impact psychometrics or problem-solving strategies utilized. These results support the notion that well-constructed multiple-choice questions can in fact test higher order clinical reasoning. Furthermore, it can be concluded that in testing clinical reasoning, the question stem, or content, remains more important than the number of alternatives.
Multiple choice questions in electronics and electrical engineering

CERN Document Server

DAVIES, T J

2013-01-01

A unique compendium of over 2000 multiple choice questions for students of electronics and electrical engineering. This book is designed for the following City and Guilds courses: 2010, 2240, 2320, 2360. It can also be used as a resource for practice questions for any vocational course.
The Disaggregation of Value-Added Test Scores to Assess Learning Outcomes in Economics Courses

Science.gov (United States)

Walstad, William B.; Wagner, Jamie

2016-01-01

This study disaggregates posttest, pretest, and value-added or difference scores in economics into four types of economic learning: positive, retained, negative, and zero. The types are derived from patterns of student responses to individual items on a multiple-choice test. The micro and macro data from the "Test of Understanding in College…
A comparison of discriminant logistic regression and Item Response Theory Likelihood-Ratio Tests for Differential Item Functioning (IRTLRDIF) in polytomous short tests.

Science.gov (United States)

Hidalgo, María D; López-Martínez, María D; Gómez-Benito, Juana; Guilera, Georgina

2016-01-01

Short scales are typically used in the social, behavioural and health sciences. This is relevant since test length can influence whether items showing DIF are correctly flagged. This paper compares the relative effectiveness of discriminant logistic regression (DLR) and IRTLRDIF for detecting DIF in polytomous short tests. A simulation study was designed. Test length, sample size, DIF amount and item response categories number were manipulated. Type I error and power were evaluated. IRTLRDIF and DLR yielded Type I error rates close to nominal level in no-DIF conditions. Under DIF conditions, Type I error rates were affected by test length DIF amount, degree of test contamination, sample size and number of item response categories. DLR showed a higher Type I error rate than did IRTLRDIF. Power rates were affected by DIF amount and sample size, but not by test length. DLR achieved higher power rates than did IRTLRDIF in very short tests, although the high Type I error rate involved means that this result cannot be taken into account. Test length had an important impact on the Type I error rate. IRTLRDIF and DLR showed a low power rate in short tests and with small sample sizes.
Evaluating an Automated Number Series Item Generator Using Linear Logistic Test Models

Directory of Open Access Journals (Sweden)

Bao Sheng Loe

2018-04-01

Full Text Available This study investigates the item properties of a newly developed Automatic Number Series Item Generator (ANSIG. The foundation of the ANSIG is based on five hypothesised cognitive operators. Thirteen item models were developed using the numGen R package and eleven were evaluated in this study. The 16-item ICAR (International Cognitive Ability Resource1 short form ability test was used to evaluate construct validity. The Rasch Model and two Linear Logistic Test Model(s (LLTM were employed to estimate and predict the item parameters. Results indicate that a single factor determines the performance on tests composed of items generated by the ANSIG. Under the LLTM approach, all the cognitive operators were significant predictors of item difficulty. Moderate to high correlations were evident between the number series items and the ICAR test scores, with high correlation found for the ICAR Letter-Numeric-Series type items, suggesting adequate nomothetic span. Extended cognitive research is, nevertheless, essential for the automatic generation of an item pool with predictable psychometric properties.
A Multiple Items EPQ/EOQ Model for a Vendor and Multiple Buyers System with Considering Continuous and Discrete Demand Simultaneously

Science.gov (United States)

Jonrinaldi; Rahman, T.; Henmaidi; Wirdianto, E.; Zhang, D. Z.

2018-03-01

This paper proposed a mathematical model for multiple items Economic Production and Order Quantity (EPQ/EOQ) with considering continuous and discrete demand simultaneously in a system consisting of a vendor and multiple buyers. This model is used to investigate the optimal production lot size of the vendor and the number of shipments policy of orders to multiple buyers. The model considers the multiple buyers’ holding cost as well as transportation cost, which minimize the total production and inventory costs of the system. The continuous demand from any other customers can be fulfilled anytime by the vendor while the discrete demand from multiple buyers can be fulfilled by the vendor using the multiple delivery policy with a number of shipments of items in the production cycle time. A mathematical model is developed to illustrate the system based on EPQ and EOQ model. Solution procedures are proposed to solve the model using a Mixed Integer Non Linear Programming (MINLP) and algorithm methods. Then, the numerical example is provided to illustrate the system and results are discussed.
Graded Multiple Choice Questions: Rewarding Understanding and Preventing Plagiarism

Science.gov (United States)

Denyer, G. S.; Hancock, D.

2002-08-01

This paper describes an easily implemented method that allows the generation and analysis of graded multiple-choice examinations. The technique, which uses standard functions in user-end software (Microsoft Excel 5+), can also produce several different versions of an examination that can be employed to prevent the reward of plagarism. The manuscript also discusses the advantages of having a graded marking system for the elimination of ambiguities, use in multi-step calculation questions, and questions that require extrapolation or reasoning. The advantages of the scrambling strategy, which maintains the same question order, is discussed with reference to student equity. The system provides a non-confrontational mechanism for dealing with cheating in large-class multiple-choice examinations, as well as providing a reward for problem solving over surface learning.
Criteria for eliminating items of a Test of Figural Analogies

Directory of Open Access Journals (Sweden)

Diego Blum

2013-12-01

Full Text Available This paper describes the steps taken to eliminate two of the items in a Test of Figural Analogies (TFA. The main guidelines of psychometric analysis concerning Classical Test Theory (CTT and Item Response Theory (IRT are explained. The item elimination process was based on both the study of the CTT difficulty and discrimination index, and the unidimensionality analysis. The a, b, and c parameters of the Three Parameter Logistic Model of IRT were also considered for this purpose, as well as the assessment of each item fitting this model. The unfavourable characteristics of a group of TFA items are detailed, and decisions leading to their possible elimination are discussed.
Memory for Items and Relationships among Items Embedded in Realistic Scenes: Disproportionate Relational Memory Impairments in Amnesia

Science.gov (United States)

Hannula, Deborah E.; Tranel, Daniel; Allen, John S.; Kirchhoff, Brenda A.; Nickel, Allison E.; Cohen, Neal J.

2014-01-01

Objective The objective of this study was to examine the dependence of item memory and relational memory on medial temporal lobe (MTL) structures. Patients with amnesia, who either had extensive MTL damage or damage that was relatively restricted to the hippocampus, were tested, as was a matched comparison group. Disproportionate relational memory impairments were predicted for both patient groups, and those with extensive MTL damage were also expected to have impaired item memory. Method Participants studied scenes, and were tested with interleaved two-alternative forced-choice probe trials. Probe trials were either presented immediately after the corresponding study trial (lag 1), five trials later (lag 5), or nine trials later (lag 9) and consisted of the studied scene along with a manipulated version of that scene in which one item was replaced with a different exemplar (item memory test) or was moved to a new location (relational memory test). Participants were to identify the exact match of the studied scene. Results As predicted, patients were disproportionately impaired on the test of relational memory. Item memory performance was marginally poorer among patients with extensive MTL damage, but both groups were impaired relative to matched comparison participants. Impaired performance was evident at all lags, including the shortest possible lag (lag 1). Conclusions The results are consistent with the proposed role of the hippocampus in relational memory binding and representation, even at short delays, and suggest that the hippocampus may also contribute to successful item memory when items are embedded in complex scenes. PMID:25068665
Industrial Arts Test Development, Book III. Resource Items for Graphics Technology, Power Technology, Production Technology.

Science.gov (United States)

New York State Education Dept., Albany.

This booklet is designed to assist teachers in developing examinations for classroom use. It is a collection of 955 objective test questions, mostly multiple choice, for industrial arts students in the three areas of graphics technology, power technology, and production technology. Scoring keys are provided. There are no copyright restrictions,…
A multiple choice testing program coupled with a year-long elective experience is associated with improved performance on the internal medicine in-training examination.

Science.gov (United States)

Mathis, Bradley R; Warm, Eric J; Schauer, Daniel P; Holmboe, Eric; Rouan, Gregory W

2011-11-01

The Internal Medicine In-Training Exam (IM-ITE) assesses the content knowledge of internal medicine trainees. Many programs use the IM-ITE to counsel residents, to create individual remediation plans, and to make fundamental programmatic and curricular modifications. To assess the association between a multiple-choice testing program administered during 12 consecutive months of ambulatory and inpatient elective experience and IM-ITE percentile scores in third post-graduate year (PGY-3) categorical residents. Retrospective cohort study. One hundred and four categorical internal medicine residents. Forty-five residents in the 2008 and 2009 classes participated in the study group, and the 59 residents in the three classes that preceded the use of the testing program, 2005-2007, served as controls. A comprehensive, elective rotation specific, multiple-choice testing program and a separate board review program, both administered during a continuous long-block elective experience during the twelve months between the second post-graduate year (PGY-2) and PGY-3 in-training examinations. We analyzed the change in median individual percent correct and percentile scores between the PGY-1 and PGY-2 IM-ITE and between the PGY-2 and PGY-3 IM-ITE in both control and study cohorts. For our main outcome measure, we compared the change in median individual percentile rank between the control and study cohorts between the PGY-2 and the PGY-3 IM-ITE testing opportunities. After experiencing the educational intervention, the study group demonstrated a significant increase in median individual IM-ITE percentile score between PGY-2 and PGY-3 examinations of 8.5 percentile points (p ITE performance.
An Effect Size Measure for Raju's Differential Functioning for Items and Tests

Science.gov (United States)

Wright, Keith D.; Oshima, T. C.

2015-01-01

This study established an effect size measure for differential functioning for items and tests' noncompensatory differential item functioning (NCDIF). The Mantel-Haenszel parameter served as the benchmark for developing NCDIF's effect size measure for reporting moderate and large differential item functioning in test items. The effect size of…
Computerized adaptive testing item selection in computerized adaptive learning systems

NARCIS (Netherlands)

Eggen, Theodorus Johannes Hendrikus Maria; Eggen, T.J.H.M.; Veldkamp, B.P.

2012-01-01

Item selection methods traditionally developed for computerized adaptive testing (CAT) are explored for their usefulness in item-based computerized adaptive learning (CAL) systems. While in CAT Fisher information-based selection is optimal, for recovering learning populations in CAL systems item
Correcting Grade Deflation Caused by Multiple-Choice Scoring.

Science.gov (United States)

Baranchik, Alvin; Cherkas, Barry

2000-01-01

Presents a study involving three sections of pre-calculus (n=181) at four-year college where partial credit scoring on multiple-choice questions was examined over an entire semester. Indicates that grades determined by partial credit scoring seemed more reflective of both the quantity and quality of student knowledge than grades determined by…
A review of the effects on IRT item parameter estimates with a focus on misbehaving common items in test equating

Directory of Open Access Journals (Sweden)

Michalis P Michaelides

2010-10-01

Full Text Available Many studies have investigated the topic of change or drift in item parameter estimates in the context of Item Response Theory. Content effects, such as instructional variation and curricular emphasis, as well as context effects, such as the wording, position, or exposure of an item have been found to impact item parameter estimates. The issue becomes more critical when items with estimates exhibiting differential behavior across test administrations are used as common for deriving equating transformations. This paper reviews the types of effects on IRT item parameter estimates and focuses on the impact of misbehaving or aberrant common items on equating transformations. Implications relating to test validity and the judgmental nature of the decision to keep or discard aberrant common items are discussed, with recommendations for future research into more informed and formal ways of dealing with misbehaving common items.
A Review of the Effects on IRT Item Parameter Estimates with a Focus on Misbehaving Common Items in Test Equating.

Science.gov (United States)

Michaelides, Michalis P

2010-01-01

Many studies have investigated the topic of change or drift in item parameter estimates in the context of item response theory (IRT). Content effects, such as instructional variation and curricular emphasis, as well as context effects, such as the wording, position, or exposure of an item have been found to impact item parameter estimates. The issue becomes more critical when items with estimates exhibiting differential behavior across test administrations are used as common for deriving equating transformations. This paper reviews the types of effects on IRT item parameter estimates and focuses on the impact of misbehaving or aberrant common items on equating transformations. Implications relating to test validity and the judgmental nature of the decision to keep or discard aberrant common items are discussed, with recommendations for future research into more informed and formal ways of dealing with misbehaving common items.
Measuring more than we know? An examination of the motivational and situational influences in science achievement

Science.gov (United States)

Haydel, Angela Michelle

The purpose of this dissertation was to advance theoretical understanding about fit between the personal resources of individuals and the characteristics of science achievement tasks. Testing continues to be pervasive in schools, yet we know little about how students perceive tests and what they think and feel while they are actually working on test items. This study focused on both the personal (cognitive and motivational) and situational factors that may contribute to individual differences in achievement-related outcomes. 387 eighth grade students first completed a survey including measures of science achievement goals, capability beliefs, efficacy related to multiple-choice items and performance assessments, validity beliefs about multiple-choice items and performance assessments, and other perceptions of these item formats. Students then completed science achievement tests including multiple-choice items and two performance assessments. A sample of students was asked to verbalize both thoughts and feelings as they worked through the test items. These think-alouds were transcribed and coded for evidence of cognitive, metacognitive and motivational engagement. Following each test, all students completed measures of effort, mood, energy level and strategy use during testing. Students reported that performance assessments were more challenging, authentic, interesting and valid than multiple-choice tests. They also believed that comparisons between students were easier using multiple-choice items. Overall, students tried harder, felt better, had higher levels of energy and used more strategies while working on performance assessments. Findings suggested that performance assessments might be more congruent with a mastery achievement goal orientation, while multiple-choice tests might be more congruent with a performance achievement goal orientation. A variable-centered analytic approach including regression analyses provided information about how students, on
Applying Item Response Theory methods to design a learning progression-based science assessment

Science.gov (United States)

Chen, Jing

Learning progressions are used to describe how students' understanding of a topic progresses over time and to classify the progress of students into steps or levels. This study applies Item Response Theory (IRT) based methods to investigate how to design learning progression-based science assessments. The research questions of this study are: (1) how to use items in different formats to classify students into levels on the learning progression, (2) how to design a test to give good information about students' progress through the learning progression of a particular construct and (3) what characteristics of test items support their use for assessing students' levels. Data used for this study were collected from 1500 elementary and secondary school students during 2009--2010. The written assessment was developed in several formats such as the Constructed Response (CR) items, Ordered Multiple Choice (OMC) and Multiple True or False (MTF) items. The followings are the main findings from this study. The OMC, MTF and CR items might measure different components of the construct. A single construct explained most of the variance in students' performances. However, additional dimensions in terms of item format can explain certain amount of the variance in student performance. So additional dimensions need to be considered when we want to capture the differences in students' performances on different types of items targeting the understanding of the same underlying progression. Items in each item format need to be improved in certain ways to classify students more accurately into the learning progression levels. This study establishes some general steps that can be followed to design other learning progression-based tests as well. For example, first, the boundaries between levels on the IRT scale can be defined by using the means of the item thresholds across a set of good items. Second, items in multiple formats can be selected to achieve the information criterion at all
Item response theory analysis of the life orientation test-revised: age and gender differential item functioning analyses.

Science.gov (United States)

Steca, Patrizia; Monzani, Dario; Greco, Andrea; Chiesi, Francesca; Primi, Caterina

2015-06-01

This study is aimed at testing the measurement properties of the Life Orientation Test-Revised (LOT-R) for the assessment of dispositional optimism by employing item response theory (IRT) analyses. The LOT-R was administered to a large sample of 2,862 Italian adults. First, confirmatory factor analyses demonstrated the theoretical conceptualization of the construct measured by the LOT-R as a single bipolar dimension. Subsequently, IRT analyses for polytomous, ordered response category data were applied to investigate the items' properties. The equivalence of the items across gender and age was assessed by analyzing differential item functioning. Discrimination and severity parameters indicated that all items were able to distinguish people with different levels of optimism and adequately covered the spectrum of the latent trait. Additionally, the LOT-R appears to be gender invariant and, with minor exceptions, age invariant. Results provided evidence that the LOT-R is a reliable and valid measure of dispositional optimism. © The Author(s) 2014.

Food Choices of Minority and Low-Income Employees

Science.gov (United States)

Levy, Douglas E.; Riis, Jason; Sonnenberg, Lillian M.; Barraclough, Susan J.; Thorndike, Anne N.

2012-01-01

Background Effective strategies are needed to address obesity, particularly among minority and low-income individuals. Purpose To test whether a two-phase point-of-purchase intervention improved food choices across racial, socioeconomic (job type) groups. Design A 9-month longitudinal study from 2009 to 2010 assessing person-level changes in purchases of healthy and unhealthy foods following sequentially introduced interventions. Data were analyzed in 2011. Setting/participants Participants were 4642 employees of a large hospital in Boston MA who were regular cafeteria patrons. Interventions The first intervention was a traffic light–style color-coded labeling system encouraging patrons to purchase healthy items (labeled green) and avoid unhealthy items (labeled red). The second intervention manipulated “choice architecture” by physically rearranging certain cafeteria items, making green-labeled items more accessible, red-labeled items less accessible. Main outcome measures Proportion of green- (or red-) labeled items purchased by an employee. Subanalyses tracked beverage purchases, including calories and price per beverage. Results Employees self-identified as white (73%), black (10%), Latino (7%), and Asian (10%). Compared to white employees, Latino and black employees purchased a higher proportion of red items at baseline (18%, 28%, and 33%, respectively, p0.05 for interaction between race or job type and intervention). Mean calories per beverage decreased similarly over the study period for all racial groups and job types, with no increase in per-beverage spending. Conclusions Despite baseline differences in healthy food purchases, a simple color-coded labeling and choice architecture intervention improved food and beverage choices among employees from all racial and socioeconomic backgrounds. PMID:22898116
Using a Classroom Response System to Improve Multiple-Choice Performance in AP[R] Physics

Science.gov (United States)

Bertrand, Peggy

2009-01-01

Participation in rigorous high school courses such as Advanced Placement (AP[R]) Physics increases the likelihood of college success, especially for students who are traditionally underserved. Tackling difficult multiple-choice exams should be part of any AP program because well-constructed multiple-choice questions, such as those on AP exams and…
Benford's Law: textbook exercises and multiple-choice testbanks.

Science.gov (United States)

Slepkov, Aaron D; Ironside, Kevin B; DiBattista, David

2015-01-01

Benford's Law describes the finding that the distribution of leading (or leftmost) digits of innumerable datasets follows a well-defined logarithmic trend, rather than an intuitive uniformity. In practice this means that the most common leading digit is 1, with an expected frequency of 30.1%, and the least common is 9, with an expected frequency of 4.6%. Currently, the most common application of Benford's Law is in detecting number invention and tampering such as found in accounting-, tax-, and voter-fraud. We demonstrate that answers to end-of-chapter exercises in physics and chemistry textbooks conform to Benford's Law. Subsequently, we investigate whether this fact can be used to gain advantage over random guessing in multiple-choice tests, and find that while testbank answers in introductory physics closely conform to Benford's Law, the testbank is nonetheless secure against such a Benford's attack for banal reasons.
Benford's Law: textbook exercises and multiple-choice testbanks.

Directory of Open Access Journals (Sweden)

Aaron D Slepkov

Full Text Available Benford's Law describes the finding that the distribution of leading (or leftmost digits of innumerable datasets follows a well-defined logarithmic trend, rather than an intuitive uniformity. In practice this means that the most common leading digit is 1, with an expected frequency of 30.1%, and the least common is 9, with an expected frequency of 4.6%. Currently, the most common application of Benford's Law is in detecting number invention and tampering such as found in accounting-, tax-, and voter-fraud. We demonstrate that answers to end-of-chapter exercises in physics and chemistry textbooks conform to Benford's Law. Subsequently, we investigate whether this fact can be used to gain advantage over random guessing in multiple-choice tests, and find that while testbank answers in introductory physics closely conform to Benford's Law, the testbank is nonetheless secure against such a Benford's attack for banal reasons.
Análise de itens de uma prova de raciocínio estatístico Analysis of items of a statistical reasoning test

Directory of Open Access Journals (Sweden)

Claudette Maria Medeiros Vendramini

2004-12-01

Full Text Available Este estudo objetivou analisar as 18 questões (do tipo múltipla escolha de uma prova sobre conceitos básicos de Estatística pelas teorias clássica e moderna. Participaram 325 universitários, selecionados aleatoriamente das áreas de humanas, exatas e saúde. A análise indicou que a prova é predominantemente unidimensional e que os itens podem ser mais bem ajustados ao modelo de três parâmetros. Os índices de dificuldade, discriminação e correlação bisserial apresentam valores aceitáveis. Sugere-se a inclusão de novos itens na prova, que busquem confiabilidade e validade para o contexto educacional e revelem o raciocínio estatístico de universitários ao ler representações de dados estatísticos.This study aimed at to analyze the 18 questions (of multiple choice type of a test on basic concepts of Statistics for the classic and modern theories. The test was taken by 325 undergraduate students, randomly selected from the areas of Human, Exact and Health Sciences. The analysis indicated that the test has predominantly one dimension and that the items can be better fitting to the model of three parameters. The indexes of difficulty, discrimination and biserial correlation present acceptable values. It is suggested to include new items to the test in order to obtain reliability and validity to use it in the education context and to reveal the statistical reasoning of undergraduate students when dealing with statistical data representation.
A Study on Individualized Tests

Directory of Open Access Journals (Sweden)

Metin YAŞAR

2017-12-01

Full Text Available This study aims to compare KR-20 reliability levels of “Paper and Pencil Test” developed according to Classical Test Theory and “Individualized Test” developed according to Item Response Theory (Two-Parameter Logistic Model, and the correlation levels of skill measurements obtained via these two methods in a group of students. Individualized test developed in accordance with the Two-Parameter Logistic Model was applied by means of a question pool consisting of 61 multiple-choice items which can be answered in 13 steps. On the other hand, a paper and pencil test of 47 multiple-choice items was applied to the sample student group. After the test developed according to these two methods was applied to the same group, KR-20 reliability coefficient was calculated as 0.67 for the individualized test and as 0.75 for the paper and pencil test prepared according to Classical test theory. Calculated KR-20 reliability coefficients obtained from the study were converted into Fisher Z and tested at the significance level of 0.05. No meaningful difference was detected at the 0.05 significant difference level between the two KR-20 reliability coefficients obtained from the two methods. Pearson Product-Moment Correlation Coefficient was calculated as 0.36 between the points of the individualized test and the measurement results of the paper and pencil test. A positive yet low correlation was observed between the measurement results obtained from the tests developed according to both methods. Consequently, it was seen that at the 0.05 significance level there was no statistically significant difference between KR-20 reliability coefficients of the tests developed according to the two methods and that there was a low correlation between the skill measurements of the students in both tests, but there was no significant correlation at the 0.05 significance level between the skill measurements obtained from both tests.
Detection of person misfit in computerized adaptive tests with polytomous items

NARCIS (Netherlands)

van Krimpen-Stoop, Edith; Meijer, R.R.

2000-01-01

Item scores that do not fit an assumed item response theory model may cause the latent trait value to be estimated inaccurately. For computerized adaptive tests (CAT) with dichotomous items, several person-fit statistics for detecting nonfitting item score patterns have been proposed. Both for
A practical test for the choice of mixing distribution in discrete choice models

DEFF Research Database (Denmark)

Fosgerau, Mogens; Bierlaire, Michel

2007-01-01

The choice of a specific distribution for random parameters of discrete choice models is a critical issue in transportation analysis. Indeed, various pieces of research have demonstrated that an inappropriate choice of the distribution may lead to serious bias in model forecast and in the estimated...... means of random parameters. In this paper, we propose a practical test, based on seminonparametric techniques. The test is analyzed both on synthetic and real data, and is shown to be simple and powerful. (c) 2007 Elsevier Ltd. All rights reserved....
Testing primary-school children's understanding of the nature of science.

Science.gov (United States)

Koerber, Susanne; Osterhaus, Christopher; Sodian, Beate

2015-03-01

Understanding the nature of science (NOS) is a critical aspect of scientific reasoning, yet few studies have investigated its developmental beginnings and initial structure. One contributing reason is the lack of an adequate instrument. Two studies assessed NOS understanding among third graders using a multiple-select (MS) paper-and-pencil test. Study 1 investigated the validity of the MS test by presenting the items to 68 third graders (9-year-olds) and subsequently interviewing them on their underlying NOS conception of the items. All items were significantly related between formats, indicating that the test was valid. Study 2 applied the same instrument to a larger sample of 243 third graders, and their performance was compared to a multiple-choice (MC) version of the test. Although the MC format inflated the guessing probability, there was a significant relation between the two formats. In summary, the MS format was a valid method revealing third graders' NOS understanding, thereby representing an economical test instrument. A latent class analysis identified three groups of children with expertise in qualitatively different aspects of NOS, suggesting that there is not a single common starting point for the development of NOS understanding; instead, multiple developmental pathways may exist. © 2014 The British Psychological Society.
Comparison of performance on multiple-choice questions and open-ended questions in an introductory astronomy laboratory

OpenAIRE

Michelle M. Wooten; Adrienne M. Cool; Edward E. Prather; Kimberly D. Tanner

2014-01-01

When considering the variety of questions that can be used to measure students’ learning, instructors may choose to use multiple-choice questions, which are easier to score than responses to open-ended questions. However, by design, analyses of multiple-choice responses cannot describe all of students’ understanding. One method that can be used to learn more about students’ learning is the analysis of the open-ended responses students’ provide when explaining their multiple-choice response. I...
Is It Working? Distractor Analysis Results from the Test Of Astronomy STandards (TOAST) Assessment Instrument

Science.gov (United States)

Slater, Stephanie

2009-05-01

The Test Of Astronomy STandards (TOAST) assessment instrument is a multiple-choice survey tightly aligned to the consensus learning goals stated by the American Astronomical Society - Chair's Conference on ASTRO 101, the American Association of the Advancement of Science's Project 2061 Benchmarks, and the National Research Council's National Science Education Standards. Researchers from the Cognition in Astronomy, Physics and Earth sciences Research (CAPER) Team at the University of Wyoming's Science and Math Teaching Center (UWYO SMTC) have been conducting a question-by-question distractor analysis procedure to determine the sensitivity and effectiveness of each item. In brief, the frequency each possible answer choice, known as a foil or distractor on a multiple-choice test, is determined and compared to the existing literature on the teaching and learning of astronomy. In addition to having statistical difficulty and discrimination values, a well functioning assessment item will show students selecting distractors in the relative proportions to how we expect them to respond based on known misconceptions and reasoning difficulties. In all cases, our distractor analysis suggests that all items are functioning as expected. These results add weight to the validity of the Test Of Astronomy STandards (TOAST) assessment instrument, which is designed to help instructors and researchers measure the impact of course-length duration instructional strategies for undergraduate science survey courses with learning goals tightly aligned to the consensus goals of the astronomy education community.
Set-fit effects in choice.

Science.gov (United States)

Evers, Ellen R K; Inbar, Yoel; Zeelenberg, Marcel

2014-04-01

In 4 experiments, we investigate how the "fit" of an item with a set of similar items affects choice. We find that people have a notion of a set that "fits" together--one where all items are the same, or all items differ, on salient attributes. One consequence of this notion is that in addition to preferences over the set's individual items, choice reflects set-fit. This leads to predictable shifts in preferences, sometimes even resulting in people choosing normatively inferior options over superior ones.
Distance teaching using self-marking multiple choice questions.

Science.gov (United States)

Poore, P

1987-01-01

In Papua New Guinea health extension officers receive a 3-year course of training in college, followed by a period of in-service training in hospital. They are then posted to a health center, where they are in charge of all health services within their district. While the health extension officers received an excellent basic training, and were provided with books and appropriate, locally produced texts, they often spent months or even years after graduation in remote rural health centers with little communication from colleagues. This paper describes an attempt to improve communication, and to provide support inexpensively by post. Multiple choice questions, with a system for self-marking, were sent by post to rural health workers. Multiple choice questions are used in the education system in Papua New Guinea, and all health extension officers are familiar with the technique. The most obvious and immediate result was the great enthusiasm shown by the majority of health staff involved. In this way a useful exchange of correspondence was established. With this exchange of information and recognition of each other's problems, the quality of patient care must improve.
MCQ testing in higher education: Yes, there are bad items and invalid scores—A case study identifying solutions

OpenAIRE

Brown, Gavin

2017-01-01

This is a lecture given at Umea University, Sweden in September 2017. It is based on the published study: Brown, G. T. L., & Abdulnabi, H. (2017). Evaluating the quality of higher education instructor-constructed multiple-choice tests: Impact on student grades. Frontiers in Education: Assessment, Testing, & Applied Measurement, 2(24).. doi:10.3389/feduc.2017.00024
Item Response Theory Modeling of the Philadelphia Naming Test

Science.gov (United States)

Fergadiotis, Gerasimos; Kellough, Stacey; Hula, William D.

2015-01-01

Purpose: In this study, we investigated the fit of the Philadelphia Naming Test (PNT; Roach, Schwartz, Martin, Grewal, & Brecher, 1996) to an item-response-theory measurement model, estimated the precision of the resulting scores and item parameters, and provided a theoretical rationale for the interpretation of PNT overall scores by relating…
Examining Gender DIF on a Multiple-Choice Test of Mathematics: A Confirmatory Approach.

Science.gov (United States)

Ryan, Katherine E.; Fan, Meichu

1996-01-01

Results for 3,244 female and 3,033 male junior high school students from the Second International Mathematics Study show that applied items in algebra, geometry, and computation were easier for males but arithmetic items were differentially easier for females. Implications of these findings for assessment and instruction are discussed. (SLD)
Multi-item economic production quantity model for imperfect items with multiple production setups and rework under the effect of preservation technology and learning environment

Directory of Open Access Journals (Sweden)

Preeti Jawla

2016-09-01

Full Text Available This study aims to investigate the multi-item inventory model in a production/rework system with multiple production setups. Rework can be depicted as the transformation of production rejects, failed, or non-conforming items into re-usable products of the same or lower quality during or after inspection. Rework is very valuable and profitable, especially if materials are limited in availability and also pricey. Moreover, rework can be a good contribution to a ‘green image environment’. In this paper, we establish a multi-item inventory model to determine the optimal inventory replenishment policy for the economic production quantity (EPQ model for imperfect, deteriorating items with multiple productions and rework under inflation and learning environment. In inventory modelling, Inflation plays a very important role. In one cycle, production system produces items in n production setups and one rework setup, i.e. system follows (n, 1 policy. To reduce the deterioration of products preservation technology investment is also considered in this model. Holding cost is taken as time dependent. We develop expressions for the average profit per time unit, including procurement of input materials, costs for production, rework, deterioration cost and storage of serviceable and reworkable lots. Using those expressions, the proposed model is demonstrated numerically and the sensitivity analysis is also performed to study the behaviour of the model.
Effects of Differential Item Functioning on Examinees' Test Performance and Reliability of Test

Science.gov (United States)

Lee, Yi-Hsuan; Zhang, Jinming

2017-01-01

Simulations were conducted to examine the effect of differential item functioning (DIF) on measurement consequences such as total scores, item response theory (IRT) ability estimates, and test reliability in terms of the ratio of true-score variance to observed-score variance and the standard error of estimation for the IRT ability parameter. The…
The Development Of A Diagnostic Reading Test Of English For The Students Of Medical Faculty, Brawijaya University

Directory of Open Access Journals (Sweden)

Indah Winarni

2003-01-01

Full Text Available This paper describes the development of a diagnostic test of multiple choice reading comprehension as an initial stage in developing teaching materials for medical students learning English. Sample texts were collected from all the departments in the faculty. Selection of relevant texts involved the participation of some subject lecturers. Sixty one items were developed from fifteen texts to be reduced to forty items after pilot testing. Face validity was improved. The main trial was carried out to twenty nine students and item analysis was carried out. The test showed low level of concurrent validity and the internal consistency showed a moderate level of reliability. The low level of concurrent validity was suspected to result from the test being too difficult for the testees as the item analysis had revealed.
Older adults’ preferences for colorectal cancer-screening test attributes and test choice

Directory of Open Access Journals (Sweden)

Kistler CE

2015-07-01

Full Text Available Christine E Kistler,1–3 Thomas M Hess,4 Kirsten Howard,5,6 Michael P Pignone,2,3,7 Trisha M Crutchfield,2,3,8 Sarah T Hawley,9 Alison T Brenner,2 Kimberly T Ward,2 Carmen L Lewis10 1Department of Family Medicine, School of Medicine, 2Cecil G Sheps Center for Health Services Research, 3Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, 4Department of Psychology, North Carolina State University, Raleigh, NC, USA; 5Institute for Choice, University of South Australia, Sydney, NSW, Australia; 6School of Public Health, University of Sydney, Sydney, NSW, Australia; 7Division of General Internal Medicine, School of Medicine, 8Center for Health Promotion and Disease Prevention, University of North Carolina at Chapel Hill, Chapel Hill, NC, 9Department of Medicine, University of Michigan, Ann Arbor, MI, 10Division of General Internal Medicine, Department of Medicine, University of Colorado School of Medicine, Aurora, CO, USA Background: Understanding which attributes of colorectal cancer (CRC screening tests drive older adults’ test preferences and choices may help improve decision making surrounding CRC screening in older adults.Materials and methods: To explore older adults’ preferences for CRC-screening test attributes and screening tests, we conducted a survey with a discrete choice experiment (DCE, a directly selected preferred attribute question, and an unlabeled screening test-choice question in 116 cognitively intact adults aged 70–90 years, without a history of CRC or inflammatory bowel disease. Each participant answered ten discrete choice questions presenting two hypothetical tests comprised of four attributes: testing procedure, mortality reduction, test frequency, and complications. DCE responses were used to estimate each participant’s most important attribute and to simulate their preferred test among three existing CRC-screening tests. For each individual, we compared the DCE

Chemistry and biology by new multiple choice

International Nuclear Information System (INIS)

Seo, Hyeong Seok; Kim, Seong Hwan

2003-02-01

This book is divided into two parts, the first part is about chemistry, which deals with science of material, atom structure and periodic law, chemical combination and power between molecule, state of material and solution, chemical reaction and an organic compound. The second part give description of biology with molecule and cell, energy in cells and chemical synthesis, molecular biology and heredity, function on animal, function on plant and evolution and ecology. This book has explanation of chemistry and biology with new multiple choice.
The Answering Process for Multiple-Choice Questions in Collaborative Learning: A Mathematical Learning Model Analysis

Science.gov (United States)

Nakamura, Yasuyuki; Nishi, Shinnosuke; Muramatsu, Yuta; Yasutake, Koichi; Yamakawa, Osamu; Tagawa, Takahiro

2014-01-01

In this paper, we introduce a mathematical model for collaborative learning and the answering process for multiple-choice questions. The collaborative learning model is inspired by the Ising spin model and the model for answering multiple-choice questions is based on their difficulty level. An intensive simulation study predicts the possibility of…
Science Literacy: How do High School Students Solve PISA Test Items?

Science.gov (United States)

Wati, F.; Sinaga, P.; Priyandoko, D.

2017-09-01

The Programme for International Students Assessment (PISA) does assess students’ science literacy in a real-life contexts and wide variety of situation. Therefore, the results do not provide adequate information for the teacher to excavate students’ science literacy because the range of materials taught at schools depends on the curriculum used. This study aims to investigate the way how junior high school students in Indonesia solve PISA test items. Data was collected by using PISA test items in greenhouse unit employed to 36 students of 9th grade. Students’ answer was analyzed qualitatively for each item based on competence tested in the problem. The way how students answer the problem exhibits their ability in particular competence which is influenced by a number of factors. Those are students’ unfamiliarity with test construction, low performance on reading, low in connecting available information and question, and limitation on expressing their ideas effectively and easy-read. As the effort, selected PISA test items can be used in accordance teaching topic taught to familiarize students with science literacy.
Item Analysis di una prova di lettura a scelta multipla della certificazione di italiano per stranieri CILS (livello B1; sessione estiva 2012

Directory of Open Access Journals (Sweden)

Paolo Torresan

2014-10-01

Full Text Available Nell’articolo presentiamo un’analisi degli item di una prova di lettura a scelta multipla di livello B1 della certificazione CILS (Università per Stranieri di Siena. L’indagine si muove da una prima ricognizione del testo su cui si basa la prova, con uno studio delle modifiche cui è andata soggetta per mano dell’item writer, per poi ragionare sull’analisi di ogni singolo item, grazie ai dati emersi dalla somministrazione della prova a 161 studenti di italiano di livello corrispondente sparsi per il pianeta. Dalla nostra ricerca si evince che si danno un item ambiguo (# 1, per via della presenza di due chiavi, e un item di difficile risoluzione, per via della mancanza di informazioni utili per desumere il significato del vocabolo cui si riferisce (# 4.In this article we present an analysis of items in a reading multiple-choice test, B1 level, of the CILS certification (Università per Stranieri di Siena. The research starts with a preliminary recognition of the text on which the test is based, with a study of the modifications it has undergone by the item writer’s hand, and proceeds to reason about the analysis of every single item, using data from the ministration of the test to 161 students of Italian in the corresponding level, from all over the planet. From our research it emerges that the test presents an ambiguous item (# 1, with two keys, and a difficult item, without enough information to make clear the meaning of the word it refers to (# 4.
catcher: A Software Program to Detect Answer Copying in Multiple-Choice Tests Based on Nominal Response Model

Science.gov (United States)

Kalender, Ilker

2012-01-01

catcher is a software program designed to compute the [omega] index, a common statistical index for the identification of collusions (cheating) among examinees taking an educational or psychological test. It requires (a) responses and (b) ability estimations of individuals, and (c) item parameters to make computations and outputs the results of…
Bayesian item selection criteria for adaptive testing

NARCIS (Netherlands)

van der Linden, Willem J.

1996-01-01

R.J. Owen (1975) proposed an approximate empirical Bayes procedure for item selection in adaptive testing. The procedure replaces the true posterior by a normal approximation with closed-form expressions for its first two moments. This approximation was necessary to minimize the computational
Testing Preference Axioms in Discrete Choice experiments

DEFF Research Database (Denmark)

Hougaard, Jens Leth; Østerdal, Lars Peter; Tjur, Tue

Recent studies have tested the preference axioms of completeness and transitivity, and have detected other preference phenomena such as unstability, learning- and tiredness effects, ordering effects and dominance, in stated preference discrete choice experiments. However, it has not been explicitly...... of the preference axioms and other preference phenomena in the context of stated preference discrete choice experiments, and examine whether or how these can be subject to meaningful (statistical) tests...
Robust Scale Transformation Methods in IRT True Score Equating under Common-Item Nonequivalent Groups Design

Science.gov (United States)

He, Yong

2013-01-01

Common test items play an important role in equating multiple test forms under the common-item nonequivalent groups design. Inconsistent item parameter estimates among common items can lead to large bias in equated scores for IRT true score equating. Current methods extensively focus on detection and elimination of outlying common items, which…
Item validity vs. item discrimination index: a redundancy?

Science.gov (United States)

Panjaitan, R. L.; Irawati, R.; Sujana, A.; Hanifah, N.; Djuanda, D.

2018-03-01

In several literatures about evaluation and test analysis, it is common to find that there are calculations of item validity as well as item discrimination index (D) with different formula for each. Meanwhile, other resources said that item discrimination index could be obtained by calculating the correlation between the testee’s score in a particular item and the testee’s score on the overall test, which is actually the same concept as item validity. Some research reports, especially undergraduate theses tend to include both item validity and item discrimination index in the instrument analysis. It seems that these concepts might overlap for both reflect the test quality on measuring the examinees’ ability. In this paper, examples of some results of data processing on item validity and item discrimination index were compared. It would be discussed whether item validity and item discrimination index can be represented by one of them only or it should be better to present both calculations for simple test analysis, especially in undergraduate theses where test analyses were included.
Development of knowledge tests for multi-disciplinary emergency training

DEFF Research Database (Denmark)

Sorensen, J. L.; Thellesen, L.; Strandbygaard, J.

2015-01-01

and evaluating a multiple-choice question(MCQ) test for use in a multi-disciplinary training program inobstetric-anesthesia emergencies. Methods: A multi-disciplinary working committee with 12members representing six professional healthcare groups andanother 28 participants were involved. Recurrent revisions......, 40 out of originally50 items were included in the final MCQ test. The MCQ test wasable to distinguish between levels of competence, and good con-struct validity was indicated by a significant difference in the meanscore between consultants and first-year trainees, as well as betweenfirst...
Statistical Indexes for Monitoring Item Behavior under Computer Adaptive Testing Environment.

Science.gov (United States)

Zhu, Renbang; Yu, Feng; Liu, Su

A computerized adaptive test (CAT) administration usually requires a large supply of items with accurately estimated psychometric properties, such as item response theory (IRT) parameter estimates, to ensure the precision of examinee ability estimation. However, an estimated IRT model of a given item in any given pool does not always correctly…
Item Response Theory Analyses of the Cambridge Face Memory Test (CFMT)

Science.gov (United States)

Cho, Sun-Joo; Wilmer, Jeremy; Herzmann, Grit; McGugin, Rankin; Fiset, Daniel; Van Gulick, Ana E.; Ryan, Katie; Gauthier, Isabel

2014-01-01

We evaluated the psychometric properties of the Cambridge face memory test (CFMT; Duchaine & Nakayama, 2006). First, we assessed the dimensionality of the test with a bi-factor exploratory factor analysis (EFA). This EFA analysis revealed a general factor and three specific factors clustered by targets of CFMT. However, the three specific factors appeared to be minor factors that can be ignored. Second, we fit a unidimensional item response model. This item response model showed that the CFMT items could discriminate individuals at different ability levels and covered a wide range of the ability continuum. We found the CFMT to be particularly precise for a wide range of ability levels. Third, we implemented item response theory (IRT) differential item functioning (DIF) analyses for each gender group and two age groups (Age ≤ 20 versus Age > 21). This DIF analysis suggested little evidence of consequential differential functioning on the CFMT for these groups, supporting the use of the test to compare older to younger, or male to female, individuals. Fourth, we tested for a gender difference on the latent facial recognition ability with an explanatory item response model. We found a significant but small gender difference on the latent ability for face recognition, which was higher for women than men by 0.184, at age mean 23.2, controlling for linear and quadratic age effects. Finally, we discuss the practical considerations of the use of total scores versus IRT scale scores in applications of the CFMT. PMID:25642930
Training impulsive choices for healthy and sustainable food.

Science.gov (United States)

Veling, Harm; Chen, Zhang; Tombrock, Merel C; Verpaalen, Iris A M; Schmitz, Laura I; Dijksterhuis, Ap; Holland, Rob W

2017-06-01

Many people find it hard to change their dietary choices. Food choice often occurs impulsively, without deliberation, and it has been unclear whether impulsive food choice can be experimentally created. Across 3 exploratory and 2 confirmatory preregistered experiments we examined whether impulsive food choice can be trained. Participants were cued to make motor responses upon the presentation of, among others, healthy and sustainable food items. They subsequently selected these food items more often for actual consumption when they needed to make their choices impulsively as a result of time pressure. This effect disappeared when participants were asked to think about their choices, merely received more time to make their choices, or when choosing required attention to alternatives. Participants preferred high to low valued food items under time pressure and without time pressure, suggesting that the impulsive choices reflect valid preferences. These findings demonstrate that it is possible to train impulsive choices for food items while leaving deliberative choices for these items unaffected, and connect research on attention training to dual-process theories of decision making. The present research suggests that attention training may lead to behavioral change only when people behave impulsively. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
A 2-phase labeling and choice architecture intervention to improve healthy food and beverage choices.

Science.gov (United States)

Thorndike, Anne N; Sonnenberg, Lillian; Riis, Jason; Barraclough, Susan; Levy, Douglas E

2012-03-01

We assessed whether a 2-phase labeling and choice architecture intervention would increase sales of healthy food and beverages in a large hospital cafeteria. Phase 1 was a 3-month color-coded labeling intervention (red = unhealthy, yellow = less healthy, green = healthy). Phase 2 added a 3-month choice architecture intervention that increased the visibility and convenience of some green items. We compared relative changes in 3-month sales from baseline to phase 1 and from phase 1 to phase 2. At baseline (977,793 items, including 199,513 beverages), 24.9% of sales were red and 42.2% were green. Sales of red items decreased in both phases (P labeling intervention improved sales of healthy items and was enhanced by a choice architecture intervention.
Comparing Two Types of Diagnostic Items to Evaluate Understanding of Heat and Temperature Concepts

Science.gov (United States)

Chu, Hye-Eun; Chandrasegaran, A. L.; Treagust, David F.

2018-01-01

The purpose of this research was to investigate an efficient method to assess year 8 (age 13-14) students' conceptual understanding of heat and temperature concepts. Two different types of instruments were used in this study: Type 1, consisting of multiple-choice items with open-ended justifications; and Type 2, consisting of two-tier…
The Effect of Feedback Delay on Perceptual Category Learning and Item Memory: Further Limits of Multiple Systems.

Science.gov (United States)

Stephens, Rachel G; Kalish, Michael L

2018-02-01

Delayed feedback during categorization training has been hypothesized to differentially affect 2 systems that underlie learning for rule-based (RB) or information-integration (II) structures. We tested an alternative possibility: that II learning requires more precise item representations than RB learning, and so is harmed more by a delay interval filled with a confusable mask. Experiments 1 and 2 examined the effect of feedback delay on memory for RB and II exemplars, both without and with concurrent categorization training. Without the training, II items were indeed more difficult to recognize than RB items, but there was no detectable effect of delay on item memory. In contrast, with concurrent categorization training, there were effects of both category structure and delayed feedback on item memory, which were related to corresponding changes in category learning. However, we did not observe the critical selective impact of delay on II classification performance that has been shown previously. Our own results were also confirmed in a follow-up study (Experiment 3) involving only categorization training. The selective influence of feedback delay on II learning appears to be contingent on the relative size of subgroups of high-performing participants, and in fact does not support that RB and II category learning are qualitatively different. We conclude that a key part of successfully solving perceptual categorization problems is developing more precise item representations, which can be impaired by delayed feedback during training. More important, the evidence for multiple systems of category learning is even weaker than previously proposed. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Item Response Theory analysis of Fagerström Test for Cigarette Dependence.

Science.gov (United States)

Svicher, Andrea; Cosci, Fiammetta; Giannini, Marco; Pistelli, Francesco; Fagerström, Karl

2018-02-01

The Fagerström Test for Cigarette Dependence (FTCD) and the Heaviness of Smoking Index (HSI) are the gold standard measures to assess cigarette dependence. However, FTCD reliability and factor structure have been questioned and HSI psychometric properties are in need of further investigations. The present study examined the psychometrics properties of the FTCD and the HSI via the Item Response Theory. The study was a secondary analysis of data collected in 862 Italian daily smokers. Confirmatory factor analysis was run to evaluate the dimensionality of FTCD. A Grade Response Model was applied to FTCD and HSI to verify the fit to the data. Both item and test functioning were analyzed and item statistics, Test Information Function, and scale reliabilities were calculated. Mokken Scale Analysis was applied to estimate homogeneity and Loevinger's coefficients were calculated. The FTCD showed unidimensionality and homogeneity for most of the items and for the total score. It also showed high sensitivity and good reliability from medium to high levels of cigarette dependence, although problems related to some items (i.e., items 3 and 5) were evident. HSI had good homogeneity, adequate item functioning, and high reliability from medium to high levels of cigarette dependence. Significant Differential Item Functioning was found for items 1, 4, 5 of the FTCD and for both items of HSI. HSI seems highly recommended in clinical settings addressed to heavy smokers while FTCD would be better used in smokers with a level of cigarette dependence ranging between low and high. Copyright © 2017 Elsevier Ltd. All rights reserved.
Using Two-Tier Test to Identify Primary Students' Conceptual Understanding and Alternative Conceptions in Acid Base

Science.gov (United States)

Bayrak, Beyza Karadeniz

2013-01-01

The purpose of this study was to identify primary students' conceptual understanding and alternative conceptions in acid-base. For this reason, a 15 items two-tier multiple choice test administered 56 eighth grade students in spring semester 2009-2010. Data for this study were collected using a conceptual understanding scale prepared to include…
Development of abbreviated eight-item form of the Penn Verbal Reasoning Test.

Science.gov (United States)

Bilker, Warren B; Wierzbicki, Michael R; Brensinger, Colleen M; Gur, Raquel E; Gur, Ruben C

2014-12-01

The ability to reason with language is a highly valued cognitive capacity that correlates with IQ measures and is sensitive to damage in language areas. The Penn Verbal Reasoning Test (PVRT) is a 29-item computerized test for measuring abstract analogical reasoning abilities using language. The full test can take over half an hour to administer, which limits its applicability in large-scale studies. We previously described a procedure for abbreviating a clinical rating scale and a modified procedure for reducing tests with a large number of items. Here we describe the application of the modified method to reducing the number of items in the PVRT to a parsimonious subset of items that accurately predicts the total score. As in our previous reduction studies, a split sample is used for model fitting and validation, with cross-validation to verify results. We find that an 8-item scale predicts the total 29-item score well, achieving a correlation of .9145 for the reduced form for the model fitting sample and .8952 for the validation sample. The results indicate that a drastically abbreviated version, which cuts administration time by more than 70%, can be safely administered as a predictor of PVRT performance. © The Author(s) 2014.
Development of Abbreviated Eight-Item Form of the Penn Verbal Reasoning Test

Science.gov (United States)

Bilker, Warren B.; Wierzbicki, Michael R.; Brensinger, Colleen M.; Gur, Raquel E.; Gur, Ruben C.

2014-01-01

The ability to reason with language is a highly valued cognitive capacity that correlates with IQ measures and is sensitive to damage in language areas. The Penn Verbal Reasoning Test (PVRT) is a 29-item computerized test for measuring abstract analogical reasoning abilities using language. The full test can take over half an hour to administer, which limits its applicability in large-scale studies. We previously described a procedure for abbreviating a clinical rating scale and a modified procedure for reducing tests with a large number of items. Here we describe the application of the modified method to reducing the number of items in the PVRT to a parsimonious subset of items that accurately predicts the total score. As in our previous reduction studies, a split sample is used for model fitting and validation, with cross-validation to verify results. We find that an 8-item scale predicts the total 29-item score well, achieving a correlation of .9145 for the reduced form for the model fitting sample and .8952 for the validation sample. The results indicate that a drastically abbreviated version, which cuts administration time by more than 70%, can be safely administered as a predictor of PVRT performance. PMID:24577310

MIMIC Methods for Assessing Differential Item Functioning in Polytomous Items

Science.gov (United States)

Wang, Wen-Chung; Shih, Ching-Lin

2010-01-01

Three multiple indicators-multiple causes (MIMIC) methods, namely, the standard MIMIC method (M-ST), the MIMIC method with scale purification (M-SP), and the MIMIC method with a pure anchor (M-PA), were developed to assess differential item functioning (DIF) in polytomous items. In a series of simulations, it appeared that all three methods…
Application of Item Response Theory to Tests of Substance-related Associative Memory

Science.gov (United States)

Shono, Yusuke; Grenard, Jerry L.; Ames, Susan L.; Stacy, Alan W.

2015-01-01

A substance-related word association test (WAT) is one of the commonly used indirect tests of substance-related implicit associative memory and has been shown to predict substance use. This study applied an item response theory (IRT) modeling approach to evaluate psychometric properties of the alcohol- and marijuana-related WATs and their items among 775 ethnically diverse at-risk adolescents. After examining the IRT assumptions, item fit, and differential item functioning (DIF) across gender and age groups, the original 18 WAT items were reduced to 14- and 15-items in the alcohol- and marijuana-related WAT, respectively. Thereafter, unidimensional one- and two-parameter logistic models (1PL and 2PL models) were fitted to the revised WAT items. The results demonstrated that both alcohol- and marijuana-related WATs have good psychometric properties. These results were discussed in light of the framework of a unified concept of construct validity (Messick, 1975, 1989, 1995). PMID:25134051
Using Module Analysis for Multiple Choice Responses: A New Method Applied to Force Concept Inventory Data

Science.gov (United States)

Brewe, Eric; Bruun, Jesper; Bearden, Ian G.

2016-01-01

We describe "Module Analysis for Multiple Choice Responses" (MAMCR), a new methodology for carrying out network analysis on responses to multiple choice assessments. This method is used to identify modules of non-normative responses which can then be interpreted as an alternative to factor analysis. MAMCR allows us to identify conceptual…
Group differences in the heritability of items and test scores

NARCIS (Netherlands)

Wicherts, J.M.; Johnson, W.

2009-01-01

It is important to understand potential sources of group differences in the heritability of intelligence test scores. On the basis of a basic item response model we argue that heritabilities which are based on dichotomous item scores normally do not generalize from one sample to the next. If groups
Conservatism implications of shock test tailoring for multiple design environments

Science.gov (United States)

Baca, Thomas J.; Bell, R. Glenn; Robbins, Susan A.

1987-01-01

A method for analyzing shock conservation in test specifications that have been tailored to qualify a structure for multiple design environments is discussed. Shock test conservation is qualified for shock response spectra, shock intensity spectra and ranked peak acceleration data in terms of an Index of Conservation (IOC) and an Overtest Factor (OTF). The multi-environment conservation analysis addresses the issue of both absolute and average conservation. The method is demonstrated in a case where four laboratory tests have been specified to qualify a component which must survive seven different field environments. Final judgment of the tailored test specification is shown to require an understanding of the predominant failure modes of the test item.
Development of a lack of appetite item bank for computer-adaptive testing (CAT)

DEFF Research Database (Denmark)

Thamsborg, Lise Laurberg Holst; Petersen, Morten Aa; Aaronson, Neil K

2015-01-01

to 12 lack of appetite items. CONCLUSIONS: Phases 1-3 resulted in 12 lack of appetite candidate items. Based on a field testing (phase 4), the psychometric characteristics of the items will be assessed and the final item bank will be generated. This CAT item bank is expected to provide precise...
Comparison of Performance on Multiple-Choice Questions and Open-Ended Questions in an Introductory Astronomy Laboratory

Science.gov (United States)

Wooten, Michelle M.; Cool, Adrienne M.; Prather, Edward E.; Tanner, Kimberly D.

2014-01-01

When considering the variety of questions that can be used to measure students' learning, instructors may choose to use multiple-choice questions, which are easier to score than responses to open-ended questions. However, by design, analyses of multiple-choice responses cannot describe all of students' understanding. One method that can…
Benford’s Law: Textbook Exercises and Multiple-Choice Testbanks

Science.gov (United States)

Slepkov, Aaron D.; Ironside, Kevin B.; DiBattista, David

2015-01-01

Benford’s Law describes the finding that the distribution of leading (or leftmost) digits of innumerable datasets follows a well-defined logarithmic trend, rather than an intuitive uniformity. In practice this means that the most common leading digit is 1, with an expected frequency of 30.1%, and the least common is 9, with an expected frequency of 4.6%. Currently, the most common application of Benford’s Law is in detecting number invention and tampering such as found in accounting-, tax-, and voter-fraud. We demonstrate that answers to end-of-chapter exercises in physics and chemistry textbooks conform to Benford’s Law. Subsequently, we investigate whether this fact can be used to gain advantage over random guessing in multiple-choice tests, and find that while testbank answers in introductory physics closely conform to Benford’s Law, the testbank is nonetheless secure against such a Benford’s attack for banal reasons. PMID:25689468
Non-ignorable missingness item response theory models for choice effects in examinee-selected items.

Science.gov (United States)

Liu, Chen-Wei; Wang, Wen-Chung

2017-11-01

Examinee-selected item (ESI) design, in which examinees are required to respond to a fixed number of items in a given set, always yields incomplete data (i.e., when only the selected items are answered, data are missing for the others) that are likely non-ignorable in likelihood inference. Standard item response theory (IRT) models become infeasible when ESI data are missing not at random (MNAR). To solve this problem, the authors propose a two-dimensional IRT model that posits one unidimensional IRT model for observed data and another for nominal selection patterns. The two latent variables are assumed to follow a bivariate normal distribution. In this study, the mirt freeware package was adopted to estimate parameters. The authors conduct an experiment to demonstrate that ESI data are often non-ignorable and to determine how to apply the new model to the data collected. Two follow-up simulation studies are conducted to assess the parameter recovery of the new model and the consequences for parameter estimation of ignoring MNAR data. The results of the two simulation studies indicate good parameter recovery of the new model and poor parameter recovery when non-ignorable missing data were mistakenly treated as ignorable. © 2017 The British Psychological Society.
Algorithmic test design using classical item parameters

NARCIS (Netherlands)

van der Linden, Willem J.; Adema, Jos J.

Two optimalization models for the construction of tests with a maximal value of coefficient alpha are given. Both models have a linear form and can be solved by using a branch-and-bound algorithm. The first model assumes an item bank calibrated under the Rasch model and can be used, for instance,
Multiple choice questiones as a tool for assessment in medical ...

African Journals Online (AJOL)

Methods For this review, a PuBMed online search was carried out for English language ... Advantages and disadvantages of MCQs in medical education are ... multiple-choice question meets many of the educational requirements for an assessment method. The use of automation for grading and low costs makes it a viable ...
Using the Multiple Choice Procedure to Measure College Student Gambling

Science.gov (United States)

Butler, Leon Harvey

2010-01-01

Research suggests that gambling is similar to addictive behaviors such as substance use. In the current study, gambling was investigated from a behavioral economics perspective. The Multiple Choice Procedure (MCP) with gambling as the target behavior was used to assess for relative reinforcing value, the effect of alternative reinforcers, and…
Detection of differential item functioning using Lagrange multiplier tests

NARCIS (Netherlands)

Glas, Cornelis A.W.

1998-01-01

Abstract: In the present paper it is shown that differential item functioning can be evaluated using the Lagrange multiplier test or Rao’s efficient score test. The test is presented in the framework of a number of IRT models such as the Rasch model, the OPLM, the 2-parameter logistic model, the
Effects of Using Modified Items to Test Students with Persistent Academic Difficulties

Science.gov (United States)

Elliott, Stephen N.; Kettler, Ryan J.; Beddow, Peter A.; Kurz, Alexander; Compton, Elizabeth; McGrath, Dawn; Bruen, Charles; Hinton, Kent; Palmer, Porter; Rodriguez, Michael C.; Bolt, Daniel; Roach, Andrew T.

2010-01-01

This study investigated the effects of using modified items in achievement tests to enhance accessibility. An experiment determined whether tests composed of modified items would reduce the performance gap between students eligible for an alternate assessment based on modified achievement standards (AA-MAS) and students not eligible, and the…
Overview of Classical Test Theory and Item Response Theory for Quantitative Assessment of Items in Developing Patient-Reported Outcome Measures

Science.gov (United States)

Cappelleri, Joseph C.; Lundy, J. Jason; Hays, Ron D.

2014-01-01

Introduction The U.S. Food and Drug Administration’s patient-reported outcome (PRO) guidance document defines content validity as “the extent to which the instrument measures the concept of interest” (FDA, 2009, p. 12). “Construct validity is now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity” (Strauss & Smith, 2009, p. 7). Hence both qualitative and quantitative information are essential in evaluating the validity of measures. Methods We review classical test theory and item response theory approaches to evaluating PRO measures including frequency of responses to each category of the items in a multi-item scale, the distribution of scale scores, floor and ceiling effects, the relationship between item response options and the total score, and the extent to which hypothesized “difficulty” (severity) order of items is represented by observed responses. Conclusion Classical test theory and item response theory can be useful in providing a quantitative assessment of items and scales during the content validity phase of patient-reported outcome measures. Depending on the particular type of measure and the specific circumstances, either one or both approaches should be considered to help maximize the content validity of PRO measures. PMID:24811753
FormScanner: Open-Source Solution for Grading Multiple-Choice Exams

Science.gov (United States)

Young, Chadwick; Lo, Glenn; Young, Kaisa; Borsetta, Alberto

2016-01-01

The multiple-choice exam remains a staple for many introductory physics courses. In the past, people have graded these by hand or even flaming needles. Today, one usually grades the exams with a form scanner that utilizes optical mark recognition (OMR). Several companies provide these scanners and particular forms, such as the eponymous…
Student-Generated Content: Enhancing Learning through Sharing Multiple-Choice Questions

Science.gov (United States)

Hardy, Judy; Bates, Simon P.; Casey, Morag M.; Galloway, Kyle W.; Galloway, Ross K.; Kay, Alison E.; Kirsop, Peter; McQueen, Heather A.

2014-01-01

The relationship between students' use of PeerWise, an online tool that facilitates peer learning through student-generated content in the form of multiple-choice questions (MCQs), and achievement, as measured by their performance in the end-of-module examinations, was investigated in 5 large early-years science modules (in physics, chemistry and…
Small group learning: effect on item analysis and accuracy of self-assessment of medical students.

Science.gov (United States)

Biswas, Shubho Subrata; Jain, Vaishali; Agrawal, Vandana; Bindra, Maninder

2015-01-01

Small group sessions are regarded as a more active and student-centered approach to learning. Item analysis provides objective evidence of whether such sessions improve comprehension and make the topic easier for students, in addition to assessing the relative benefit of the sessions to good versus poor performers. Self-assessment makes students aware of their deficiencies. Small group sessions can also help students develop the ability to self-assess. This study was carried out to assess the effect of small group sessions on item analysis and students' self-assessment. A total of 21 female and 29 male first year medical students participated in a small group session on topics covered by didactic lectures two weeks earlier. It was preceded and followed by two multiple choice question (MCQ) tests, in which students were asked to self-assess their likely score. The MCQs used were item analyzed in a previous group and were chosen of matching difficulty and discriminatory indices for the pre- and post-tests. The small group session improved the marks of both genders equally, but female performance was better. The session made the items easier; increasing the difficulty index significantly but there was no significant alteration in the discriminatory index. There was overestimation in the self-assessment of both genders, but male overestimation was greater. The session improved the self-assessment of students in terms of expected marks and expectation of passing. Small group session improved the ability of students to self-assess their knowledge and increased the difficulty index of items reflecting students' better performance.
Attention! Can choices for low value food over high value food be trained?

Science.gov (United States)

Zoltak, Michael J; Veling, Harm; Chen, Zhang; Holland, Rob W

2018-05-01

People choose high value food items over low value food items, because food choices are guided by the comparison of values placed upon choice alternatives. This value comparison process is also influenced by the amount of attention people allocate to different items. Recent research shows that choices for food items can be increased by training attention toward these items, with a paradigm named cued-approach training (CAT). However, previous work till now has only examined the influence of CAT on choices between two equally valued items. It has remained unclear whether CAT can increase choices for low value items when people choose between a low and high value food item. To address this question in the current study participants were cued to make rapid responses in CAT to certain low and high value items. Next, they made binary choices between low and high value items, where we systematically varied whether the low and high value items were cued or uncued. In two experiments, we found that participants overall preferred high over low value food items for real consumption. More important, their choices for low value items increased when only the low value item had been cued in CAT compared to when both low and high value items had not been cued. Exploratory analyses revealed that this effect was more pronounced for participants with a relatively small value difference between low and high value items. The present research thus suggests that CAT may be used to boost the choice and consumption of low value items via enhanced attention toward these items, as long as the value difference is not too large. Implications for facilitating choices for healthy food are discussed. Copyright © 2017 Elsevier Ltd. All rights reserved.
Endogenous Formation of Preferences: Choices Systematically Change Willingness-to-Pay for Goods

Science.gov (United States)

Voigt, Katharina; Murawski, Carsten; Bode, Stefan

2017-01-01

Standard decision theory assumes that choices result from stable preferences. This position has been challenged by claims that the act of choosing between goods may alter preferences. To test this claim, we investigated in three experiments whether choices between equally valued snack food items can systematically shape preferences. We directly…

Overview of classical test theory and item response theory for the quantitative assessment of items in developing patient-reported outcomes measures.

Science.gov (United States)

Cappelleri, Joseph C; Jason Lundy, J; Hays, Ron D

2014-05-01

The US Food and Drug Administration's guidance for industry document on patient-reported outcomes (PRO) defines content validity as "the extent to which the instrument measures the concept of interest" (FDA, 2009, p. 12). According to Strauss and Smith (2009), construct validity "is now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity" (p. 7). Hence, both qualitative and quantitative information are essential in evaluating the validity of measures. We review classical test theory and item response theory (IRT) approaches to evaluating PRO measures, including frequency of responses to each category of the items in a multi-item scale, the distribution of scale scores, floor and ceiling effects, the relationship between item response options and the total score, and the extent to which hypothesized "difficulty" (severity) order of items is represented by observed responses. If a researcher has few qualitative data and wants to get preliminary information about the content validity of the instrument, then descriptive assessments using classical test theory should be the first step. As the sample size grows during subsequent stages of instrument development, confidence in the numerical estimates from Rasch and other IRT models (as well as those of classical test theory) would also grow. Classical test theory and IRT can be useful in providing a quantitative assessment of items and scales during the content-validity phase of PRO-measure development. Depending on the particular type of measure and the specific circumstances, the classical test theory and/or the IRT should be considered to help maximize the content validity of PRO measures. Copyright © 2014 Elsevier HS Journals, Inc. All rights reserved.
Differential Item Functioning (DIF) among Spanish-Speaking English Language Learners (ELLs) in State Science Tests

Science.gov (United States)

Ilich, Maria O.

Psychometricians and test developers evaluate standardized tests for potential bias against groups of test-takers by using differential item functioning (DIF). English language learners (ELLs) are a diverse group of students whose native language is not English. While they are still learning the English language, they must take their standardized tests for their school subjects, including science, in English. In this study, linguistic complexity was examined as a possible source of DIF that may result in test scores that confound science knowledge with a lack of English proficiency among ELLs. Two years of fifth-grade state science tests were analyzed for evidence of DIF using two DIF methods, Simultaneous Item Bias Test (SIBTest) and logistic regression. The tests presented a unique challenge in that the test items were grouped together into testlets---groups of items referring to a scientific scenario to measure knowledge of different science content or skills. Very large samples of 10, 256 students in 2006 and 13,571 students in 2007 were examined. Half of each sample was composed of Spanish-speaking ELLs; the balance was comprised of native English speakers. The two DIF methods were in agreement about the items that favored non-ELLs and the items that favored ELLs. Logistic regression effect sizes were all negligible, while SIBTest flagged items with low to high DIF. A decrease in socioeconomic status and Spanish-speaking ELL diversity may have led to inconsistent SIBTest effect sizes for items used in both testing years. The DIF results for the testlets suggested that ELLs lacked sufficient opportunity to learn science content. The DIF results further suggest that those constructed response test items requiring the student to draw a conclusion about a scientific investigation or to plan a new investigation tended to favor ELLs.
Item response theory, computerized adaptive testing, and PROMIS: assessment of physical function.

Science.gov (United States)

Fries, James F; Witter, James; Rose, Matthias; Cella, David; Khanna, Dinesh; Morgan-DeWitt, Esi

2014-01-01

Patient-reported outcome (PRO) questionnaires record health information directly from research participants because observers may not accurately represent the patient perspective. Patient-reported Outcomes Measurement Information System (PROMIS) is a US National Institutes of Health cooperative group charged with bringing PRO to a new level of precision and standardization across diseases by item development and use of item response theory (IRT). With IRT methods, improved items are calibrated on an underlying concept to form an item bank for a "domain" such as physical function (PF). The most informative items can be combined to construct efficient "instruments" such as 10-item or 20-item PF static forms. Each item is calibrated on the basis of the probability that a given person will respond at a given level, and the ability of the item to discriminate people from one another. Tailored forms may cover any desired level of the domain being measured. Computerized adaptive testing (CAT) selects the best items to sharpen the estimate of a person's functional ability, based on prior responses to earlier questions. PROMIS item banks have been improved with experience from several thousand items, and are calibrated on over 21,000 respondents. In areas tested to date, PROMIS PF instruments are superior or equal to Health Assessment Questionnaire and Medical Outcome Study Short Form-36 Survey legacy instruments in clarity, translatability, patient importance, reliability, and sensitivity to change. Precise measures, such as PROMIS, efficiently incorporate patient self-report of health into research, potentially reducing research cost by lowering sample size requirements. The advent of routine IRT applications has the potential to transform PRO measurement.
Confidence in Forced-Choice Recognition: What Underlies the Ratings?

Science.gov (United States)

Zawadzka, Katarzyna; Higham, Philip A.; Hanczakowski, Maciej

2017-01-01

Two-alternative forced-choice recognition tests are commonly used to assess recognition accuracy that is uncontaminated by changes in bias. In such tests, participants are asked to endorse the studied item out of 2 presented alternatives. Participants may be further asked to provide confidence judgments for their recognition decisions. It is often…
Preference index supported by motivation tests in Nile tilapia.

Science.gov (United States)

Maia, Caroline Marques; Volpato, Gilson Luiz

2017-01-01

The identification of animal preferences is assumed to provide better rearing environments for the animals in question. Preference tests focus on the frequency of approaches or the time an animal spends in proximity to each item of the investigated resource during a multiple-choice trial. Recently, a preference index (PI) was proposed to differentiate animal preferences from momentary responses (Sci Rep, 2016, 6:28328, DOI: 10.1038/srep28328). This index also quantifies the degree of preference for each item. Each choice response is also weighted, with the most recent responses weighted more heavily, but the index includes the entire bank of tests, and thus represents a history-based approach. In this study, we compared this PI to motivation tests, which consider how much effort is expended to access a resource. We performed choice tests over 7 consecutive days for 34 Nile tilapia fish that presented with different colored compartments in each test. We first detected the preferred and non-preferred colors of each fish using the PI and then tested their motivation to reach these compartments. We found that fish preferences varied individually, but the results were consistent with the motivation profiles, as individual fish were more motivated (the number of touches made on transparent, hinged doors that prevented access to the resource) to access their preferred items. On average, most of the 34 fish avoided the color yellow and showed less motivation to reach yellow and red colors. The fish also exhibited greater motivation to access blue and green colors (the most preferred colors). These results corroborate the PI as a reliable tool for the identification of animal preferences. We recommend this index to animal keepers and researchers to identify an animal's preferred conditions.
Preference index supported by motivation tests in Nile tilapia.

Directory of Open Access Journals (Sweden)

Caroline Marques Maia

Full Text Available The identification of animal preferences is assumed to provide better rearing environments for the animals in question. Preference tests focus on the frequency of approaches or the time an animal spends in proximity to each item of the investigated resource during a multiple-choice trial. Recently, a preference index (PI was proposed to differentiate animal preferences from momentary responses (Sci Rep, 2016, 6:28328, DOI: 10.1038/srep28328. This index also quantifies the degree of preference for each item. Each choice response is also weighted, with the most recent responses weighted more heavily, but the index includes the entire bank of tests, and thus represents a history-based approach. In this study, we compared this PI to motivation tests, which consider how much effort is expended to access a resource. We performed choice tests over 7 consecutive days for 34 Nile tilapia fish that presented with different colored compartments in each test. We first detected the preferred and non-preferred colors of each fish using the PI and then tested their motivation to reach these compartments. We found that fish preferences varied individually, but the results were consistent with the motivation profiles, as individual fish were more motivated (the number of touches made on transparent, hinged doors that prevented access to the resource to access their preferred items. On average, most of the 34 fish avoided the color yellow and showed less motivation to reach yellow and red colors. The fish also exhibited greater motivation to access blue and green colors (the most preferred colors. These results corroborate the PI as a reliable tool for the identification of animal preferences. We recommend this index to animal keepers and researchers to identify an animal's preferred conditions.
Science Library of Test Items. Volume Eight. Mastery Testing Program. Series 3 & 4 Supplements to Introduction and Manual.

Science.gov (United States)

New South Wales Dept. of Education, Sydney (Australia).

Continuing a series of short tests aimed at measuring student mastery of specific skills in the natural sciences, this supplementary volume includes teachers' notes, a users' guide and inspection copies of test items 27 to 50. Answer keys and test scoring statistics are provided. The items are designed for grades 7 through 10, and a list of the…
A Generalized Logistic Regression Procedure to Detect Differential Item Functioning among Multiple Groups

Science.gov (United States)

Magis, David; Raiche, Gilles; Beland, Sebastien; Gerard, Paul

2011-01-01

We present an extension of the logistic regression procedure to identify dichotomous differential item functioning (DIF) in the presence of more than two groups of respondents. Starting from the usual framework of a single focal group, we propose a general approach to estimate the item response functions in each group and to test for the presence…
Aufwandsanalyse für computerunterstützte Multiple-Choice Papierklausuren [Cost analysis for computer supported multiple-choice paper examinations

Directory of Open Access Journals (Sweden)

Mandel, Alexander

2011-11-01

Full Text Available [english] Introduction: Multiple-choice-examinations are still fundamental for assessment in medical degree programs. In addition to content related research, the optimization of the technical procedure is an important question. Medical examiners face three options: paper-based examinations with or without computer support or completely electronic examinations. Critical aspects are the effort for formatting, the logistic effort during the actual examination, quality, promptness and effort of the correction, the time for making the documents available for inspection by the students, and the statistical analysis of the examination results.Methods: Since three semesters a computer program for input and formatting of MC-questions in medical and other paper-based examinations is used and continuously improved at Wuerzburg University. In the winter semester (WS 2009/10 eleven, in the summer semester (SS 2010 twelve and in WS 2010/11 thirteen medical examinations were accomplished with the program and automatically evaluated. For the last two semesters the remaining manual workload was recorded. Results: The cost of the formatting and the subsequent analysis including adjustments of the analysis of an average examination with about 140 participants and about 35 questions was 5-7 hours for exams without complications in the winter semester 2009/2010, about 2 hours in SS 2010 and about 1.5 hours in the winter semester 2010/11. Including exams with complications, the average time was about 3 hours per exam in SS 2010 and 2.67 hours for the WS 10/11. Discussion: For conventional multiple-choice exams the computer-based formatting and evaluation of paper-based exams offers a significant time reduction for lecturers in comparison with the manual correction of paper-based exams and compared to purely electronically conducted exams it needs a much simpler technological infrastructure and fewer staff during the exam.[german] Einleitung: Multiple-Choice
An empirical comparison of Item Response Theory and Classical Test Theory

Directory of Open Access Journals (Sweden)

Špela Progar

2008-11-01

Full Text Available Based on nonlinear models between the measured latent variable and the item response, item response theory (IRT enables independent estimation of item and person parameters and local estimation of measurement error. These properties of IRT are also the main theoretical advantages of IRT over classical test theory (CTT. Empirical evidence, however, often failed to discover consistent differences between IRT and CTT parameters and between invariance measures of CTT and IRT parameter estimates. In this empirical study a real data set from the Third International Mathematics and Science Study (TIMSS 1995 was used to address the following questions: (1 How comparable are CTT and IRT based item and person parameters? (2 How invariant are CTT and IRT based item parameters across different participant groups? (3 How invariant are CTT and IRT based item and person parameters across different item sets? The findings indicate that the CTT and the IRT item/person parameters are very comparable, that the CTT and the IRT item parameters show similar invariance property when estimated across different groups of participants, that the IRT person parameters are more invariant across different item sets, and that the CTT item parameters are at least as much invariant in different item sets as the IRT item parameters. The results furthermore demonstrate that, with regards to the invariance property, IRT item/person parameters are in general empirically superior to CTT parameters, but only if the appropriate IRT model is used for modelling the data.
The Role of Item Models in Automatic Item Generation

Science.gov (United States)

Gierl, Mark J.; Lai, Hollis

2012-01-01

Automatic item generation represents a relatively new but rapidly evolving research area where cognitive and psychometric theories are used to produce tests that include items generated using computer technology. Automatic item generation requires two steps. First, test development specialists create item models, which are comparable to templates…
Dual process theory and intermediate effect: are faculty and residents' performance on multiple-choice, licensing exam questions different?

Science.gov (United States)

Dong, Ting; Durning, Steven J; Artino, Anthony R; van der Vleuten, Cees; Holmboe, Eric; Lipner, Rebecca; Schuwirth, Lambert

2015-04-01

Clinical reasoning is essential for the practice of medicine. Dual process theory conceptualizes reasoning as falling into two general categories: nonanalytic reasoning (pattern recognition) and analytic reasoning (active comparing and contrasting of alternatives). The debate continues regarding how expert performance develops and how individuals make the best use of analytic and nonanalytic processes. Several investigators have identified the unexpected finding that intermediates tend to perform better on licensing examination items than experts, which has been termed the "intermediate effect." We explored differences between faculty and residents on multiple-choice questions (MCQs) using dual process measures (both reading and answering times) to inform this ongoing debate. Faculty (board-certified internists; experts) and residents (internal medicine interns; intermediates) answered live licensing examination MCQs (U.S. Medical Licensing Examination Step 2 Clinical Knowledge and American Board of Internal Medicine Certifying Examination) while being timed. We conducted repeated analysis of variance to compare the 2 groups on average reading time, answering time, and accuracy on various types of items. Faculty and residents did not differ significantly in reading time [F (1,35) = 0.01, p = 0.93], answering time [F (1,35) = 0.60, p = 0.44], or accuracy [F (1,35) = 0.24, p = 0.63] regardless of easy or hard items. Dual process theory was not evidenced in this study. However, this lack of difference between faculty and residents may have been affected by the small sample size of participants and MCQs may not reflect how physicians made decisions in actual practice setting. Reprint & Copyright © 2015 Association of Military Surgeons of the U.S.
Testing ESL pragmatics development and validation of a web-based assessment battery

CERN Document Server

Roever, Carsten

2014-01-01

Although second language learners' pragmatic competence (their ability to use language in context) is an essential part of their general communicative competence, it has not been a part of second language tests. This book helps fill this gap by describing the development and validation of a web-based test of ESL pragmalinguistics. The instrument assesses learners' knowledge of routine formulae, speech acts, and implicature in 36 multiple-choice and brief-response items. The test's quantitative and qualitative validation with 300 learners showed high reliability and provided strong evidence of
Re-evaluating a vision-related quality of life questionnaire with item response theory (IRT and differential item functioning (DIF analyses

Directory of Open Access Journals (Sweden)

Knol Dirk L

2011-09-01

Full Text Available Abstract Background For the Low Vision Quality Of Life questionnaire (LVQOL it is unknown whether the psychometric properties are satisfactory when an item response theory (IRT perspective is considered. This study evaluates some essential psychometric properties of the LVQOL questionnaire in an IRT model, and investigates differential item functioning (DIF. Methods Cross-sectional data were used from an observational study among visually-impaired patients (n = 296. Calibration was performed for every dimension of the LVQOL in the graded response model. Item goodness-of-fit was assessed with the S-X2-test. DIF was assessed on relevant background variables (i.e. age, gender, visual acuity, eye condition, rehabilitation type and administration type with likelihood-ratio tests for DIF. The magnitude of DIF was interpreted by assessing the largest difference in expected scores between subgroups. Measurement precision was assessed by presenting test information curves; reliability with the index of subject separation. Results All items of the LVQOL dimensions fitted the model. There was significant DIF on several items. For two items the maximum difference between expected scores exceeded one point, and DIF was found on multiple relevant background variables. Item 1 'Vision in general' from the "Adjustment" dimension and item 24 'Using tools' from the "Reading and fine work" dimension were removed. Test information was highest for the "Reading and fine work" dimension. Indices for subject separation ranged from 0.83 to 0.94. Conclusions The items of the LVQOL showed satisfactory item fit to the graded response model; however, two items were removed because of DIF. The adapted LVQOL with 21 items is DIF-free and therefore seems highly appropriate for use in heterogeneous populations of visually impaired patients.
Prolonged-release fampridine treatment improved subject-reported impact of multiple sclerosis: Item-level analysis of the MSIS-29.

Science.gov (United States)

Gasperini, Claudio; Hupperts, Raymond; Lycke, Jan; Short, Christine; McNeill, Manjit; Zhong, John; Mehta, Lahar R

2016-11-15

Prolonged-release (PR) fampridine is approved to treat walking impairment in persons with multiple sclerosis (MS); however, treatment benefits may extend beyond walking. MOBILE was a phase 2, 24-week, double-blind, placebo-controlled exploratory study to assess the impact of 10mg PR-fampridine twice daily versus placebo on several subject-assessed measures. This analysis evaluated the physical and psychological health outcomes of subjects with progressing or relapsing MS from individual items of the Multiple Sclerosis Impact Scale (MSIS-29). PR-fampridine treatment (n=68) resulted in greater improvements from baseline in the MSIS-29 physical (PHYS) and psychological (PSYCH) impact subscales, with differences of 89% and 148% in mean score reduction from baseline (n=64) at week 24 versus placebo, respectively. MSIS-29 item analysis showed that a higher percentage of PR-fampridine subjects had mean improvements in 16/20 PHYS and 6/9 PSYCH items versus placebo after 24weeks. Post hoc analysis of the 12-item Multiple Sclerosis Walking Scale (MSWS-12) improver population (≥8-point mean improvement) demonstrated differences in mean reductions from baseline of 97% and 111% in PR-fampridine MSIS-29 PHYS and PSYCH subscales versus the overall placebo group over 24weeks. A higher percentage of MSWS-12 improvers treated with PR-fampridine showed mean improvements in 20/20 PHYS and 8/9 PSYCH items versus placebo at 24weeks. In conclusion, PR-fampridine resulted in physical and psychological benefits versus placebo, sustained over 24weeks. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Latent Trait Theory Applications to Test Item Bias Methodology. Research Memorandum No. 1.

Science.gov (United States)

Osterlind, Steven J.; Martois, John S.

This study discusses latent trait theory applications to test item bias methodology. A real data set is used in describing the rationale and application of the Rasch probabilistic model item calibrations across various ethnic group populations. A high school graduation proficiency test covering reading comprehension, writing mechanics, and…
Traffic-light labels and choice architecture: promoting healthy food choices.

Science.gov (United States)

Thorndike, Anne N; Riis, Jason; Sonnenberg, Lillian M; Levy, Douglas E

2014-02-01

Preventing obesity requires maintenance of healthy eating behaviors over time. Food labels and strategies that increase visibility and convenience of healthy foods (choice architecture) promote healthier choices, but long-term effectiveness is unknown. Assess effectiveness of traffic-light labeling and choice architecture cafeteria intervention over 24 months. Longitudinal pre-post cohort follow-up study between December 2009 and February 2012. Data were analyzed in 2012. Large hospital cafeteria with a mean of 6511 transactions daily. Cafeteria sales were analyzed for (1) all cafeteria customers and (2) a longitudinal cohort of 2285 hospital employees who used the cafeteria regularly. After a 3-month baseline period, cafeteria items were labeled green (healthy); yellow (less healthy); or red (unhealthy) and rearranged to make healthy items more accessible. Proportion of cafeteria sales that were green or red during each 3-month period from baseline to 24 months. Changes in 12- and 24-month sales were compared to baseline for all transactions and transactions by the employee cohort. The proportion of sales of red items decreased from 24% at baseline to 20% at 24 months (pchoice architecture cafeteria intervention resulted in sustained healthier choices over 2 years, suggesting that food environment interventions can promote long-term changes in population eating behaviors. © 2013 American Journal of Preventive Medicine Published by American Journal of Preventive Medicine All rights reserved.
Parameter Estimation for Thurstone Choice Models

Energy Technology Data Exchange (ETDEWEB)

Vojnovic, Milan [London School of Economics (United Kingdom); Yun, Seyoung [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)

2017-04-24

We consider the estimation accuracy of individual strength parameters of a Thurstone choice model when each input observation consists of a choice of one item from a set of two or more items (so called top-1 lists). This model accommodates the well-known choice models such as the Luce choice model for comparison sets of two or more items and the Bradley-Terry model for pair comparisons. We provide a tight characterization of the mean squared error of the maximum likelihood parameter estimator. We also provide similar characterizations for parameter estimators defined by a rank-breaking method, which amounts to deducing one or more pair comparisons from a comparison of two or more items, assuming independence of these pair comparisons, and maximizing a likelihood function derived under these assumptions. We also consider a related binary classification problem where each individual parameter takes value from a set of two possible values and the goal is to correctly classify all items within a prescribed classification error. The results of this paper shed light on how the parameter estimation accuracy depends on given Thurstone choice model and the structure of comparison sets. In particular, we found that for unbiased input comparison sets of a given cardinality, when in expectation each comparison set of given cardinality occurs the same number of times, for a broad class of Thurstone choice models, the mean squared error decreases with the cardinality of comparison sets, but only marginally according to a diminishing returns relation. On the other hand, we found that there exist Thurstone choice models for which the mean squared error of the maximum likelihood parameter estimator can decrease much faster with the cardinality of comparison sets. We report empirical evaluation of some claims and key parameters revealed by theory using both synthetic and real-world input data from some popular sport competitions and online labor platforms.
QTest: Quantitative Testing of Theories of Binary Choice.

Science.gov (United States)

Regenwetter, Michel; Davis-Stober, Clintin P; Lim, Shiau Hong; Guo, Ying; Popova, Anna; Zwilling, Chris; Cha, Yun-Shil; Messner, William

2014-01-01

The goal of this paper is to make modeling and quantitative testing accessible to behavioral decision researchers interested in substantive questions. We provide a novel, rigorous, yet very general, quantitative diagnostic framework for testing theories of binary choice. This permits the nontechnical scholar to proceed far beyond traditionally rather superficial methods of analysis, and it permits the quantitatively savvy scholar to triage theoretical proposals before investing effort into complex and specialized quantitative analyses. Our theoretical framework links static algebraic decision theory with observed variability in behavioral binary choice data. The paper is supplemented with a custom-designed public-domain statistical analysis package, the QTest software. We illustrate our approach with a quantitative analysis using published laboratory data, including tests of novel versions of "Random Cumulative Prospect Theory." A major asset of the approach is the potential to distinguish decision makers who have a fixed preference and commit errors in observed choices from decision makers who waver in their preferences.
QTest: Quantitative Testing of Theories of Binary Choice

Science.gov (United States)

Regenwetter, Michel; Davis-Stober, Clintin P.; Lim, Shiau Hong; Guo, Ying; Popova, Anna; Zwilling, Chris; Cha, Yun-Shil; Messner, William

2014-01-01

The goal of this paper is to make modeling and quantitative testing accessible to behavioral decision researchers interested in substantive questions. We provide a novel, rigorous, yet very general, quantitative diagnostic framework for testing theories of binary choice. This permits the nontechnical scholar to proceed far beyond traditionally rather superficial methods of analysis, and it permits the quantitatively savvy scholar to triage theoretical proposals before investing effort into complex and specialized quantitative analyses. Our theoretical framework links static algebraic decision theory with observed variability in behavioral binary choice data. The paper is supplemented with a custom-designed public-domain statistical analysis package, the QTest software. We illustrate our approach with a quantitative analysis using published laboratory data, including tests of novel versions of “Random Cumulative Prospect Theory.” A major asset of the approach is the potential to distinguish decision makers who have a fixed preference and commit errors in observed choices from decision makers who waver in their preferences. PMID:24999495

List length and overlap effects in forced-choice associative recognition.

Science.gov (United States)

Clark, S E; Hori, A

1995-07-01

Two experiments examined forced-choice associative recognition for OLAP and NOLAP test conditions. OLAP test trials consist of pairs with overlapping items (e.g., AB vs. AD), whereas NOLAP test trials contain no overlapping items (e.g., AB vs. CF). Previous results show better performance for NOLAP than for OLAP tests, contrary to the predictions of global memory models. The present experiments varied list length to examine the hypothesis that the NOLAP advantage is produced by recall-like retrieval processes. The use of longer lists either eliminated (Experiment 1) or greatly reduced (Experiment 2) the NOLAP advantage. However, a reliable OLAP advantage was not obtained. Implications for models are discussed.
Kernel Machine SNP-set Testing under Multiple Candidate Kernels

Science.gov (United States)

Wu, Michael C.; Maity, Arnab; Lee, Seunggeun; Simmons, Elizabeth M.; Harmon, Quaker E.; Lin, Xinyi; Engel, Stephanie M.; Molldrem, Jeffrey J.; Armistead, Paul M.

2013-01-01

Joint testing for the cumulative effect of multiple single nucleotide polymorphisms grouped on the basis of prior biological knowledge has become a popular and powerful strategy for the analysis of large scale genetic association studies. The kernel machine (KM) testing framework is a useful approach that has been proposed for testing associations between multiple genetic variants and many different types of complex traits by comparing pairwise similarity in phenotype between subjects to pairwise similarity in genotype, with similarity in genotype defined via a kernel function. An advantage of the KM framework is its flexibility: choosing different kernel functions allows for different assumptions concerning the underlying model and can allow for improved power. In practice, it is difficult to know which kernel to use a priori since this depends on the unknown underlying trait architecture and selecting the kernel which gives the lowest p-value can lead to inflated type I error. Therefore, we propose practical strategies for KM testing when multiple candidate kernels are present based on constructing composite kernels and based on efficient perturbation procedures. We demonstrate through simulations and real data applications that the procedures protect the type I error rate and can lead to substantially improved power over poor choices of kernels and only modest differences in power versus using the best candidate kernel. PMID:23471868
Fostering a student's skill for analyzing test items through an authentic task

Science.gov (United States)

Setiawan, Beni; Sabtiawan, Wahyu Budi

2017-08-01

Analyzing test items is a skill that must be mastered by prospective teachers, in order to determine the quality of test questions which have been written. The main aim of this research was to describe the effectiveness of authentic task to foster the student's skill for analyzing test items involving validity, reliability, item discrimination index, level of difficulty, and distractor functioning through the authentic task. The participant of the research is students of science education study program, science and mathematics faculty, Universitas Negeri Surabaya, enrolled for assessment course. The research design was a one-group posttest design. The treatment in this study is that the students were provided an authentic task facilitating the students to develop test items, then they analyze the items like a professional assessor using Microsoft Excel and Anates Software. The data of research obtained were analyzed descriptively, such as the analysis was presented by displaying the data of students' skill, then they were associated with theories or previous empirical studies. The research showed the task facilitated the students to have the skills. Thirty-one students got a perfect score for the analyzing, five students achieved 97% mastery, two students had 92% mastery, and another two students got 89% and 79% of mastery. The implication of the finding was the students who get authentic tasks forcing them to perform like a professional, the possibility of the students for achieving the professional skills will be higher at the end of learning.
Does Correct Answer Distribution Influence Student Choices When Writing Multiple Choice Examinations?

Directory of Open Access Journals (Sweden)

Jacqueline A. Carnegie

2017-03-01

Full Text Available Summative evaluation for large classes of first- and second-year undergraduate courses often involves the use of multiple choice question (MCQ exams in order to provide timely feedback. Several versions of those exams are often prepared via computer-based question scrambling in an effort to deter cheating. An important parameter to consider when preparing multiple exam versions is that they must be equivalent in their assessment of student knowledge. This project investigated a possible influence of correct answer organization on student answer selection when writing multiple versions of MCQ exams. The specific question asked was whether the existence of a series of four to five consecutive MCQs in which the same letter represented the correct answer had a detrimental influence on a student’s ability to continue to select the correct answer as he/she moved through that series. Student outcomes from such exams were compared with results from exams with identical questions but which did not contain such series. These findings were supplemented by student survey data in which students self-assessed the extent to which they paid attention to the distribution of correct answer choices when writing summative exams, both during their initial answer selection and when transferring their answer letters to the Scantron sheet for correction. Despite the fact that more than half of survey respondents indicated that they do make note of answer patterning during exams and that a series of four to five questions with the same letter for the correct answer would encourage many of them to take a second look at their answer choice, the results pertaining to student outcomes suggest that MCQ randomization, even when it does result in short serial arrays of letter-specific correct answers, does not constitute a distraction capable of adversely influencing student performance. Dans les très grandes classes de cours de première et deuxième années, l�
Factors that influence beverage choices at meal times. An application of the food choice kaleidoscope framework.

Science.gov (United States)

Mueller Loose, S; Jaeger, S R

2012-12-01

Beverages are consumed at almost every meal occasion, but knowledge about the factors that influence beverage choice is less than for food choice. The aim of this research was to characterize and quantify factors that influence beverage choices at meal times. Insights into what beverages are chosen by whom, when and where can be helpful for manufacturers, dieticians/health care providers, and health policy makers. A descriptive framework - the food choice kaleidoscope (Jaeger et al., 2011) - was applied to self-reported 24h food recall data from a sample of New Zealand consumers. Participants (n=164) described 8356 meal occasions in terms of foods and beverages consumed, and the contextual characteristics of the occasion. Beverage choice was explored with random-parameter logit regressions to reveal influences linked to food items eaten, context factors and person factors. Thereby this study contributed to the food choice kaleidoscope research approach by expressing the degree of context dependency in the form of odds ratios and according significance levels. The exploration of co-occurrence of beverages with food items suggests that beverage-meal item combinations can be meal specific. Furthermore, this study integrates psychographic variables into the 'person' mirror of the food choice kaleidoscope. A measure of habit in beverage choice was obtained from the inter-participant correlation. Copyright © 2012 Elsevier Ltd. All rights reserved.
The quadratic relationship between difficulty of intelligence test items and their correlations with working memory

Directory of Open Access Journals (Sweden)

Tomasz eSmoleń

2015-08-01

Full Text Available Fluid intelligence (Gf is a crucial cognitive ability that involves abstract reasoning in order to solve novel problems. Recent research demonstrated that Gf strongly depends on the individual effectiveness of working memory (WM. We investigated a popular claim that if the storage capacity underlay the WM-Gf correlation, then such a correlation should increase with an increasing number of items or rules (load in a Gf test. As often no such link is observed, on that basis the storage-capacity account is rejected, and alternative accounts of Gf (e.g., related to executive control or processing speed are proposed. Using both analytical inference and numerical simulations, we demonstrated that the load-dependent change in correlation is primarily a function of the amount of floor/ceiling effect for particular items. Thus, the item-wise WM correlation of a Gf test depends on its overall difficulty, and the difficulty distribution across its items. When the early test items yield huge ceiling, but the late items do not approach floor, that correlation will increase throughout the test. If the early items locate themselves between ceiling and floor, but the late items approach floor, the respective correlation will decrease. For a hallmark Gf test, the Raven test, whose items span from ceiling to floor, the quadratic relationship is expected, and it was shown empirically using a large sample and two types of WMC tasks. In consequence, no changes in correlation due to varying WM/Gf load, or lack of them, can yield an argument for or against any theory of WM/Gf. Moreover, as the mathematical properties of the correlation formula make it relatively immune to ceiling/floor effects for overall moderate correlations, only minor changes (if any in the WM-Gf correlation should be expected for many psychological tests.
The quadratic relationship between difficulty of intelligence test items and their correlations with working memory.

Science.gov (United States)

Smolen, Tomasz; Chuderski, Adam

2015-01-01

Fluid intelligence (Gf) is a crucial cognitive ability that involves abstract reasoning in order to solve novel problems. Recent research demonstrated that Gf strongly depends on the individual effectiveness of working memory (WM). We investigated a popular claim that if the storage capacity underlay the WM-Gf correlation, then such a correlation should increase with an increasing number of items or rules (load) in a Gf-test. As often no such link is observed, on that basis the storage-capacity account is rejected, and alternative accounts of Gf (e.g., related to executive control or processing speed) are proposed. Using both analytical inference and numerical simulations, we demonstrated that the load-dependent change in correlation is primarily a function of the amount of floor/ceiling effect for particular items. Thus, the item-wise WM correlation of a Gf-test depends on its overall difficulty, and the difficulty distribution across its items. When the early test items yield huge ceiling, but the late items do not approach floor, that correlation will increase throughout the test. If the early items locate themselves between ceiling and floor, but the late items approach floor, the respective correlation will decrease. For a hallmark Gf-test, the Raven-test, whose items span from ceiling to floor, the quadratic relationship is expected, and it was shown empirically using a large sample and two types of WMC tasks. In consequence, no changes in correlation due to varying WM/Gf load, or lack of them, can yield an argument for or against any theory of WM/Gf. Moreover, as the mathematical properties of the correlation formula make it relatively immune to ceiling/floor effects for overall moderate correlations, only minor changes (if any) in the WM-Gf correlation should be expected for many psychological tests.
The establisment of an achievement test for determination of primary teachers’ knowledge level of earthquake

International Nuclear Information System (INIS)

Aydin, Süleyman; Haşiloğlu, M. Akif; Kunduraci, Ayşe

2016-01-01

In this study it was aimed to improve an academic achievement test to establish the students’ knowledge about the earthquake and the ways of protection from earthquakes. In the method of this study, the steps that Webb (1994) was created to improve an academic achievement test for a unit were followed. In the developmental process of multiple choice test having 25 questions, was prepared to measure the pre-service teachers’ knowledge levels about the earthquake and the ways of protection from earthquakes. The multiple choice test was presented to view of six academics (one of them was from geographic field and five of them were science educator) and two expert teachers in science Prepared test was applied to 93 pre-service teachers studying in elementary education department in 2014-2015 academic years. As a result of validity and reliability of the study, the test was composed of 20 items. As a result of these applications, Pearson Moments Multiplication half-reliability coefficient was found to be 0.94. When this value is adjusted according to Spearman Brown reliability coefficient the reliability coefficient was set at 0.97.
The establisment of an achievement test for determination of primary teachers’ knowledge level of earthquake

Energy Technology Data Exchange (ETDEWEB)

Aydin, Süleyman, E-mail: yupul@hotmail.com; Haşiloğlu, M. Akif, E-mail: mehmet.hasiloglu@hotmail.com; Kunduraci, Ayşe, E-mail: ayse-kndrc@hotmail.com [Ağrı İbrahim Çeçen University, Faculty of Education, Science Education, Ağrı (Turkey)

2016-04-18

In this study it was aimed to improve an academic achievement test to establish the students’ knowledge about the earthquake and the ways of protection from earthquakes. In the method of this study, the steps that Webb (1994) was created to improve an academic achievement test for a unit were followed. In the developmental process of multiple choice test having 25 questions, was prepared to measure the pre-service teachers’ knowledge levels about the earthquake and the ways of protection from earthquakes. The multiple choice test was presented to view of six academics (one of them was from geographic field and five of them were science educator) and two expert teachers in science Prepared test was applied to 93 pre-service teachers studying in elementary education department in 2014-2015 academic years. As a result of validity and reliability of the study, the test was composed of 20 items. As a result of these applications, Pearson Moments Multiplication half-reliability coefficient was found to be 0.94. When this value is adjusted according to Spearman Brown reliability coefficient the reliability coefficient was set at 0.97.
Optimizing the Use of Response Times for Item Selection in Computerized Adaptive Testing

Science.gov (United States)

Choe, Edison M.; Kern, Justin L.; Chang, Hua-Hua

2018-01-01

Despite common operationalization, measurement efficiency of computerized adaptive testing should not only be assessed in terms of the number of items administered but also the time it takes to complete the test. To this end, a recent study introduced a novel item selection criterion that maximizes Fisher information per unit of expected response…
Secondary Psychometric Examination of the Dimensional Obsessive-Compulsive Scale: Classical Testing, Item Response Theory, and Differential Item Functioning.

Science.gov (United States)

Thibodeau, Michel A; Leonard, Rachel C; Abramowitz, Jonathan S; Riemann, Bradley C

2015-12-01

The Dimensional Obsessive-Compulsive Scale (DOCS) is a promising measure of obsessive-compulsive disorder (OCD) symptoms but has received minimal psychometric attention. We evaluated the utility and reliability of DOCS scores. The study included 832 students and 300 patients with OCD. Confirmatory factor analysis supported the originally proposed four-factor structure. DOCS total and subscale scores exhibited good to excellent internal consistency in both samples (α = .82 to α = .96). Patient DOCS total scores reduced substantially during treatment (t = 16.01, d = 1.02). DOCS total scores discriminated between students and patients (sensitivity = 0.76, 1 - specificity = 0.23). The measure did not exhibit gender-based differential item functioning as tested by Mantel-Haenszel chi-square tests. Expected response options for each item were plotted as a function of item response theory and demonstrated that DOCS scores incrementally discriminate OCD symptoms ranging from low to extremely high severity. Incremental differences in DOCS scores appear to represent unbiased and reliable differences in true OCD symptom severity. © The Author(s) 2014.
Factor Structure and Reliability of Test Items for Saudi Teacher Licence Assessment

Science.gov (United States)

Alsadaawi, Abdullah Saleh

2017-01-01

The Saudi National Assessment Centre administers the Computer Science Teacher Test for teacher certification. The aim of this study is to explore gender differences in candidates' scores, and investigate dimensionality, reliability, and differential item functioning using confirmatory factor analysis and item response theory. The confirmatory…
Design of Web Questionnaires : A Test for Number of Items per Screen

NARCIS (Netherlands)

Toepoel, V.; Das, J.W.M.; van Soest, A.H.O.

2005-01-01

This paper presents results from an experimental manipulation of one versus multiple-items per screen format in a Web survey.The purpose of the experiment was to find out if a questionnaire s format influences how respondents provide answers in online questionnaires and if this is depending on
Application of Item Response Theory to Modeling of Expanded Disability Status Scale in Multiple Sclerosis.

NARCIS (Netherlands)

Novakovic, A.M.; Krekels, E.H.; Munafo, A.; Ueckert, S.; Karlsson, M.O.

2016-01-01

In this study, we report the development of the first item response theory (IRT) model within a pharmacometrics framework to characterize the disease progression in multiple sclerosis (MS), as measured by Expanded Disability Status Score (EDSS). Data were collected quarterly from a 96-week phase III
The effects of linguistic modification on ESL students' comprehension of nursing course test items.

Science.gov (United States)

Bosher, Susan; Bowles, Melissa

2008-01-01

Recent research has indicated that language may be a source of construct-irrelevant variance for non-native speakers of English, or English as a second language (ESL) students, when they take exams. As a result, exams may not accurately measure knowledge of nursing content. One accommodation often used to level the playing field for ESL students is linguistic modification, a process by which the reading load of test items is reduced while the content and integrity of the item are maintained. Research on the effects of linguistic modification has been conducted on examinees in the K-12 population, but is just beginning in other areas. This study describes the collaborative process by which items from a pathophysiology exam were linguistically modified and subsequently evaluated for comprehensibility by ESL students. Findings indicate that in a majority of cases, modification improved examinees' comprehension of test items. Implications for test item writing and future research are discussed.
Factors that influence beverage choices at meal times

DEFF Research Database (Denmark)

Mueller Loose, Simone; Jaeger, S. R.

2012-01-01

Beverages are consumed at almost every meal occasion, but knowledge about the factors that influence beverage choice is less than for food choice. The aim of this research was to characterize and quantify factors that influence beverage choices at meal times. Insights into what beverages are chosen...... consumers. Participants (n=164) described 8356 meal occasions in terms of foods and beverages consumed, and the contextual characteristics of the occasion. Beverage choice was explored with random-parameter logit regressions to reveal influences linked to food items eaten, context factors and person factors....... Thereby this study contributed to the food choice kaleidoscope research approach by expressing the degree of context dependency in form of odds ratios and according significance levels. The exploration of co-occurrence of beverages with food items suggests that beverage-meal item combinations can be meal...
Multiple Choice Knapsack Problem: example of planning choice in transportation.

Science.gov (United States)

Zhong, Tao; Young, Rhonda

2010-05-01

Transportation programming, a process of selecting projects for funding given budget and other constraints, is becoming more complex as a result of new federal laws, local planning regulations, and increased public involvement. This article describes the use of an integer programming tool, Multiple Choice Knapsack Problem (MCKP), to provide optimal solutions to transportation programming problems in cases where alternative versions of projects are under consideration. In this paper, optimization methods for use in the transportation programming process are compared and then the process of building and solving the optimization problems is discussed. The concepts about the use of MCKP are presented and a real-world transportation programming example at various budget levels is provided. This article illustrates how the use of MCKP addresses the modern complexities and provides timely solutions in transportation programming practice. While the article uses transportation programming as a case study, MCKP can be useful in other fields where a similar decision among a subset of the alternatives is required. Copyright 2009 Elsevier Ltd. All rights reserved.
An intuitive graphical webserver for multiple-choice protein sequence search.

Science.gov (United States)

Banky, Daniel; Szalkai, Balazs; Grolmusz, Vince

2014-04-10

Every day tens of thousands of sequence searches and sequence alignment queries are submitted to webservers. The capitalized word "BLAST" becomes a verb, describing the act of performing sequence search and alignment. However, if one needs to search for sequences that contain, for example, two hydrophobic and three polar residues at five given positions, the query formation on the most frequently used webservers will be difficult. Some servers support the formation of queries with regular expressions, but most of the users are unfamiliar with their syntax. Here we present an intuitive, easily applicable webserver, the Protein Sequence Analysis server, that allows the formation of multiple choice queries by simply drawing the residues to their positions; if more than one residue are drawn to the same position, then they will be nicely stacked on the user interface, indicating the multiple choice at the given position. This computer-game-like interface is natural and intuitive, and the coloring of the residues makes possible to form queries requiring not just certain amino acids in the given positions, but also small nonpolar, negatively charged, hydrophobic, positively charged, or polar ones. The webserver is available at http://psa.pitgroup.org. Copyright © 2014 Elsevier B.V. All rights reserved.
Some further notes on the forced-choice leadership attitude scale.

NARCIS (Netherlands)

Hoogstraten, J.; Vorst, H.C.M.

1972-01-01

The Conceptions of Leadership Scale (LOS) consists of 19 2-choice items and the responses are in terms of a forced-choice technique. 14 individual items refer to the subscale (T) that indicates behavior in which the leader organizes and defines the activities of the group, 11 individual items refer
Food Choice Architecture: An Intervention in a Secondary School and its Impact on Students' Plant-based Food Choices.

Science.gov (United States)

Ensaff, Hannah; Homer, Matt; Sahota, Pinki; Braybrook, Debbie; Coan, Susan; McLeod, Helen

2015-06-02

With growing evidence for the positive health outcomes associated with a plant-based diet, the study's purpose was to examine the potential of shifting adolescents' food choices towards plant-based foods. Using a real world setting of a school canteen, a set of small changes to the choice architecture was designed and deployed in a secondary school in Yorkshire, England. Focussing on designated food items (whole fruit, fruit salad, vegetarian daily specials, and sandwiches containing salad) the changes were implemented for six weeks. Data collected on students' food choice (218,796 transactions) enabled students' (980 students) selections to be examined. Students' food choice was compared for three periods: baseline (29 weeks); intervention (six weeks); and post-intervention (three weeks). Selection of designated food items significantly increased during the intervention and post-intervention periods, compared to baseline (baseline, 1.4%; intervention 3.0%; post-intervention, 2.2%) χ(2)(2) = 68.1, p food items during the intervention period, compared to baseline. The study's results point to the influence of choice architecture within secondary school settings, and its potential role in improving adolescents' daily food choices.

Simultaneous modeling of visual saliency and value computation improves predictions of economic choice.

Science.gov (United States)

Towal, R Blythe; Mormann, Milica; Koch, Christof

2013-10-01

Many decisions we make require visually identifying and evaluating numerous alternatives quickly. These usually vary in reward, or value, and in low-level visual properties, such as saliency. Both saliency and value influence the final decision. In particular, saliency affects fixation locations and durations, which are predictive of choices. However, it is unknown how saliency propagates to the final decision. Moreover, the relative influence of saliency and value is unclear. Here we address these questions with an integrated model that combines a perceptual decision process about where and when to look with an economic decision process about what to choose. The perceptual decision process is modeled as a drift-diffusion model (DDM) process for each alternative. Using psychophysical data from a multiple-alternative, forced-choice task, in which subjects have to pick one food item from a crowded display via eye movements, we test four models where each DDM process is driven by (i) saliency or (ii) value alone or (iii) an additive or (iv) a multiplicative combination of both. We find that models including both saliency and value weighted in a one-third to two-thirds ratio (saliency-to-value) significantly outperform models based on either quantity alone. These eye fixation patterns modulate an economic decision process, also described as a DDM process driven by value. Our combined model quantitatively explains fixation patterns and choices with similar or better accuracy than previous models, suggesting that visual saliency has a smaller, but significant, influence than value and that saliency affects choices indirectly through perceptual decisions that modulate economic decisions.
Visual Attention for Solving Multiple-Choice Science Problem: An Eye-Tracking Analysis

Science.gov (United States)

Tsai, Meng-Jung; Hou, Huei-Tse; Lai, Meng-Lung; Liu, Wan-Yi; Yang, Fang-Ying

2012-01-01

This study employed an eye-tracking technique to examine students' visual attention when solving a multiple-choice science problem. Six university students participated in a problem-solving task to predict occurrences of landslide hazards from four images representing four combinations of four factors. Participants' responses and visual attention…
Application of Item Response Theory to Modeling of Expanded Disability Status Scale in Multiple Sclerosis.

Science.gov (United States)

Novakovic, A M; Krekels, E H J; Munafo, A; Ueckert, S; Karlsson, M O

2017-01-01

In this study, we report the development of the first item response theory (IRT) model within a pharmacometrics framework to characterize the disease progression in multiple sclerosis (MS), as measured by Expanded Disability Status Score (EDSS). Data were collected quarterly from a 96-week phase III clinical study by a blinder rater, involving 104,206 item-level observations from 1319 patients with relapsing-remitting MS (RRMS), treated with placebo or cladribine. Observed scores for each EDSS item were modeled describing the probability of a given score as a function of patients' (unobserved) disability using a logistic model. Longitudinal data from placebo arms were used to describe the disease progression over time, and the model was then extended to cladribine arms to characterize the drug effect. Sensitivity with respect to patient disability was calculated as Fisher information for each EDSS item, which were ranked according to the amount of information they contained. The IRT model was able to describe baseline and longitudinal EDSS data on item and total level. The final model suggested that cladribine treatment significantly slows disease-progression rate, with a 20% decrease in disease-progression rate compared to placebo, irrespective of exposure, and effects an additional exposure-dependent reduction in disability progression. Four out of eight items contained 80% of information for the given range of disabilities. This study has illustrated that IRT modeling is specifically suitable for accurate quantification of disease status and description and prediction of disease progression in phase 3 studies on RRMS, by integrating EDSS item-level data in a meaningful manner.
PENGEMBANGAN TES BERPIKIR KRITIS DENGAN PENDEKATAN ITEM RESPONSE THEORY

Directory of Open Access Journals (Sweden)

Fajrianthi Fajrianthi

2016-06-01

Assumption, Deduction, Interpretation dan Evaluation of arguments. 1453 respondents from Surabaya, Gresik, Tuban, Bojonegoro and Rembang were used for trailing the test. The dichotomous data were analized using the Item Response Theory with two parameter logistic model using statistical program Mplus ver. 6.11. Several assumptions were tested prior the IRT analysis; the test of unidimensionality, local independency and Item Characteristic Curve (ICC. Amongst 68 items only 15 items had good discrimination parameter. Difficulty item level ranged from – 4.95 to 2.448. The study was limited in producing high number of qualified items due to its failure in finding subject matter experts in critical thinking area and inadequate choice in scoring method. Keywords: test development, critical thinking, Item response theory
Airport choice model in multiple airport regions

Directory of Open Access Journals (Sweden)

Claudia Muñoz

2017-02-01

Full Text Available Purpose: This study aims to analyze travel choices made by air transportation users in multi airport regions because it is a crucial component when planning passenger redistribution policies. The purpose of this study is to find a utility function which makes it possible to know the variables that influence users’ choice of the airports on routes to the main cities in the Colombian territory. Design/methodology/approach: This research generates a Multinomial Logit Model (MNL, which is based on the theory of maximizing utility, and it is based on the data obtained on revealed and stated preference surveys applied to users who reside in the metropolitan area of Aburrá Valley (Colombia. This zone is the only one in the Colombian territory which has two neighboring airports for domestic flights. The airports included in the modeling process were Enrique Olaya Herrera (EOH Airport and José María Córdova (JMC Airport. Several structure models were tested, and the MNL proved to be the most significant revealing the common variables that affect passenger airport choice include the airfare, the price to travel the airport, and the time to get to the airport. Findings and Originality/value: The airport choice model which was calibrated corresponds to a valid powerful tool used to calculate the probability of each analyzed airport of being chosen for domestic flights in the Colombian territory. This is done bearing in mind specific characteristic of each of the attributes contained in the utility function. In addition, these probabilities will be used to calculate future market shares of the two airports considered in this study, and this will be done generating a support tool for airport and airline marketing policies.
A 2-Phase Labeling and Choice Architecture Intervention to Improve Healthy Food and Beverage Choices

Science.gov (United States)

Sonnenberg, Lillian; Riis, Jason; Barraclough, Susan; Levy, Douglas E.

2012-01-01

Objectives. We assessed whether a 2-phase labeling and choice architecture intervention would increase sales of healthy food and beverages in a large hospital cafeteria. Methods. Phase 1 was a 3-month color-coded labeling intervention (red = unhealthy, yellow = less healthy, green = healthy). Phase 2 added a 3-month choice architecture intervention that increased the visibility and convenience of some green items. We compared relative changes in 3-month sales from baseline to phase 1 and from phase 1 to phase 2. Results. At baseline (977 793 items, including 199 513 beverages), 24.9% of sales were red and 42.2% were green. Sales of red items decreased in both phases (P beverages. Red beverages decreased 16.5% during phase 1 (P beverages increased 9.6% in phase 1 (P < .001) and further increased 4.0% in phase 2 (P < .001). Bottled water increased 25.8% during phase 2 (P < .001) but did not increase at 2 on-site comparison cafeterias (P < .001). Conclusions. A color-coded labeling intervention improved sales of healthy items and was enhanced by a choice architecture intervention. PMID:22390518
IRT-Estimated Reliability for Tests Containing Mixed Item Formats

Science.gov (United States)

Shu, Lianghua; Schwarz, Richard D.

2014-01-01

As a global measure of precision, item response theory (IRT) estimated reliability is derived for four coefficients (Cronbach's a, Feldt-Raju, stratified a, and marginal reliability). Models with different underlying assumptions concerning test-part similarity are discussed. A detailed computational example is presented for the targeted…
Applications of NLP Techniques to Computer-Assisted Authoring of Test Items for Elementary Chinese

Science.gov (United States)

Liu, Chao-Lin; Lin, Jen-Hsiang; Wang, Yu-Chun

2010-01-01

The authors report an implemented environment for computer-assisted authoring of test items and provide a brief discussion about the applications of NLP techniques for computer assisted language learning. Test items can serve as a tool for language learners to examine their competence in the target language. The authors apply techniques for…
Validation and structural analysis of the kinematics concept test

Directory of Open Access Journals (Sweden)

A. Lichtenberger

2017-04-01

Full Text Available The kinematics concept test (KCT is a multiple-choice test designed to evaluate students’ conceptual understanding of kinematics at the high school level. The test comprises 49 multiple-choice items about velocity and acceleration, which are based on seven kinematic concepts and which make use of three different representations. In the first part of this article we describe the development and the validation process of the KCT. We applied the KCT to 338 Swiss high school students who attended traditional teaching in kinematics. We analyzed the response data to provide the psychometric properties of the test. In the second part we present the results of a structural analysis of the test. An exploratory factor analysis of 664 student answers finally uncovered the seven kinematics concepts as factors. However, the analysis revealed a hierarchical structure of concepts. At the higher level, mathematical concepts group together, and then split up into physics concepts at the lower level. Furthermore, students who seem to understand a concept in one representation have difficulties transferring the concept to similar problems in another representation. Both results have implications for teaching kinematics. First, teaching mathematical concepts beforehand might be beneficial for learning kinematics. Second, instructions have to be designed to teach students the change between different representations.
Validation and structural analysis of the kinematics concept test

Science.gov (United States)

Lichtenberger, A.; Wagner, C.; Hofer, S. I.; Stern, E.; Vaterlaus, A.

2017-06-01

The kinematics concept test (KCT) is a multiple-choice test designed to evaluate students' conceptual understanding of kinematics at the high school level. The test comprises 49 multiple-choice items about velocity and acceleration, which are based on seven kinematic concepts and which make use of three different representations. In the first part of this article we describe the development and the validation process of the KCT. We applied the KCT to 338 Swiss high school students who attended traditional teaching in kinematics. We analyzed the response data to provide the psychometric properties of the test. In the second part we present the results of a structural analysis of the test. An exploratory factor analysis of 664 student answers finally uncovered the seven kinematics concepts as factors. However, the analysis revealed a hierarchical structure of concepts. At the higher level, mathematical concepts group together, and then split up into physics concepts at the lower level. Furthermore, students who seem to understand a concept in one representation have difficulties transferring the concept to similar problems in another representation. Both results have implications for teaching kinematics. First, teaching mathematical concepts beforehand might be beneficial for learning kinematics. Second, instructions have to be designed to teach students the change between different representations.
Cross-cultural adaptation and validation of the 12-item Multiple Sclerosis Walking Scale (MSWS-12 for the Brazilian population

Directory of Open Access Journals (Sweden)

Bruna E. M. Marangoni

2012-12-01

Full Text Available Gait impairment is reported by 85% of patients with multiple sclerosis (MS as main complaint. In 2003, Hobart et al. developed a scale for walking known as The 12-item Multiple Sclerosis Walking Scale (MSWS-12, which combines the perspectives of patients with psychometric methods. OBJECTIVE: This study aimed to cross-culturally adapt and validate the MSWS-12 for the Brazilian population with MS. METHODS: This study included 116 individuals diagnosed with MS, in accordance with McDonald's criteria. The steps of the adaptation process included translation, back-translation, review by an expert committee and pretesting. A test and retest of MSWS-12/BR was made for validation, with comparison with another scale (MSIS-29/BR and another test (T25FW. RESULTS: The Brazilian version of MSWS-12/BR was shown to be similar to the original. The results indicate that MSWS-12/BR is a reliable and reproducible scale. CONCLUSIONS: MSWS-12/BR has been adapted and validated, and it is a reliable tool for the Brazilian population.
Traffic-Light Labels and Choice Architecture Promoting Healthy Food Choices

Science.gov (United States)

Thorndike, Anne N.; Riis, Jason; Sonnenberg, Lillian M.; Levy, Douglas E.

2014-01-01

Background Preventing obesity requires maintenance of healthy eating behaviors over time. Food labels and strategies that increase visibility and convenience of healthy foods (choice architecture) promote healthier choices, but long-term effectiveness is unknown. Purpose Assess effectiveness of traffic-light labeling and choice architecture cafeteria intervention over 24 months. Design Longitudinal pre–post cohort follow-up study between December 2009 and February 2012. Data were analyzed in 2012. Setting/participants Large hospital cafeteria with mean of 6511 transactions daily. Cafeteria sales were analyzed for: (1) all cafeteria customers and (2) longitudinal cohort of 2285 hospital employees who used the cafeteria regularly. Intervention After 3-month baseline period, cafeteria items were labeled green (healthy), yellow (less healthy) or red (unhealthy) and rearranged to make healthy items more accessible. Main outcome measures Proportion of cafeteria sales that were green or red during each 3-month period from baseline to 24 months. Changes in 12- and 24-month sales were compared to baseline for all transactions and transactions by the employee cohort. Results The proportion of sales of red items decreased from 24% at baseline to 20% at 24 months (p<0.001), and green sales increased from 41% to 46% (p<0.001). Red beverages decreased from 26% of beverage sales at baseline to 17% at 24 months (p<0.001); green beverages increased from 52% to 60% (p<0.001). Similar patterns were observed for the cohort of employees, with largest change for red beverages (23% to 14%, p<0.001). Conclusions A traffic-light and choice architecture cafeteria intervention resulted in sustained healthier choices over 2 years, suggesting food environment interventions can promote long-term changes in population eating behaviors. PMID:24439347
Analyzing Test-Taking Behavior: Decision Theory Meets Psychometric Theory.

Science.gov (United States)

Budescu, David V; Bo, Yuanchao

2015-12-01

We investigate the implications of penalizing incorrect answers to multiple-choice tests, from the perspective of both test-takers and test-makers. To do so, we use a model that combines a well-known item response theory model with prospect theory (Kahneman and Tversky, Prospect theory: An analysis of decision under risk, Econometrica 47:263-91, 1979). Our results reveal that when test-takers are fully informed of the scoring rule, the use of any penalty has detrimental effects for both test-takers (they are always penalized in excess, particularly those who are risk averse and loss averse) and test-makers (the bias of the estimated scores, as well as the variance and skewness of their distribution, increase as a function of the severity of the penalty).
Redefining diagnostic symptoms of depression using Rasch analysis: testing an item bank suitable for DSM-V and computer adaptive testing.

Science.gov (United States)

Mitchell, Alex J; Smith, Adam B; Al-salihy, Zerak; Rahim, Twana A; Mahmud, Mahmud Q; Muhyaldin, Asma S

2011-10-01

We aimed to redefine the optimal self-report symptoms of depression suitable for creation of an item bank that could be used in computer adaptive testing or to develop a simplified screening tool for DSM-V. Four hundred subjects (200 patients with primary depression and 200 non-depressed subjects), living in Iraqi Kurdistan were interviewed. The Mini International Neuropsychiatric Interview (MINI) was used to define the presence of major depression (DSM-IV criteria). We examined symptoms of depression using four well-known scales delivered in Kurdish. The Partial Credit Model was applied to each instrument. Common-item equating was subsequently used to create an item bank and differential item functioning (DIF) explored for known subgroups. A symptom level Rasch analysis reduced the original 45 items to 24 items of the original after the exclusion of 21 misfitting items. A further six items (CESD13 and CESD17, HADS-D4, HADS-D5 and HADS-D7, and CDSS3 and CDSS4) were removed due to misfit as the items were added together to form the item bank, and two items were subsequently removed following the DIF analysis by diagnosis (CESD20 and CDSS9, both of which were harder to endorse for women). Therefore the remaining optimal item bank consisted of 17 items and produced an area under the curve (AUC) of 0.987. Using a bank restricted to the optimal nine items revealed only minor loss of accuracy (AUC = 0.989, sensitivity 96%, specificity 95%). Finally, when restricted to only four items accuracy was still high (AUC was still 0.976; sensitivity 93%, specificity 96%). An item bank of 17 items may be useful in computer adaptive testing and nine or even four items may be used to develop a simplified screening tool for DSM-V major depressive disorder (MDD). Further examination of this item bank should be conducted in different cultural settings.
Relationships among Classical Test Theory and Item Response Theory Frameworks via Factor Analytic Models

Science.gov (United States)

Kohli, Nidhi; Koran, Jennifer; Henn, Lisa

2015-01-01

There are well-defined theoretical differences between the classical test theory (CTT) and item response theory (IRT) frameworks. It is understood that in the CTT framework, person and item statistics are test- and sample-dependent. This is not the perception with IRT. For this reason, the IRT framework is considered to be theoretically superior…
Using response-time constraints in item selection to control for differential speededness in computerized adaptive testing

NARCIS (Netherlands)

van der Linden, Willem J.; Scrams, David J.; Schnipke, Deborah L.

2003-01-01

This paper proposes an item selection algorithm that can be used to neutralize the effect of time limits in computer adaptive testing. The method is based on a statistical model for the response-time distributions of the test takers on the items in the pool that is updated each time a new item has
Social attribution test--multiple choice (SAT-MC) in schizophrenia: comparison with community sample and relationship to neurocognitive, social cognitive and symptom measures.

Science.gov (United States)

Bell, Morris D; Fiszdon, Joanna M; Greig, Tamasine C; Wexler, Bruce E

2010-09-01

This is the first report on the use of the Social Attribution Task - Multiple Choice (SAT-MC) to assess social cognitive impairments in schizophrenia. The SAT-MC was originally developed for autism research, and consists of a 64-second animation showing geometric figures enacting a social drama, with 19 multiple choice questions about the interactions. Responses from 85 community-dwelling participants and 66 participants with SCID confirmed schizophrenia or schizoaffective disorders (Scz) revealed highly significant group differences. When the two samples were combined, SAT-MC scores were significantly correlated with other social cognitive measures, including measures of affect recognition, theory of mind, self-report of egocentricity and the Social Cognition Index from the MATRICS battery. Using a cut-off score, 53% of Scz were significantly impaired on SAT-MC compared with 9% of the community sample. Most Scz participants with impairment on SAT-MC also had impairment on affect recognition. Significant correlations were also found with neurocognitive measures but with less dependence on verbal processes than other social cognitive measures. Logistic regression using SAT-MC scores correctly classified 75% of both samples. Results suggest that this measure may have promise, but alternative versions will be needed before it can be used in pre-post or longitudinal designs. (c) 2009 Elsevier B.V. All rights reserved.
Item-focussed Trees for the Identification of Items in Differential Item Functioning.

Science.gov (United States)

Tutz, Gerhard; Berger, Moritz

2016-09-01

A novel method for the identification of differential item functioning (DIF) by means of recursive partitioning techniques is proposed. We assume an extension of the Rasch model that allows for DIF being induced by an arbitrary number of covariates for each item. Recursive partitioning on the item level results in one tree for each item and leads to simultaneous selection of items and variables that induce DIF. For each item, it is possible to detect groups of subjects with different item difficulties, defined by combinations of characteristics that are not pre-specified. The way a DIF item is determined by covariates is visualized in a small tree and therefore easily accessible. An algorithm is proposed that is based on permutation tests. Various simulation studies, including the comparison with traditional approaches to identify items with DIF, show the applicability and the competitive performance of the method. Two applications illustrate the usefulness and the advantages of the new method.
Negative effects of item repetition on source memory.

Science.gov (United States)

Kim, Kyungmi; Yi, Do-Joon; Raye, Carol L; Johnson, Marcia K

2012-08-01

In the present study, we explored how item repetition affects source memory for new item-feature associations (picture-location or picture-color). We presented line drawings varying numbers of times in Phase 1. In Phase 2, each drawing was presented once with a critical new feature. In Phase 3, we tested memory for the new source feature of each item from Phase 2. Experiments 1 and 2 demonstrated and replicated the negative effects of item repetition on incidental source memory. Prior item repetition also had a negative effect on source memory when different source dimensions were used in Phases 1 and 2 (Experiment 3) and when participants were explicitly instructed to learn source information in Phase 2 (Experiments 4 and 5). Importantly, when the order between Phases 1 and 2 was reversed, such that item repetition occurred after the encoding of critical item-source combinations, item repetition no longer affected source memory (Experiment 6). Overall, our findings did not support predictions based on item predifferentiation, within-dimension source interference, or general interference from multiple traces of an item. Rather, the findings were consistent with the idea that prior item repetition reduces attention to subsequent presentations of the item, decreasing the likelihood that critical item-source associations will be encoded.
A labelled discrete choice experiment adds realism to the choices presented: preferences for surveillance tests for Barrett esophagus

Directory of Open Access Journals (Sweden)

Donkers Bas

2009-05-01

Full Text Available Abstract Background Discrete choice experiments (DCEs allow systematic assessment of preferences by asking respondents to choose between scenarios. We conducted a labelled discrete choice experiment with realistic choices to investigate patients' trade-offs between the expected health gains and the burden of testing in surveillance of Barrett esophagus (BE. Methods Fifteen choice scenarios were selected based on 2 attributes: 1 type of test (endoscopy and two less burdensome fictitious tests, 2 frequency of surveillance. Each test-frequency combination was associated with its own realistic decrease in risk of dying from esophageal adenocarcinoma. A conditional logit model was fitted. Results Of 297 eligible patients (155 BE and 142 with non-specific upper GI symptoms, 247 completed the questionnaire (84%. Patients preferred surveillance to no surveillance. Current surveillance schemes of once every 1–2 years were amongst the most preferred alternatives. Higher health gains were preferred over those with lower health gains, except when test frequencies exceeded once a year. For similar health gains, patients preferred video-capsule over saliva swab and least preferred endoscopy. Conclusion This first example of a labelled DCE using realistic scenarios in a healthcare context shows that such experiments are feasible. A comparison of labelled and unlabelled designs taking into account setting and research question is recommended.

Test Score Equating Using Discrete Anchor Items versus Passage-Based Anchor Items: A Case Study Using "SAT"® Data. Research Report. ETS RR-14-14

Science.gov (United States)

Liu, Jinghua; Zu, Jiyun; Curley, Edward; Carey, Jill

2014-01-01

The purpose of this study is to investigate the impact of discrete anchor items versus passage-based anchor items on observed score equating using empirical data.This study compares an "SAT"® critical reading anchor that contains more discrete items proportionally, compared to the total tests to be equated, to another anchor that…
Development of an item bank for computerized adaptive test (CAT) measurement of pain

DEFF Research Database (Denmark)

Petersen, Morten Aa.; Aaronson, Neil K; Chie, Wei-Chu

2016-01-01

PURPOSE: Patient-reported outcomes should ideally be adapted to the individual patient while maintaining comparability of scores across patients. This is achievable using computerized adaptive testing (CAT). The aim here was to develop an item bank for CAT measurement of the pain domain as measured...... were obtained from 1103 cancer patients from five countries. Psychometric evaluations showed that 16 items could be retained in a unidimensional item bank. Evaluations indicated that use of the CAT measure may reduce sample size requirements with 15-25 % compared to using the QLQ-C30 pain scale....... CONCLUSIONS: We have established an item bank of 16 items suitable for CAT measurement of pain. While being backward compatible with the QLQ-C30, the new item bank will significantly improve measurement precision of pain. We recommend initiating CAT measurement by screening for pain using the two original QLQ...
[Formula: see text]Determination of the smoking gun of intent: significance testing of forced choice results in social security claimants.

Science.gov (United States)

Binder, Laurence M; Chafetz, Michael D

2018-01-01

Significantly below-chance findings on forced choice tests have been described as revealing "the smoking gun of intent" that proved malingering. The issues of probability levels, one-tailed vs. two-tailed tests, and the combining of PVT scores on significantly below-chance findings were addressed in a previous study, with a recommendation of a probability level of .20 to test the significance of below-chance results. The purpose of the present study was to determine the rate of below-chance findings in a Social Security Disability claimant sample using the previous recommendations. We compared the frequency of below-chance results on forced choice performance validity tests (PVTs) at two levels of significance, .05 and .20, and when using significance testing on individual subtests of the PVTs compared with total scores in claimants for Social Security Disability in order to determine the rate of the expected increase. The frequency of significant results increased with the higher level of significance for each subtest of the PVT and when combining individual test sections to increase the number of test items, with up to 20% of claimants showing significantly below-chance results at the higher p-value. These findings are discussed in light of Social Security Administration policy, showing an impact on policy issues concerning child abuse and neglect, and the importance of using these techniques in evaluations for Social Security Disability.
Integrating personalized medical test contents with XML and XSL-FO.

Science.gov (United States)

Toddenroth, Dennis; Dugas, Martin; Frankewitsch, Thomas

2011-03-01

In 2004 the adoption of a modular curriculum at the medical faculty in Muenster led to the introduction of centralized examinations based on multiple-choice questions (MCQs). We report on how organizational challenges of realizing faculty-wide personalized tests were addressed by implementation of a specialized software module to automatically generate test sheets from individual test registrations and MCQ contents. Key steps of the presented method for preparing personalized test sheets are (1) the compilation of relevant item contents and graphical media from a relational database with database queries, (2) the creation of Extensible Markup Language (XML) intermediates, and (3) the transformation into paginated documents. The software module by use of an open source print formatter consistently produced high-quality test sheets, while the blending of vectorized textual contents and pixel graphics resulted in efficient output file sizes. Concomitantly the module permitted an individual randomization of item sequences to prevent illicit collusion. The automatic generation of personalized MCQ test sheets is feasible using freely available open source software libraries, and can be efficiently deployed on a faculty-wide scale.
Cytogenetic Alterations in Multiple Myeloma: Prognostic Significance and the Choice of Frontline Therapy.

Science.gov (United States)

Stella, Flavia; Pedrazzini, Estela; Agazzoni, Mara; Ballester, Oscar; Slavutsky, Irma

2015-01-01

Multiple myeloma tumor cells demonstrate multiple and often complex genetic lesions as evaluated by standard cytogenetic/FISH studies. Over the past decade, specific abnormalities have been associated with standard or high-risk clinical behavior and they have become strong prognostic indicators. Further, as evidenced by recent randomized clinical trials, the choice of front-line therapy (transplant vs. no transplant, inclusion of novel drugs such as bortezomib, thalidomide, and lenalidomide) may be able to overcome the adverse effect of high-risk genetic lesions.
A Feedback Control Strategy for Enhancing Item Selection Efficiency in Computerized Adaptive Testing

Science.gov (United States)

Weissman, Alexander

2006-01-01

A computerized adaptive test (CAT) may be modeled as a closed-loop system, where item selection is influenced by trait level ([theta]) estimation and vice versa. When discrepancies exist between an examinee's estimated and true [theta] levels, nonoptimal item selection is a likely result. Nevertheless, examinee response behavior consistent with…
Australian Biology Test Item Bank, Years 11 and 12. Volume II: Year 12.

Science.gov (United States)

Brown, David W., Ed.; Sewell, Jeffrey J., Ed.

This document consists of test items which are applicable to biology courses throughout Australia (irrespective of course materials used); assess key concepts within course statement (for both core and optional studies); assess a wide range of cognitive processes; and are relevant to current biological concepts. These items are arranged under…
Australian Biology Test Item Bank, Years 11 and 12. Volume I: Year 11.

Science.gov (United States)

Brown, David W., Ed.; Sewell, Jeffrey J., Ed.

This document consists of test items which are applicable to biology courses throughout Australia (irrespective of course materials used); assess key concepts within course statement (for both core and optional studies); assess a wide range of cognitive processes; and are relevant to current biological concepts. These items are arranged under…
The Use of Management and Marketing Textbook Multiple-Choice Questions: A Case Study.

Science.gov (United States)

Hampton, David R.; And Others

1993-01-01

Four management and four marketing professors classified multiple-choice questions in four widely adopted introductory textbooks according to the two levels of Bloom's taxonomy of educational objectives: knowledge and intellectual ability and skill. Inaccuracies may cause instructors to select questions that require less thinking than they intend.…
Do Self Concept Tests Test Self Concept? An Evaluation of the Validity of Items on the Piers Harris and Coopersmith Measures.

Science.gov (United States)

Lynch, Mervin D.; Chaves, John

Items from Peirs-Harris and Coopersmith self-concept tests were evaluated against independent measures on three self-constructs, idealized, empathic, and worth. Construct measurements were obtained with the semantic differential and D statistic. Ratings were obtained from 381 children, grades 4-6. For each test, item ratings and construct measures…
Gender Effect According to Item Directionality on the Perceived Stress Scale for Adults with Multiple Sclerosis

Science.gov (United States)

Gitchel, W. Dent; Roessler, Richard T.; Turner, Ronna C.

2011-01-01

Assessment is critical to rehabilitation practice and research, and self-reports are a commonly used form of assessment. This study examines a gender effect according to item wording on the "Perceived Stress Scale" for adults with multiple sclerosis. Past studies have demonstrated two-factor solutions on this scale and other scales measuring…
Why Students Answer TIMSS Science Test Items the Way They Do

Science.gov (United States)

Harlow, Ann; Jones, Alister

2004-04-01

The purpose of this study was to explore how Year 8 students answered Third International Mathematics and Science Study (TIMSS) questions and whether the test questions represented the scientific understanding of these students. One hundred and seventy-seven students were tested using written test questions taken from the science test used in the Third International Mathematics and Science Study. The degree to which a sample of 38 children represented their understanding of the topics in a written test compared to the level of understanding that could be elicited by an interview is presented in this paper. In exploring student responses in the interview situation this study hoped to gain some insight into the science knowledge that students held and whether or not the test items had been able to elicit this knowledge successfully. We question the usefulness and quality of data from large-scale summative assessments on their own to represent student scientific understanding and conclude that large scale written test items, such as TIMSS, on their own are not a valid way of exploring students'' understanding of scientific concepts. Considerable caution is therefore needed in exploiting the outcomes of international achievement testing when considering educational policy changes or using TIMSS data on their own to represent student understanding.
Does Educator Training or Experience Affect the Quality of Multiple-Choice Questions?

Science.gov (United States)

Webb, Emily M; Phuong, Jonathan S; Naeger, David M

2015-10-01

Physicians receive little training on proper multiple-choice question (MCQ) writing methods. Well-constructed MCQs follow rules, which ensure that a question tests what it is intended to test. Questions that break these are described as "flawed." We examined whether the prevalence of flawed questions differed significantly between those with or without prior training in question writing and between those with different levels of educator experience. We assessed 200 unedited MCQs from a question bank for our senior medical student radiology elective: an equal number of questions (50) were written by faculty with previous training in MCQ writing, other faculty, residents, and medical students. Questions were scored independently by two readers for the presence of 11 distinct flaws described in the literature. Questions written by faculty with MCQ writing training had significantly fewer errors: mean 0.4 errors per question compared to a mean of 1.5-1.7 errors per question for the other groups (P Educator experience alone had no effect on the frequency of flaws; faculty without dedicated training, residents, and students performed similarly. Copyright © 2015 AUR. Published by Elsevier Inc. All rights reserved.
Food Choice Architecture: An Intervention in a Secondary School and its Impact on Students’ Plant-based Food Choices

Directory of Open Access Journals (Sweden)

Hannah Ensaff

2015-06-01

Full Text Available With growing evidence for the positive health outcomes associated with a plant-based diet, the study’s purpose was to examine the potential of shifting adolescents’ food choices towards plant-based foods. Using a real world setting of a school canteen, a set of small changes to the choice architecture was designed and deployed in a secondary school in Yorkshire, England. Focussing on designated food items (whole fruit, fruit salad, vegetarian daily specials, and sandwiches containing salad the changes were implemented for six weeks. Data collected on students’ food choice (218,796 transactions enabled students’ (980 students selections to be examined. Students’ food choice was compared for three periods: baseline (29 weeks; intervention (six weeks; and post-intervention (three weeks. Selection of designated food items significantly increased during the intervention and post-intervention periods, compared to baseline (baseline, 1.4%; intervention 3.0%; post-intervention, 2.2% χ2(2 = 68.1, p < 0.001. Logistic regression modelling also revealed the independent effect of the intervention, with students 2.5 times as likely (p < 0.001 to select the designated food items during the intervention period, compared to baseline. The study’s results point to the influence of choice architecture within secondary school settings, and its potential role in improving adolescents’ daily food choices.
Strategies for Controlling Item Exposure in Computerized Adaptive Testing with the Generalized Partial Credit Model

Science.gov (United States)

Davis, Laurie Laughlin

2004-01-01

Choosing a strategy for controlling item exposure has become an integral part of test development for computerized adaptive testing (CAT). This study investigated the performance of six procedures for controlling item exposure in a series of simulated CATs under the generalized partial credit model. In addition to a no-exposure control baseline…
Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies: The PRISMA-DTA Statement.

Science.gov (United States)

McInnes, Matthew D F; Moher, David; Thombs, Brett D; McGrath, Trevor A; Bossuyt, Patrick M; Clifford, Tammy; Cohen, Jérémie F; Deeks, Jonathan J; Gatsonis, Constantine; Hooft, Lotty; Hunt, Harriet A; Hyde, Christopher J; Korevaar, Daniël A; Leeflang, Mariska M G; Macaskill, Petra; Reitsma, Johannes B; Rodin, Rachel; Rutjes, Anne W S; Salameh, Jean-Paul; Stevens, Adrienne; Takwoingi, Yemisi; Tonelli, Marcello; Weeks, Laura; Whiting, Penny; Willis, Brian H

2018-01-23

Systematic reviews of diagnostic test accuracy synthesize data from primary diagnostic studies that have evaluated the accuracy of 1 or more index tests against a reference standard, provide estimates of test performance, allow comparisons of the accuracy of different tests, and facilitate the identification of sources of variability in test accuracy. To develop the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) diagnostic test accuracy guideline as a stand-alone extension of the PRISMA statement. Modifications to the PRISMA statement reflect the specific requirements for reporting of systematic reviews and meta-analyses of diagnostic test accuracy studies and the abstracts for these reviews. Established standards from the Enhancing the Quality and Transparency of Health Research (EQUATOR) Network were followed for the development of the guideline. The original PRISMA statement was used as a framework on which to modify and add items. A group of 24 multidisciplinary experts used a systematic review of articles on existing reporting guidelines and methods, a 3-round Delphi process, a consensus meeting, pilot testing, and iterative refinement to develop the PRISMA diagnostic test accuracy guideline. The final version of the PRISMA diagnostic test accuracy guideline checklist was approved by the group. The systematic review (produced 64 items) and the Delphi process (provided feedback on 7 proposed items; 1 item was later split into 2 items) identified 71 potentially relevant items for consideration. The Delphi process reduced these to 60 items that were discussed at the consensus meeting. Following the meeting, pilot testing and iterative feedback were used to generate the 27-item PRISMA diagnostic test accuracy checklist. To reflect specific or optimal contemporary systematic review methods for diagnostic test accuracy, 8 of the 27 original PRISMA items were left unchanged, 17 were modified, 2 were added, and 2 were omitted. The 27-item
Food choices during Ramadan

Directory of Open Access Journals (Sweden)

Thamina Rashid

2017-03-01

Full Text Available Few studies have assessed the dietary Practices of people with diabetes during Ramadan (1. A sub study of Ramadan prospective diabetes study (2 which was conducted at the outpatient department of Baqai Institute of Diabetology and endocrinology, Karachi Pakistan in 2009 analyzed the food choices of patients with diabetes during Ramadan. Several irregularities regarding dietary intake and food choices were noted among the study participants. Although, the patients were counseled regarding diet before Ramadan, many did not follow the dietary advice. All patients had taken food at Iftar but majority of them preferred fried items like samosas, pakoras (fried snack, chicken rolls etc. these deeply fried items can lead to post Iftar hyperglycemia. Patients were also opted for fruit chat, dahibara and chanachaat at Iftar, higher load of these items can also worsen glycemic control. The striking finding was almost absence of meat (protein intake at Iftar but study from India showed increment of all three macronutrients during Ramadan (3. This may result in higher intake of items from carbohydrate and fat groups resulting in hyperglycemia after iftar. Intake of vegetables at Iftar was also negligible and hence the diet was not well balanced. The food choices at sahoor included roti, paratha (fried bread, slices, khajla, pheni, meat, egg and milk. Though it is advisable to take complex carbohydrates, protein and fat at sahoor as these are slowly digestible and can prevent hypoglycemia during fasting but khajla pheni are extremely rich in fat and carbohydrate content and should be avoided (4. However, paratha in 2 teaspoon of oil can be taken at sahoor.Patients with diabetes who fast during the month of Ramadan should have pre Ramadan dietary guidance and counseling session in order to modify their food preferences and choices during the holy month of Ramadan (4.
Grouping of Items in Mobile Web Questionnaires

Science.gov (United States)

Mavletova, Aigul; Couper, Mick P.

2016-01-01

There is some evidence that a scrolling design may reduce breakoffs in mobile web surveys compared to a paging design, but there is little empirical evidence to guide the choice of the optimal number of items per page. We investigate the effect of the number of items presented on a page on data quality in two types of questionnaires: with or…
Item level diagnostics and model - data fit in item response theory ...

African Journals Online (AJOL)

Item response theory (IRT) is a framework for modeling and analyzing item response data. Item-level modeling gives IRT advantages over classical test theory. The fit of an item score pattern to an item response theory (IRT) models is a necessary condition that must be assessed for further use of item and models that best fit ...
The Technical Quality of Test Items Generated Using a Systematic Approach to Item Writing.

Science.gov (United States)

Siskind, Theresa G.; Anderson, Lorin W.

The study was designed to examine the similarity of response options generated by different item writers using a systematic approach to item writing. The similarity of response options to student responses for the same item stems presented in an open-ended format was also examined. A non-systematic (subject matter expertise) approach and a…

Evaluation of psychometric properties and differential item functioning of 8-item Child Perceptions Questionnaires using item response theory.

Science.gov (United States)

Yau, David T W; Wong, May C M; Lam, K F; McGrath, Colman

2015-08-19

Four-factor structure of the two 8-item short forms of Child Perceptions Questionnaire CPQ11-14 (RSF:8 and ISF:8) has been confirmed. However, the sum scores are typically reported in practice as a proxy of Oral health-related Quality of Life (OHRQoL), which implied a unidimensional structure. This study first assessed the unidimensionality of 8-item short forms of CPQ11-14. Item response theory (IRT) was employed to offer an alternative and complementary approach of validation and to overcome the limitations of classical test theory assumptions. A random sample of 649 12-year-old school children in Hong Kong was analyzed. Unidimensionality of the scale was tested by confirmatory factor analysis (CFA), principle component analysis (PCA) and local dependency (LD) statistic. Graded response model was fitted to the data. Contribution of each item to the scale was assessed by item information function (IIF). Reliability of the scale was assessed by test information function (TIF). Differential item functioning (DIF) across gender was identified by Wald test and expected score functions. Both CPQ11-14 RSF:8 and ISF:8 did not deviate much from the unidimensionality assumption. Results from CFA indicated acceptable fit of the one-factor model. PCA indicated that the first principle component explained >30 % of the total variation with high factor loadings for both RSF:8 and ISF:8. Almost all LD statistic items suggesting little contribution of information to the scale and item removal caused little practical impact. Comparing the TIFs, RSF:8 showed slightly better information than ISF:8. In addition to oral symptoms items, the item "Concerned with what other people think" demonstrated a uniform DIF (p Items related to oral symptoms were not informative to OHRQoL and deletion of these items is suggested. The impact of DIF across gender on the overall score was minimal. CPQ11-14 RSF:8 performed slightly better than ISF:8 in measurement precision. The 6-item short forms
Attention! Can choices for low value food over high value food be trained?

NARCIS (Netherlands)

Zoltak, M.J.; Veling, H.P.; Chen, Z.; Holland, R.W.

2018-01-01

People choose high value food items over low value food items, because food choices are guided by the comparison of values placed upon choice alternatives. This value comparison process is also influenced by the amount of attention people allocate to different items. Recent research shows that
Choice consistency and preference stability in test-retests of discrete choice experiment and open-ended willingness to pay elicitation formats

NARCIS (Netherlands)

Brouwer, R.; Logar, I.; Sheremet, O.I.

2017-01-01

This study tests the temporal stability of preferences, choices and willingness to pay (WTP) values using both discrete choice experiment (DCE) and open-ended (OE) WTP elicitation formats. The same sample is surveyed three times over the course of two years using each time the same choice sets.
Conditioning factors of test-taking engagement in PIAAC: an exploratory IRT modelling approach considering person and item characteristics

Directory of Open Access Journals (Sweden)

Frank Goldhammer

2017-11-01

Full Text Available Abstract Background A potential problem of low-stakes large-scale assessments such as the Programme for the International Assessment of Adult Competencies (PIAAC is low test-taking engagement. The present study pursued two goals in order to better understand conditioning factors of test-taking disengagement: First, a model-based approach was used to investigate whether item indicators of disengagement constitute a continuous latent person variable by domain. Second, the effects of person and item characteristics were jointly tested using explanatory item response models. Methods Analyses were based on the Canadian sample of Round 1 of the PIAAC, with N = 26,683 participants completing test items in the domains of literacy, numeracy, and problem solving. Binary item disengagement indicators were created by means of item response time thresholds. Results The results showed that disengagement indicators define a latent dimension by domain. Disengagement increased with lower educational attainment, lower cognitive skills, and when the test language was not the participant’s native language. Gender did not exert any effect on disengagement, while age had a positive effect for problem solving only. An item’s location in the second of two assessment modules was positively related to disengagement, as was item difficulty. The latter effect was negatively moderated by cognitive skill, suggesting that poor test-takers are especially likely to disengage with more difficult items. Conclusions The negative effect of cognitive skill, the positive effect of item difficulty, and their negative interaction effect support the assumption that disengagement is the outcome of individual expectations about success (informed disengagement.
Easy and Informative: Using Confidence-Weighted True-False Items for Knowledge Tests in Psychology Courses

Science.gov (United States)

Dutke, Stephan; Barenberg, Jonathan

2015-01-01

We introduce a specific type of item for knowledge tests, confidence-weighted true-false (CTF) items, and review experiences of its application in psychology courses. A CTF item is a statement about the learning content to which students respond whether the statement is true or false, and they rate their confidence level. Previous studies using…
Critique of the Watson-Glaser Critical Thinking Appraisal Test: The More You Know, the Lower Your Score

Directory of Open Access Journals (Sweden)

Kevin Possin

2014-12-01

Full Text Available The Watson-Glaser Critical Thinking Appraisal Test is one of the oldest, most frequently used, multiple-choice critical-thinking tests on the market in business, government, and legal settings for purposes of hiring and promotion. I demonstrate, however, that the test has serious construct-validity issues, stemming primarily from its ambiguous, unclear, misleading, and sometimes mysterious instructions, which have remained unaltered for decades. Erroneously scored items further diminish the test’s validity. As a result, having enhanced knowledge of formal and informal logic could well result in test subjects receiving lower scores on the test. That’s not how things should work for a CT assessment test.
A Comparison of the Approaches of Generalizability Theory and Item Response Theory in Estimating the Reliability of Test Scores for Testlet-Composed Tests

Science.gov (United States)

Lee, Guemin; Park, In-Yong

2012-01-01

Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several…
Obstetricians' choice of cesarean delivery in ambiguous cases

DEFF Research Database (Denmark)

Fuglenes, Dorthe; Oian, Pål; Kristiansen, Ivar Sønbø

2009-01-01

OBJECTIVE: The aim of this study was to test the hypothesis that obstetricians' choice of delivery method is influenced by their risk attitude and perceived risk of complaints and malpractice litigation. STUDY DESIGN: The choice of delivery method in ambiguous cases was studied in a nationwide...... survey of Norwegian obstetricians (n = 716; response rate, 71%) using clinical scenarios. The risk attitude was measured by 6 items from the Jackson Personality Inventory-Revised. RESULTS: The proportion of obstetricians consenting to the cesarean request varied both within and across the scenarios....... The perceived risk of complaints and malpractice litigation was a clear determinant of obstetricians' choice of cesarean in all of the clinical scenarios, whereas no impact was observed for risk attitude. CONCLUSION: Obstetricians' judgments about cesarean request in ambiguous clinical cases vary considerably...
A restaurant-based intervention to promote sales of healthy children's menu items: the Kids' Choice Restaurant Program cluster randomized trial.

Science.gov (United States)

Ayala, Guadalupe X; Castro, Iana A; Pickrel, Julie L; Williams, Christine B; Lin, Shih-Fan; Madanat, Hala; Jun, Hee-Jin; Zive, Michelle

2016-03-10

completed, providing evidence that the restaurant industry is open to working on the public health challenge of childhood obesity. Determining whether a restaurant intervention can promote sales of healthy children's menu items will provide evidence for how to create environments that support the healthy choices needed to prevent and control obesity. Despite these strengths, collection of sales data that will allow comprehensive analysis of intervention effects remains a challenge. NCT02511938.
Gruppenleistungen beim Review von Multiple-Choice-Fragen - Ein Vergleich von face-to-face und virtuellen Gruppen mit und ohne Moderation [Review of multiple-choice-questions and group performance - A comparison of face-to-face and virtual groups with and without facilitation

Directory of Open Access Journals (Sweden)

Schüttpelz-Brauns, Katrin

2010-11-01

Full Text Available [english] Background: Multiple choice questions (MCQs are often used in exams of medical education and need careful quality management for example by the application of review committees. This study investigates whether groups communicating virtually by email are similar to face-to-face groups concerning their review process performance and whether a facilitator has positive effects.Methods: 16 small groups of students were examined, which had to evaluate and correct MCQs under four different conditions. In the second part of the investigation the changed questions were given to a new random sample for the judgement of the item quality.Results: There was no significant influence of the variables “form of review committee” and “facilitation”. However, face-to-face and virtual groups clearly differed in the required treatment times. The test condition “face to face without facilitation” was generally valued most positively concerning taking over responsibility, approach to work, sense of well-being, motivation and concentration on the task.Discussion: Face-to-face and virtual groups are equally effective in the review of MCQs but differ concerning their efficiency. The application of electronic review seems to be possible but is hardly recommendable because of the long process time and technical problems.[german] Einleitung: Multiple-Choice-Fragen (MCF werden in vielen Prüfungen der medizinischen Ausbildung verwendet und bedürfen aus diesem Grund einer sorgfältigen Qualitätssicherung, beispielsweise durch den Einsatz von Review-Komitees. Anhand der vorliegenden empirischen Studie soll erforscht werden, ob virtuell per E-Mail kommunizierende Review-Komitees vergleichbar sind mit face-to-face Review-Komitees hinsichtlich ihrer Leistung beim Review-Prozess und ob sich Moderation positiv auswirkt.Methodik: 16 Kleingruppen von Psychologie-Studenten hatten die Aufgabe unter vier verschiedenen Versuchsbedingungen MCF zu bewerten und zu
Use of differential item functioning (DIF analysis for bias analysis in test construction

Directory of Open Access Journals (Sweden)

Marié De Beer

2004-10-01

Opsomming Waar differensiële itemfunksioneringsprosedures (DIF-prosedures vir itemontleding gebaseer op itemresponsteorie (IRT tydens toetskonstruksie gebruik word, is dit moontlik om itemkarakteristiekekrommes vir dieselfde item vir verskillende subgroepe voor te stel. Hierdie krommes dui aan hoe elke item vir die verskillende subgroepe op verskillende vermoënsvlakke te funksioneer. DIF word aangetoon deur die area tussen die krommes. DIF is in die konstruksie van die 'Learning Potential Computerised Adaptive test (LPCAT' gebruik om die items te identifiseer wat sydigheid ten opsigte van geslag, kultuur, taal of opleidingspeil geopenbaar het. Items wat ’n voorafbepaalde vlak van DIF oorskry het, is uit die finale itembank weggelaat, ongeag die subgroep wat bevoordeel of benadeel is. Die proses en resultate van die DIF-ontleding word bespreek.
Explanatory item response modelling of an abstract reasoning assessment: A case for modern test design

OpenAIRE

Helland, Fredrik

2016-01-01

Assessment is an integral part of society and education, and for this reason it is important to know what you measure. This thesis is about explanatory item response modelling of an abstract reasoning assessment, with the objective to create a modern test design framework for automatic generation of valid and precalibrated items of abstract reasoning. Modern test design aims to strengthen the connections between the different components of a test, with a stress on strong theory, systematic it...
The development and validation of a test of science critical thinking for fifth graders.

Science.gov (United States)

Mapeala, Ruslan; Siew, Nyet Moi

2015-01-01

The paper described the development and validation of the Test of Science Critical Thinking (TSCT) to measure the three critical thinking skill constructs: comparing and contrasting, sequencing, and identifying cause and effect. The initial TSCT consisted of 55 multiple choice test items, each of which required participants to select a correct response and a correct choice of critical thinking used for their response. Data were obtained from a purposive sampling of 30 fifth graders in a pilot study carried out in a primary school in Sabah, Malaysia. Students underwent the sessions of teaching and learning activities for 9 weeks using the Thinking Maps-aided Problem-Based Learning Module before they answered the TSCT test. Analyses were conducted to check on difficulty index (p) and discrimination index (d), internal consistency reliability, content validity, and face validity. Analysis of the test-retest reliability data was conducted separately for a group of fifth graders with similar ability. Findings of the pilot study showed that out of initial 55 administered items, only 30 items with relatively good difficulty index (p) ranged from 0.40 to 0.60 and with good discrimination index (d) ranged within 0.20-1.00 were selected. The Kuder-Richardson reliability value was found to be appropriate and relatively high with 0.70, 0.73 and 0.92 for identifying cause and effect, sequencing, and comparing and contrasting respectively. The content validity index obtained from three expert judgments equalled or exceeded 0.95. In addition, test-retest reliability showed good, statistically significant correlations ([Formula: see text]). From the above results, the selected 30-item TSCT was found to have sufficient reliability and validity and would therefore represent a useful tool for measuring critical thinking ability among fifth graders in primary science.
Gender-Based Differential Item Performance in Mathematics Achievement Items.

Science.gov (United States)

Doolittle, Allen E.; Cleary, T. Anne

1987-01-01

Eight randomly equivalent samples of high school seniors were each given a unique form of the ACT Assessment Mathematics Usage Test (ACTM). Signed measures of differential item performance (DIP) were obtained for each item in the eight ACTM forms. DIP estimates were analyzed and a significant item category effect was found. (Author/LMO)
Consequences the extensive use of multiple-choice questions might have on student's reasoning structure

OpenAIRE

Raduta, C. M.

2013-01-01

Learning physics is a context dependent process. I consider a broader interdisciplinary problem of where differences in understanding and reasoning arise. I suggest the long run effects a multiple choice based learning system as well as society cultural habits and rules might have on student reasoning structure.
Multiple-Choice Exams: An Obstacle for Higher-Level Thinking in Introductory Science Classes

Science.gov (United States)

Stanger-Hall, Kathrin F.

2012-01-01

Learning science requires higher-level (critical) thinking skills that need to be practiced in science classes. This study tested the effect of exam format on critical-thinking skills. Multiple-choice (MC) testing is common in introductory science courses, and students in these classes tend to associate memorization with MC questions and may not see the need to modify their study strategies for critical thinking, because the MC exam format has not changed. To test the effect of exam format, I used two sections of an introductory biology class. One section was assessed with exams in the traditional MC format, the other section was assessed with both MC and constructed-response (CR) questions. The mixed exam format was correlated with significantly more cognitively active study behaviors and a significantly better performance on the cumulative final exam (after accounting for grade point average and gender). There was also less gender-bias in the CR answers. This suggests that the MC-only exam format indeed hinders critical thinking in introductory science classes. Introducing CR questions encouraged students to learn more and to be better critical thinkers and reduced gender bias. However, student resistance increased as students adjusted their perceptions of their own critical-thinking abilities. PMID:22949426
Qualitätsverbesserung von MC Fragen [Quality assurance of Multiple Choice Questions

Directory of Open Access Journals (Sweden)

Rotthoff, Thomas

2006-08-01

Full Text Available [english] Because of the missing relevance of graded examinations at the German medical faculties, there was no need to reflect the question quality in written examinations consistently. Through the new national legislation for medical education certification-relevant, faculty-internal examinations are prescribed. Until now, there is a lack of structures and processes which could lead to an improvement of the question quality. To reflect the different performance of the students, a different severity and a good selectivity of the test questions are necessary. For a interdisciplinary examination for fourth year undergraduate students at the University Hospital Duesseldorf, new Multiple choice (MC- questions which are application-orientated, clearly formulated and to a large extent free from formal errors should be developed. The implementation took place in the conception and performance of Workshops for the construction of MC-questions and the appointment of an interdisciplinary review-committee. It could be shown that an author training facilitates and accelerates the review-process for the committee and that a review process reflects itself in a high practise-orientation of the items. Prospectively, high-quality questions which are created in a review-process and metrological analysed could be read into inter-university databases. Therewith the initial expenditure of time could be reduced. The interdisciplinary constitution of the review-committee offers the possibility of an intensified discussion over content and relevance of the questions. [german] Wegen fehlender notenrelevanter Prüfungen an den Medizinischen Fakultäten in Deutschland bestand bisher keine Notwendigkeit, die Fragenqualität in schriftlichen Prüfungen konsequent zu reflektieren. Erst durch die neue Approbationsordnung sind zeugnisrelevante, fakultätsinterne Prüfungen vorgeschrieben. Es fehlen somit oftmals Strukturen und Prozesse, die zu einer Verbesserung der
Multiple sensitive estimation and optimal sample size allocation in the item sum technique.

Science.gov (United States)

Perri, Pier Francesco; Rueda García, María Del Mar; Cobo Rodríguez, Beatriz

2018-01-01

For surveys of sensitive issues in life sciences, statistical procedures can be used to reduce nonresponse and social desirability response bias. Both of these phenomena provoke nonsampling errors that are difficult to deal with and can seriously flaw the validity of the analyses. The item sum technique (IST) is a very recent indirect questioning method derived from the item count technique that seeks to procure more reliable responses on quantitative items than direct questioning while preserving respondents' anonymity. This article addresses two important questions concerning the IST: (i) its implementation when two or more sensitive variables are investigated and efficient estimates of their unknown population means are required; (ii) the determination of the optimal sample size to achieve minimum variance estimates. These aspects are of great relevance for survey practitioners engaged in sensitive research and, to the best of our knowledge, were not studied so far. In this article, theoretical results for multiple estimation and optimal allocation are obtained under a generic sampling design and then particularized to simple random sampling and stratified sampling designs. Theoretical considerations are integrated with a number of simulation studies based on data from two real surveys and conducted to ascertain the efficiency gain derived from optimal allocation in different situations. One of the surveys concerns cannabis consumption among university students. Our findings highlight some methodological advances that can be obtained in life sciences IST surveys when optimal allocation is achieved. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Are Greeks’ Unconcerned about Ethical Market Choices?"

Directory of Open Access Journals (Sweden)

Antonia DELISTAVROU

2014-12-01

Full Text Available An Ethical Unconcern (EthU scale was constructed and its impact on Positive Ethical Consumption was examined. The procedure of EthU included literature search, brainstorming and discussion groups to generate the preliminary pool of 99 items, refinement of the scale via a students’ survey by the employment of item-to-total correlation and alpha-if-item deleted techniques. The initial scale was tested in a consumer survey conducted in the urban area of Thessaloniki, Greece. Item-to-total correlation and alpha-if-item deleted techniques were applied again, followed by Exploratory Factor Analysis (EFA by the employment of PCA. The procedure left 21 items in five factors with eigenvalues greater than 1 explaining 61.34% of the variance. The five factors were named Boycott/ Discursive, Fair-Trade, Scepticism, Powerlessness and Ineffectiveness. The AMOS SPSS was then used to conduct confirmatory factor analysis. Goodness-of-fit results indicated that the measurement model fit the data well (χ2=594.226, p<0.000, CFI=0.926, NFI=0.899, TLI=0.910, RMSEA=0.066. The examination of the Positive Ethical Consumption indicated rare to occasional ethical buying choices among Greek consumers. The inhibiting role of Ethical Unconcern on Positive Ethical Consumption was found to be rather low.
Do people with and without medical conditions respond similarly to the short health anxiety inventory? An assessment of differential item functioning using item response theory.

Science.gov (United States)

LeBouthillier, Daniel M; Thibodeau, Michel A; Alberts, Nicole M; Hadjistavropoulos, Heather D; Asmundson, Gordon J G

2015-04-01

Individuals with medical conditions are likely to have elevated health anxiety; however, research has not demonstrated how medical status impacts response patterns on health anxiety measures. Measurement bias can undermine the validity of a questionnaire by overestimating or underestimating scores in groups of individuals. We investigated whether the Short Health Anxiety Inventory (SHAI), a widely-used measure of health anxiety, exhibits medical condition-based bias on item and subscale levels, and whether the SHAI subscales adequately assess the health anxiety continuum. Data were from 963 individuals with diabetes, breast cancer, or multiple sclerosis, and 372 healthy individuals. Mantel-Haenszel tests and item characteristic curves were used to classify the severity of item-level differential item functioning in all three medical groups compared to the healthy group. Test characteristic curves were used to assess scale-level differential item functioning and whether the SHAI subscales adequately assess the health anxiety continuum. Nine out of 14 items exhibited differential item functioning. Two items exhibited differential item functioning in all medical groups compared to the healthy group. In both Thought Intrusion and Fear of Illness subscales, differential item functioning was associated with mildly deflated scores in medical groups with very high levels of the latent traits. Fear of Illness items poorly discriminated between individuals with low and very low levels of the latent trait. While individuals with medical conditions may respond differentially to some items, clinicians and researchers can confidently use the SHAI with a variety of medical populations without concern of significant bias. Copyright © 2015 Elsevier Inc. All rights reserved.

Step by Step: Biology Undergraduates’ Problem-Solving Procedures during Multiple-Choice Assessment

Science.gov (United States)

Prevost, Luanna B.; Lemons, Paula P.

2016-01-01

This study uses the theoretical framework of domain-specific problem solving to explore the procedures students use to solve multiple-choice problems about biology concepts. We designed several multiple-choice problems and administered them on four exams. We trained students to produce written descriptions of how they solved the problem, and this allowed us to systematically investigate their problem-solving procedures. We identified a range of procedures and organized them as domain general, domain specific, or hybrid. We also identified domain-general and domain-specific errors made by students during problem solving. We found that students use domain-general and hybrid procedures more frequently when solving lower-order problems than higher-order problems, while they use domain-specific procedures more frequently when solving higher-order problems. Additionally, the more domain-specific procedures students used, the higher the likelihood that they would answer the problem correctly, up to five procedures. However, if students used just one domain-general procedure, they were as likely to answer the problem correctly as if they had used two to five domain-general procedures. Our findings provide a categorization scheme and framework for additional research on biology problem solving and suggest several important implications for researchers and instructors. PMID:27909021
A comparison of item response models for accuracy and speed of item responses with applications to adaptive testing.

Science.gov (United States)

van Rijn, Peter W; Ali, Usama S

2017-05-01

We compare three modelling frameworks for accuracy and speed of item responses in the context of adaptive testing. The first framework is based on modelling scores that result from a scoring rule that incorporates both accuracy and speed. The second framework is the hierarchical modelling approach developed by van der Linden (2007, Psychometrika, 72, 287) in which a regular item response model is specified for accuracy and a log-normal model for speed. The third framework is the diffusion framework in which the response is assumed to be the result of a Wiener process. Although the three frameworks differ in the relation between accuracy and speed, one commonality is that the marginal model for accuracy can be simplified to the two-parameter logistic model. We discuss both conditional and marginal estimation of model parameters. Models from all three frameworks were fitted to data from a mathematics and spelling test. Furthermore, we applied a linear and adaptive testing mode to the data off-line in order to determine differences between modelling frameworks. It was found that a model from the scoring rule framework outperformed a hierarchical model in terms of model-based reliability, but the results were mixed with respect to correlations with external measures. © 2017 The British Psychological Society.
The Prediction of Item Parameters Based on Classical Test Theory and Latent Trait Theory

Science.gov (United States)

Anil, Duygu

2008-01-01

In this study, the prediction power of the item characteristics based on the experts' predictions on conditions try-out practices cannot be applied was examined for item characteristics computed depending on classical test theory and two-parameters logistic model of latent trait theory. The study was carried out on 9914 randomly selected students…
Multiple Maximum Exposure Rates in Computerized Adaptive Testing

Science.gov (United States)

Ramon Barrada, Juan; Veldkamp, Bernard P.; Olea, Julio

2009-01-01

Computerized adaptive testing is subject to security problems, as the item bank content remains operative over long periods and administration time is flexible for examinees. Spreading the content of a part of the item bank could lead to an overestimation of the examinees' trait level. The most common way of reducing this risk is to impose a…
Hunger enhances consistent economic choices in non-human primates.

Science.gov (United States)

Yamada, Hiroshi

2017-05-24

Hunger and thirst are fundamental biological processes that drive consumption behavior in humans and non-human animals. While the existing literature in neuroscience suggests that these satiety states change how consumable rewards are represented in the brain, it remains unclear as to how they change animal choice behavior and the underlying economic preferences. Here, I used combined techniques from experimental economics, psychology, and neuroscience to measure food preferences of marmoset monkeys (Callithrix jacchus), a recently developed primate model for neuroscience. Hunger states of animals were manipulated by scheduling feeding intervals, resulting in three different conditions: sated, non-sated, and hungry. During these hunger states, animals performed pairwise choices of food items, which included all possible pairwise combinations of five different food items except for same-food pairs. Results showed that hunger enhanced economic rationality, evident as a decrease of transitivity violations (item A was preferred to item B, and B to C, but C was preferred to A). Further analysis demonstrated that hungry monkeys chose more-preferred items over less-preferred items in a more deterministic manner, while the individual food preferences appeared to remain stable across hunger states. These results suggest that hunger enhances consistent choice behavior and shifts animals towards efficient outcome maximization.
Validity and Reliability of Scores Obtained on Multiple-Choice Questions: Why Functioning Distractors Matter

Science.gov (United States)

Ali, Syed Haris; Carr, Patrick A.; Ruit, Kenneth G.

2016-01-01

Plausible distractors are important for accurate measurement of knowledge via multiple-choice questions (MCQs). This study demonstrates the impact of higher distractor functioning on validity and reliability of scores obtained on MCQs. Freeresponse (FR) and MCQ versions of a neurohistology practice exam were given to four cohorts of Year 1 medical…
The Effect of Error in Item Parameter Estimates on the Test Response Function Method of Linking.

Science.gov (United States)

Kaskowitz, Gary S.; De Ayala, R. J.

2001-01-01

Studied the effect of item parameter estimation for computation of linking coefficients for the test response function (TRF) linking/equating method. Simulation results showed that linking was more accurate when there was less error in the parameter estimates, and that 15 or 25 common items provided better results than 5 common items under both…
The Relative Importance of Persons, Items, Subtests, and Languages to TOEFL Test Variance.

Science.gov (United States)

Brown, James Dean

1999-01-01

Explored the relative contributions to Test of English as a Foreign Language (TOEFL) score dependability of various numbers of persons, items, subtests, languages, and their various interactions. Sampled 15,000 test takers, 1000 each from 15 different language backgrounds. (Author/VWL)
A Method for Generating Educational Test Items That Are Aligned to the Common Core State Standards

Science.gov (United States)

Gierl, Mark J.; Lai, Hollis; Hogan, James B.; Matovinovic, Donna

2015-01-01

The demand for test items far outstrips the current supply. This increased demand can be attributed, in part, to the transition to computerized testing, but, it is also linked to dramatic changes in how 21st century educational assessments are designed and administered. One way to address this growing demand is with automatic item generation.…
A Comparison of Procedures for Content-Sensitive Item Selection in Computerized Adaptive Tests.

Science.gov (United States)

Kingsbury, G. Gage; Zara, Anthony R.

1991-01-01

This simulation investigated two procedures that reduce differences between paper-and-pencil testing and computerized adaptive testing (CAT) by making CAT content sensitive. Results indicate that the price in terms of additional test items of using constrained CAT for content balancing is much smaller than that of using testlets. (SLD)
An emotional functioning item bank of 24 items for computerized adaptive testing (CAT) was established

DEFF Research Database (Denmark)

Petersen, Morten Aa.; Gamper, Eva-Maria; Costantini, Anna

2016-01-01

of the widely used EORTC Quality of Life questionnaire (QLQ-C30). STUDY DESIGN AND SETTING: On the basis of literature search and evaluations by international samples of experts and cancer patients, 38 candidate items were developed. The psychometric properties of the items were evaluated in a large...... international sample of cancer patients. This included evaluations of dimensionality, item response theory (IRT) model fit, differential item functioning (DIF), and of measurement precision/statistical power. RESULTS: Responses were obtained from 1,023 cancer patients from four countries. The evaluations showed...... that 24 items could be included in a unidimensional IRT model. DIF did not seem to have any significant impact on the estimation of EF. Evaluations indicated that the CAT measure may reduce sample size requirements by up to 50% compared to the QLQ-C30 EF scale without reducing power. CONCLUSION...
Piecewise Polynomial Fitting with Trend Item Removal and Its Application in a Cab Vibration Test

Directory of Open Access Journals (Sweden)

Wu Ren

2018-01-01

Full Text Available The trend item of a long-term vibration signal is difficult to remove. This paper proposes a piecewise integration method to remove trend items. Examples of direct integration without trend item removal, global integration after piecewise polynomial fitting with trend item removal, and direct integration after piecewise polynomial fitting with trend item removal were simulated. The results showed that direct integration of the fitted piecewise polynomial provided greater acceleration and displacement precision than the other two integration methods. A vibration test was then performed on a special equipment cab. The results indicated that direct integration by piecewise polynomial fitting with trend item removal was highly consistent with the measured signal data. However, the direct integration method without trend item removal resulted in signal distortion. The proposed method can help with frequency domain analysis of vibration signals and modal parameter identification for such equipment.
Branched Adaptive Testing with a Rasch-Model-Calibrated Test: Analysing Item Presentation's Sequence Effects Using the Rasch-Model-Based LLTM

Science.gov (United States)

Kubinger, Klaus D.; Reif, Manuel; Yanagida, Takuya

2011-01-01

Item position effects provoke serious problems within adaptive testing. This is because different testees are necessarily presented with the same item at different presentation positions, as a consequence of which comparing their ability parameter estimations in the case of such effects would not at all be fair. In this article, a specific…
What Form of Mathematics Are Assessments Assessing? The Case of Multiplication and Division in Fourth Grade NAEP Items

Science.gov (United States)

Kosko Karl W.; Singh, Rashmi

2018-01-01

Multiplicative reasoning is a key concept in elementary school mathematics. Item statistics reported by the National Assessment of Educational Progress (NAEP) assessment provide the best current indicator for how well elementary students across the U.S. understand this, and other concepts. However, beyond expert reviews and statistical analysis,…
Using Differential Item Functioning Procedures to Explore Sources of Item Difficulty and Group Performance Characteristics.

Science.gov (United States)

Scheuneman, Janice Dowd; Gerritz, Kalle

1990-01-01

Differential item functioning (DIF) methodology for revealing sources of item difficulty and performance characteristics of different groups was explored. A total of 150 Scholastic Aptitude Test items and 132 Graduate Record Examination general test items were analyzed. DIF was evaluated for males and females and Blacks and Whites. (SLD)
Automated Scoring of Constructed-Response Science Items: Prospects and Obstacles

Science.gov (United States)

Liu, Ou Lydia; Brew, Chris; Blackmore, John; Gerard, Libby; Madhok, Jacquie; Linn, Marcia C.

2014-01-01

Content-based automated scoring has been applied in a variety of science domains. However, many prior applications involved simplified scoring rubrics without considering rubrics representing multiple levels of understanding. This study tested a concept-based scoring tool for content-based scoring, c-rater™, for four science items with rubrics…
Multiple Improvements of Multiple Imputation Likelihood Ratio Tests

OpenAIRE

Chan, Kin Wai; Meng, Xiao-Li

2017-01-01

Multiple imputation (MI) inference handles missing data by first properly imputing the missing values $m$ times, and then combining the $m$ analysis results from applying a complete-data procedure to each of the completed datasets. However, the existing method for combining likelihood ratio tests has multiple defects: (i) the combined test statistic can be negative in practice when the reference null distribution is a standard $F$ distribution; (ii) it is not invariant to re-parametrization; ...
Test-retest reliability of selected items of Health Behaviour in School-aged Children (HBSC survey questionnaire in Beijing, China

Directory of Open Access Journals (Sweden)

Liu Yang

2010-08-01

Full Text Available Abstract Background Children's health and health behaviour are essential for their development and it is important to obtain abundant and accurate information to understand young people's health and health behaviour. The Health Behaviour in School-aged Children (HBSC study is among the first large-scale international surveys on adolescent health through self-report questionnaires. So far, more than 40 countries in Europe and North America have been involved in the HBSC study. The purpose of this study is to assess the test-retest reliability of selected items in the Chinese version of the HBSC survey questionnaire in a sample of adolescents in Beijing, China. Methods A sample of 95 male and female students aged 11 or 15 years old participated in a test and retest with a three weeks interval. Student Identity numbers of respondents were utilized to permit matching of test-retest questionnaires. 23 items concerning physical activity, sedentary behaviour, sleep and substance use were evaluated by using the percentage of response shifts and the single measure Intraclass Correlation Coefficients (ICC with 95% confidence interval (CI for all respondents and stratified by gender and age. Items on substance use were only evaluated for school children aged 15 years old. Results The percentage of no response shift between test and retest varied from 32% for the item on computer use at weekends to 92% for the three items on smoking. Of all the 23 items evaluated, 6 items (26% showed a moderate reliability, 12 items (52% displayed a substantial reliability and 4 items (17% indicated almost perfect reliability. No gender and age group difference of the test-retest reliability was found except for a few items on sedentary behaviour. Conclusions The overall findings of this study suggest that most selected indicators in the HBSC survey questionnaire have satisfactory test-retest reliability for the students in Beijing. Further test-retest studies in a large
Analyzing Item Generation with Natural Language Processing Tools for the "TOEIC"® Listening Test. Research Report. ETS RR-17-52

Science.gov (United States)

Yoon, Su-Youn; Lee, Chong Min; Houghton, Patrick; Lopez, Melissa; Sakano, Jennifer; Loukina, Anastasia; Krovetz, Bob; Lu, Chi; Madani, Nitin

2017-01-01

In this study, we developed assistive tools and resources to support TOEIC® Listening test item generation. There has recently been an increased need for a large pool of items for these tests. This need has, in turn, inspired efforts to increase the efficiency of item generation while maintaining the quality of the created items. We aimed to…
The Environment Makes a Difference: The Impact of Explicit and Implicit Attitudes as Precursors in Different Food Choice Tasks.

Science.gov (United States)

König, Laura M; Giese, Helge; Schupp, Harald T; Renner, Britta

2016-01-01

Studies show that implicit and explicit attitudes influence food choice. However, precursors of food choice often are investigated using tasks offering a very limited number of options despite the comparably complex environment surrounding real life food choice. In the present study, we investigated how the assortment impacts the relationship between implicit and explicit attitudes and food choice (confectionery and fruit), assuming that a more complex choice architecture is more taxing on cognitive resources. Specifically, a binary and a multiple option choice task based on the same stimulus set (fake food items) were presented to ninety-seven participants. Path modeling revealed that both explicit and implicit attitudes were associated with relative food choice (confectionery vs. fruit) in both tasks. In the binary option choice task, both explicit and implicit attitudes were significant precursors of food choice, with explicit attitudes having a greater impact. Conversely, in the multiple option choice task, the additive impact of explicit and implicit attitudes was qualified by an interaction indicating that, even if explicit and implicit attitudes toward confectionery were inconsistent, more confectionery was chosen than fruit if either was positive. This compensatory 'one is sufficient'-effect indicates that the structure of the choice environment modulates the relationship between attitudes and choice. The study highlights that environmental constraints, such as the number of choice options, are an important boundary condition that need to be included when investigating the relationship between psychological precursors and behavior.

The Dysexecutive Questionnaire advanced: item and test score characteristics, 4-factor solution, and severity classification.

Science.gov (United States)

Bodenburg, Sebastian; Dopslaff, Nina

2008-01-01

The Dysexecutive Questionnaire (DEX, , Behavioral assessment of the dysexecutive syndrome, 1996) is a standardized instrument to measure possible behavioral changes as a result of the dysexecutive syndrome. Although initially intended only as a qualitative instrument, the DEX has also been used increasingly to address quantitative problems. Until now there have not been more fundamental statistical analyses of the questionnaire's testing quality. The present study is based on an unselected sample of 191 patients with acquired brain injury and reports on the data relating to the quality of the items, the reliability and the factorial structure of the DEX. Item 3 displayed too great an item difficulty, whereas item 11 was not sufficiently discriminating. The DEX's reliability in self-rating is r = 0.85. In addition to presenting the statistical values of the tests, a clinical severity classification of the overall scores of the 4 found factors and of the questionnaire as a whole is carried out on the basis of quartile standards.
Development of Test Items Related to Selected Concepts Within the Scheme the Particle Nature of Matter.

Science.gov (United States)

Doran, Rodney L.; Pella, Milton O.

The purpose of this study was to develop tests items with a minimum reading demand for use with pupils at grade levels two through six. An item was judged to be acceptable if the item satisfied at least four of six criteria. Approximately 250 students in grades 2-6 participated in the study. Half of the students were given instruction to develop…
The influence of calorie and physical activity labelling on snack and beverage choices.

Science.gov (United States)

Masic, U; Christiansen, P; Boyland, E J

2017-05-01

Much research suggests nutrition labelling does not influence lower energy food choice. This study aimed to assess the impact of physical activity based and kilocalorie (Kcal) based labels on the energy content of snack food and beverage choices made. An independent-groups design, utilizing an online questionnaire platform tested 458 UK adults (87 men), aged 18-64 years (mean: 30 years) whose BMI ranged from 16 to 41 kg/m 2 (mean: 24 kg/m 2 ). Participants were randomized to one of four label information conditions (no label, Kcal label, physical activity label [duration of walking required to burn the Kcal in the product], Kcal and physical activity label) and were asked to choose from higher and lower energy options for a series of items. Label condition significantly affected low vs. high-energy product selection of snack foods (p snack and beverage choices than the Kcal label condition (p snack food and beverage items, when compared with no information or Kcal information. These findings could inform the debate around potential legislative policies to facilitate healthier nutritional choices at a population level. Crown Copyright © 2017. Published by Elsevier Ltd. All rights reserved.
Varying levels of difficulty index of skills-test items randomly selected by examinees on the Korean emergency medical technician licensing examination.

Science.gov (United States)

Koh, Bongyeun; Hong, Sunggi; Kim, Soon-Sim; Hyun, Jin-Sook; Baek, Milye; Moon, Jundong; Kwon, Hayran; Kim, Gyoungyong; Min, Seonggi; Kang, Gu-Hyun

2016-01-01

The goal of this study was to characterize the difficulty index of the items in the skills test components of the class I and II Korean emergency medical technician licensing examination (KEMTLE), which requires examinees to select items randomly. The results of 1,309 class I KEMTLE examinations and 1,801 class II KEMTLE examinations in 2013 were subjected to analysis. Items from the basic and advanced skills test sections of the KEMTLE were compared to determine whether some were significantly more difficult than others. In the class I KEMTLE, all 4 of the items on the basic skills test showed significant variation in difficulty index (P<0.01), as well as 4 of the 5 items on the advanced skills test (P<0.05). In the class II KEMTLE, 4 of the 5 items on the basic skills test showed significantly different difficulty index (P<0.01), as well as all 3 of the advanced skills test items (P<0.01). In the skills test components of the class I and II KEMTLE, the procedure in which examinees randomly select questions should be revised to require examinees to respond to a set of fixed items in order to improve the reliability of the national licensing examination.
Varying levels of difficulty index of skills-test items randomly selected by examinees on the Korean emergency medical technician licensing examination

Directory of Open Access Journals (Sweden)

Bongyeun Koh

2016-01-01

Full Text Available Purpose: The goal of this study was to characterize the difficulty index of the items in the skills test components of the class I and II Korean emergency medical technician licensing examination (KEMTLE, which requires examinees to select items randomly. Methods: The results of 1,309 class I KEMTLE examinations and 1,801 class II KEMTLE examinations in 2013 were subjected to analysis. Items from the basic and advanced skills test sections of the KEMTLE were compared to determine whether some were significantly more difficult than others. Results: In the class I KEMTLE, all 4 of the items on the basic skills test showed significant variation in difficulty index (P<0.01, as well as 4 of the 5 items on the advanced skills test (P<0.05. In the class II KEMTLE, 4 of the 5 items on the basic skills test showed significantly different difficulty index (P<0.01, as well as all 3 of the advanced skills test items (P<0.01. Conclusion: In the skills test components of the class I and II KEMTLE, the procedure in which examinees randomly select questions should be revised to require examinees to respond to a set of fixed items in order to improve the reliability of the national licensing examination.
FACTORS AFFECTING CITY DESTINATION CHOICE AMONG YOUNG PEOPLE IN SERBIA

Directory of Open Access Journals (Sweden)

Nemanja Tomić

2015-07-01

Full Text Available The main goal of this study is to explore factors which influence city destination choice among young people in Serbia. In order to achieve this we conducted a survey consisting of 20 different items influencing the choice of city destination. Afterwards the principal component exploratory factor analysis (EFA was carried out in order to extract factors. T-test and ANOVA test were also used to determine if there is a difference between different gender and age groups in terms of which factors influence their choice of a city destination. The results indicate four motivating factors extracted by factor analysis, from which Good hospitality and restaurant service seems to be the major motivating factor. The results also show that respondents belonging to the age group of under 25 give more importance to Information and promotion as well as to Good hospitality and restaurant service than those belonging to older age groups. The same two factors are also more important to females than males.
Sequential and simultaneous choices: testing the diet selection and sequential choice models.

Science.gov (United States)

Freidin, Esteban; Aw, Justine; Kacelnik, Alex

2009-03-01

We investigate simultaneous and sequential choices in starlings, using Charnov's Diet Choice Model (DCM) and Shapiro, Siller and Kacelnik's Sequential Choice Model (SCM) to integrate function and mechanism. During a training phase, starlings encountered one food-related option per trial (A, B or R) in random sequence and with equal probability. A and B delivered food rewards after programmed delays (shorter for A), while R ('rejection') moved directly to the next trial without reward. In this phase we measured latencies to respond. In a later, choice, phase, birds encountered the pairs A-B, A-R and B-R, the first implementing a simultaneous choice and the second and third sequential choices. The DCM predicts when R should be chosen to maximize intake rate, and SCM uses latencies of the training phase to predict choices between any pair of options in the choice phase. The predictions of both models coincided, and both successfully predicted the birds' preferences. The DCM does not deal with partial preferences, while the SCM does, and experimental results were strongly correlated to this model's predictions. We believe that the SCM may expose a very general mechanism of animal choice, and that its wider domain of success reflects the greater ecological significance of sequential over simultaneous choices.
A leukocyte activation test identifies food items which induce release of DNA by innate immune peripheral blood leucocytes.

Science.gov (United States)

Garcia-Martinez, Irma; Weiss, Theresa R; Yousaf, Muhammad N; Ali, Ather; Mehal, Wajahat Z

2018-01-01

Leukocyte activation (LA) testing identifies food items that induce a patient specific cellular response in the immune system, and has recently been shown in a randomized double blinded prospective study to reduce symptoms in patients with irritable bowel syndrome (IBS). We hypothesized that test reactivity to particular food items, and the systemic immune response initiated by these food items, is due to the release of cellular DNA from blood immune cells. We tested this by quantifying total DNA concentration in the cellular supernatant of immune cells exposed to positive and negative foods from 20 healthy volunteers. To establish if the DNA release by positive samples is a specific phenomenon, we quantified myeloperoxidase (MPO) in cellular supernatants. We further assessed if a particular immune cell population (neutrophils, eosinophils, and basophils) was activated by the positive food items by flow cytometry analysis. To identify the signaling pathways that are required for DNA release we tested if specific inhibitors of key signaling pathways could block DNA release. Foods with a positive LA test result gave a higher supernatant DNA content when compared to foods with a negative result. This was specific as MPO levels were not increased by foods with a positive LA test. Protein kinase C (PKC) inhibitors resulted in inhibition of positive food stimulated DNA release. Positive foods resulted in CD63 levels greater than negative foods in eosinophils in 76.5% of tests. LA test identifies food items that result in release of DNA and activation of peripheral blood innate immune cells in a PKC dependent manner, suggesting that this LA test identifies food items that result in release of inflammatory markers and activation of innate immune cells. This may be the basis for the improvement in symptoms in IBS patients who followed an LA test guided diet.
Reading ability and print exposure: item response theory analysis of the author recognition test.

Science.gov (United States)

Moore, Mariah; Gordon, Peter C

2015-12-01

In the author recognition test (ART), participants are presented with a series of names and foils and are asked to indicate which ones they recognize as authors. The test is a strong predictor of reading skill, and this predictive ability is generally explained as occurring because author knowledge is likely acquired through reading or other forms of print exposure. In this large-scale study (1,012 college student participants), we used item response theory (IRT) to analyze item (author) characteristics in order to facilitate identification of the determinants of item difficulty, provide a basis for further test development, and optimize scoring of the ART. Factor analysis suggested a potential two-factor structure of the ART, differentiating between literary and popular authors. Effective and ineffective author names were identified so as to facilitate future revisions of the ART. Analyses showed that the ART is a highly significant predictor of the time spent encoding words, as measured using eyetracking during reading. The relationship between the ART and time spent reading provided a basis for implementing a higher penalty for selecting foils, rather than the standard method of ART scoring (names selected minus foils selected). The findings provide novel support for the view that the ART is a valid indicator of reading volume. Furthermore, they show that frequency data can be used to select items of appropriate difficulty, and that frequency data from corpora based on particular time periods and types of texts may allow adaptations of the test for different populations.
Building the BIKE: Development and Testing of the Biotechnology Instrument for Knowledge Elicitation (BIKE)

Science.gov (United States)

Witzig, Stephen B.; Rebello, Carina M.; Siegel, Marcelle A.; Freyermuth, Sharyn K.; Izci, Kemal; McClure, Bruce

2014-10-01

Identifying students' conceptual scientific understanding is difficult if the appropriate tools are not available for educators. Concept inventories have become a popular tool to assess student understanding; however, traditionally, they are multiple choice tests. International science education standard documents advocate that assessments should be reform based, contain diverse question types, and should align with instructional approaches. To date, no instrument of this type targeting student conceptions in biotechnology has been developed. We report here the development, testing, and validation of a 35-item Biotechnology Instrument for Knowledge Elicitation (BIKE) that includes a mix of question types. The BIKE was designed to elicit student thinking and a variety of conceptual understandings, as opposed to testing closed-ended responses. The design phase contained nine steps including a literature search for content, student interviews, a pilot test, as well as expert review. Data from 175 students over two semesters, including 16 student interviews and six expert reviewers (professors from six different institutions), were used to validate the instrument. Cronbach's alpha on the pre/posttest was 0.664 and 0.668, respectively, indicating the BIKE has internal consistency. Cohen's kappa for inter-rater reliability among the 6,525 total items was 0.684 indicating substantial agreement among scorers. Item analysis demonstrated that the items were challenging, there was discrimination among the individual items, and there was alignment with research-based design principles for construct validity. This study provides a reliable and valid conceptual understanding instrument in the understudied area of biotechnology.
[Development of critical thinking skill evaluation scale for nursing students].

Science.gov (United States)

You, So Young; Kim, Nam Cho

2014-04-01

To develop a Critical Thinking Skill Test for Nursing Students. The construct concepts were drawn from a literature review and in-depth interviews with hospital nurses and surveys were conducted among students (n=607) from nursing colleges. The data were collected from September 13 to November 23, 2012 and analyzed using the SAS program, 9.2 version. The KR 20 coefficient for reliability, difficulty index, discrimination index, item-total correlation and known group technique for validity were performed. Four domains and 27 skills were identified and 35 multiple choice items were developed. Thirty multiple choice items which had scores higher than .80 on the content validity index were selected for the pre test. From the analysis of the pre test data, a modified 30 items were selected for the main test. In the main test, the KR 20 coefficient was .70 and Corrected Item-Total Correlations range was .11-.38. There was a statistically significant difference between two academic systems (p=.001). The developed instrument is the first critical thinking skill test reflecting nursing perspectives in hospital settings and is expected to be utilized as a tool which contributes to improvement of the critical thinking ability of nursing students.
Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions.

Science.gov (United States)

Haberman, Shelby J; Sinharay, Sandip; Chon, Kyong Hee

2013-07-01

Residual analysis (e.g. Hambleton & Swaminathan, Item response theory: principles and applications, Kluwer Academic, Boston, 1985; Hambleton, Swaminathan, & Rogers, Fundamentals of item response theory, Sage, Newbury Park, 1991) is a popular method to assess fit of item response theory (IRT) models. We suggest a form of residual analysis that may be applied to assess item fit for unidimensional IRT models. The residual analysis consists of a comparison of the maximum-likelihood estimate of the item characteristic curve with an alternative ratio estimate of the item characteristic curve. The large sample distribution of the residual is proved to be standardized normal when the IRT model fits the data. We compare the performance of our suggested residual to the standardized residual of Hambleton et al. (Fundamentals of item response theory, Sage, Newbury Park, 1991) in a detailed simulation study. We then calculate our suggested residuals using data from an operational test. The residuals appear to be useful in assessing the item fit for unidimensional IRT models.
Motives for food choice among Serbian consumers

OpenAIRE

Gagić Snježana; Jovičić Ana; Tešanović Dragan; Kalenjuk Bojana

2014-01-01

People's motives for food choice depend on a number of very complex economic, social and individual factors. A Food Choice Questionnaire (FCQ), an instrument that measures the importance of factors underlying food choice, was used to reveal the Serbian consumers' food choice motives by survey of 450 respondents of different age groups. A conﬁrmatory factor analysis was conducted on the motive items, using 11 factors. Previous research shows that the nutrition in Serbia is not balanced enough,...
Prueba de Ciencia Primer Grado (Science Test for the First Grade). [In Spanish

Science.gov (United States)

Puerto Rico State Dept. of Education, Hato Rey.

This document consists of three parts: (1) a manual for administering the science test to first graders (in Spanish), (2) a copy of the test itself (pictorial), and (3) a list of expected competencies in science for the first three grades (in English). The test consists of 25, four-choice items. For each item, the administrator reads a statement…
Restrictive food intake as a choice--a paradigm for study.

Science.gov (United States)

Steinglass, Joanna; Foerde, Karin; Kostro, Katrina; Shohamy, Daphna; Walsh, B Timothy

2015-01-01

Inadequate intake and preference for low-calorie foods are salient behavioral features of Anorexia Nervosa (AN). The neurocognitive mechanisms underlying pathological food choice have not been characterized. This study aimed to develop a new paradigm for experimentally modeling maladaptive food choice in AN. Individuals with AN (n = 22) and healthy controls (HC, n = 20) participated in a computer-based Food Choice Task, adapted for individuals with eating disorders. Participants first rated 43 food images (including high-fat and low-fat items) for Healthiness and Tastiness; an item rated neutral on both blocks was then selected as the Reference item. On each of 42 subsequent trials participants were asked to choose between the food item presented and the Reference item. The AN group was less likely to choose high-fat foods relative to HC, as evidenced both in multilevel logistic regression (z = 2.59, p = .009) and ANOVA (F(1,39) = 7.80, p = .008) analyses. Health ratings influenced choice significantly more in AN relative to HC (z = 2.7, p = .006), and were more related to Taste among AN (χ(2) = 4.10, p = .04). Additionally, taste ratings declined with duration of illness (r = -.50, p = .02). The Food Choice Task captures the preference for low-fat foods among individuals with AN. The findings suggest that the experience of tastiness changes over time and may contribute to perpetuation of illness. By providing an experimental quantitative measure of food restriction, this task opens the door to new experimental investigations into the cognitive, affective, and neural factors contributing to maladaptive food choices characteristic of AN. © 2014 Wiley Periodicals, Inc.
Single-Item Measurement of Suicidal Behaviors: Validity and Consequences of Misclassification.

Directory of Open Access Journals (Sweden)

Alexander J Millner

Full Text Available Suicide is a leading cause of death worldwide. Although research has made strides in better defining suicidal behaviors, there has been less focus on accurate measurement. Currently, the widespread use of self-report, single-item questions to assess suicide ideation, plans and attempts may contribute to measurement problems and misclassification. We examined the validity of single-item measurement and the potential for statistical errors. Over 1,500 participants completed an online survey containing single-item questions regarding a history of suicidal behaviors, followed by questions with more precise language, multiple response options and narrative responses to examine the validity of single-item questions. We also conducted simulations to test whether common statistical tests are robust against the degree of misclassification produced by the use of single-items. We found that 11.3% of participants that endorsed a single-item suicide attempt measure engaged in behavior that would not meet the standard definition of a suicide attempt. Similarly, 8.8% of those who endorsed a single-item measure of suicide ideation endorsed thoughts that would not meet standard definitions of suicide ideation. Statistical simulations revealed that this level of misclassification substantially decreases statistical power and increases the likelihood of false conclusions from statistical tests. Providing a wider range of response options for each item reduced the misclassification rate by approximately half. Overall, the use of single-item, self-report questions to assess the presence of suicidal behaviors leads to misclassification, increasing the likelihood of statistical decision errors. Improving the measurement of suicidal behaviors is critical to increase understanding and prevention of suicide.
Test-retest reliability of Eurofit Physical Fitness items for children with visual impairments

NARCIS (Netherlands)

Houwen, Suzanne; Visscher, Chris; Hartman, Esther; Lemmink, Koen A. P. M.

The purpose of this study was to examine the test-retest reliability of physical fitness items from the European Test of Physical Fitness (Eurofit) for children with visual impairments. A sample of 21 children, ages 6-12 years, that were recruited from a special school for children with visual
Using item response theory to explore the psychometric properties of extended matching questions examination in undergraduate medical education

Directory of Open Access Journals (Sweden)

Lawton Gemma

2005-03-01

Full Text Available Abstract Background As assessment has been shown to direct learning, it is critical that the examinations developed to test clinical competence in medical undergraduates are valid and reliable. The use of extended matching questions (EMQ has been advocated to overcome some of the criticisms of using multiple-choice questions to test factual and applied knowledge. Methods We analysed the results from the Extended Matching Questions Examination taken by 4th year undergraduate medical students in the academic year 2001 to 2002. Rasch analysis was used to examine whether the set of questions used in the examination mapped on to a unidimensional scale, the degree of difficulty of questions within and between the various medical and surgical specialties and the pattern of responses within individual questions to assess the impact of the distractor options. Results Analysis of a subset of items and of the full examination demonstrated internal construct validity and the absence of bias on the majority of questions. Three main patterns of response selection were identified. Conclusion Modern psychometric methods based upon the work of Rasch provide a useful approach to the calibration and analysis of EMQ undergraduate medical assessments. The approach allows for a formal test of the unidimensionality of the questions and thus the validity of the summed score. Given the metric calibration which follows fit to the model, it also allows for the establishment of items banks to facilitate continuity and equity in exam standards.
MULTIPLE INTELLIGENCES AS PREDICTORS OF READING COMPREHENSION AND VOCABULARY KNOWLEDGE

OpenAIRE

Abbas Ali Zarei; Nima Shokri Afshar

2014-01-01

Abstract: The present study was conducted to investigate types of Multiple Intelligences as predictors of reading comprehension and vocabulary knowledge. To meet this objective, a 60-item TOEFL test and a 90-item multiple intelligences questionnaire were distributed among 240 male and female Iranians studying English at Qazali and Parsian Universities in Qazvin. Data were analyzed using a multiple regression procedure. The result of the data analysis indicated that musical, interpersonal, kin...
What Does a Verbal Test Measure? A New Approach to Understanding Sources of Item Difficulty.

Science.gov (United States)

Berk, Eric J. Vanden; Lohman, David F.; Cassata, Jennifer Coyne

Assessing the construct relevance of mental test results continues to present many challenges, and it has proven to be particularly difficult to assess the construct relevance of verbal items. This study was conducted to gain a better understanding of the conceptual sources of verbal item difficulty using a unique approach that integrates…

Hierarchical screening for multiple mental disorders.

Science.gov (United States)

Batterham, Philip J; Calear, Alison L; Sunderland, Matthew; Carragher, Natacha; Christensen, Helen; Mackinnon, Andrew J

2013-10-01

There is a need for brief, accurate screening when assessing multiple mental disorders. Two-stage hierarchical screening, consisting of brief pre-screening followed by a battery of disorder-specific scales for those who meet diagnostic criteria, may increase the efficiency of screening without sacrificing precision. This study tested whether more efficient screening could be gained using two-stage hierarchical screening than by administering multiple separate tests. Two Australian adult samples (N=1990) with high rates of psychopathology were recruited using Facebook advertising to examine four methods of hierarchical screening for four mental disorders: major depressive disorder, generalised anxiety disorder, panic disorder and social phobia. Using K6 scores to determine whether full screening was required did not increase screening efficiency. However, pre-screening based on two decision tree approaches or item gating led to considerable reductions in the mean number of items presented per disorder screened, with estimated item reductions of up to 54%. The sensitivity of these hierarchical methods approached 100% relative to the full screening battery. Further testing of the hierarchical screening approach based on clinical criteria and in other samples is warranted. The results demonstrate that a two-phase hierarchical approach to screening multiple mental disorders leads to considerable increases efficiency gains without reducing accuracy. Screening programs should take advantage of prescreeners based on gating items or decision trees to reduce the burden on respondents. © 2013 Elsevier B.V. All rights reserved.
Psychometric characteristics of Clinical Reasoning Problems (CRPs) and its correlation with routine multiple choice question (MCQ) in Cardiology department.

Science.gov (United States)

Derakhshandeh, Zahra; Amini, Mitra; Kojuri, Javad; Dehbozorgian, Marziyeh

2018-01-01

Clinical reasoning is one of the most important skills in the process of training a medical student to become an efficient physician. Assessment of the reasoning skills in a medical school program is important to direct students' learning. One of the tests for measuring the clinical reasoning ability is Clinical Reasoning Problems (CRPs). The major aim of this study is to measure psychometric qualities of CRPs and define correlation between this test and routine MCQ in cardiology department of Shiraz medical school. This study was a descriptive study conducted on total cardiology residents of Shiraz Medical School. The study population consists of 40 residents in 2014. The routine CRPs and the MCQ tests was designed based on similar objectives and were carried out simultaneously. Reliability, item difficulty, item discrimination, and correlation between each item and the total score of CRPs were all measured by Excel and SPSS software for checking psycometeric CRPs test. Furthermore, we calculated the correlation between CRPs test and MCQ test. The mean differences of CRPs test score between residents' academic year [second, third and fourth year] were also evaluated by Analysis of variances test (One Way ANOVA) using SPSS software (version 20)(α=0.05). The mean and standard deviation of score in CRPs was 10.19 ±3.39 out of 20; in MCQ, it was 13.15±3.81 out of 20. Item difficulty was in the range of 0.27-0.72; item discrimination was 0.30-0.75 with question No.3 being the exception (that was 0.24). The correlation between each item and the total score of CRP was 0.26-0.87; the correlation between CRPs test and MCQ test was 0.68 (preasoning in residents. It can be included in cardiology residency assessment programs.
Effect of individual thinking styles on item selection during study time allocation.

Science.gov (United States)

Jia, Xiaoyu; Li, Weijian; Cao, Liren; Li, Ping; Shi, Meiling; Wang, Jingjing; Cao, Wei; Li, Xinyu

2018-04-01

The influence of individual differences on learners' study time allocation has been emphasised in recent studies; however, little is known about the role of individual thinking styles (analytical versus intuitive). In the present study, we explored the influence of individual thinking styles on learners' application of agenda-based and habitual processes when selecting the first item during a study-time allocation task. A 3-item cognitive reflection test (CRT) was used to determine individuals' degree of cognitive reliance on intuitive versus analytical cognitive processing. Significant correlations between CRT scores and the choices of first item selection were observed in both Experiment 1a (study time was 5 seconds per triplet) and Experiment 1b (study time was 20 seconds per triplet). Furthermore, analytical decision makers constructed a value-based agenda (prioritised high-reward items), whereas intuitive decision makers relied more upon habitual responding (selected items from the leftmost of the array). The findings of Experiment 1a were replicated in Experiment 2 notwithstanding ruling out the possible effects from individual intelligence and working memory capacity. Overall, the individual thinking style plays an important role on learners' study time allocation and the predictive ability of CRT is reliable in learners' item selection strategy. © 2016 International Union of Psychological Science.
Power and Sample Size Calculations for Logistic Regression Tests for Differential Item Functioning

Science.gov (United States)

Li, Zhushan

2014-01-01

Logistic regression is a popular method for detecting uniform and nonuniform differential item functioning (DIF) effects. Theoretical formulas for the power and sample size calculations are derived for likelihood ratio tests and Wald tests based on the asymptotic distribution of the maximum likelihood estimators for the logistic regression model.…
Neural Activity Reveals Preferences Without Choices

Science.gov (United States)

Smith, Alec; Bernheim, B. Douglas; Camerer, Colin

2014-01-01

We investigate the feasibility of inferring the choices people would make (if given the opportunity) based on their neural responses to the pertinent prospects when they are not engaged in actual decision making. The ability to make such inferences is of potential value when choice data are unavailable, or limited in ways that render standard methods of estimating choice mappings problematic. We formulate prediction models relating choices to “non-choice” neural responses and use them to predict out-of-sample choices for new items and for new groups of individuals. The predictions are sufficiently accurate to establish the feasibility of our approach. PMID:25729468
Multiple linear combination (MLC) regression tests for common variants adapted to linkage disequilibrium structure.

Science.gov (United States)

Yoo, Yun Joo; Sun, Lei; Poirier, Julia G; Paterson, Andrew D; Bull, Shelley B

2017-02-01

By jointly analyzing multiple variants within a gene, instead of one at a time, gene-based multiple regression can improve power, robustness, and interpretation in genetic association analysis. We investigate multiple linear combination (MLC) test statistics for analysis of common variants under realistic trait models with linkage disequilibrium (LD) based on HapMap Asian haplotypes. MLC is a directional test that exploits LD structure in a gene to construct clusters of closely correlated variants recoded such that the majority of pairwise correlations are positive. It combines variant effects within the same cluster linearly, and aggregates cluster-specific effects in a quadratic sum of squares and cross-products, producing a test statistic with reduced degrees of freedom (df) equal to the number of clusters. By simulation studies of 1000 genes from across the genome, we demonstrate that MLC is a well-powered and robust choice among existing methods across a broad range of gene structures. Compared to minimum P-value, variance-component, and principal-component methods, the mean power of MLC is never much lower than that of other methods, and can be higher, particularly with multiple causal variants. Moreover, the variation in gene-specific MLC test size and power across 1000 genes is less than that of other methods, suggesting it is a complementary approach for discovery in genome-wide analysis. The cluster construction of the MLC test statistics helps reveal within-gene LD structure, allowing interpretation of clustered variants as haplotypic effects, while multiple regression helps to distinguish direct and indirect associations. © 2016 The Authors Genetic Epidemiology Published by Wiley Periodicals, Inc.
Price vector effects in choice experiments: an empirical test

International Nuclear Information System (INIS)

Hanley, Nick; Wright, Robert E.; Adamowicz, Wiktor

2005-01-01

This paper investigates whether the preference and willingness-to-pay estimates obtained from the choice experiment method of estimating non-market values are sensitive to the vector of prices used in the experimental design. We undertake this test in the context of water quality improvements under the European Union's new Water Framework Directive. Using a mixed logit model, which allows for differing scale between the two samples, we find no significant impact of changing the price vector on estimates of preferences or willingness-to-pay. (author) (Choice modelling; Non-market valuation; Design effects; Water Framework Directive)
Frameworks of Choice : Predictive and Genetic Testing in Asia

NARCIS (Netherlands)

2010-01-01

Frameworks of Choice verkent de culturele en politieke aspecten van voorspellende en genetische tests. Het boek analyseert de sociale, culturele, en economische gevolgen voor het individu na een voorspellende of genetische screening. Margaret Sleeboom-Faulkner geeft een genuanceerd beeld van de
Assessing Scientific Practices Using Machine-Learning Methods: How Closely Do They Match Clinical Interview Performance?

Science.gov (United States)

Beggrow, Elizabeth P.; Ha, Minsu; Nehm, Ross H.; Pearl, Dennis; Boone, William J.

2014-02-01

The landscape of science education is being transformed by the new Framework for Science Education (National Research Council, A framework for K-12 science education: practices, crosscutting concepts, and core ideas. The National Academies Press, Washington, DC, 2012), which emphasizes the centrality of scientific practices—such as explanation, argumentation, and communication—in science teaching, learning, and assessment. A major challenge facing the field of science education is developing assessment tools that are capable of validly and efficiently evaluating these practices. Our study examined the efficacy of a free, open-source machine-learning tool for evaluating the quality of students' written explanations of the causes of evolutionary change relative to three other approaches: (1) human-scored written explanations, (2) a multiple-choice test, and (3) clinical oral interviews. A large sample of undergraduates (n = 104) exposed to varying amounts of evolution content completed all three assessments: a clinical oral interview, a written open-response assessment, and a multiple-choice test. Rasch analysis was used to compute linear person measures and linear item measures on a single logit scale. We found that the multiple-choice test displayed poor person and item fit (mean square outfit >1.3), while both oral interview measures and computer-generated written response measures exhibited acceptable fit (average mean square outfit for interview: person 0.97, item 0.97; computer: person 1.03, item 1.06). Multiple-choice test measures were more weakly associated with interview measures (r = 0.35) than the computer-scored explanation measures (r = 0.63). Overall, Rasch analysis indicated that computer-scored written explanation measures (1) have the strongest correspondence to oral interview measures; (2) are capable of capturing students' normative scientific and naive ideas as accurately as human-scored explanations, and (3) more validly detect understanding
Differential Item Functioning of Pathological Gambling Criteria: An Examination of Gender, Race/Ethnicity, and Age

OpenAIRE

Sacco, Paul; Torres, Luis R.; Cunningham-Williams, Renee M.; Woods, Carol; Unick, G. Jay

2011-01-01

This study tested for the presence of differential item functioning (DIF) in DSM-IV Pathological Gambling Disorder (PGD) criteria based on gender, race/ethnicity and age. Using a nationally representative sample of adults from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), indicating current gambling (n = 10,899), Multiple Indicator-Multiple Cause (MIMIC) models tested for DIF, controlling for income, education, and marital status. Compared to the reference grou...
Bayes factor covariance testing in item response models

NARCIS (Netherlands)

Fox, J.P.; Mulder, J.; Sinharay, Sandip

2017-01-01

Two marginal one-parameter item response theory models are introduced, by integrating out the latent variable or random item parameter. It is shown that both marginal response models are multivariate (probit) models with a compound symmetry covariance structure. Several common hypotheses concerning
Bayes Factor Covariance Testing in Item Response Models

NARCIS (Netherlands)

Fox, Jean-Paul; Mulder, Joris; Sinharay, Sandip

2017-01-01

Two marginal one-parameter item response theory models are introduced, by integrating out the latent variable or random item parameter. It is shown that both marginal response models are multivariate (probit) models with a compound symmetry covariance structure. Several common hypotheses concerning
A restaurant-based intervention to promote sales of healthy children’s menu items: the Kids’ Choice Restaurant Program cluster randomized trial

Directory of Open Access Journals (Sweden)

Guadalupe X. Ayala

2016-03-01

fidelity. Discussion Successful recruitment of the restaurants has been completed, providing evidence that the restaurant industry is open to working on the public health challenge of childhood obesity. Determining whether a restaurant intervention can promote sales of healthy children’s menu items will provide evidence for how to create environments that support the healthy choices needed to prevent and control obesity. Despite these strengths, collection of sales data that will allow comprehensive analysis of intervention effects remains a challenge. Trial registration NCT02511938
Exploring Multiple Motivations on Urban Residents’ Travel Mode Choices: An Empirical Study from Jiangsu Province in China

Directory of Open Access Journals (Sweden)

Jichao Geng

2017-01-01

Full Text Available People’s actions are always accompanied with multiple motives. How to estimate the role of the pro-environment motivation under the interference of other motivations will help us to better interpret human environmental behaviors. On the basis of classical motivation theories and travel mode choice research backgrounds, the concepts of pro-environmental and self-interested motivation were defined. Then based on survey data on 1244 urban residents in the Jiangsu Province in China, the multinomial logistic regression model was constructed to examine the effects of multiple motivations, government measures, and demographic characteristics on residents’ travel mode choice behaviors. The result indicates that compared to car use, pro-environmental motivation certainly has a significant and positive role in promoting green travel mode choices (walking, bicycling, and using public transport, but this unstable green behavior is always dominated by self-interested motivations rather than the pro-environmental motivation. In addition, the effects of gender, age, income, vehicle ownership, travel distance, and government instruments show significant differences among travel mode choices. The findings suggest that pro-environmental motivation needs to be stressed and highlighted to ensure sustainable urban transportation. However, policies aimed to only increase the public awareness of environment protection are not enough; tailored policy interventions should be targeted to specific groups having different main motivations.
Test analysis and research on static choice reaction ability of commercial vehicle drivers

Science.gov (United States)

Zhang, Lingchao; Wei, Lang; Qiao, Jie; Tian, Shun; Wang, Shengchang

2017-03-01

Drivers' choice reaction ability has a certain relation with safe driving. It has important significance to research its influence on traffic safety. Firstly, the paper uses a choice reaction detector developed by research group to detect drivers' choice reaction ability of commercial vehicles, and gets 2641 effective samples. Then by using mathematical statistics method, the paper founds that average reaction time from accident group has no difference with non-accident group, and then introduces a variance rate of reaction time as a new index to replace it. The result shows that the test index choice reaction errors and variance rate of reaction time have positive correlations with accidents. Finally, according to testing results of the detector, the paper formulates a detection threshold with four levels for helping transportation companies to assess commercial vehicles drivers.
Item-saving assessment of self-care performance in children with developmental disabilities: A prospective caregiver-report computerized adaptive test

Science.gov (United States)

Chen, Cheng-Te; Chen, Yu-Lan; Lin, Yu-Ching; Hsieh, Ching-Lin; Tzeng, Jeng-Yi

2018-01-01

Objective The purpose of this study was to construct a computerized adaptive test (CAT) for measuring self-care performance (the CAT-SC) in children with developmental disabilities (DD) aged from 6 months to 12 years in a content-inclusive, precise, and efficient fashion. Methods The study was divided into 3 phases: (1) item bank development, (2) item testing, and (3) a simulation study to determine the stopping rules for the administration of the CAT-SC. A total of 215 caregivers of children with DD were interviewed with the 73-item CAT-SC item bank. An item response theory model was adopted for examining the construct validity to estimate item parameters after investigation of the unidimensionality, equality of slope parameters, item fitness, and differential item functioning (DIF). In the last phase, the reliability and concurrent validity of the CAT-SC were evaluated. Results The final CAT-SC item bank contained 56 items. The stopping rules suggested were (a) reliability coefficient greater than 0.9 or (b) 14 items administered. The results of simulation also showed that 85% of the estimated self-care performance scores would reach a reliability higher than 0.9 with a mean test length of 8.5 items, and the mean reliability for the rest was 0.86. Administering the CAT-SC could reduce the number of items administered by 75% to 84%. In addition, self-care performances estimated by the CAT-SC and the full item bank were very similar to each other (Pearson r = 0.98). Conclusion The newly developed CAT-SC can efficiently measure self-care performance in children with DD whose performances are comparable to those of TD children aged from 6 months to 12 years as precisely as the whole item bank. The item bank of the CAT-SC has good reliability and a unidimensional self-care construct, and the CAT can estimate self-care performance with less than 25% of the items in the item bank. Therefore, the CAT-SC could be useful for measuring self-care performance in children with
On the Relationship between Classical Test Theory and Item Response Theory: From One to the Other and Back

Science.gov (United States)

Raykov, Tenko; Marcoulides, George A.

2016-01-01

The frequently neglected and often misunderstood relationship between classical test theory and item response theory is discussed for the unidimensional case with binary measures and no guessing. It is pointed out that popular item response models can be directly obtained from classical test theory-based models by accounting for the discrete…
Differential Item Functioning in While-Listening Performance Tests: The Case of the International English Language Testing System (IELTS) Listening Module

Science.gov (United States)

Aryadoust, Vahid

2012-01-01

This article investigates a version of the International English Language Testing System (IELTS) listening test for evidence of differential item functioning (DIF) based on gender, nationality, age, and degree of previous exposure to the test. Overall, the listening construct was found to be underrepresented, which is probably an important cause…
Using a Fine-Grained Multiple-Choice Response Format in Educational Drill-and-Practice Video Games

Science.gov (United States)

Beserra, Vagner; Nussbaum, Miguel; Grass, Antonio

2017-01-01

When using educational video games, particularly drill-and-practice video games, there are several ways of providing an answer to a quiz. The majority of paper-based options can be classified as being either multiple-choice or constructed-response. Therefore, in the process of creating an educational drill-and-practice video game, one fundamental…
The Incidence of Clueing in Multiple Choice Testbank Questions in Accounting: Some Evidence from Australia

Science.gov (United States)

Ibbett, Nicole L.; Wheldon, Brett J.

2016-01-01

In 2014 Central Queensland University (CQU) in Australia banned the use of multiple choice questions (MCQs) as an assessment tool. One of the reasons given for this decision was that MCQs provide an opportunity for students to "pass" by merely guessing their answers. The mathematical likelihood of a student passing by guessing alone can…

Evaluating the validity of the Work Role Functioning Questionnaire (Canadian French version) using classical test theory and item response theory.

Science.gov (United States)

Hong, Quan Nha; Coutu, Marie-France; Berbiche, Djamal

2017-01-01

The Work Role Functioning Questionnaire (WRFQ) was developed to assess workers' perceived ability to perform job demands and is used to monitor presenteeism. Still few studies on its validity can be found in the literature. The purpose of this study was to assess the items and factorial composition of the Canadian French version of the WRFQ (WRFQ-CF). Two measurement approaches were used to test the WRFQ-CF: Classical Test Theory (CTT) and non-parametric Item Response Theory (IRT). A total of 352 completed questionnaires were analyzed. A four-factor and three-factor model models were tested and shown respectively good fit with 14 items (Root Mean Square Error of Approximation (RMSEA) = 0.06, Standardized Root Mean Square Residual (SRMR) = 0.04, Bentler Comparative Fit Index (CFI) = 0.98) and with 17 items (RMSEA = 0.059, SRMR = 0.048, CFI = 0.98). Using IRT, 13 problematic items were identified, of which 9 were common with CTT. This study tested different models with fewer problematic items found in a three-factor model. Using a non-parametric IRT and CTT for item purification gave complementary results. IRT is still scarcely used and can be an interesting alternative method to enhance the quality of a measurement instrument. More studies are needed on the WRFQ-CF to refine its items and factorial composition.
Which Statistic Should Be Used to Detect Item Preknowledge When the Set of Compromised Items Is Known?

Science.gov (United States)

Sinharay, Sandip

2017-09-01

Benefiting from item preknowledge is a major type of fraudulent behavior during educational assessments. Belov suggested the posterior shift statistic for detection of item preknowledge and showed its performance to be better on average than that of seven other statistics for detection of item preknowledge for a known set of compromised items. Sinharay suggested a statistic based on the likelihood ratio test for detection of item preknowledge; the advantage of the statistic is that its null distribution is known. Results from simulated and real data and adaptive and nonadaptive tests are used to demonstrate that the Type I error rate and power of the statistic based on the likelihood ratio test are very similar to those of the posterior shift statistic. Thus, the statistic based on the likelihood ratio test appears promising in detecting item preknowledge when the set of compromised items is known.
The Social Attribution Task-Multiple Choice (SAT-MC): A Psychometric and Equivalence Study of an Alternate Form.

Science.gov (United States)

Johannesen, Jason K; Lurie, Jessica B; Fiszdon, Joanna M; Bell, Morris D

2013-01-01

The Social Attribution Task-Multiple Choice (SAT-MC) uses a 64-second video of geometric shapes set in motion to portray themes of social relatedness and intentions. Considered a test of "Theory of Mind," the SAT-MC assesses implicit social attribution formation while reducing verbal and basic cognitive demands required of other common measures. We present a comparability analysis of the SAT-MC and the new SAT-MC-II, an alternate form created for repeat testing, in a university sample (n = 92). Score distributions and patterns of association with external validation measures were nearly identical between the two forms, with convergent and discriminant validity supported by association with affect recognition ability and lack of association with basic visual reasoning. Internal consistency of the SAT-MC-II was superior (alpha = .81) to the SAT-MC (alpha = .56). Results support the use of SAT-MC and new SAT-MC-II as equivalent test forms. Demonstrating relatively higher association to social cognitive than basic cognitive abilities, the SAT-MC may provide enhanced sensitivity as an outcome measure of social cognitive intervention trials.
Calibration of context-specific survey items to assess youth physical activity behaviour.

Science.gov (United States)

Saint-Maurice, Pedro F; Welk, Gregory J; Bartee, R Todd; Heelan, Kate

2017-05-01

This study tests calibration models to re-scale context-specific physical activity (PA) items to accelerometer-derived PA. A total of 195 4th-12th grades children wore an Actigraph monitor and completed the Physical Activity Questionnaire (PAQ) one week later. The relative time spent in moderate-to-vigorous PA (MVPA % ) obtained from the Actigraph at recess, PE, lunch, after-school, evening and weekend was matched with a respective item score obtained from the PAQ's. Item scores from 145 participants were calibrated against objective MVPA % using multiple linear regression with age, and sex as additional predictors. Predicted minutes of MVPA for school, out-of-school and total week were tested in the remaining sample (n = 50) using equivalence testing. The results showed that PAQ β-weights ranged from 0.06 (lunch) to 4.94 (PE) MVPA % (P PAQ and accelerometer MVPA at school and out-of-school ranged from -15.6 to +3.8 min and the PAQ was within 10-15% of accelerometer measured activity. This study demonstrated that context-specific items can be calibrated to predict minutes of MVPA in groups of youth during in- and out-of-school periods.
Instrument Formatting with Computer Data Entry in Mind.

Science.gov (United States)

Boser, Judith A.; And Others

Different formats for four types of research items were studied for ease of computer data entry. The types were: (1) numeric response items; (2) individual multiple choice items; (3) multiple choice items with the same response items; and (4) card column indicator placement. Each of the 13 experienced staff members of a major university's Data…
Test person operated 2-Alternative Forced Choice Audiometry compared to traditional audiometry

DEFF Research Database (Denmark)

Schmidt, Jesper Hvass; Brandt, Christian; Christensen-Dalsgaard, Jakob

Background: With a newly developed technique, hearing thresholds can be estimated with a system operated by the test persons themselves. This technique is based on the 2 Alternative Forced Choice paradigm known from the psychoacoustic research theory. Test persons can operate the system very......-likelihood and up-down methods has proven effective and reliable even under suboptimal test settings. In non-optimal testing conditions i.e. as a part of a hearing conservation programme the headphone Sennheiser HDA-200 has been used as it contains hearing protection. This test-method has been validated......-retest studies of 2AFC audiometry are comparable to test-retest results known from traditional audiometry under standard clinical settings. Conclusions 2 Alternative Forced Choice audiometry can be a reliable alternative to traditional audiometry especially under certain circumstances, where it can...
Validation of science virtual test to assess 8th grade students' critical thinking on living things and environmental sustainability theme

Science.gov (United States)

Rusyati, Lilit; Firman, Harry

2017-05-01

This research was motivated by the importance of multiple-choice questions that indicate the elements and sub-elements of critical thinking and implementation of computer-based test. The method used in this research was descriptive research for profiling the validation of science virtual test to measure students' critical thinking in junior high school. The participant is junior high school students of 8th grade (14 years old) while science teacher and expert as the validators. The instrument that used as a tool to capture the necessary data are sheet of an expert judgment, sheet of legibility test, and science virtual test package in multiple choice form with four possible answers. There are four steps to validate science virtual test to measure students' critical thinking on the theme of "Living Things and Environmental Sustainability" in 7th grade Junior High School. These steps are analysis of core competence and basic competence based on curriculum 2013, expert judgment, legibility test and trial test (limited and large trial test). The test item criterion based on trial test are accepted, accepted but need revision, and rejected. The reliability of the test is α = 0.747 that categorized as `high'. It means the test instruments used is reliable and high consistency. The validity of Rxy = 0.63 means that the validity of the instrument was categorized as `high' according to interpretation value of Rxy (correlation).
Item selection via Bayesian IRT models.

Science.gov (United States)

Arima, Serena

2015-02-10

With reference to a questionnaire that aimed to assess the quality of life for dysarthric speakers, we investigate the usefulness of a model-based procedure for reducing the number of items. We propose a mixed cumulative logit model, which is known in the psychometrics literature as the graded response model: responses to different items are modelled as a function of individual latent traits and as a function of item characteristics, such as their difficulty and their discrimination power. We jointly model the discrimination and the difficulty parameters by using a k-component mixture of normal distributions. Mixture components correspond to disjoint groups of items. Items that belong to the same groups can be considered equivalent in terms of both difficulty and discrimination power. According to decision criteria, we select a subset of items such that the reduced questionnaire is able to provide the same information that the complete questionnaire provides. The model is estimated by using a Bayesian approach, and the choice of the number of mixture components is justified according to information criteria. We illustrate the proposed approach on the basis of data that are collected for 104 dysarthric patients by local health authorities in Lecce and in Milan. Copyright © 2014 John Wiley & Sons, Ltd.
A New Clinical Pain Knowledge Test for Nurses: Development and Psychometric Evaluation.

Science.gov (United States)

Bernhofer, Esther I; St Marie, Barbara; Bena, James F

2017-08-01

All nurses care for patients with pain, and pain management knowledge and attitude surveys for nurses have been around since 1987. However, no validated knowledge test exists to measure postlicensure clinicians' knowledge of the core competencies of pain management in current complex patient populations. To develop and test the psychometric properties of an instrument designed to measure pain management knowledge of postlicensure nurses. Psychometric instrument validation. Four large Midwestern U.S. hospitals. Registered nurses employed full time and part time August 2015 to April 2016, aged M = 43.25 years; time as RN, M = 16.13 years. Prospective survey design using e-mail to invite nurses to take an electronic multiple choice pain knowledge test. Content validity of initial 36-item test "very good" (95.1% agreement). Completed tests that met analysis criteria, N = 747. Mean initial test score, 69.4% correct (range 27.8-97.2). After revision/removal of 13 unacceptable questions, mean test score was 50.4% correct (range 8.7-82.6). Initial test item percent difficulty range was 15.2%-98.1%; discrimination values range, 0.03-0.50; final test item percent difficulty range, 17.6%-91.1%, discrimination values range, -0.04 to 1.04. Split-half reliability final test was 0.66. A high decision consistency reliability was identified, with test cut-score of 75%. The final 23-item Clinical Pain Knowledge Test has acceptable discrimination, difficulty, decision consistency, reliability, and validity in the general clinical inpatient nurse population. This instrument will be useful in assessing pain management knowledge of clinical nurses to determine gaps in education, evaluate knowledge after pain management education, and measure research outcomes. Copyright © 2017 American Society for Pain Management Nursing. Published by Elsevier Inc. All rights reserved.
Multiple choices of time in quantum cosmology

International Nuclear Information System (INIS)

Małkiewicz, Przemysław

2015-01-01

It is often conjectured that a choice of time function merely sets up a frame for the quantum evolution of the gravitational field, meaning that all choices should be in some sense compatible. In order to explore this conjecture (and the meaning of compatibility), we develop suitable tools for determining the relation between quantum theories based on different time functions. First, we discuss how a time function fixes a canonical structure on the constraint surface. The presentation includes both the kinematical and the reduced perspective, and the relation between them. Second, we formulate twin theorems about the existence of two inequivalent maps between any two deparameterizations, a formal canonical and a coordinate one. They are used to separate the effects induced by choice of clock and other factors. We show, in an example, how the spectra of quantum observables are transformed under the change of clock and prove, via a general argument, the existence of choice-of-time-induced semiclassical effects. Finally, we study an example, in which we find that the semiclassical discrepancies can in fact be arbitrarily large for dynamical observables. We conclude that the values of critical energy density or critical volume in the bouncing scenarios of quantum cosmology cannot in general be at the Planck scale, and always need to be given with reference to a specific time function. (paper)
Teoria da Resposta ao Item Teoria de la respuesta al item Item response theory

Directory of Open Access Journals (Sweden)

Eutalia Aparecida Candido de Araujo

2009-12-01

Full Text Available A preocupação com medidas de traços psicológicos é antiga, sendo que muitos estudos e propostas de métodos foram desenvolvidos no sentido de alcançar este objetivo. Entre os trabalhos propostos, destaca-se a Teoria da Resposta ao Item (TRI que, a princípio, veio completar limitações da Teoria Clássica de Medidas, empregada em larga escala até hoje na medida de traços psicológicos. O ponto principal da TRI é que ela leva em consideração o item particularmente, sem relevar os escores totais; portanto, as conclusões não dependem apenas do teste ou questionário, mas de cada item que o compõe. Este artigo propõe-se a apresentar esta Teoria que revolucionou a teoria de medidas.La preocupación con las medidas de los rasgos psicológicos es antigua y muchos estudios y propuestas de métodos fueron desarrollados para lograr este objetivo. Entre estas propuestas de trabajo se incluye la Teoría de la Respuesta al Ítem (TRI que, en principio, vino a completar las limitaciones de la Teoría Clásica de los Tests, ampliamente utilizada hasta hoy en la medida de los rasgos psicológicos. El punto principal de la TRI es que se tiene en cuenta el punto concreto, sin relevar las puntuaciones totales; por lo tanto, los resultados no sólo dependen de la prueba o cuestionario, sino que de cada ítem que lo compone. En este artículo se propone presentar la Teoría que revolucionó la teoría de medidas.The concern with measures of psychological traits is old and many studies and proposals of methods were developed to achieve this goal. Among these proposed methods highlights the Item Response Theory (IRT that, in principle, came to complete limitations of the Classical Test Theory, which is widely used until nowadays in the measurement of psychological traits. The main point of IRT is that it takes into account the item in particular, not relieving the total scores; therefore, the findings do not only depend on the test or questionnaire
The Multiple-Choice Model: Some Solutions for Estimation of Parameters in the Presence of Omitted Responses

Science.gov (United States)

Abad, Francisco J.; Olea, Julio; Ponsoda, Vicente

2009-01-01

This article deals with some of the problems that have hindered the application of Samejima's and Thissen and Steinberg's multiple-choice models: (a) parameter estimation difficulties owing to the large number of parameters involved, (b) parameter identifiability problems in the Thissen and Steinberg model, and (c) their treatment of omitted…
Comparison of Classical Test Theory and Item Response Theory in Individual Change Assessment

NARCIS (Netherlands)

Jabrayilov, Ruslan; Emons, Wilco H. M.; Sijtsma, Klaas

2016-01-01

Clinical psychologists are advised to assess clinical and statistical significance when assessing change in individual patients. Individual change assessment can be conducted using either the methodologies of classical test theory (CTT) or item response theory (IRT). Researchers have been optimistic
A Review of Classical Methods of Item Analysis.

Science.gov (United States)

French, Christine L.

Item analysis is a very important consideration in the test development process. It is a statistical procedure to analyze test items that combines methods used to evaluate the important characteristics of test items, such as difficulty, discrimination, and distractibility of the items in a test. This paper reviews some of the classical methods for…
A simple and fast item selection procedure for adaptive testing

NARCIS (Netherlands)

Veerkamp, W.J.J.; Veerkamp, Wim J.J.; Berger, Martijn; Berger, Martijn P.F.

1994-01-01

Items with the highest discrimination parameter values in a logistic item response theory (IRT) model do not necessarily give maximum information. This paper shows which discrimination parameter values (as a function of the guessing parameter and the distance between person ability and item
Critical success factors in awareness of and choice towards low vision rehabilitation.

Science.gov (United States)

Fraser, Sarah A; Johnson, Aaron P; Wittich, Walter; Overbury, Olga

2015-01-01

The goal of the current study was to examine the critical factors indicative of an individual's choice to access low vision rehabilitation services. Seven hundred and forty-nine visually impaired individuals, from the Montreal Barriers Study, completed a structured interview and questionnaires (on visual function, coping, depression, satisfaction with life). Seventy-five factors from the interview and questionnaires were entered into a data-driven Classification and Regression Tree Analysis in order to determine the best predictors of awareness group: positive personal choice (I knew and I went), negative personal choice (I knew and did not go), and lack of information (Nobody told me, and I did not know). Having a response of moderate to no difficulty on item 6 (reading signs) of the Visual Function Index 14 (VF-14) indicated that the person had made a positive personal choice to seek rehabilitation, whereas reporting a great deal of difficulty on this item was associated with a lack of information on low vision rehabilitation. In addition to this factor, symptom duration of under nine years, moderate difficulty or less on item 5 (seeing steps or curbs) of the VF-14, and an indication of little difficulty or less on item 3 (reading large print) of the VF-14 further identified those who were more likely to have made a positive personal choice. Individuals in the lack of information group also reported greater difficulty on items 3 and 5 of the VF-14 and were more likely to be male. The duration-of-symptoms factor suggests that, even in the positive choice group, it may be best to offer rehabilitation services early. Being male and responding moderate difficulty or greater to the VF-14 questions about far, medium-distance and near situations involving vision was associated with individuals that lack information. Consequently, these individuals may need additional education about the benefits of low vision services in order to make a positive personal choice. © 2014
Test of Understanding of Vectors: A Reliable Multiple-Choice Vector Concept Test

Science.gov (United States)

Barniol, Pablo; Zavala, Genaro

2014-01-01

In this article we discuss the findings of our research on students' understanding of vector concepts in problems without physical context. First, we develop a complete taxonomy of the most frequent errors made by university students when learning vector concepts. This study is based on the results of several test administrations of open-ended…
RT-based memory detection : Item saliency effects in the single-probe and the multiple-probe protocol

NARCIS (Netherlands)

Verschuere, B.; Kleinberg, B.; Theocharidou, K.

RT-based memory detection may provide an efficient means to assess recognition of concealed information. There is, however, considerable heterogeneity in detection rates, and we explored two potential moderators: item saliency and test protocol. Participants tried to conceal low salient (e.g.,
Development of a psychological test to measure ability-based emotional intelligence in the Indonesian workplace using an item response theory.

Science.gov (United States)

Fajrianthi; Zein, Rizqy Amelia

2017-01-01

This study aimed to develop an emotional intelligence (EI) test that is suitable to the Indonesian workplace context. Airlangga Emotional Intelligence Test (Tes Kecerdasan Emosi Airlangga [TKEA]) was designed to measure three EI domains: 1) emotional appraisal, 2) emotional recognition, and 3) emotional regulation. TKEA consisted of 120 items with 40 items for each subset. TKEA was developed based on the Situational Judgment Test (SJT) approach. To ensure its psychometric qualities, categorical confirmatory factor analysis (CCFA) and item response theory (IRT) were applied to test its validity and reliability. The study was conducted on 752 participants, and the results showed that test information function (TIF) was 3.414 (ability level = 0) for subset 1, 12.183 for subset 2 (ability level = -2), and 2.398 for subset 3 (level of ability = -2). It is concluded that TKEA performs very well to measure individuals with a low level of EI ability. It is worth to note that TKEA is currently at the development stage; therefore, in this study, we investigated TKEA's item analysis and dimensionality test of each TKEA subset.
Comparison of Methods for Adjusting Incorrect Assignments of Items to Subtests Oblique Multiple Group Method Versus Confirmatory Common Factor Method

NARCIS (Netherlands)

Stuive, Ilse; Kiers, Henk A.L.; Timmerman, Marieke E.

2009-01-01

A common question in test evaluation is whether an a priori assignment of items to subtests is supported by empirical data. If the analysis results indicate the assignment of items to subtests under study is not supported by data, the assignment is often adjusted. In this study the authors compare

Nurse-led immunotreatment DEcision Coaching In people with Multiple Sclerosis (DECIMS) - Feasibility testing, pilot randomised controlled trial and mixed methods process evaluation.

Science.gov (United States)

Rahn, A C; Köpke, S; Backhus, I; Kasper, J; Anger, K; Untiedt, B; Alegiani, A; Kleiter, I; Mühlhauser, I; Heesen, C

2018-02-01

Treatment decision-making is complex for people with multiple sclerosis. Profound information on available options is virtually not possible in regular neurologist encounters. The "nurse decision coach model" was developed to redistribute health professionals' tasks in supporting immunotreatment decision-making following the principles of informed shared decision-making. To test the feasibility of a decision coaching programme and recruitment strategies to inform the main trial. Feasibility testing and parallel pilot randomised controlled trial, accompanied by a mixed methods process evaluation. Two German multiple sclerosis university centres. People with suspected or relapsing-remitting multiple sclerosis facing immunotreatment decisions on first line drugs were recruited. Randomisation to the intervention (n = 38) or control group (n = 35) was performed on a daily basis. Quantitative and qualitative process data were collected from people with multiple sclerosis, nurses and physicians. We report on the development and piloting of the decision coaching programme. It comprises a training course for multiple sclerosis nurses and the coaching intervention. The intervention consists of up to three structured nurse-led decision coaching sessions, access to an evidence-based online information platform (DECIMS-Wiki) and a final physician consultation. After feasibility testing, a pilot randomised controlled trial was performed. People with multiple sclerosis were randomised to the intervention or control group. The latter had also access to the DECIMS-Wiki, but received otherwise care as usual. Nurses were not blinded to group assignment, while people with multiple sclerosis and physicians were. The primary outcome was 'informed choice' after six months including the sub-dimensions' risk knowledge (after 14 days), attitude concerning immunotreatment (after physician consultation), and treatment uptake (after six months). Quantitative process evaluation data
The Effectiveness of learning materials based on multiple intelligence on the understanding of global warming

Science.gov (United States)

Liliawati, W.; Purwanto; Zulfikar, A.; Kamal, R. N.

2018-05-01

This study aims to examine the effectiveness of the use of teaching materials based on multiple intelligences on the understanding of high school students’ material on the theme of global warming. The research method used is static-group pretest-posttest design. Participants of the study were 60 high school students of XI class in one of the high schools in Bandung. Participants were divided into two classes of 30 students each for the experimental class and control class. The experimental class uses compound-based teaching materials while the experimental class does not use a compound intelligence-based teaching material. The instrument used is a test of understanding of the concept of global warming with multiple choices form amounted to 15 questions and 5 essay items. The test is given before and after it is applied to both classes. Data analysis using N-gain and effect size. The results obtained that the N-gain for both classes is in the medium category and the effectiveness of the use of teaching materials based on the results of effect-size test results obtained in the high category.
Software Note: Using BILOG for Fixed-Anchor Item Calibration

Science.gov (United States)

DeMars, Christine E.; Jurich, Daniel P.

2012-01-01

The nonequivalent groups anchor test (NEAT) design is often used to scale item parameters from two different test forms. A subset of items, called the anchor items or common items, are administered as part of both test forms. These items are used to adjust the item calibrations for any differences in the ability distributions of the groups taking…
A multi-level differential item functioning analysis of trends in international mathematics and science study: Potential sources of gender and minority difference among U.S. eighth graders' science achievement

Science.gov (United States)

Qian, Xiaoyu

Science is an area where a large achievement gap has been observed between White and minority, and between male and female students. The science minority gap has continued as indicated by the National Assessment of Educational Progress and the Trends in International Mathematics and Science Studies (TIMSS). TIMSS also shows a gender gap favoring males emerging at the eighth grade. Both gaps continue to be wider in the number of doctoral degrees and full professorships awarded (NSF, 2008). The current study investigated both minority and gender achievement gaps in science utilizing a multi-level differential item functioning (DIF) methodology (Kamata, 2001) within fully Bayesian framework. All dichotomously coded items from TIMSS 2007 science assessment at eighth grade were analyzed. Both gender DIF and minority DIF were studied. Multi-level models were employed to identify DIF items and sources of DIF at both student and teacher levels. The study found that several student variables were potential sources of achievement gaps. It was also found that gender DIF favoring male students was more noticeable in the content areas of physics and earth science than biology and chemistry. In terms of item type, the majority of these gender DIF items were multiple choice than constructed response items. Female students also performed less well on items requiring visual-spatial ability. Minority students performed significantly worse on physics and earth science items as well. A higher percentage of minority DIF items in earth science and biology were constructed response than multiple choice items, indicating that literacy may be the cause of minority DIF. Three-level model results suggested that some teacher variables may be the cause of DIF variations from teacher to teacher. It is essential for both middle school science teachers and science educators to find instructional methods that work more effectively to improve science achievement of both female and minority students
A 67-Item Stress Resilience item bank showing high content validity was developed in a psychosomatic sample.

Science.gov (United States)

Obbarius, Nina; Fischer, Felix; Obbarius, Alexander; Nolte, Sandra; Liegl, Gregor; Rose, Matthias

2018-04-10

To develop the first item bank to measure Stress Resilience (SR) in clinical populations. Qualitative item development resulted in an initial pool of 131 items covering a broad theoretical SR concept. These items were tested in n=521 patients at a psychosomatic outpatient clinic. Exploratory and Confirmatory Factor Analysis (CFA), as well as other state-of-the-art item analyses and IRT were used for item evaluation and calibration of the final item bank. Out of the initial item pool of 131 items, we excluded 64 items (54 factor loading .3, 2 non-discriminative Item Response Curves, 4 Differential Item Functioning). The final set of 67 items indicated sufficient model fit in CFA and IRT analyses. Additionally, a 10-item short form with high measurement precision (SE≤.32 in a theta range between -1.8 and +1.5) was derived. Both the SR item bank and the SR short form were highly correlated with an existing static legacy tool (Connor-Davidson Resilience Scale). The final SR item bank and 10-item short form showed good psychometric properties. When further validated, they will be ready to be used within a framework of Computer-Adaptive Tests for a comprehensive assessment of the Stress-Construct. Copyright © 2018. Published by Elsevier Inc.
High Agreement was Obtained Across Scores from Multiple Equated Scales for Social Anxiety Disorder using Item Response Theory.

Science.gov (United States)

Sunderland, Matthew; Batterham, Philip; Calear, Alison; Carragher, Natacha; Baillie, Andrew; Slade, Tim

2018-04-10

There is no standardized approach to the measurement of social anxiety. Researchers and clinicians are faced with numerous self-report scales with varying strengths, weaknesses, and psychometric properties. The lack of standardization makes it difficult to compare scores across populations that utilise different scales. Item response theory offers one solution to this problem via equating different scales using an anchor scale to set a standardized metric. This study is the first to equate several scales for social anxiety disorder. Data from two samples (n=3,175 and n=1,052), recruited from the Australian community using online advertisements, were utilised to equate a network of 11 self-report social anxiety scales via a fixed parameter item calibration method. Comparisons between actual and equated scores for most of the scales indicted a high level of agreement with mean differences <0.10 (equivalent to a mean difference of less than one point on the standardized metric). This study demonstrates that scores from multiple scales that measure social anxiety can be converted to a common scale. Re-scoring observed scores to a common scale provides opportunities to combine research from multiple studies and ultimately better assess social anxiety in treatment and research settings. Copyright © 2018. Published by Elsevier Inc.
Step by Step: Biology Undergraduates' Problem-Solving Procedures during Multiple-Choice Assessment.

Science.gov (United States)

Prevost, Luanna B; Lemons, Paula P

2016-01-01

This study uses the theoretical framework of domain-specific problem solving to explore the procedures students use to solve multiple-choice problems about biology concepts. We designed several multiple-choice problems and administered them on four exams. We trained students to produce written descriptions of how they solved the problem, and this allowed us to systematically investigate their problem-solving procedures. We identified a range of procedures and organized them as domain general, domain specific, or hybrid. We also identified domain-general and domain-specific errors made by students during problem solving. We found that students use domain-general and hybrid procedures more frequently when solving lower-order problems than higher-order problems, while they use domain-specific procedures more frequently when solving higher-order problems. Additionally, the more domain-specific procedures students used, the higher the likelihood that they would answer the problem correctly, up to five procedures. However, if students used just one domain-general procedure, they were as likely to answer the problem correctly as if they had used two to five domain-general procedures. Our findings provide a categorization scheme and framework for additional research on biology problem solving and suggest several important implications for researchers and instructors. © 2016 L. B. Prevost and P. P. Lemons. CBE—Life Sciences Education © 2016 The American Society for Cell Biology. This article is distributed by The American Society for Cell Biology under license from the author(s). It is available to the public under an Attribution–Noncommercial–Share Alike 3.0 Unported Creative Commons License (http://creativecommons.org/licenses/by-nc-sa/3.0).
Item Banking with Embedded Standards

Science.gov (United States)

MacCann, Robert G.; Stanley, Gordon

2009-01-01

An item banking method that does not use Item Response Theory (IRT) is described. This method provides a comparable grading system across schools that would be suitable for low-stakes testing. It uses the Angoff standard-setting method to obtain item ratings that are stored with each item. An example of such a grading system is given, showing how…
Using Classical Test Theory and Item Response Theory to Evaluate the LSCI

Science.gov (United States)

Schlingman, Wayne M.; Prather, E. E.; Collaboration of Astronomy Teaching Scholars CATS

2011-01-01

Analyzing the data from the recent national study using the Light and Spectroscopy Concept Inventory (LSCI), this project uses both Classical Test Theory (CTT) and Item Response Theory (IRT) to investigate the LSCI itself in order to better understand what it is actually measuring. We use Classical Test Theory to form a framework of results that can be used to evaluate the effectiveness of individual questions at measuring differences in student understanding and provide further insight into the prior results presented from this data set. In the second phase of this research, we use Item Response Theory to form a theoretical model that generates parameters accounting for a student's ability, a question's difficulty, and estimate the level of guessing. The combined results from our investigations using both CTT and IRT are used to better understand the learning that is taking place in classrooms across the country. The analysis will also allow us to evaluate the effectiveness of individual questions and determine whether the item difficulties are appropriately matched to the abilities of the students in our data set. These results may require that some questions be revised, motivating the need for further development of the LSCI. This material is based upon work supported by the National Science Foundation under Grant No. 0715517, a CCLI Phase III Grant for the Collaboration of Astronomy Teaching Scholars (CATS). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Development of a psychological test to measure ability-based emotional intelligence in the Indonesian workplace using an item response theory

Directory of Open Access Journals (Sweden)

Fajrianthi

2017-11-01

Full Text Available Fajrianthi,1 Rizqy Amelia Zein2 1Department of Industrial and Organizational Psychology, 2Department of Personality and Social Psychology, Faculty of Psychology, Universitas Airlangga, Surabaya, East Java, Indonesia Abstract: This study aimed to develop an emotional intelligence (EI test that is suitable to the Indonesian workplace context. Airlangga Emotional Intelligence Test (Tes Kecerdasan Emosi Airlangga [TKEA] was designed to measure three EI domains: 1 emotional appraisal, 2 emotional recognition, and 3 emotional regulation. TKEA consisted of 120 items with 40 items for each subset. TKEA was developed based on the Situational Judgment Test (SJT approach. To ensure its psychometric qualities, categorical confirmatory factor analysis (CCFA and item response theory (IRT were applied to test its validity and reliability. The study was conducted on 752 participants, and the results showed that test information function (TIF was 3.414 (ability level = 0 for subset 1, 12.183 for subset 2 (ability level = -2, and 2.398 for subset 3 (level of ability = -2. It is concluded that TKEA performs very well to measure individuals with a low level of EI ability. It is worth to note that TKEA is currently at the development stage; therefore, in this study, we investigated TKEA’s item analysis and dimensionality test of each TKEA subset. Keywords: categorical confirmatory factor analysis, emotional intelligence, item response theory
Dairy cow feeding space requirements assessed in a Y-maze choice test.

Science.gov (United States)

Rioja-Lang, F C; Roberts, D J; Healy, S D; Lawrence, A B; Haskell, M J

2012-07-01

The effect of proximity to a dominant cow on a low-ranking cow's willingness to feed was assessed using choice tests. The main aim of the experiment was to determine the feeding space allowance at which the majority of subordinate cows would choose to feed on high-palatability food (HPF) next to a dominant cow rather than feeding alone on low-palatability food (LPF). Thirty Holstein-Friesian cows were used in the study. Half of the cows were trained to make an association between a black bin and HPF and a white bin and LPF, and the other half were trained with the opposite combination. Observations of pair-wise aggressive interactions were observed during feeding to determine the relative social status of each cow. From this, dominant and subordinate cows were allocated to experimental pairs. When cows had achieved an HPF preference with an 80% success rate in training, they were presented with choices using a Y-maze test apparatus, in which cows were offered choices between feeding on HPF with a dominant cow and feeding on LPF alone. Four different space allowances were tested at the HPF feeder: 0.3, 0.45, 0.6, and 0.75 m. At the 2 smaller space allowances, cows preferred to feed alone (choices between feeding alone or not for 0.3- and 0.45-m tests were significantly different). For the 2 larger space allowances, cows had no significant preferences (number of choices for feeding alone or with a dominant). Given that low-status cows are willing to sacrifice food quality to avoid close contact with a dominant animal, we suggest that the feeding space allowance should be at least 0.6m per cow whenever possible. However, even when space allowances are large, it is clear that some subordinate cows will still prefer to avoid proximity to dominant individuals. Copyright © 2012 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Assessment of chromium(VI) release from 848 jewellery items by use of a diphenylcarbazide spot test

DEFF Research Database (Denmark)

Bregnbak, David; Johansen, Jeanne D.; Hamann, Dathan

2016-01-01

We recently evaluated and validated a diphenylcarbazide(DPC)-based screening spot test that can detect the release of chromium(VI) ions (≥0.5 ppm) from various metallic items and leather goods (1). We then screened a selection of metal screws, leather shoes, and gloves, as well as 50 earrings......, and identified chromium(VI) release from one earring. In the present study, we used the DPC spot test to assess chromium(VI) release in a much larger sample of jewellery items (n=848), 160 (19%) of which had previously be shown to contain chromium when analysed with X-ray fluorescence spectroscopy (2)....
Marine Education Knowledge Inventory.

Science.gov (United States)

Hounshell, Paul B.; Hampton, Carolyn

This 35-item, multiple-choice Marine Education Knowledge Inventory was developed for use in upper elementary/middle schools to measure a student's knowledge of marine science. Content of test items is drawn from oceanography, ecology, earth science, navigation, and the biological sciences (focusing on marine animals). Steps in the construction of…
Generalization of the Lord-Wingersky Algorithm to Computing the Distribution of Summed Test Scores Based on Real-Number Item Scores

Science.gov (United States)

Kim, Seonghoon

2013-01-01

With known item response theory (IRT) item parameters, Lord and Wingersky provided a recursive algorithm for computing the conditional frequency distribution of number-correct test scores, given proficiency. This article presents a generalized algorithm for computing the conditional distribution of summed test scores involving real-number item…
Importance-satisfaction analysis of street food sanitation and choice factor in Korea and Taiwan.

Science.gov (United States)

Joo, Nami; Park, Sanghyun; Lee, Bohee; Yoon, Jiyoung

2015-06-01

The present study investigated Korean and Taiwan adults on the importance of and the satisfaction with street food sanitation and street food choice factor, in order to present management and improvement measures for street foods. The present study conducted a survey on 400 randomly chosen adults (200 Korean, 200 Taiwanese). General characteristics, eating habits, street food intake frequency, and preference by type of street food of respondents were checked. Respondents' importance and satisfaction of street food hygiene and selection attributes were also measured. In order to test for the difference between groups, χ(2)-test and t-test were performed. ISA was also performed to analyze importance and satisfaction. Results showed that the importance of sanitation was significantly higher than satisfaction on all items in both Korea and Taiwan, and the satisfaction with sanitation was higher in Taiwan than in Korea. According to ISA results with street food sanitation, satisfaction was low while importance was high in both Korea and Taiwan. In terms of street food choice factor, importance scores were significantly higher than satisfaction scores on all items. In addition, satisfaction scores on all items except 'taste' were significantly higher in Taiwan than in Korea. A manual on sanitation management of street foods should be developed to change the knowledge and attitude toward sanitation by putting into practice a regularly conducted education. Considering the popularity of street foods and its potential as a tourism resource to easily publicize our food culture, thorough management measures should be prepared on sanitation so that safe street food culture should be created.
Effect of Item Response Theory (IRT) Model Selection on Testlet-Based Test Equating. Research Report. ETS RR-14-19

Science.gov (United States)

Cao, Yi; Lu, Ru; Tao, Wei

2014-01-01

The local item independence assumption underlying traditional item response theory (IRT) models is often not met for tests composed of testlets. There are 3 major approaches to addressing this issue: (a) ignore the violation and use a dichotomous IRT model (e.g., the 2-parameter logistic [2PL] model), (b) combine the interdependent items to form a…
Determinants of choice of delivery place: Testing rational choice theory and habitus theory.

Science.gov (United States)

Broda, Anja; Krüger, Juliane; Schinke, Stephanie; Weber, Andreas

2018-05-07

The current study uses two antipodal social science theories, the rational choice theory and the habitus theory, and applies these to describe how women choose between intraclinical (i.e., hospital-run birth clinics) and extraclinical (i.e., midwife-led birth centres or home births) delivery places. Data were collected in a cross-sectional questionnaire-based survey among 189 women. A list of 22 determinants, conceptualized to capture the two theoretical concepts, were rated on a 7-point Likert scale with 1 = unimportant to 7 = very important. The analytic method was structural equation modelling. A model was built, in which the rational choice theory and the habitus theory as latent variables predicted the choice of delivery place. With regards to the choice of delivery place, 89.3% of the women wanted an intraclinical and 10.7% an extraclinical delivery place at the time of their last child's birth. Significant differences between women with a choice of an intraclinical or extraclinical delivery place were found for 14 of the 22 determinants. In the structural equation model, rational choice theory determinants predicted a choice of intraclinical delivery and habitus theory determinants predicted a choice of extraclinical delivery. The two theories had diametrically opposed effects on the choice of delivery place. Women are more likely to decide on intraclinical delivery when arguments such as high medical standards, positive evaluations, or good advanced information are rated important. In contrast, women are more likely to decide on extraclinical delivery when factors such as family atmosphere during birth, friendliness of health care professionals, or consideration of the woman's interests are deemed important. A practical implication of our study is that intraclinical deliveries may be promoted by providing comprehensive information, data and facts on various delivery-related issues, while extraclinical deliveries may be fostered by healthcare
MULTIPLE INTELLIGENCES AS PREDICTORS OF READING COMPREHENSION AND VOCABULARY KNOWLEDGE

Directory of Open Access Journals (Sweden)

Abbas Ali Zarei

2014-07-01

Full Text Available Abstract: The present study was conducted to investigate types of Multiple Intelligences as predictors of reading comprehension and vocabulary knowledge. To meet this objective, a 60-item TOEFL test and a 90-item multiple intelligences questionnaire were distributed among 240 male and female Iranians studying English at Qazali and Parsian Universities in Qazvin. Data were analyzed using a multiple regression procedure. The result of the data analysis indicated that musical, interpersonal, kinesthetic, and logical intelligences were predicators of reading comprehension. Moreover, musical, verbal, visual, kinesthetic and natural intelligences made significant contributions to predicting vocabulary knowledge. Key words: Multiple intelligences, reading comprehension, vocabulary knowledge.
End-of-life decision making in respiratory failure. The therapeutic choices in chronic respiratory failure in a 7-item questionnaire

Directory of Open Access Journals (Sweden)

Dagmar Elfriede Rinnenburger

2012-01-01

Full Text Available INTRODUCTION: The transition from paternalistic medicine to a healthcare culture centred on the patient's decision making autonomy presents problems of communication and understanding. Chronic respiratory failure challenges patients, their families and caregivers with important choices, such as invasive and non-invasive mechanical ventilation and tracheostomy, which, especially in the case of neuromuscular diseases, can significantly postpone the end of life. MATERIAL AND METHODS: A 7-item questionnaire was administered to 100 patients with advanced COPD, neuromuscular diseases and pulmonary fibrosis, all of them on oxygen therapy and receiving day-hospital treatment for respiratory failure. The objective was to find out whether or not patients, if faced with a deterioration of their health condition, would want to take part in the decision making process and, if so, how and with whom. RESULTS. Results showed that: 90% of patients wanted to be interviewed, 10% preferred not to be interviewed, 82% wanted to be regularly updated on their clinical situation, 75% wanted to be intubated, if necessary, and 56% would also agree to have a tracheostomy. These choices have been confirmed one year later, with 93% of respondents accepting the questionnaire and considering it useful. CONCLUSIONS: It is possible to conclude that a simple questionnaire can be a useful tool contributing to therapeutic decision making in respiratory failure.
The development and validation of a two-tiered multiple-choice instrument to identify alternative conceptions in earth science

Science.gov (United States)

Mangione, Katherine Anna

This study was to determine reliability and validity for a two-tiered, multiple- choice instrument designed to identify alternative conceptions in earth science. Additionally, this study sought to identify alternative conceptions in earth science held by preservice teachers, to investigate relationships between self-reported confidence scores and understanding of earth science concepts, and to describe relationships between content knowledge and alternative conceptions and planning instruction in the science classroom. Eighty-seven preservice teachers enrolled in the MAT program participated in this study. Sixty-eight participants were female, twelve were male, and seven chose not to answer. Forty-seven participants were in the elementary certification program, five were in the middle school certification program, and twenty-nine were pursuing secondary certification. Results indicate that the two-tiered, multiple-choice format can be a reliable and valid method for identifying alternative conceptions. Preservice teachers in all certification areas who participated in this study may possess common alternative conceptions previously identified in the literature. Alternative conceptions included: all rivers flow north to south, the shadow of the Earth covers the Moon causing lunar phases, the Sun is always directly overhead at noon, weather can be predicted by animal coverings, and seasons are caused by the Earth's proximity to the Sun. Statistical analyses indicated differences, however not all of them significant, among all subgroups according to gender and certification area. Generally males outperformed females and preservice teachers pursuing middle school certification had higher scores on the questionnaire followed by those obtaining secondary certification. Elementary preservice teachers scored the lowest. Additionally, self-reported scores of confidence in one's answers and understanding of the earth science concept in question were analyzed. There was a

Reproducibility of an objective four-choice canine vision testing technique that assesses vision at differing light intensities.

Science.gov (United States)

Annear, Matthew J; Gornik, Kara R; Venturi, Francesca L; Hauptman, Joe G; Bartoe, Joshua T; Petersen-Jones, Simon M

2013-09-01

The increasing importance of canine retinal dystrophy models means accurate vision testing is needed. This study was performed to evaluate a four-choice vision testing technique for any difference in outcome measures with repeated evaluations of the same dogs. Four 11-month-old RPE65-deficient dogs. Vision was evaluated using a previously described four-choice vision testing device. Four evaluations were performed at 2-week intervals. Vision was assessed at six different white light intensities (bright through dim), and each eye was evaluated separately. The ability to select the one of the four exit tunnels that was open at the far end was assessed ('choice of exit') and recorded as correct or incorrect first tunnel choice. 'Time to exit' the device was also recorded. Both outcomes were analyzed for significance using anova. We hypothesized that performance would improve with repeated testing (more correct choices and more rapid time to exit). 'Choice of exit' did not vary significantly between each evaluation (P = 0.12), in contrast 'time to exit' increased significantly (P = 0.012), and showed greater variability in dim light conditions. We found no evidence to support the hypothesis that either measure of outcome worsened with repeated testing; in fact, the 'time to exit' outcome worsened rather than improved. The 'choice of exit' gave consistent results between trials. These outcome data indicate the importance of including a choice-based assessment of vision in addition to measurement of device transit time. © 2012 American College of Veterinary Ophthalmologists.
Investigating High School Students' Understanding of Chemical Equilibrium Concepts

Science.gov (United States)

Karpudewan, Mageswary; Treagust, David F.; Mocerino, Mauro; Won, Mihye; Chandrasegaran, A. L.

2015-01-01

This study investigated the year 12 students' (N = 56) understanding of chemical equilibrium concepts after instruction using two conceptual tests, the "Chemical Equilibrium Conceptual Test 1" ("CECT-1") consisting of nine two-tier multiple-choice items and the "Chemical Equilibrium Conceptual Test 2"…
Neural correlates of cognitive dissonance and choice-induced preference change.

Science.gov (United States)

Izuma, Keise; Matsumoto, Madoka; Murayama, Kou; Samejima, Kazuyuki; Sadato, Norihiro; Matsumoto, Kenji

2010-12-21

According to many modern economic theories, actions simply reflect an individual's preferences, whereas a psychological phenomenon called "cognitive dissonance" claims that actions can also create preference. Cognitive dissonance theory states that after making a difficult choice between two equally preferred items, the act of rejecting a favorite item induces an uncomfortable feeling (cognitive dissonance), which in turn motivates individuals to change their preferences to match their prior decision (i.e., reducing preference for rejected items). Recently, however, Chen and Risen [Chen K, Risen J (2010) J Pers Soc Psychol 99:573-594] pointed out a serious methodological problem, which casts a doubt on the very existence of this choice-induced preference change as studied over the past 50 y. Here, using a proper control condition and two measures of preferences (self-report and brain activity), we found that the mere act of making a choice can change self-report preference as well as its neural representation (i.e., striatum activity), thus providing strong evidence for choice-induced preference change. Furthermore, our data indicate that the anterior cingulate cortex and dorsolateral prefrontal cortex tracked the degree of cognitive dissonance on a trial-by-trial basis. Our findings provide important insights into the neural basis of how actions can alter an individual's preferences.
Modeling Item-Level and Step-Level Invariance Effects in Polytomous Items Using the Partial Credit Model

Science.gov (United States)

Gattamorta, Karina A.; Penfield, Randall D.; Myers, Nicholas D.

2012-01-01

Measurement invariance is a common consideration in the evaluation of the validity and fairness of test scores when the tested population contains distinct groups of examinees, such as examinees receiving different forms of a translated test. Measurement invariance in polytomous items has traditionally been evaluated at the item-level,…
Assessing the Life Science Knowledge of Students and Teachers Represented by the K-8 National Science Standards

Science.gov (United States)

Sadler, Philip M.; Coyle, Harold; Cook Smith, Nancy; Miller, Jaimie; Mintzes, Joel; Tanner, Kimberly; Murray, John

2013-01-01

We report on the development of an item test bank and associated instruments based on the National Research Council (NRC) K-8 life sciences content standards. Utilizing hundreds of studies in the science education research literature on student misconceptions, we constructed 476 unique multiple-choice items that measure the degree to which test…
Effect of price and information on the food choices of women university students in Saudi Arabia: An experimental study.

Science.gov (United States)

Halimic, Aida; Gage, Heather; Raats, Monique; Williams, Peter

2018-04-01

To explore the impact of price manipulation and healthy eating information on intended food choices. Health information was provided to a random half of subjects (vs. information on Saudi agriculture). Each subject chose from the same lunch menu, containing two healthy and two unhealthy entrees, deserts and beverages, on five occasions. Reference case prices were 5, 3 and 2 Saudi Arabian Reals (SARs). Prices of healthy and unhealthy items were manipulated up (taxed) and down (subsidized) by 1 SAR in four menu variations (random order); subjects were given a budget enabling full choice within any menu. The number of healthy food choices were compared with different price combinations, and between information groups. Linear regression modelling explored the effect of relative prices of healthy/unhealthy options and information on number of healthy choices controlling for dietary behaviours and hunger levels. University campus, Saudi Arabia, 2013. 99 women students. In the reference case, 49.5% of choices were for healthy items. When the price of healthy items was reduced, 58.5% of selections were healthy; 57.2% when the price of unhealthy items rose. In regression modelling, reducing the price of healthy items and increasing the price of unhealthy items increased the number of healthy choices by 5% and 6% respectively. Students reporting a less healthy usual diet selected significantly fewer healthy items. Providing healthy eating information was not a significant influence. Price manipulation offers potential for altering behaviours to combat rising youth obesity in Saudi Arabia. Copyright © 2018 Elsevier Ltd. All rights reserved.
Item analysis of the Spanish version of the Boston Naming Test with a Spanish speaking adult population from Colombia.

Science.gov (United States)

Kim, Stella H; Strutt, Adriana M; Olabarrieta-Landa, Laiene; Lequerica, Anthony H; Rivera, Diego; De Los Reyes Aragon, Carlos Jose; Utria, Oscar; Arango-Lasprilla, Juan Carlos

2018-02-23

The Boston Naming Test (BNT) is a widely used measure of confrontation naming ability that has been criticized for its questionable construct validity for non-English speakers. This study investigated item difficulty and construct validity of the Spanish version of the BNT to assess cultural and linguistic impact on performance. Subjects were 1298 healthy Spanish speaking adults from Colombia. They were administered the 60- and 15-item Spanish version of the BNT. A Rasch analysis was computed to assess dimensionality, item hierarchy, targeting, reliability, and item fit. Both versions of the BNT satisfied requirements for unidimensionality. Although internal consistency was excellent for the 60-item BNT, order of difficulty did not increase consistently with item number and there were a number of items that did not fit the Rasch model. For the 15-item BNT, a total of 5 items changed position on the item hierarchy with 7 poor fitting items. Internal consistency was acceptable. Construct validity of the BNT remains a concern when it is administered to non-English speaking populations. Similar to previous findings, the order of item presentation did not correspond with increasing item difficulty, and both versions were inadequate at assessing high naming ability.
The six-item Clock Drawing Test – reliability and validity in mild Alzheimer’s disease

DEFF Research Database (Denmark)

Jørgensen, Kasper; Kristensen, Maria K; Waldemar, Gunhild

2015-01-01

This study presents a reliable, short and practical version of the Clock Drawing Test (CDT) for clinical use and examines its diagnostic accuracy in mild Alzheimer's disease versus elderly nonpatients. Clock drawings from 231 participants were scored independently by four clinical neuropsychologi......This study presents a reliable, short and practical version of the Clock Drawing Test (CDT) for clinical use and examines its diagnostic accuracy in mild Alzheimer's disease versus elderly nonpatients. Clock drawings from 231 participants were scored independently by four clinical...... neuropsychologists blind to diagnostic classification. The interrater agreement of individual scoring criteria was analyzed and items with poor or moderate reliability were excluded. The classification accuracy of the resulting scoring system - the six-item CDT - was examined. We explored the effect of further...
The Effectiveness of Problem-Based Learning Approach Based on Multiple Intelligences in Terms of Student’s Achievement, Mathematical Connection Ability, and Self-Esteem

Science.gov (United States)

Kartikasari, A.; Widjajanti, D. B.

2017-02-01

The aim of this study is to explore the effectiveness of learning approach using problem-based learning based on multiple intelligences in developing student’s achievement, mathematical connection ability, and self-esteem. This study is experimental research with research sample was 30 of Grade X students of MIA III MAN Yogyakarta III. Learning materials that were implemented consisting of trigonometry and geometry. For the purpose of this study, researchers designed an achievement test made up of 44 multiple choice questions with respectively 24 questions on the concept of trigonometry and 20 questions for geometry. The researcher also designed a connection mathematical test and self-esteem questionnaire that consisted of 7 essay questions on mathematical connection test and 30 items of self-esteem questionnaire. The learning approach said that to be effective if the proportion of students who achieved KKM on achievement test, the proportion of students who achieved a minimum score of high category on the results of both mathematical connection test and self-esteem questionnaire were greater than or equal to 70%. Based on the hypothesis testing at the significance level of 5%, it can be concluded that the learning approach using problem-based learning based on multiple intelligences was effective in terms of student’s achievement, mathematical connection ability, and self-esteem.
Equal Opportunity in the Classroom: Test Construction in a Diversity-Sensitive Environment.

Science.gov (United States)

Ghorpade, Jai; Lackritz, James R.

1998-01-01

Two multiple-choice tests and one essay test were taken by 231 students (50/50 male/female, 192 White, 39 East Asian, Black, Mexican American, or Middle Eastern). Multiple-choice tests showed no significant differences in equal employment opportunity terms; women and men scored about the same on essays, but minority students had significantly…
Development of Abbreviated Nine-Item Forms of the Raven's Standard Progressive Matrices Test

Science.gov (United States)

Bilker, Warren B.; Hansen, John A.; Brensinger, Colleen M.; Richard, Jan; Gur, Raquel E.; Gur, Ruben C.

2012-01-01

The Raven's Standard Progressive Matrices (RSPM) is a 60-item test for measuring abstract reasoning, considered a nonverbal estimate of fluid intelligence, and often included in clinical assessment batteries and research on patients with cognitive deficits. The goal was to develop and apply a predictive model approach to reduce the number of items…
A speech reception in noise test for preschool children (the Galker-test)

DEFF Research Database (Denmark)

Lauritsen, Maj-Britt Glenn; Kreiner, Svend; Söderström, Margareta

2015-01-01

Purpose: This study evaluates initial validity and reliability of the “Galker test of speech reception in noise” developed for Danish preschool children suspected to have problems with hearing or understanding speech against strict psychometric standards and assesses acceptance by the children....... Methods:The Galker test is an audio-visual, computerised, word discrimination test in background noise, originally comprised of 50 word pairs. Three hundred and eighty eight children attending ordinary day care centres and aged 3–5 years were included. With multiple regression and the Rasch item response...... model it was examined whether the total score of the Galker test validly reflected item responses across subgroups defined by sex, age, bilingualism, tympanometry, audiometry and verbal comprehension. Results: A total of 370 children (95%) accepted testing and 339 (87%) completed all 50 items...
Sensitivity and specificity of the 3-item memory test in the assessment of post traumatic amnesia.

Science.gov (United States)

Andriessen, Teuntje M J C; de Jong, Ben; Jacobs, Bram; van der Werf, Sieberen P; Vos, Pieter E

2009-04-01

To investigate how the type of stimulus (pictures or words) and the method of reproduction (free recall or recognition after a short or a long delay) affect the sensitivity and specificity of a 3-item memory test in the assessment of post traumatic amnesia (PTA). Daily testing was performed in 64 consecutively admitted traumatic brain injured patients, 22 orthopedically injured patients and 26 healthy controls until criteria for resolution of PTA were reached. Subjects were randomly assigned to a test with visual or verbal stimuli. Short delay reproduction was tested after an interval of 3-5 minutes, long delay reproduction was tested after 24 hours. Sensitivity and specificity were calculated over the first 4 test days. The 3-word test showed higher sensitivity than the 3-picture test, while specificity of the two tests was equally high. Free recall was a more effortful task than recognition for both patients and controls. In patients, a longer delay between registration and recall resulted in a significant decrease in the number of items reproduced. Presence of PTA is best assessed with a memory test that incorporates the free recall of words after a long delay.
Pushing Critical Thinking Skills With Multiple-Choice Questions: Does Bloom's Taxonomy Work?

Science.gov (United States)

Zaidi, Nikki L Bibler; Grob, Karri L; Monrad, Seetha M; Kurtz, Joshua B; Tai, Andrew; Ahmed, Asra Z; Gruppen, Larry D; Santen, Sally A

2018-06-01

Medical school assessments should foster the development of higher-order thinking skills to support clinical reasoning and a solid foundation of knowledge. Multiple-choice questions (MCQs) are commonly used to assess student learning, and well-written MCQs can support learner engagement in higher levels of cognitive reasoning such as application or synthesis of knowledge. Bloom's taxonomy has been used to identify MCQs that assess students' critical thinking skills, with evidence suggesting that higher-order MCQs support a deeper conceptual understanding of scientific process skills. Similarly, clinical practice also requires learners to develop higher-order thinking skills that include all of Bloom's levels. Faculty question writers and examinees may approach the same material differently based on varying levels of knowledge and expertise, and these differences can influence the cognitive levels being measured by MCQs. Consequently, faculty question writers may perceive that certain MCQs require higher-order thinking skills to process the question, whereas examinees may only need to employ lower-order thinking skills to render a correct response. Likewise, seemingly lower-order questions may actually require higher-order thinking skills to respond correctly. In this Perspective, the authors describe some of the cognitive processes examinees use to respond to MCQs. The authors propose that various factors affect both the question writer and examinee's interaction with test material and subsequent cognitive processes necessary to answer a question.
The short- and long-term fates of memory items retained outside the focus of attention.

Science.gov (United States)

LaRocque, Joshua J; Eichenbaum, Adam S; Starrett, Michael J; Rose, Nathan S; Emrich, Stephen M; Postle, Bradley R

2015-04-01

When a test of working memory (WM) requires the retention of multiple items, a subset of them can be prioritized. Recent studies have shown that, although prioritized (i.e., attended) items are associated with active neural representations, unprioritized (i.e., unattended) memory items can be retained in WM despite the absence of such active representations, and with no decrement in their recognition if they are cued later in the trial. These findings raise two intriguing questions about the nature of the short-term retention of information outside the focus of attention. First, when the focus of attention shifts from items in WM, is there a loss of fidelity for those unattended memory items? Second, could the retention of unattended memory items be accomplished by long-term memory mechanisms? We addressed the first question by comparing the precision of recall of attended versus unattended memory items, and found a significant decrease in precision for unattended memory items, reflecting a degradation in the quality of those representations. We addressed the second question by asking subjects to perform a WM task, followed by a surprise memory test for the items that they had seen in the WM task. Long-term memory for unattended memory items from the WM task was not better than memory for items that had remained selected by the focus of attention in the WM task. These results show that unattended WM representations are degraded in quality and are not preferentially represented in long-term memory, as compared to attended memory items.
Using Cochran's Z Statistic to Test the Kernel-Smoothed Item Response Function Differences between Focal and Reference Groups

Science.gov (United States)

Zheng, Yinggan; Gierl, Mark J.; Cui, Ying

2010-01-01

This study combined the kernel smoothing procedure and a nonparametric differential item functioning statistic--Cochran's Z--to statistically test the difference between the kernel-smoothed item response functions for reference and focal groups. Simulation studies were conducted to investigate the Type I error and power of the proposed…
Objective Tests and Their Discriminating Power in Business Courses: a Case Study

Directory of Open Access Journals (Sweden)

Edgard B. Cornachione Jr.

2005-07-01

Full Text Available Evaluating students’ learning experiences outcomes cannot be considered a simple task. This paper aims at investigating students’ overall performance and the discriminating power of particular tests’ items in the context of business courses. The purpose of this paper is to contribute with this issue while analyzing it, with scientific approach, from an accounting information systems standpoint: two experiments based on a database management system (DBMS undergraduate course, involving 66 and 62 students (experiments E1 and E2, respectively. The discriminant analysis generated discriminant functions with high canonical correlations (E1=0.898 and E2= 0.789. As a result, high percentages of original grouped cases were correctly classified (E1=98.5% and E2= 95.2% based on a relatively small number of items: 7 out of 22 items from E1 (multiple-choice, and 3 out of 6 from E2 (short-answer. So, with only a few items from the analyzed instruments it is possible todiscriminate “good” or “bad” academic performance, and this is a measure of quality of the observed testing instruments. According to these findings, especially in business area, instructors and institutions, together, are able to analyze and act towards improving their assessment methods, to be of minimum influence whileevaluating students’ performance.
Food choice questionnaire revisited in four countries. Does it still measure the same?

Science.gov (United States)

Januszewska, Renata; Pieniak, Zuzanna; Verbeke, Wim

2011-08-01

This study focuses on the implementation of the food choice questionnaire (FCQ) across four countries. The first objective is to examine the degree to which the factor structure of the FCQ is invariant across different populations. The second objective is to analyse the motives for food choice in different countries. The cross-sectional sample of 1420 consumers consisted of Belgians (N=458), Hungarians (N=401), Romanians (N=229) and Filipinos (N=332). Data analyses included estimation of five multi-group confirmatory factor analysis models; calculation of mean importance ratings for each food choice factor across countries; ANOVA and Tukey post hoc tests; and a rank order test of most to least important factors within each country. The results confirm that the factorial structure of the FCQ is invariant with respect to factor configuration, factor loadings and item intercept. Sensory appeal is the most important factor among all European consumers, while health, convenience and price were all among the five most important factors shaping food choice in Belgium, Hungary and Romania. For Filipinos, the most important were health, price and mood. Sensory appeal ranked on the fourth place. Copyright © 2011 Elsevier Ltd. All rights reserved.
The Role of Hypertext in Consumer Decision Making. The Case of Travel Destination Choice

Directory of Open Access Journals (Sweden)

Raúl Valdez Munoz

2012-01-01

Full Text Available Travel is one of the most popular items people tend to be comfortable with purchasing over the Internet. Hypertext is a form of electronic text composed of blocks of words (or images linked electronically by multiple paths, chains, or trails. This study explores the importance of hypertext in the travel destination choice from websites. Results show that hypertext links containing images of destinations, informative texts, and search tools are the three most important features utilized by tourist website browsers. This study aims to offer insights into new areas for further research on tourism websites design, application and evaluation.
Analysis of the residential location choice and household energy consumption behavior by incorporating multiple self-selection effects

International Nuclear Information System (INIS)

Yu Biying; Junyi Zhang; Fujiwara, Akimasa

2012-01-01

It is expected that the residential location choice and household energy consumption behavior might correlate with each other. Besides, due to the existence of self-selection effects, the observed inter-relationship between them might be the spurious result of the fact that some unobserved variables are causing both. These concerns motivate us to (1) consider residential location choice and household energy consumption behavior (for both in-home appliances and out-of-home cars) simultaneously and, (2) explicitly control self-selection effects so as to capture a relatively true effect of land-use policy on household energy consumption behavior. An integrated model termed as joint mixed Multinomial Logit-Multiple Discrete-Continuous Extreme Value model is presented here to identify the sensitivity of household energy consumption to land use policy by considering multiple self-selection effects. The model results indicate that land-use policy do play a great role in changing Beijing residents’ energy consumption pattern, while the self-selection effects cannot be ignored when evaluating the effect of land-use policy. Based on the policy scenario design, it is found that increasing recreational facilities and bus lines in the neighborhood can greatly promote household's energy-saving behavior. Additionally, the importance of “soft policy” and package policy is also emphasized in the context of Beijing. - Highlights: ► Representing residential choice and household energy consumption behavior jointly. ► Land use policy is found effective to control the household energy use in Beijing. ► Multiple self-selection effects are posed to get the true effect of land use policy. ► Significant self-selection effects call an attention to the soft policy in Beijing. ► The necessity of package policy on saving Beijing residents’ energy use is confirmed.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.