WorldWideScience

Sample records for multiple-choice test items

  1. Using automatic item generation to create multiple-choice test items.

    Gierl, Mark J; Lai, Hollis; Turner, Simon R

    2012-08-01

    Many tests of medical knowledge, from the undergraduate level to the level of certification and licensure, contain multiple-choice items. Although these are efficient in measuring examinees' knowledge and skills across diverse content areas, multiple-choice items are time-consuming and expensive to create. Changes in student assessment brought about by new forms of computer-based testing have created the demand for large numbers of multiple-choice items. Our current approaches to item development cannot meet this demand. We present a methodology for developing multiple-choice items based on automatic item generation (AIG) concepts and procedures. We describe a three-stage approach to AIG and we illustrate this approach by generating multiple-choice items for a medical licensure test in the content area of surgery. To generate multiple-choice items, our method requires a three-stage process. Firstly, a cognitive model is created by content specialists. Secondly, item models are developed using the content from the cognitive model. Thirdly, items are generated from the item models using computer software. Using this methodology, we generated 1248 multiple-choice items from one item model. Automatic item generation is a process that involves using models to generate items using computer technology. With our method, content specialists identify and structure the content for the test items, and computer technology systematically combines the content to generate new test items. By combining these outcomes, items can be generated automatically. © Blackwell Publishing Ltd 2012.

  2. Item difficulty of multiple choice tests dependant on different item response formats – An experiment in fundamental research on psychological assessment

    KLAUS D. KUBINGER

    2007-12-01

    Full Text Available Multiple choice response formats are problematical as an item is often scored as solved simply because the test-taker is a lucky guesser. Instead of applying pertinent IRT models which take guessing effects into account, a pragmatic approach of re-conceptualizing multiple choice response formats to reduce the chance of lucky guessing is considered. This paper compares the free response format with two different multiple choice formats. A common multiple choice format with a single correct response option and five distractors (“1 of 6” is used, as well as a multiple choice format with five response options, of which any number of the five is correct and the item is only scored as mastered if all the correct response options and none of the wrong ones are marked (“x of 5”. An experiment was designed, using pairs of items with exactly the same content but different response formats. 173 test-takers were randomly assigned to two test booklets of 150 items altogether. Rasch model analyses adduced a fitting item pool, after the deletion of 39 items. The resulting item difficulty parameters were used for the comparison of the different formats. The multiple choice format “1 of 6” differs significantly from “x of 5”, with a relative effect of 1.63, while the multiple choice format “x of 5” does not significantly differ from the free response format. Therefore, the lower degree of difficulty of items with the “1 of 6” multiple choice format is an indicator of relevant guessing effects. In contrast the “x of 5” multiple choice format can be seen as an appropriate substitute for free response format.

  3. Evaluating the Psychometric Characteristics of Generated Multiple-Choice Test Items

    Gierl, Mark J.; Lai, Hollis; Pugh, Debra; Touchie, Claire; Boulais, André-Philippe; De Champlain, André

    2016-01-01

    Item development is a time- and resource-intensive process. Automatic item generation integrates cognitive modeling with computer technology to systematically generate test items. To date, however, items generated using cognitive modeling procedures have received limited use in operational testing situations. As a result, the psychometric…

  4. Science Library of Test Items. Volume Twenty-Two. A Collection of Multiple Choice Test Items Relating Mainly to Skills.

    New South Wales Dept. of Education, Sydney (Australia).

    As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…

  5. Science Library of Test Items. Volume Eighteen. A Collection of Multiple Choice Test Items Relating Mainly to Chemistry.

    New South Wales Dept. of Education, Sydney (Australia).

    As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…

  6. Science Library of Test Items. Volume Twenty. A Collection of Multiple Choice Test Items Relating Mainly to Physics, 1.

    New South Wales Dept. of Education, Sydney (Australia).

    As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…

  7. Science Library of Test Items. Volume Seventeen. A Collection of Multiple Choice Test Items Relating Mainly to Biology.

    New South Wales Dept. of Education, Sydney (Australia).

    As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…

  8. Science Library of Test Items. Volume Nineteen. A Collection of Multiple Choice Test Items Relating Mainly to Geology.

    New South Wales Dept. of Education, Sydney (Australia).

    As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…

  9. Item and test analysis to identify quality multiple choice questions (MCQS from an assessment of medical students of Ahmedabad, Gujarat

    Sanju Gajjar

    2014-01-01

    Full Text Available Background: Multiple choice questions (MCQs are frequently used to assess students in different educational streams for their objectivity and wide reach of coverage in less time. However, the MCQs to be used must be of quality which depends upon its difficulty index (DIF I, discrimination index (DI and distracter efficiency (DE. Objective: To evaluate MCQs or items and develop a pool of valid items by assessing with DIF I, DI and DE and also to revise/ store or discard items based on obtained results. Settings: Study was conducted in a medical school of Ahmedabad. Materials and Methods: An internal examination in Community Medicine was conducted after 40 hours teaching during 1 st MBBS which was attended by 148 out of 150 students. Total 50 MCQs or items and 150 distractors were analyzed. Statistical Analysis: Data was entered and analyzed in MS Excel 2007 and simple proportions, mean, standard deviations, coefficient of variation were calculated and unpaired t test was applied. Results: Out of 50 items, 24 had "good to excellent" DIF I (31 - 60% and 15 had "good to excellent" DI (> 0.25. Mean DE was 88.6% considered as ideal/ acceptable and non functional distractors (NFD were only 11.4%. Mean DI was 0.14. Poor DI (< 0.15 with negative DI in 10 items indicates poor preparedness of students and some issues with framing of at least some of the MCQs. Increased proportion of NFDs (incorrect alternatives selected by < 5% students in an item decrease DE and makes it easier. There were 15 items with 17 NFDs, while rest items did not have any NFD with mean DE of 100%. Conclusion: Study emphasizes the selection of quality MCQs which truly assess the knowledge and are able to differentiate the students of different abilities in correct manner.

  10. Dynamic Testing of Analogical Reasoning in 5- to 6-Year-Olds: Multiple-Choice versus Constructed-Response Training Items

    Stevenson, Claire E.; Heiser, Willem J.; Resing, Wilma C. M.

    2016-01-01

    Multiple-choice (MC) analogy items are often used in cognitive assessment. However, in dynamic testing, where the aim is to provide insight into potential for learning and the learning process, constructed-response (CR) items may be of benefit. This study investigated whether training with CR or MC items leads to differences in the strategy…

  11. Dynamic Testing of Analogical Reasoning in 5- to 6-Year-Olds : Multiple-Choice Versus Constructed-Response Training Items

    Stevenson, C.E.; Heiser, W.J.; Resing, W.C.M.

    2016-01-01

    Multiple-choice (MC) analogy items are often used in cognitive assessment. However, in dynamic testing, where the aim is to provide insight into potential for learning and the learning process, constructed-response (CR) items may be of benefit. This study investigated whether training with CR or MC

  12. Science Library of Test Items. Volume Twenty-One. A Collection of Multiple Choice Test Items Relating Mainly to Physics, 2.

    New South Wales Dept. of Education, Sydney (Australia).

    As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items are made available to teachers for the construction of unit tests or term examinations or as a basis for class discussion. Each collection was reviewed for content validity and reliability. The test items meet syllabus…

  13. A Diagnostic Study of Pre-Service Teachers' Competency in Multiple-Choice Item Development

    Asim, Alice E.; Ekuri, Emmanuel E.; Eni, Eni I.

    2013-01-01

    Large class size is an issue in testing at all levels of Education. As a panacea to this, multiple choice test formats has become very popular. This case study was designed to diagnose pre-service teachers' competency in constructing questions (IQT); direct questions (DQT); and best answer (BAT) varieties of multiple choice items. Subjects were 88…

  14. To Show or Not to Show: The Effects of Item Stems and Answer Options on Performance on a Multiple-Choice Listening Comprehension Test

    Yanagawa, Kozo; Green, Anthony

    2008-01-01

    The purpose of this study is to examine whether the choice between three multiple-choice listening comprehension test formats results in any difference in listening comprehension test performance. The three formats entail (a) allowing test takers to preview both the question stem and answer options prior to listening; (b) allowing test takers to…

  15. Multiple-choice test of energy and momentum concepts

    Singh, Chandralekha; Rosengrant, David

    2016-01-01

    We investigate student understanding of energy and momentum concepts at the level of introductory physics by designing and administering a 25-item multiple choice test and conducting individual interviews. We find that most students have difficulty in qualitatively interpreting basic principles related to energy and momentum and in applying them in physical situations.

  16. The positive and negative consequences of multiple-choice testing.

    Roediger, Henry L; Marsh, Elizabeth J

    2005-09-01

    Multiple-choice tests are commonly used in educational settings but with unknown effects on students' knowledge. The authors examined the consequences of taking a multiple-choice test on a later general knowledge test in which students were warned not to guess. A large positive testing effect was obtained: Prior testing of facts aided final cued-recall performance. However, prior testing also had negative consequences. Prior reading of a greater number of multiple-choice lures decreased the positive testing effect and increased production of multiple-choice lures as incorrect answers on the final test. Multiple-choice testing may inadvertently lead to the creation of false knowledge.

  17. Evaluating the quality of medical multiple-choice items created with automated processes.

    Gierl, Mark J; Lai, Hollis

    2013-07-01

    Computerised assessment raises formidable challenges because it requires large numbers of test items. Automatic item generation (AIG) can help address this test development problem because it yields large numbers of new items both quickly and efficiently. To date, however, the quality of the items produced using a generative approach has not been evaluated. The purpose of this study was to determine whether automatic processes yield items that meet standards of quality that are appropriate for medical testing. Quality was evaluated firstly by subjecting items created using both AIG and traditional processes to rating by a four-member expert medical panel using indicators of multiple-choice item quality, and secondly by asking the panellists to identify which items were developed using AIG in a blind review. Fifteen items from the domain of therapeutics were created in three different experimental test development conditions. The first 15 items were created by content specialists using traditional test development methods (Group 1 Traditional). The second 15 items were created by the same content specialists using AIG methods (Group 1 AIG). The third 15 items were created by a new group of content specialists using traditional methods (Group 2 Traditional). These 45 items were then evaluated for quality by a four-member panel of medical experts and were subsequently categorised as either Traditional or AIG items. Three outcomes were reported: (i) the items produced using traditional and AIG processes were comparable on seven of eight indicators of multiple-choice item quality; (ii) AIG items can be differentiated from Traditional items by the quality of their distractors, and (iii) the overall predictive accuracy of the four expert medical panellists was 42%. Items generated by AIG methods are, for the most part, equivalent to traditionally developed items from the perspective of expert medical reviewers. While the AIG method produced comparatively fewer plausible

  18. On the Equivalence of Constructed-Response and Multiple-Choice Tests.

    Traub, Ross E.; Fisher, Charles W.

    Two sets of mathematical reasoning and two sets of verbal comprehension items were cast into each of three formats--constructed response, standard multiple-choice, and Coombs multiple-choice--in order to assess whether tests with indentical content but different formats measure the same attribute, except for possible differences in error variance…

  19. Evaluation of five guidelines for option development in multiple-choice item-writing.

    Martínez, Rafael J; Moreno, Rafael; Martín, Irene; Trigo, M Eva

    2009-05-01

    This paper evaluates certain guidelines for writing multiple-choice test items. The analysis of the responses of 5013 subjects to 630 items from 21 university classroom achievement tests suggests that an option should not differ in terms of heterogeneous content because such error has a slight but harmful effect on item discrimination. This also occurs with the "None of the above" option when it is the correct one. In contrast, results do not show the supposedly negative effects of a different-length option, the use of specific determiners, or the use of the "All of the above" option, which not only decreases difficulty but also improves discrimination when it is the correct option.

  20. Optimizing Multiple-Choice Tests as Learning Events

    Little, Jeri Lynn

    2011-01-01

    Although generally used for assessment, tests can also serve as tools for learning--but different test formats may not be equally beneficial. Specifically, research has shown multiple-choice tests to be less effective than cued-recall tests in improving the later retention of the tested information (e.g., see meta-analysis by Hamaker, 1986),…

  1. Optimizing multiple-choice tests as tools for learning.

    Little, Jeri L; Bjork, Elizabeth Ligon

    2015-01-01

    Answering multiple-choice questions with competitive alternatives can enhance performance on a later test, not only on questions about the information previously tested, but also on questions about related information not previously tested-in particular, on questions about information pertaining to the previously incorrect alternatives. In the present research, we assessed a possible explanation for this pattern: When multiple-choice questions contain competitive incorrect alternatives, test-takers are led to retrieve previously studied information pertaining to all of the alternatives in order to discriminate among them and select an answer, with such processing strengthening later access to information associated with both the correct and incorrect alternatives. Supporting this hypothesis, we found enhanced performance on a later cued-recall test for previously nontested questions when their answers had previously appeared as competitive incorrect alternatives in the initial multiple-choice test, but not when they had previously appeared as noncompetitive alternatives. Importantly, however, competitive alternatives were not more likely than noncompetitive alternatives to be intruded as incorrect responses, indicating that a general increased accessibility for previously presented incorrect alternatives could not be the explanation for these results. The present findings, replicated across two experiments (one in which corrective feedback was provided during the initial multiple-choice testing, and one in which it was not), thus strongly suggest that competitive multiple-choice questions can trigger beneficial retrieval processes for both tested and related information, and the results have implications for the effective use of multiple-choice tests as tools for learning.

  2. Are Faculty Predictions or Item Taxonomies Useful for Estimating the Outcome of Multiple-Choice Examinations?

    Kibble, Jonathan D.; Johnson, Teresa

    2011-01-01

    The purpose of this study was to evaluate whether multiple-choice item difficulty could be predicted either by a subjective judgment by the question author or by applying a learning taxonomy to the items. Eight physiology faculty members teaching an upper-level undergraduate human physiology course consented to participate in the study. The…

  3. Complement or Contamination: A Study of the Validity of Multiple-Choice Items when Assessing Reasoning Skills in Physics

    Anders Jönsson; David Rosenlund; Fredrik Alvén

    2017-01-01

    The purpose of this study is to investigate the validity of using multiple-choice (MC) items as a complement to constructed-response (CR) items when making decisions about student performance on reasoning tasks. CR items from a national test in physics have been reformulated into MC items and students’ reasoning skills have been analyzed in two substudies. In the first study, 12 students answered the MC items and were asked to explain their answers orally. In the second study, 102 students fr...

  4. Impact of Answer-Switching Behavior on Multiple-Choice Test Scores in Higher Education

    Ramazan BAŞTÜRK

    2011-06-01

    Full Text Available The multiple- choice format is one of the most popular selected-response item formats used in educational testing. Researchers have shown that Multiple-choice type test is a useful vehicle for student assessment in core university subjects that usually have large student numbers. Even though the educators, test experts and different test recourses maintain the idea that the first answer should be retained, many researchers argued that this argument is not dependent with empirical findings. The main question of this study is to examine how the answer switching behavior affects the multiple-choice test score. Additionally, gender differences and relationship between number of answer switching behavior and item parameters (item difficulty and item discrimination were investigated. The participants in this study consisted of 207 upper-level College of Education students from mid-sized universities. A Midterm exam consisted of 20 multiple-choice questions was used. According to the result of this study, answer switching behavior statistically increase test scores. On the other hand, there is no significant gender difference in answer-switching behavior. Additionally, there is a significant negative relationship between answer switching behavior and item difficulties.

  5. Multiple-Choice versus Constructed-Response Tests in the Assessment of Mathematics Computation Skills.

    Gadalla, Tahany M.

    The equivalence of multiple-choice (MC) and constructed response (discrete) (CR-D) response formats as applied to mathematics computation at grade levels two to six was tested. The difference between total scores from the two response formats was tested for statistical significance, and the factor structure of items in both response formats was…

  6. The "None of the Above" Option in Multiple-Choice Testing: An Experimental Study

    DiBattista, David; Sinnige-Egger, Jo-Anne; Fortuna, Glenda

    2014-01-01

    The authors assessed the effects of using "none of the above" as an option in a 40-item, general-knowledge multiple-choice test administered to undergraduate students. Examinees who selected "none of the above" were given an incentive to write the correct answer to the question posed. Using "none of the above" as the…

  7. Multiple Choice Testing and the Retrieval Hypothesis of the Testing Effect

    Sensenig, Amanda E.

    2010-01-01

    Taking a test often leads to enhanced later memory for the tested information, a phenomenon known as the "testing effect". This memory advantage has been reliably demonstrated with recall tests but not multiple choice tests. One potential explanation for this finding is that multiple choice tests do not rely on retrieval processes to the same…

  8. Use of flawed multiple-choice items by the New England Journal of Medicine for continuing medical education.

    Stagnaro-Green, Alex S; Downing, Steven M

    2006-09-01

    Physicians in the United States are required to complete a minimum number of continuing medical education (CME) credits annually. The goal of CME is to ensure that physicians maintain their knowledge and skills throughout their medical career. The New England Journal of Medicine (NEJM) provides its readers with the opportunity to obtain weekly CME credits. Deviation from established item-writing principles may result in a decrease in validity evidence for tests. This study evaluated the quality of 40 NEJM MCQs using the standard evidence-based principles of effective item writing. Each multiple-choice item reviewed had at least three item flaws, with a mean of 5.1 and a range of 3 to 7. The results of this study demonstrate that the NEJM uses flawed MCQs in its weekly CME program.

  9. Item Analysis of Multiple Choice Questions at the Department of Paediatrics, Arabian Gulf University, Manama, Bahrain

    Deena Kheyami

    2018-04-01

    Full Text Available Objectives: The current study aimed to carry out a post-validation item analysis of multiple choice questions (MCQs in medical examinations in order to evaluate correlations between item difficulty, item discrimination and distraction effectiveness so as to determine whether questions should be included, modified or discarded. In addition, the optimal number of options per MCQ was analysed. Methods: This cross-sectional study was performed in the Department of Paediatrics, Arabian Gulf University, Manama, Bahrain. A total of 800 MCQs and 4,000 distractors were analysed between November 2013 and June 2016. Results: The mean difficulty index ranged from 36.70–73.14%. The mean discrimination index ranged from 0.20–0.34. The mean distractor efficiency ranged from 66.50–90.00%. Of the items, 48.4%, 35.3%, 11.4%, 3.9% and 1.1% had zero, one, two, three and four nonfunctional distractors (NFDs, respectively. Using three or four rather than five options in each MCQ resulted in 95% or 83.6% of items having zero NFDs, respectively. The distractor efficiency was 91.87%, 85.83% and 64.13% for difficult, acceptable and easy items, respectively (P <0.005. Distractor efficiency was 83.33%, 83.24% and 77.56% for items with excellent, acceptable and poor discrimination, respectively (P <0.005. The average Kuder-Richardson formula 20 reliability coefficient was 0.76. Conclusion: A considerable number of the MCQ items were within acceptable ranges. However, some items needed to be discarded or revised. Using three or four rather than five options in MCQs is recommended to reduce the number of NFDs and improve the overall quality of the examination.

  10. Examining the Psychometric Quality of Multiple-Choice Assessment Items using Mokken Scale Analysis.

    Wind, Stefanie A

    The concept of invariant measurement is typically associated with Rasch measurement theory (Engelhard, 2013). Concerned with the appropriateness of the parametric transformation upon which the Rasch model is based, Mokken (1971) proposed a nonparametric procedure for evaluating the quality of social science measurement that is theoretically and empirically related to the Rasch model. Mokken's nonparametric procedure can be used to evaluate the quality of dichotomous and polytomous items in terms of the requirements for invariant measurement. Despite these potential benefits, the use of Mokken scaling to examine the properties of multiple-choice (MC) items in education has not yet been fully explored. A nonparametric approach to evaluating MC items is promising in that this approach facilitates the evaluation of assessments in terms of invariant measurement without imposing potentially inappropriate transformations. Using Rasch-based indices of measurement quality as a frame of reference, data from an eighth-grade physical science assessment are used to illustrate and explore Mokken-based techniques for evaluating the quality of MC items. Implications for research and practice are discussed.

  11. Analyzing Multiple-Choice Questions by Model Analysis and Item Response Curves

    Wattanakasiwich, P.; Ananta, S.

    2010-07-01

    In physics education research, the main goal is to improve physics teaching so that most students understand physics conceptually and be able to apply concepts in solving problems. Therefore many multiple-choice instruments were developed to probe students' conceptual understanding in various topics. Two techniques including model analysis and item response curves were used to analyze students' responses from Force and Motion Conceptual Evaluation (FMCE). For this study FMCE data from more than 1000 students at Chiang Mai University were collected over the past three years. With model analysis, we can obtain students' alternative knowledge and the probabilities for students to use such knowledge in a range of equivalent contexts. The model analysis consists of two algorithms—concentration factor and model estimation. This paper only presents results from using the model estimation algorithm to obtain a model plot. The plot helps to identify a class model state whether it is in the misconception region or not. Item response curve (IRC) derived from item response theory is a plot between percentages of students selecting a particular choice versus their total score. Pros and cons of both techniques are compared and discussed.

  12. Quantitative Analysis of Complex Multiple-Choice Items in Science Technology and Society: Item Scaling

    Ángel Vázquez Alonso

    2005-05-01

    Full Text Available The scarce attention to assessment and evaluation in science education research has been especially harmful for Science-Technology-Society (STS education, due to the dialectic, tentative, value-laden, and controversial nature of most STS topics. To overcome the methodological pitfalls of the STS assessment instruments used in the past, an empirically developed instrument (VOSTS, Views on Science-Technology-Society have been suggested. Some methodological proposals, namely the multiple response models and the computing of a global attitudinal index, were suggested to improve the item implementation. The final step of these methodological proposals requires the categorization of STS statements. This paper describes the process of categorization through a scaling procedure ruled by a panel of experts, acting as judges, according to the body of knowledge from history, epistemology, and sociology of science. The statement categorization allows for the sound foundation of STS items, which is useful in educational assessment and science education research, and may also increase teachers’ self-confidence in the development of the STS curriculum for science classrooms.

  13. Feedback-related brain activity predicts learning from feedback in multiple-choice testing.

    Ernst, Benjamin; Steinhauser, Marco

    2012-06-01

    Different event-related potentials (ERPs) have been shown to correlate with learning from feedback in decision-making tasks and with learning in explicit memory tasks. In the present study, we investigated which ERPs predict learning from corrective feedback in a multiple-choice test, which combines elements from both paradigms. Participants worked through sets of multiple-choice items of a Swahili-German vocabulary task. Whereas the initial presentation of an item required the participants to guess the answer, corrective feedback could be used to learn the correct response. Initial analyses revealed that corrective feedback elicited components related to reinforcement learning (FRN), as well as to explicit memory processing (P300) and attention (early frontal positivity). However, only the P300 and early frontal positivity were positively correlated with successful learning from corrective feedback, whereas the FRN was even larger when learning failed. These results suggest that learning from corrective feedback crucially relies on explicit memory processing and attentional orienting to corrective feedback, rather than on reinforcement learning.

  14. Feedback enhances the positive effects and reduces the negative effects of multiple-choice testing.

    Butler, Andrew C; Roediger, Henry L

    2008-04-01

    Multiple-choice tests are used frequently in higher education without much consideration of the impact this form of assessment has on learning. Multiple-choice testing enhances retention of the material tested (the testing effect); however, unlike other tests, multiple-choice can also be detrimental because it exposes students to misinformation in the form of lures. The selection of lures can lead students to acquire false knowledge (Roediger & Marsh, 2005). The present research investigated whether feedback could be used to boost the positive effects and reduce the negative effects of multiple-choice testing. Subjects studied passages and then received a multiple-choice test with immediate feedback, delayed feedback, or no feedback. In comparison with the no-feedback condition, both immediate and delayed feedback increased the proportion of correct responses and reduced the proportion of intrusions (i.e., lure responses from the initial multiple-choice test) on a delayed cued recall test. Educators should provide feedback when using multiple-choice tests.

  15. Test of understanding of vectors: A reliable multiple-choice vector concept test

    Barniol, Pablo; Zavala, Genaro

    2014-06-01

    In this article we discuss the findings of our research on students' understanding of vector concepts in problems without physical context. First, we develop a complete taxonomy of the most frequent errors made by university students when learning vector concepts. This study is based on the results of several test administrations of open-ended problems in which a total of 2067 students participated. Using this taxonomy, we then designed a 20-item multiple-choice test [Test of understanding of vectors (TUV)] and administered it in English to 423 students who were completing the required sequence of introductory physics courses at a large private Mexican university. We evaluated the test's content validity, reliability, and discriminatory power. The results indicate that the TUV is a reliable assessment tool. We also conducted a detailed analysis of the students' understanding of the vector concepts evaluated in the test. The TUV is included in the Supplemental Material as a resource for other researchers studying vector learning, as well as instructors teaching the material.

  16. Will a Short Training Session Improve Multiple-Choice Item-Writing Quality by Dental School Faculty? A Pilot Study.

    Dellinges, Mark A; Curtis, Donald A

    2017-08-01

    Faculty members are expected to write high-quality multiple-choice questions (MCQs) in order to accurately assess dental students' achievement. However, most dental school faculty members are not trained to write MCQs. Extensive faculty development programs have been used to help educators write better test items. The aim of this pilot study was to determine if a short workshop would result in improved MCQ item-writing by dental school faculty at one U.S. dental school. A total of 24 dental school faculty members who had previously written MCQs were randomized into a no-intervention group and an intervention group in 2015. Six previously written MCQs were randomly selected from each of the faculty members and given an item quality score. The intervention group participated in a training session of one-hour duration that focused on reviewing standard item-writing guidelines to improve in-house MCQs. The no-intervention group did not receive any training but did receive encouragement and an explanation of why good MCQ writing was important. The faculty members were then asked to revise their previously written questions, and these were given an item quality score. The item quality scores for each faculty member were averaged, and the difference from pre-training to post-training scores was evaluated. The results showed a significant difference between pre-training and post-training MCQ difference scores for the intervention group (p=0.04). This pilot study provides evidence that the training session of short duration was effective in improving the quality of in-house MCQs.

  17. Force Concept Inventory-based multiple-choice test for investigating students’ representational consistency

    Pasi Nieminen

    2010-08-01

    Full Text Available This study investigates students’ ability to interpret multiple representations consistently (i.e., representational consistency in the context of the force concept. For this purpose we developed the Representational Variant of the Force Concept Inventory (R-FCI, which makes use of nine items from the 1995 version of the Force Concept Inventory (FCI. These original FCI items were redesigned using various representations (such as motion map, vectorial and graphical, yielding 27 multiple-choice items concerning four central concepts underpinning the force concept: Newton’s first, second, and third laws, and gravitation. We provide some evidence for the validity and reliability of the R-FCI; this analysis is limited to the student population of one Finnish high school. The students took the R-FCI at the beginning and at the end of their first high school physics course. We found that students’ (n=168 representational consistency (whether scientifically correct or not varied considerably depending on the concept. On average, representational consistency and scientifically correct understanding increased during the instruction, although in the post-test only a few students performed consistently both in terms of representations and scientifically correct understanding. We also compared students’ (n=87 results of the R-FCI and the FCI, and found that they correlated quite well.

  18. Decision making under internal uncertainty: the case of multiple-choice tests with different scoring rules.

    Bereby-Meyer, Yoella; Meyer, Joachim; Budescu, David V

    2003-02-01

    This paper assesses framing effects on decision making with internal uncertainty, i.e., partial knowledge, by focusing on examinees' behavior in multiple-choice (MC) tests with different scoring rules. In two experiments participants answered a general-knowledge MC test that consisted of 34 solvable and 6 unsolvable items. Experiment 1 studied two scoring rules involving Positive (only gains) and Negative (only losses) scores. Although answering all items was the dominating strategy for both rules, the results revealed a greater tendency to answer under the Negative scoring rule. These results are in line with the predictions derived from Prospect Theory (PT) [Econometrica 47 (1979) 263]. The second experiment studied two scoring rules, which allowed respondents to exhibit partial knowledge. Under the Inclusion-scoring rule the respondents mark all answers that could be correct, and under the Exclusion-scoring rule they exclude all answers that might be incorrect. As predicted by PT, respondents took more risks under the Inclusion rule than under the Exclusion rule. The results illustrate that the basic process that underlies choice behavior under internal uncertainty and especially the effect of framing is similar to the process of choice under external uncertainty and can be described quite accurately by PT. Copyright 2002 Elsevier Science B.V.

  19. Exploring problem solving strategies on multiple-choice science items: Comparing native Spanish-speaking English Language Learners and mainstream monolinguals

    Kachchaf, Rachel Rae

    The purpose of this study was to compare how English language learners (ELLs) and monolingual English speakers solved multiple-choice items administered with and without a new form of testing accommodation---vignette illustration (VI). By incorporating theories from second language acquisition, bilingualism, and sociolinguistics, this study was able to gain more accurate and comprehensive input into the ways students interacted with items. This mixed methods study used verbal protocols to elicit the thinking processes of thirty-six native Spanish-speaking English language learners (ELLs), and 36 native-English speaking non-ELLs when solving multiple-choice science items. Results from both qualitative and quantitative analyses show that ELLs used a wider variety of actions oriented to making sense of the items than non-ELLs. In contrast, non-ELLs used more problem solving strategies than ELLs. There were no statistically significant differences in student performance based on the interaction of presence of illustration and linguistic status or the main effect of presence of illustration. However, there were significant differences based on the main effect of linguistic status. An interaction between the characteristics of the students, the items, and the illustrations indicates considerable heterogeneity in the ways in which students from both linguistic groups think about and respond to science test items. The results of this study speak to the need for more research involving ELLs in the process of test development to create test items that do not require ELLs to carry out significantly more actions to make sense of the item than monolingual students.

  20. Constructing a multiple choice test to measure elementary school teachers' Pedagogical Content Knowledge of technology education.

    Rohaan, E.J.; Taconis, R.; Jochems, W.M.G.

    2009-01-01

    This paper describes the construction and validation of a multiple choice test to measure elementary school teachers' Pedagogical Content Knowledge of technology education. Pedagogical Content Knowledge is generally accepted to be a crucial domain of teacher knowledge and is, therefore, an important

  1. Item Analysis in Introductory Economics Testing.

    Tinari, Frank D.

    1979-01-01

    Computerized analysis of multiple choice test items is explained. Examples of item analysis applications in the introductory economics course are discussed with respect to three objectives: to evaluate learning; to improve test items; and to help improve classroom instruction. Problems, costs and benefits of the procedures are identified. (JMD)

  2. Comparing Item Performance on Three- Versus Four-Option Multiple Choice Questions in a Veterinary Toxicology Course.

    Royal, Kenneth; Dorman, David

    2018-06-09

    The number of answer options is an important element of multiple-choice questions (MCQs). Many MCQs contain four or more options despite the limited literature suggesting that there is little to no benefit beyond three options. The purpose of this study was to evaluate item performance on 3-option versus 4-option MCQs used in a core curriculum course in veterinary toxicology at a large veterinary medical school in the United States. A quasi-experimental, crossover design was used in which students in each class were randomly assigned to take one of two versions (A or B) of two major exams. Both the 3-option and 4-option MCQs resulted in similar psychometric properties. The findings of our study support earlier research in other medical disciplines and settings that likewise concluded there was no significant change in the psychometric properties of three option MCQs when compared to the traditional MCQs with four or more options.

  3. Comparison between three option, four option and five option multiple choice question tests for quality parameters: A randomized study.

    Vegada, Bhavisha; Shukla, Apexa; Khilnani, Ajeetkumar; Charan, Jaykaran; Desai, Chetna

    2016-01-01

    Most of the academic teachers use four or five options per item of multiple choice question (MCQ) test as formative and summative assessment. Optimal number of options in MCQ item is a matter of considerable debate among academic teachers of various educational fields. There is a scarcity of the published literature regarding the optimum number of option in each item of MCQ in the field of medical education. To compare three options, four options, and five options MCQs test for the quality parameters - reliability, validity, item analysis, distracter analysis, and time analysis. Participants were 3 rd semester M.B.B.S. students. Students were divided randomly into three groups. Each group was given one set of MCQ test out of three options, four options, and five option randomly. Following the marking of the multiple choice tests, the participants' option selections were analyzed and comparisons were conducted of the mean marks, mean time, validity, reliability and facility value, discrimination index, point biserial value, distracter analysis of three different option formats. Students score more ( P = 0.000) and took less time ( P = 0.009) for the completion of three options as compared to four options and five options groups. Facility value was more ( P = 0.004) in three options group as compared to four and five options groups. There was no significant difference between three groups for the validity, reliability, and item discrimination. Nonfunctioning distracters were more in the four and five options group as compared to three option group. Assessment based on three option MCQs is can be preferred over four option and five option MCQs.

  4. Do Students Behave Rationally in Multiple Choice Tests? Evidence from a Field Experiment

    María Paz Espinosa; Javier Gardeazabal

    2013-01-01

    A disadvantage of multiple choice tests is that students have incentives to guess. To discourage guessing, it is common to use scoring rules that either penalize wrong answers or reward omissions. In psychometrics, penalty and reward scoring rules are considered equivalent. However, experimental evidence indicates that students behave differently under penalty or reward scoring rules. These differences have been attributed to the different framing (penalty versus reward). In this paper, we mo...

  5. The Relationship of Deep and Surface Study Approaches on Factual and Applied Test-Bank Multiple-Choice Question Performance

    Yonker, Julie E.

    2011-01-01

    With the advent of online test banks and large introductory classes, instructors have often turned to textbook publisher-generated multiple-choice question (MCQ) exams in their courses. Multiple-choice questions are often divided into categories of factual or applied, thereby implicating levels of cognitive processing. This investigation examined…

  6. Analysis of Multiple Choice Tests Designed by Faculty Members of Kermanshah University of Medical Sciences

    Reza Pourmirza Kalhori

    2013-12-01

    Full Text Available Dear Editor Multiple choice tests are the most common objective tests in medical education which are used to assess the ind-ividual knowledge, recall, recognition and problem solving abilities. One of the testing components is the post-test analysis. This component includes; first, qualitative analysis of the taxonomy of questions based on the Bloom’s educational objectives and percentage of the questions with no structural problems; and second, the quantitative analysis of the reliability (KR-20 and indices of difficulty and differentiation (1. This descriptive-analytical study was aimed to qualitatively and quan-titatively investigate the multiple-choice tests of the faculty members at Kermanshah University of Medical Sciences in 2009-2010. The sample size comprised of 156 tests. Data were analyzed by SPSS-16 software using t-test, chi-squared test, ANOVA and Tukey multiple comparison tests. The mean of reliability (KR-20, difficulty index, and discrimination index were 0.68 (± 0.31, 0.56 (± 0.15 and 0.21 (± 0.15, respectively, which were acceptable. The analysis of the tests at Mashad University of Medical Sciences indicated that the mean for the reliability of the tests was 0.72, and 52.2% of the tests had inappropriate difficulty index and 49.2% of the tests did not have acceptable differentiation index (2. Comparison of the tests at Kermanshah University of Medical Sciences for the fields of anatomy, physiology, biochemistry, genetics, statistics and behavioral sciences courses at Malaysia Faculty of Medicine (3 and tests at Argentina Faculty of Medicine (4 showed that while difficulty index was acceptable in all three universities, but differentiation indices in Malaysia and Argentina Medical Faculties were higher than that in Kermanshah University of Medical Sciences. The mean for the questions with no structural flaws in all tests, taxonomy I, taxonomy II, and taxonomy III were 73.88% (± 14.88, 34.65% (± 15.78, 41.34% (± 13

  7. ACER Chemistry Test Item Collection. ACER Chemtic Year 12.

    Australian Council for Educational Research, Hawthorn.

    The chemistry test item banks contains 225 multiple-choice questions suitable for diagnostic and achievement testing; a three-page teacher's guide; answer key with item facilities; an answer sheet; and a 45-item sample achievement test. Although written for the new grade 12 chemistry course in Victoria, Australia, the items are widely applicable.…

  8. Australian Chemistry Test Item Bank: Years 11 & 12. Volume 1.

    Commons, C., Ed.; Martin, P., Ed.

    Volume 1 of the Australian Chemistry Test Item Bank, consisting of two volumes, contains nearly 2000 multiple-choice items related to the chemistry taught in Year 11 and Year 12 courses in Australia. Items which were written during 1979 and 1980 were initially published in the "ACER Chemistry Test Item Collection" and in the "ACER…

  9. Identification of Misconceptions through Multiple Choice Tasks at Municipal Chemistry Competition Test

    Dušica D Milenković

    2016-01-01

    Full Text Available In this paper, the level of conceptual understanding of chemical contents among seventh grade students who participated in the municipal Chemistry competition in Novi Sad, Serbia, in 2013 have been examined. Tests for the municipal chemistry competition were used as a measuring instrument, wherein only multiple choice tasks were considered and analyzed. Determination of the level of conceptual understanding of the tested chemical contents was based on the calculation of the frequency of choosing the correct answers. Thereby, identification of areas of satisfactory conceptual understanding, areas of roughly adequate performance, areas of inadequate performance, and areas of quite inadequate performance have been conducted. On the other hand, the analysis of misconceptions was based on the analysis of distractors. The results showed that satisfactory level of conceptual understanding and roughly adequate performance characterize majority of contents, which was expected since only the best students who took part in the contest were surveyed. However, this analysis identified a large number of misunderstandings, as well. In most of the cases, these misconceptions were related to the inability to distinguish elements, compounds, homogeneous and heterogeneous mixtures. Besides, it is shown that students are not familiar with crystal structure of the diamond, and with metric prefixes. The obtained results indicate insufficient visualization of the submicroscopic level in school textbooks, the imprecise use of chemical language by teachers and imprecise use of language in chemistry textbooks.

  10. Effect of response format on cognitive reflection: Validating a two- and four-option multiple choice question version of the Cognitive Reflection Test.

    Sirota, Miroslav; Juanchich, Marie

    2018-03-27

    The Cognitive Reflection Test, measuring intuition inhibition and cognitive reflection, has become extremely popular because it reliably predicts reasoning performance, decision-making, and beliefs. Across studies, the response format of CRT items sometimes differs, based on the assumed construct equivalence of tests with open-ended versus multiple-choice items (the equivalence hypothesis). Evidence and theoretical reasons, however, suggest that the cognitive processes measured by these response formats and their associated performances might differ (the nonequivalence hypothesis). We tested the two hypotheses experimentally by assessing the performance in tests with different response formats and by comparing their predictive and construct validity. In a between-subjects experiment (n = 452), participants answered stem-equivalent CRT items in an open-ended, a two-option, or a four-option response format and then completed tasks on belief bias, denominator neglect, and paranormal beliefs (benchmark indicators of predictive validity), as well as on actively open-minded thinking and numeracy (benchmark indicators of construct validity). We found no significant differences between the three response formats in the numbers of correct responses, the numbers of intuitive responses (with the exception of the two-option version, which had a higher number than the other tests), and the correlational patterns of the indicators of predictive and construct validity. All three test versions were similarly reliable, but the multiple-choice formats were completed more quickly. We speculate that the specific nature of the CRT items helps build construct equivalence among the different response formats. We recommend using the validated multiple-choice version of the CRT presented here, particularly the four-option CRT, for practical and methodological reasons. Supplementary materials and data are available at https://osf.io/mzhyc/ .

  11. Effectiveness of Guided Multiple Choice Objective Questions Test on Students' Academic Achievement in Senior School Mathematics by School Location

    Igbojinwaekwu, Patrick Chukwuemeka

    2015-01-01

    This study investigated, using pretest-posttest quasi-experimental research design, the effectiveness of guided multiple choice objective questions test on students' academic achievement in Senior School Mathematics, by school location, in Delta State Capital Territory, Nigeria. The sample comprised 640 Students from four coeducation secondary…

  12. Electronics. Criterion-Referenced Test (CRT) Item Bank.

    Davis, Diane, Ed.

    This document contains 519 criterion-referenced multiple choice and true or false test items for a course in electronics. The test item bank is designed to work with both the Vocational Instructional Management System (VIMS) and the Vocational Administrative Management System (VAMS) in Missouri. The items are grouped into 15 units covering the…

  13. Assessing difference between classical test theory and item ...

    Assessing difference between classical test theory and item response theory methods in scoring primary four multiple choice objective test items. ... All research participants were ranked on the CTT number correct scores and the corresponding IRT item pattern scores from their performance on the PRISMADAT. Wilcoxon ...

  14. ACER Chemistry Test Item Collection (ACER CHEMTIC Year 12 Supplement).

    Australian Council for Educational Research, Hawthorn.

    This publication contains 317 multiple-choice chemistry test items related to topics covered in the Victorian (Australia) Year 12 chemistry course. It allows teachers access to a range of items suitable for diagnostic and achievement purposes, supplementing the ACER Chemistry Test Item Collection--Year 12 (CHEMTIC). The topics covered are: organic…

  15. Examining the Prediction of Reading Comprehension on Different Multiple-Choice Tests

    Andreassen, Rune; Braten, Ivar

    2010-01-01

    In this study, 180 Norwegian fifth-grade students with a mean age of 10.5 years were administered measures of word recognition skills, strategic text processing, reading motivation and working memory. Six months later, the same students were given three different multiple-choice reading comprehension measures. Based on three forced-order…

  16. Retrieval practice with short-answer, multiple-choice, and hybrid tests.

    Smith, Megan A; Karpicke, Jeffrey D

    2014-01-01

    Retrieval practice improves meaningful learning, and the most frequent way of implementing retrieval practice in classrooms is to have students answer questions. In four experiments (N=372) we investigated the effects of different question formats on learning. Students read educational texts and practised retrieval by answering short-answer, multiple-choice, or hybrid questions. In hybrid conditions students first attempted to recall answers in short-answer format, then identified answers in multiple-choice format. We measured learning 1 week later using a final assessment with two types of questions: those that could be answered by recalling information verbatim from the texts and those that required inferences. Practising retrieval in all format conditions enhanced retention, relative to a study-only control condition, on both verbatim and inference questions. However, there were little or no advantages of answering short-answer or hybrid format questions over multiple-choice questions in three experiments. In Experiment 4, when retrieval success was improved under initial short-answer conditions, there was an advantage of answering short-answer or hybrid questions over multiple-choice questions. The results challenge the simple conclusion that short-answer questions always produce the best learning, due to increased retrieval effort or difficulty, and demonstrate the importance of retrieval success for retrieval-based learning activities.

  17. Validation of a Standardized Multiple-Choice Multicultural Competence Test: Implications for Training, Assessment, and Practice

    Gillem, Angela R.; Bartoli, Eleonora; Bertsch, Kristin N.; McCarthy, Maureen A.; Constant, Kerra; Marrero-Meisky, Sheila; Robbins, Steven J.; Bellamy, Scarlett

    2016-01-01

    The Multicultural Counseling and Psychotherapy Test (MCPT), a measure of multicultural counseling competence (MCC), was validated in 2 phases. In Phase 1, the authors administered 451 test items derived from multicultural guidelines in counseling and psychology to 32 multicultural experts and 30 nonexperts. In Phase 2, the authors administered the…

  18. Student certainty answering misconception question: study of Three-Tier Multiple-Choice Diagnostic Test in Acid-Base and Solubility Equilibrium

    Ardiansah; Masykuri, M.; Rahardjo, S. B.

    2018-04-01

    Students’ concept comprehension in three-tier multiple-choice diagnostic test related to student confidence level. The confidence level related to certainty and student’s self-efficacy. The purpose of this research was to find out students’ certainty in misconception test. This research was quantitative-qualitative research method counting students’ confidence level. The research participants were 484 students that were studying acid-base and equilibrium solubility subject. Data was collected using three-tier multiple-choice (3TMC) with thirty questions and students’ questionnaire. The findings showed that #6 item gives the highest misconception percentage and high student confidence about the counting of ultra-dilute solution’s pH. Other findings were that 1) the student tendency chosen the misconception answer is to increase over item number, 2) student certainty decreased in terms of answering the 3TMC, and 3) student self-efficacy and achievement were related each other in the research. The findings suggest some implications and limitations for further research.

  19. Test of Understanding of Vectors: A Reliable Multiple-Choice Vector Concept Test

    Barniol, Pablo; Zavala, Genaro

    2014-01-01

    In this article we discuss the findings of our research on students' understanding of vector concepts in problems without physical context. First, we develop a complete taxonomy of the most frequent errors made by university students when learning vector concepts. This study is based on the results of several test administrations of open-ended…

  20. Examining Gender DIF on a Multiple-Choice Test of Mathematics: A Confirmatory Approach.

    Ryan, Katherine E.; Fan, Meichu

    1996-01-01

    Results for 3,244 female and 3,033 male junior high school students from the Second International Mathematics Study show that applied items in algebra, geometry, and computation were easier for males but arithmetic items were differentially easier for females. Implications of these findings for assessment and instruction are discussed. (SLD)

  1. Measuring the Consistency in Change in Hepatitis B Knowledge among Three Different Types of Tests: True/False, Multiple Choice, and Fill in the Blanks Tests.

    Sahai, Vic; Demeyere, Petra; Poirier, Sheila; Piro, Felice

    1998-01-01

    The recall of information about Hepatitis B demonstrated by 180 seventh graders was tested with three test types: (1) short-answer; (2) true/false; and (3) multiple-choice. Short answer testing was the most reliable. Suggestions are made for the use of short-answer tests in evaluating student knowledge. (SLD)

  2. Post-Graduate Student Performance in "Supervised In-Class" vs. "Unsupervised Online" Multiple Choice Tests: Implications for Cheating and Test Security

    Ladyshewsky, Richard K.

    2015-01-01

    This research explores differences in multiple choice test (MCT) scores in a cohort of post-graduate students enrolled in a management and leadership course. A total of 250 students completed the MCT in either a supervised in-class paper and pencil test or an unsupervised online test. The only statistically significant difference between the nine…

  3. Force Concept Inventory-Based Multiple-Choice Test for Investigating Students' Representational Consistency

    Nieminen, Pasi; Savinainen, Antti; Viiri, Jouni

    2010-01-01

    This study investigates students' ability to interpret multiple representations consistently (i.e., representational consistency) in the context of the force concept. For this purpose we developed the Representational Variant of the Force Concept Inventory (R-FCI), which makes use of nine items from the 1995 version of the Force Concept Inventory…

  4. The Prevalence of Multiple-Choice Testing in Registered Nurse Licensure-Qualifying Nursing Education Programs in New York State.

    Birkhead, Susan; Kelman, Glenda; Zittel, Barbara; Jatulis, Linnea

    The aim of this study was to describe nurse educators' use of multiple-choice questions (MCQs) in testing in registered nurse licensure-qualifying nursing education programs in New York State. This study was a descriptive correlational analysis of data obtained from surveying 1,559 nurse educators; 297 educators from 61 institutions responded (response rate [RR] = 19 percent), yielding a final cohort of 200. MCQs were reported to comprise a mean of 81 percent of questions on a typical test. Baccalaureate program respondents were equally likely to use MCQs as associate degree program respondents (p > .05) but were more likely to report using other methods of assessing student achievement to construct course grades (p < .01). Both groups reported little use of alternate format-type questions. Respondent educators reported substantial reliance upon the use of MCQs, corroborating the limited data quantifying the prevalence of use of MCQ tests in licensure-qualifying nursing education programs.

  5. "None of the above" as a correct and incorrect alternative on a multiple-choice test: implications for the testing effect.

    Odegard, Timothy N; Koen, Joshua D

    2007-11-01

    Both positive and negative testing effects have been demonstrated with a variety of materials and paradigms (Roediger & Karpicke, 2006b). The present series of experiments replicate and extend the research of Roediger and Marsh (2005) with the addition of a "none-of-the-above" response option. Participants (n=32 in both experiments) read a set of passages, took an initial multiple-choice test, completed a filler task, and then completed a final cued-recall test (Experiment 1) or multiple-choice test (Experiment 2). Questions were manipulated on the initial multiple-choice test by adding a "none-of-the-above" response alternative (choice "E") that was incorrect ("E" Incorrect) or correct ("E" Correct). The results from both experiments demonstrated that the positive testing effect was negated when the "none-of-the-above" alternative was the correct response on the initial multiple-choice test, but was still present when the "none-of-the-above" alternative was an incorrect response.

  6. Mixed-Format Test Score Equating: Effect of Item-Type Multidimensionality, Length and Composition of Common-Item Set, and Group Ability Difference

    Wang, Wei

    2013-01-01

    Mixed-format tests containing both multiple-choice (MC) items and constructed-response (CR) items are now widely used in many testing programs. Mixed-format tests often are considered to be superior to tests containing only MC items although the use of multiple item formats leads to measurement challenges in the context of equating conducted under…

  7. Australian Chemistry Test Item Bank: Years 11 and 12. Volume 2.

    Commons, C., Ed.; Martin, P., Ed.

    The second volume of the Australian Chemistry Test Item Bank, consisting of two volumes, contains nearly 2000 multiple-choice items related to the chemistry taught in Year 11 and Year 12 courses in Australia. Items which were written during 1979 and 1980 were initially published in the "ACER Chemistry Test Item Collection" and in the…

  8. Memory-Context Effects of Screen Color in Multiple-Choice and Fill-In Tests

    Prestera, Gustavo E.; Clariana, Roy; Peck, Andrew

    2005-01-01

    In this experimental study, 44 undergraduates completed five computer-based instructional lessons and either two multiplechoice tests or two fill-in-the-blank tests. Color-coded borders were displayed during the lesson, adjacent to the screen text and illustrations. In the experimental condition, corresponding border colors were shown at posttest.…

  9. Approaches to data analysis of multiple-choice questions

    Lin Ding; Robert Beichner

    2009-01-01

    This paper introduces five commonly used approaches to analyzing multiple-choice test data. They are classical test theory, factor analysis, cluster analysis, item response theory, and model analysis. Brief descriptions of the goals and algorithms of these approaches are provided, together with examples illustrating their applications in physics education research. We minimize mathematics, instead placing emphasis on data interpretation using these approaches.

  10. Approaches to Data Analysis of Multiple-Choice Questions

    Ding, Lin; Beichner, Robert

    2009-01-01

    This paper introduces five commonly used approaches to analyzing multiple-choice test data. They are classical test theory, factor analysis, cluster analysis, item response theory, and model analysis. Brief descriptions of the goals and algorithms of these approaches are provided, together with examples illustrating their applications in physics…

  11. Advanced Marketing Core Curriculum. Test Items and Assessment Techniques.

    Smith, Clifton L.; And Others

    This document contains duties and tasks, multiple-choice test items, and other assessment techniques for Missouri's advanced marketing core curriculum. The core curriculum begins with a list of 13 suggested textbook resources. Next, nine duties with their associated tasks are given. Under each task appears one or more citations to appropriate…

  12. Mechanical Waves Conceptual Survey: Its Modification and Conversion to a Standard Multiple-Choice Test

    Barniol, Pablo; Zavala, Genaro

    2016-01-01

    In this article we present several modifications of the mechanical waves conceptual survey, the most important test to date that has been designed to evaluate university students' understanding of four main topics in mechanical waves: propagation, superposition, reflection, and standing waves. The most significant changes are (i) modification of…

  13. Performance of Men and Women on Multiple-Choice and Constructed-Response Tests for Beginning Teachers. Research Report. ETS RR-04-48

    Livingston, Samuel A.; Rupp, Stacie L.

    2004-01-01

    Some previous research results imply that women tend to perform better, relative to men, on constructed-response (CR) tests than on multiple-choice (MC) tests in the same subjects. An analysis of data from several tests used in the licensing of beginning teachers supported this hypothesis, to varying degrees, in most of the tests investigated. The…

  14. Multiple-Choice Testing Using Immediate Feedback--Assessment Technique (IF AT®) Forms: Second-Chance Guessing vs. Second-Chance Learning?

    Merrel, Jeremy D.; Cirillo, Pier F.; Schwartz, Pauline M.; Webb, Jeffrey A.

    2015-01-01

    Multiple choice testing is a common but often ineffective method for evaluating learning. A newer approach, however, using Immediate Feedback Assessment Technique (IF AT®, Epstein Educational Enterprise, Inc.) forms, offers several advantages. In particular, a student learns immediately if his or her answer is correct and, in the case of an…

  15. Evolution of a Test Item

    Spaan, Mary

    2007-01-01

    This article follows the development of test items (see "Language Assessment Quarterly", Volume 3 Issue 1, pp. 71-79 for the article "Test and Item Specifications Development"), beginning with a review of test and item specifications, then proceeding to writing and editing of items, pretesting and analysis, and finally selection of an item for a…

  16. Making the Most of Multiple Choice

    Brookhart, Susan M.

    2015-01-01

    Multiple-choice questions draw criticism because many people perceive they test only recall or atomistic, surface-level objectives and do not require students to think. Although this can be the case, it does not have to be that way. Susan M. Brookhart suggests that multiple-choice questions are a useful part of any teacher's questioning repertoire…

  17. An Explanatory Item Response Theory Approach for a Computer-Based Case Simulation Test

    Kahraman, Nilüfer

    2014-01-01

    Problem: Practitioners working with multiple-choice tests have long utilized Item Response Theory (IRT) models to evaluate the performance of test items for quality assurance. The use of similar applications for performance tests, however, is often encumbered due to the challenges encountered in working with complicated data sets in which local…

  18. catcher: A Software Program to Detect Answer Copying in Multiple-Choice Tests Based on Nominal Response Model

    Kalender, Ilker

    2012-01-01

    catcher is a software program designed to compute the [omega] index, a common statistical index for the identification of collusions (cheating) among examinees taking an educational or psychological test. It requires (a) responses and (b) ability estimations of individuals, and (c) item parameters to make computations and outputs the results of…

  19. Guide to good practices for the development of test items

    NONE

    1997-01-01

    While the methodology used in developing test items can vary significantly, to ensure quality examinations, test items should be developed systematically. Test design and development is discussed in the DOE Guide to Good Practices for Design, Development, and Implementation of Examinations. This guide is intended to be a supplement by providing more detailed guidance on the development of specific test items. This guide addresses the development of written examination test items primarily. However, many of the concepts also apply to oral examinations, both in the classroom and on the job. This guide is intended to be used as guidance for the classroom and laboratory instructor or curriculum developer responsible for the construction of individual test items. This document focuses on written test items, but includes information relative to open-reference (open book) examination test items, as well. These test items have been categorized as short-answer, multiple-choice, or essay. Each test item format is described, examples are provided, and a procedure for development is included. The appendices provide examples for writing test items, a test item development form, and examples of various test item formats.

  20. Approaches to data analysis of multiple-choice questions

    Lin Ding

    2009-09-01

    Full Text Available This paper introduces five commonly used approaches to analyzing multiple-choice test data. They are classical test theory, factor analysis, cluster analysis, item response theory, and model analysis. Brief descriptions of the goals and algorithms of these approaches are provided, together with examples illustrating their applications in physics education research. We minimize mathematics, instead placing emphasis on data interpretation using these approaches.

  1. Analysis Test of Understanding of Vectors with the Three-Parameter Logistic Model of Item Response Theory and Item Response Curves Technique

    Rakkapao, Suttida; Prasitpong, Singha; Arayathanitkul, Kwan

    2016-01-01

    This study investigated the multiple-choice test of understanding of vectors (TUV), by applying item response theory (IRT). The difficulty, discriminatory, and guessing parameters of the TUV items were fit with the three-parameter logistic model of IRT, using the parscale program. The TUV ability is an ability parameter, here estimated assuming…

  2. Test of Achievement in Quantitative Economics for Secondary Schools: Construction and Validation Using Item Response Theory

    Eleje, Lydia I.; Esomonu, Nkechi P. M.

    2018-01-01

    A Test to measure achievement in quantitative economics among secondary school students was developed and validated in this study. The test is made up 20 multiple choice test items constructed based on quantitative economics sub-skills. Six research questions guided the study. Preliminary validation was done by two experienced teachers in…

  3. Differential Weighting of Items to Improve University Admission Test Validity

    Eduardo Backhoff Escudero

    2001-05-01

    Full Text Available This paper gives an evaluation of different ways to increase university admission test criterion-related validity, by differentially weighting test items. We compared four methods of weighting multiple-choice items of the Basic Skills and Knowledge Examination (EXHCOBA: (1 punishing incorrect responses by a constant factor, (2 weighting incorrect responses, considering the levels of error, (3 weighting correct responses, considering the item’s difficulty, based on the Classic Measurement Theory, and (4 weighting correct responses, considering the item’s difficulty, based on the Item Response Theory. Results show that none of these methods increased the instrument’s predictive validity, although they did improve its concurrent validity. It was concluded that it is appropriate to score the test by simply adding up correct responses.

  4. A multiple choice testing program coupled with a year-long elective experience is associated with improved performance on the internal medicine in-training examination.

    Mathis, Bradley R; Warm, Eric J; Schauer, Daniel P; Holmboe, Eric; Rouan, Gregory W

    2011-11-01

    The Internal Medicine In-Training Exam (IM-ITE) assesses the content knowledge of internal medicine trainees. Many programs use the IM-ITE to counsel residents, to create individual remediation plans, and to make fundamental programmatic and curricular modifications. To assess the association between a multiple-choice testing program administered during 12 consecutive months of ambulatory and inpatient elective experience and IM-ITE percentile scores in third post-graduate year (PGY-3) categorical residents. Retrospective cohort study. One hundred and four categorical internal medicine residents. Forty-five residents in the 2008 and 2009 classes participated in the study group, and the 59 residents in the three classes that preceded the use of the testing program, 2005-2007, served as controls. A comprehensive, elective rotation specific, multiple-choice testing program and a separate board review program, both administered during a continuous long-block elective experience during the twelve months between the second post-graduate year (PGY-2) and PGY-3 in-training examinations. We analyzed the change in median individual percent correct and percentile scores between the PGY-1 and PGY-2 IM-ITE and between the PGY-2 and PGY-3 IM-ITE in both control and study cohorts. For our main outcome measure, we compared the change in median individual percentile rank between the control and study cohorts between the PGY-2 and the PGY-3 IM-ITE testing opportunities. After experiencing the educational intervention, the study group demonstrated a significant increase in median individual IM-ITE percentile score between PGY-2 and PGY-3 examinations of 8.5 percentile points (p ITE performance.

  5. Multiple-Choice Exams and Guessing: Results from a One-Year Study of General Chemistry Tests Designed to Discourage Guessing

    Campbell, Mark L.

    2015-01-01

    Multiple-choice exams, while widely used, are necessarily imprecise due to the contribution of the final student score due to guessing. This past year at the United States Naval Academy the construction and grading scheme for the department-wide general chemistry multiple-choice exams were revised with the goal of decreasing the contribution of…

  6. Psychometrics of Multiple Choice Questions with Non-Functioning Distracters: Implications to Medical Education.

    Deepak, Kishore K; Al-Umran, Khalid Umran; AI-Sheikh, Mona H; Dkoli, B V; Al-Rubaish, Abdullah

    2015-01-01

    The functionality of distracters in a multiple choice question plays a very important role. We examined the frequency and impact of functioning and non-functioning distracters on psychometric properties of 5-option items in clinical disciplines. We analyzed item statistics of 1115 multiple choice questions from 15 summative assessments of undergraduate medical students and classified the items into five groups by their number of non-functioning distracters. We analyzed the effect of varying degree of non-functionality ranging from 0 to 4, on test reliability, difficulty index, discrimination index and point biserial correlation. The non-functionality of distracters inversely affected the test reliability and quality of items in a predictable manner. The non-functioning distracters made the items easier and lowered the discrimination index significantly. Three non-functional distracters in a 5-option MCQ significantly affected all psychometric properties (p psychometrically as effective as 5-option items. Our study reveals that a multiple choice question with 3 functional options provides lower most limit of item format that has adequate psychometric property. The test containing items with less number of functioning options have significantly lower reliability. The distracter function analysis and revision of nonfunctioning distracters can serve as important methods to improve the psychometrics and reliability of assessment.

  7. Gender and Ethnicity Differences in Multiple-Choice Testing. Effects of Self-Assessment and Risk-Taking Propensity

    1993-05-01

    correctness of the response provides I some advantages. They are: i 1. Increased reliability of the test; 2. Examinees pay more attention to the multiple...their choice 3 of test date. Each sign up sheet was divided into four cells: Non-Hispanic males and females and Hispanic males and females. 3 I I I...certain prestige and financial rewards; or entering a conservatory of music for advanced training with a well-known pianist . Mr. H realizes that even

  8. A singular choice for multiple choice

    Frandsen, Gudmund Skovbjerg; Schwartzbach, Michael Ignatieff

    2006-01-01

    How should multiple choice tests be scored and graded, in particular when students are allowed to check several boxes to convey partial knowledge? Many strategies may seem reasonable, but we demonstrate that five self-evident axioms are sufficient to determine completely the correct strategy. We ...

  9. Selecting Items for Criterion-Referenced Tests.

    Mellenbergh, Gideon J.; van der Linden, Wim J.

    1982-01-01

    Three item selection methods for criterion-referenced tests are examined: the classical theory of item difficulty and item-test correlation; the latent trait theory of item characteristic curves; and a decision-theoretic approach for optimal item selection. Item contribution to the standardized expected utility of mastery testing is discussed. (CM)

  10. Multiple-Choice and Short-Answer Exam Performance in a College Classroom

    Funk, Steven C.; Dickson, K. Laurie

    2011-01-01

    The authors experimentally investigated the effects of multiple-choice and short-answer format exam items on exam performance in a college classroom. They randomly assigned 50 students to take a 10-item short-answer pretest or posttest on two 50-item multiple-choice exams in an introduction to personality course. Students performed significantly…

  11. Effects of Reducing the Cognitive Load of Mathematics Test Items on Student Performance

    Susan C. Gillmor

    2015-01-01

    Full Text Available This study explores a new item-writing framework for improving the validity of math assessment items. The authors transfer insights from Cognitive Load Theory (CLT, traditionally used in instructional design, to educational measurement. Fifteen, multiple-choice math assessment items were modified using research-based strategies for reducing extraneous cognitive load. An experimental design with 222 middle-school students tested the effects of the reduced cognitive load items on student performance and anxiety. Significant findings confirm the main research hypothesis that reducing the cognitive load of math assessment items improves student performance. Three load-reducing item modifications are identified as particularly effective for reducing item difficulty: signalling important information, aesthetic item organization, and removing extraneous content. Load reduction was not shown to impact student anxiety. Implications for classroom assessment and future research are discussed.

  12. Social attribution test--multiple choice (SAT-MC) in schizophrenia: comparison with community sample and relationship to neurocognitive, social cognitive and symptom measures.

    Bell, Morris D; Fiszdon, Joanna M; Greig, Tamasine C; Wexler, Bruce E

    2010-09-01

    This is the first report on the use of the Social Attribution Task - Multiple Choice (SAT-MC) to assess social cognitive impairments in schizophrenia. The SAT-MC was originally developed for autism research, and consists of a 64-second animation showing geometric figures enacting a social drama, with 19 multiple choice questions about the interactions. Responses from 85 community-dwelling participants and 66 participants with SCID confirmed schizophrenia or schizoaffective disorders (Scz) revealed highly significant group differences. When the two samples were combined, SAT-MC scores were significantly correlated with other social cognitive measures, including measures of affect recognition, theory of mind, self-report of egocentricity and the Social Cognition Index from the MATRICS battery. Using a cut-off score, 53% of Scz were significantly impaired on SAT-MC compared with 9% of the community sample. Most Scz participants with impairment on SAT-MC also had impairment on affect recognition. Significant correlations were also found with neurocognitive measures but with less dependence on verbal processes than other social cognitive measures. Logistic regression using SAT-MC scores correctly classified 75% of both samples. Results suggest that this measure may have promise, but alternative versions will be needed before it can be used in pre-post or longitudinal designs. (c) 2009 Elsevier B.V. All rights reserved.

  13. [Continuing medical education: how to write multiple choice questions].

    Soler Fernández, R; Méndez Díaz, C; Rodríguez García, E

    2013-06-01

    Evaluating professional competence in medicine is a difficult but indispensable task because it makes it possible to evaluate, at different times and from different perspectives, the extent to which the knowledge, skills, and values required for exercising the profession have been acquired. Tests based on multiple choice questions have been and continue to be among the most useful tools for objectively evaluating learning in medicine. When these tests are well designed and correctly used, they can stimulate learning and even measure higher cognitive skills. Designing a multiple choice test is a difficult task that requires knowledge of the material to be tested and of the methodology of test preparation as well as time to prepare the test. The aim of this article is to review what can be evaluated through multiple choice tests, the rules and guidelines that should be taken into account when writing multiple choice questions, the different formats that can be used, the most common errors in elaborating multiple choice tests, and how to analyze the results of the test to verify its quality. Copyright © 2012 SERAM. Published by Elsevier Espana. All rights reserved.

  14. Multiple-choice pretesting potentiates learning of related information.

    Little, Jeri L; Bjork, Elizabeth Ligon

    2016-10-01

    Although the testing effect has received a substantial amount of empirical attention, such research has largely focused on the effects of tests given after study. The present research examines the effect of using tests prior to study (i.e., as pretests), focusing particularly on how pretesting influences the subsequent learning of information that is not itself pretested but that is related to the pretested information. In Experiment 1, we found that multiple-choice pretesting was better for the learning of such related information than was cued-recall pretesting or a pre-fact-study control condition. In Experiment 2, we found that the increased learning of non-pretested related information following multiple-choice testing could not be attributed to increased time allocated to that information during subsequent study. Last, in Experiment 3, we showed that the benefits of multiple-choice pretesting over cued-recall pretesting for the learning of related information persist over 48 hours, thus demonstrating the promise of multiple-choice pretesting to potentiate learning in educational contexts. A possible explanation for the observed benefits of multiple-choice pretesting for enhancing the effectiveness with which related nontested information is learned during subsequent study is discussed.

  15. Evaluating multiple-choice exams in large introductory physics courses

    Gary Gladding

    2006-07-01

    Full Text Available The reliability and validity of professionally written multiple-choice exams have been extensively studied for exams such as the SAT, graduate record examination, and the force concept inventory. Much of the success of these multiple-choice exams is attributed to the careful construction of each question, as well as each response. In this study, the reliability and validity of scores from multiple-choice exams written for and administered in the large introductory physics courses at the University of Illinois, Urbana-Champaign were investigated. The reliability of exam scores over the course of a semester results in approximately a 3% uncertainty in students’ total semester exam score. This semester test score uncertainty yields an uncertainty in the students’ assigned letter grade that is less than 1 / 3 of a letter grade. To study the validity of exam scores, a subset of students were ranked independently based on their multiple-choice score, graded explanations, and student interviews. The ranking of these students based on their multiple-choice score was found to be consistent with the ranking assigned by physics instructors based on the students’ written explanations ( r>0.94 at the 95% confidence level and oral interviews (r=0.94−0.09+0.06 .

  16. Binomial test models and item difficulty

    van der Linden, Willem J.

    1979-01-01

    In choosing a binomial test model, it is important to know exactly what conditions are imposed on item difficulty. In this paper these conditions are examined for both a deterministic and a stochastic conception of item responses. It appears that they are more restrictive than is generally

  17. Analysis test of understanding of vectors with the three-parameter logistic model of item response theory and item response curves technique

    Rakkapao, Suttida; Prasitpong, Singha; Arayathanitkul, Kwan

    2016-12-01

    This study investigated the multiple-choice test of understanding of vectors (TUV), by applying item response theory (IRT). The difficulty, discriminatory, and guessing parameters of the TUV items were fit with the three-parameter logistic model of IRT, using the parscale program. The TUV ability is an ability parameter, here estimated assuming unidimensionality and local independence. Moreover, all distractors of the TUV were analyzed from item response curves (IRC) that represent simplified IRT. Data were gathered on 2392 science and engineering freshmen, from three universities in Thailand. The results revealed IRT analysis to be useful in assessing the test since its item parameters are independent of the ability parameters. The IRT framework reveals item-level information, and indicates appropriate ability ranges for the test. Moreover, the IRC analysis can be used to assess the effectiveness of the test's distractors. Both IRT and IRC approaches reveal test characteristics beyond those revealed by the classical analysis methods of tests. Test developers can apply these methods to diagnose and evaluate the features of items at various ability levels of test takers.

  18. Development and Application of a Two-Tier Multiple-Choice Diagnostic Test for High School Students' Understanding of Cell Division and Reproduction

    Sesli, Ertugrul; Kara, Yilmaz

    2012-01-01

    This study involved the development and application of a two-tier diagnostic test for measuring students' understanding of cell division and reproduction. The instrument development procedure had three general steps: defining the content boundaries of the test, collecting information on students' misconceptions, and instrument development.…

  19. Analysis test of understanding of vectors with the three-parameter logistic model of item response theory and item response curves technique

    Suttida Rakkapao

    2016-10-01

    Full Text Available This study investigated the multiple-choice test of understanding of vectors (TUV, by applying item response theory (IRT. The difficulty, discriminatory, and guessing parameters of the TUV items were fit with the three-parameter logistic model of IRT, using the parscale program. The TUV ability is an ability parameter, here estimated assuming unidimensionality and local independence. Moreover, all distractors of the TUV were analyzed from item response curves (IRC that represent simplified IRT. Data were gathered on 2392 science and engineering freshmen, from three universities in Thailand. The results revealed IRT analysis to be useful in assessing the test since its item parameters are independent of the ability parameters. The IRT framework reveals item-level information, and indicates appropriate ability ranges for the test. Moreover, the IRC analysis can be used to assess the effectiveness of the test’s distractors. Both IRT and IRC approaches reveal test characteristics beyond those revealed by the classical analysis methods of tests. Test developers can apply these methods to diagnose and evaluate the features of items at various ability levels of test takers.

  20. Senior high school students’ need analysis of Three-Tier Multiple Choice (3TMC) diagnostic test about acid-base and solubility equilibrium

    Ardiansah; Masykuri, M.; Rahardjo, S. B.

    2018-05-01

    Students’ conceptual understanding is the most important comprehension to obtain related comprehension. However, they held their own conception. With this need analysis, we will elicit student need of 3TMC diagnostic test to measure students’ conception about acid-base and solubility equilibrium. The research done by a mixed method using questionnaire analysis based on descriptive of quantitative and qualitative. The research subject was 96 students from 4 senior high schools and 4 chemistry teachers chosen by random sampling technique. Data gathering used a questionnaire with 10 questions for student and 28 questions for teachers. The results showed that 97% of students stated that the development this instrument is needed. In addition, there were several problems obtained in this questionnaire include learning activity, teacher’s test and guessing. In conclusion, this is necessary to develop the 3TMC instrument that can diagnose and measure the student’s conception in acid-base and solubility equilibrium.

  1. Algorithmic test design using classical item parameters

    van der Linden, Willem J.; Adema, Jos J.

    Two optimalization models for the construction of tests with a maximal value of coefficient alpha are given. Both models have a linear form and can be solved by using a branch-and-bound algorithm. The first model assumes an item bank calibrated under the Rasch model and can be used, for instance,

  2. Bayesian item selection criteria for adaptive testing

    van der Linden, Willem J.

    1996-01-01

    R.J. Owen (1975) proposed an approximate empirical Bayes procedure for item selection in adaptive testing. The procedure replaces the true posterior by a normal approximation with closed-form expressions for its first two moments. This approximation was necessary to minimize the computational

  3. Using Multiple-Choice Questions to Evaluate In-Depth Learning of Economics

    Buckles, Stephen; Siegfried, John J.

    2006-01-01

    Multiple-choice questions are the basis of a significant portion of assessment in introductory economics courses. However, these questions, as found in course assessments, test banks, and textbooks, often fail to evaluate students' abilities to use and apply economic analysis. The authors conclude that multiple-choice questions can be used to…

  4. Item calibration in incomplete testing designs

    Norman D. Verhelst

    2011-01-01

    Full Text Available This study discusses the justifiability of item parameter estimation in incomplete testing designs in item response theory. Marginal maximum likelihood (MML as well as conditional maximum likelihood (CML procedures are considered in three commonly used incomplete designs: random incomplete, multistage testing and targeted testing designs. Mislevy and Sheenan (1989 have shown that in incomplete designs the justifiability of MML can be deduced from Rubin's (1976 general theory on inference in the presence of missing data. Their results are recapitulated and extended for more situations. In this study it is shown that for CML estimation the justification must be established in an alternative way, by considering the neglected part of the complete likelihood. The problems with incomplete designs are not generally recognized in practical situations. This is due to the stochastic nature of the incomplete designs which is not taken into account in standard computer algorithms. For that reason, incorrect uses of standard MML- and CML-algorithms are discussed.

  5. Science Library of Test Items. Volume Three. Mastery Testing Programme. Introduction and Manual.

    New South Wales Dept. of Education, Sydney (Australia).

    A set of short tests aimed at measuring student mastery of specific skills in the natural sciences are presented with a description of the mastery program's purposes, development, and methods. Mastery learning, criterion-referenced testing, and the scope of skills to be tested are defined. Each of the multiple choice tests for grades 7 through 10…

  6. Comedy workshop: an enjoyable way to develop multiple-choice questions.

    Droegemueller, William; Gant, Norman; Brekken, Alvin; Webb, Lynn

    2005-01-01

    To describe an innovative method of developing multiple-choice items for a board certification examination. The development of appropriate multiple-choice items is definitely more of an art, rather than a science. The comedy workshop format for developing questions for a certification examination is similar to the process used by comedy writers composing scripts for television shows. This group format dramatically diminishes the frustrations faced by an individual question writer attempting to create items. The vast majority of our comedy workshop participants enjoy and prefer the comedy workshop format. It provides an ideal environment in which to teach and blend the talents of inexperienced and experienced question writers. This is a descriptive article, in which we suggest an innovative process in the art of creating multiple-choice items for a high-stakes examination.

  7. Instructional Topics in Educational Measurement (ITEMS) Module: Using Automated Processes to Generate Test Items

    Gierl, Mark J.; Lai, Hollis

    2013-01-01

    Changes to the design and development of our educational assessments are resulting in the unprecedented demand for a large and continuous supply of content-specific test items. One way to address this growing demand is with automatic item generation (AIG). AIG is the process of using item models to generate test items with the aid of computer…

  8. The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory

    Sahin, Alper; Anil, Duygu

    2017-01-01

    This study investigates the effects of sample size and test length on item-parameter estimation in test development utilizing three unidimensional dichotomous models of item response theory (IRT). For this purpose, a real language test comprised of 50 items was administered to 6,288 students. Data from this test was used to obtain data sets of…

  9. Measuring University students' understanding of the greenhouse effect - a comparison of multiple-choice, short answer and concept sketch assessment tools with respect to students' mental models

    Gold, A. U.; Harris, S. E.

    2013-12-01

    The greenhouse effect comes up in most discussions about climate and is a key concept related to climate change. Existing studies have shown that students and adults alike lack a detailed understanding of this important concept or might hold misconceptions. We studied the effectiveness of different interventions on University-level students' understanding of the greenhouse effect. Introductory level science students were tested for their pre-knowledge of the greenhouse effect using validated multiple-choice questions, short answers and concept sketches. All students participated in a common lesson about the greenhouse effect and were then randomly assigned to one of two lab groups. One group explored an existing simulation about the greenhouse effect (PhET-lesson) and the other group worked with absorption spectra of different greenhouse gases (Data-lesson) to deepen the understanding of the greenhouse effect. All students completed the same assessment including multiple choice, short answers and concept sketches after participation in their lab lesson. 164 students completed all the assessments, 76 completed the PhET lesson and 77 completed the data lesson. 11 students missed the contrasting lesson. In this presentation we show the comparison between the multiple-choice questions, short answer questions and the concept sketches of students. We explore how well each of these assessment types represents student's knowledge. We also identify items that are indicators of the level of understanding of the greenhouse effect as measured in correspondence of student answers to an expert mental model and expert responses. Preliminary data analysis shows that student who produce concept sketch drawings that come close to expert drawings also choose correct multiple-choice answers. However, correct multiple-choice answers are not necessarily an indicator that a student produces an expert-like correlating concept sketch items. Multiple-choice questions that require detailed

  10. Evaluation of Northwest University, Kano Post-UTME Test Items Using Item Response Theory

    Bichi, Ado Abdu; Hafiz, Hadiza; Bello, Samira Abdullahi

    2016-01-01

    High-stakes testing is used for the purposes of providing results that have important consequences. Validity is the cornerstone upon which all measurement systems are built. This study applied the Item Response Theory principles to analyse Northwest University Kano Post-UTME Economics test items. The developed fifty (50) economics test items was…

  11. Chemistry and biology by new multiple choice

    Seo, Hyeong Seok; Kim, Seong Hwan

    2003-02-01

    This book is divided into two parts, the first part is about chemistry, which deals with science of material, atom structure and periodic law, chemical combination and power between molecule, state of material and solution, chemical reaction and an organic compound. The second part give description of biology with molecule and cell, energy in cells and chemical synthesis, molecular biology and heredity, function on animal, function on plant and evolution and ecology. This book has explanation of chemistry and biology with new multiple choice.

  12. Computerized Adaptive Test (CAT) Applications and Item Response Theory Models for Polytomous Items

    Aybek, Eren Can; Demirtasli, R. Nukhet

    2017-01-01

    This article aims to provide a theoretical framework for computerized adaptive tests (CAT) and item response theory models for polytomous items. Besides that, it aims to introduce the simulation and live CAT software to the related researchers. Computerized adaptive test algorithm, assumptions of item response theory models, nominal response…

  13. Can multiple-choice questions simulate free-response questions?

    Lin, Shih-Yin; Singh, Chandralekha

    2016-01-01

    We discuss a study to evaluate the extent to which free-response questions could be approximated by multiple-choice equivalents. Two carefully designed research-based multiple-choice questions were transformed into a free-response format and administered on the final exam in a calculus-based introductory physics course. The original multiple-choice questions were administered in another similar introductory physics course on final exam. Findings suggest that carefully designed multiple-choice...

  14. Modeling Local Item Dependence in Cloze and Reading Comprehension Test Items Using Testlet Response Theory

    Baghaei, Purya; Ravand, Hamdollah

    2016-01-01

    In this study the magnitudes of local dependence generated by cloze test items and reading comprehension items were compared and their impact on parameter estimates and test precision was investigated. An advanced English as a foreign language reading comprehension test containing three reading passages and a cloze test was analyzed with a…

  15. Item Response Theory Models for Performance Decline during Testing

    Jin, Kuan-Yu; Wang, Wen-Chung

    2014-01-01

    Sometimes, test-takers may not be able to attempt all items to the best of their ability (with full effort) due to personal factors (e.g., low motivation) or testing conditions (e.g., time limit), resulting in poor performances on certain items, especially those located toward the end of a test. Standard item response theory (IRT) models fail to…

  16. Genetic Algorithms for Multiple-Choice Problems

    Aickelin, Uwe

    2010-04-01

    This thesis investigates the use of problem-specific knowledge to enhance a genetic algorithm approach to multiple-choice optimisation problems.It shows that such information can significantly enhance performance, but that the choice of information and the way it is included are important factors for success.Two multiple-choice problems are considered.The first is constructing a feasible nurse roster that considers as many requests as possible.In the second problem, shops are allocated to locations in a mall subject to constraints and maximising the overall income.Genetic algorithms are chosen for their well-known robustness and ability to solve large and complex discrete optimisation problems.However, a survey of the literature reveals room for further research into generic ways to include constraints into a genetic algorithm framework.Hence, the main theme of this work is to balance feasibility and cost of solutions.In particular, co-operative co-evolution with hierarchical sub-populations, problem structure exploiting repair schemes and indirect genetic algorithms with self-adjusting decoder functions are identified as promising approaches.The research starts by applying standard genetic algorithms to the problems and explaining the failure of such approaches due to epistasis.To overcome this, problem-specific information is added in a variety of ways, some of which are designed to increase the number of feasible solutions found whilst others are intended to improve the quality of such solutions.As well as a theoretical discussion as to the underlying reasons for using each operator,extensive computational experiments are carried out on a variety of data.These show that the indirect approach relies less on problem structure and hence is easier to implement and superior in solution quality.

  17. An Investigation of Item Type in a Standards-Based Assessment.

    Liz Hollingworth

    2007-12-01

    Full Text Available Large-scale state assessment programs use both multiple-choice and open-ended items on tests for accountability purposes. Certainly, there is an intuitive belief among some educators and policy makers that open-ended items measure something different than multiple-choice items. This study examined two item formats in custom-built, standards-based tests of achievement in Reading and Mathematics at grades 3-8. In this paper, we raise questions about the value of including open-ended items, given scoring costs, time constraints, and the higher probability of missing data from test-takers.

  18. Evaluating multiple-choice exams in large introductory physics courses

    Gary Gladding; Tim Stelzer; Michael Scott

    2006-01-01

    The reliability and validity of professionally written multiple-choice exams have been extensively studied for exams such as the SAT, graduate record examination, and the force concept inventory. Much of the success of these multiple-choice exams is attributed to the careful construction of each question, as well as each response. In this study, the reliability and validity of scores from multiple-choice exams written for and administered in the large introductory physics courses at the Unive...

  19. The Role of Item Feedback in Self-Adapted Testing.

    Roos, Linda L.; And Others

    1997-01-01

    The importance of item feedback in self-adapted testing was studied by comparing feedback and no feedback conditions for computerized adaptive tests and self-adapted tests taken by 363 college students. Results indicate that item feedback is not necessary to realize score differences between self-adapted and computerized adaptive testing. (SLD)

  20. Effect of Differential Item Functioning on Test Equating

    Kabasakal, Kübra Atalay; Kelecioglu, Hülya

    2015-01-01

    This study examines the effect of differential item functioning (DIF) items on test equating through multilevel item response models (MIRMs) and traditional IRMs. The performances of three different equating models were investigated under 24 different simulation conditions, and the variables whose effects were examined included sample size, test…

  1. Computerized adaptive testing item selection in computerized adaptive learning systems

    Eggen, Theodorus Johannes Hendrikus Maria; Eggen, T.J.H.M.; Veldkamp, B.P.

    2012-01-01

    Item selection methods traditionally developed for computerized adaptive testing (CAT) are explored for their usefulness in item-based computerized adaptive learning (CAL) systems. While in CAT Fisher information-based selection is optimal, for recovering learning populations in CAL systems item

  2. An assessment of functioning and non-functioning distractors in multiple-choice questions: a descriptive analysis

    Mohammed Ahmed M

    2009-07-01

    Full Text Available Abstract Background Four- or five-option multiple choice questions (MCQs are the standard in health-science disciplines, both on certification-level examinations and on in-house developed tests. Previous research has shown, however, that few MCQs have three or four functioning distractors. The purpose of this study was to investigate non-functioning distractors in teacher-developed tests in one nursing program in an English-language university in Hong Kong. Methods Using item-analysis data, we assessed the proportion of non-functioning distractors on a sample of seven test papers administered to undergraduate nursing students. A total of 514 items were reviewed, including 2056 options (1542 distractors and 514 correct responses. Non-functioning options were defined as ones that were chosen by fewer than 5% of examinees and those with a positive option discrimination statistic. Results The proportion of items containing 0, 1, 2, and 3 functioning distractors was 12.3%, 34.8%, 39.1%, and 13.8% respectively. Overall, items contained an average of 1.54 (SD = 0.88 functioning distractors. Only 52.2% (n = 805 of all distractors were functioning effectively and 10.2% (n = 158 had a choice frequency of 0. Items with more functioning distractors were more difficult and more discriminating. Conclusion The low frequency of items with three functioning distractors in the four-option items in this study suggests that teachers have difficulty developing plausible distractors for most MCQs. Test items should consist of as many options as is feasible given the item content and the number of plausible distractors; in most cases this would be three. Item analysis results can be used to identify and remove non-functioning distractors from MCQs that have been used in previous tests.

  3. An assessment of functioning and non-functioning distractors in multiple-choice questions: a descriptive analysis.

    Tarrant, Marie; Ware, James; Mohammed, Ahmed M

    2009-07-07

    Four- or five-option multiple choice questions (MCQs) are the standard in health-science disciplines, both on certification-level examinations and on in-house developed tests. Previous research has shown, however, that few MCQs have three or four functioning distractors. The purpose of this study was to investigate non-functioning distractors in teacher-developed tests in one nursing program in an English-language university in Hong Kong. Using item-analysis data, we assessed the proportion of non-functioning distractors on a sample of seven test papers administered to undergraduate nursing students. A total of 514 items were reviewed, including 2056 options (1542 distractors and 514 correct responses). Non-functioning options were defined as ones that were chosen by fewer than 5% of examinees and those with a positive option discrimination statistic. The proportion of items containing 0, 1, 2, and 3 functioning distractors was 12.3%, 34.8%, 39.1%, and 13.8% respectively. Overall, items contained an average of 1.54 (SD = 0.88) functioning distractors. Only 52.2% (n = 805) of all distractors were functioning effectively and 10.2% (n = 158) had a choice frequency of 0. Items with more functioning distractors were more difficult and more discriminating. The low frequency of items with three functioning distractors in the four-option items in this study suggests that teachers have difficulty developing plausible distractors for most MCQs. Test items should consist of as many options as is feasible given the item content and the number of plausible distractors; in most cases this would be three. Item analysis results can be used to identify and remove non-functioning distractors from MCQs that have been used in previous tests.

  4. Criteria for eliminating items of a Test of Figural Analogies

    Diego Blum

    2013-12-01

    Full Text Available This paper describes the steps taken to eliminate two of the items in a Test of Figural Analogies (TFA. The main guidelines of psychometric analysis concerning Classical Test Theory (CTT and Item Response Theory (IRT are explained. The item elimination process was based on both the study of the CTT difficulty and discrimination index, and the unidimensionality analysis. The a, b, and c parameters of the Three Parameter Logistic Model of IRT were also considered for this purpose, as well as the assessment of each item fitting this model. The unfavourable characteristics of a group of TFA items are detailed, and decisions leading to their possible elimination are discussed.

  5. Detection of differential item functioning using Lagrange multiplier tests

    Glas, Cornelis A.W.

    1996-01-01

    In this paper it is shown that differential item functioning can be evaluated using the Lagrange multiplier test or C. R. Rao's efficient score test. The test is presented in the framework of a number of item response theory (IRT) models such as the Rasch model, the one-parameter logistic model, the

  6. A person fit test for IRT models for polytomous items

    Glas, Cornelis A.W.; Dagohoy, A.V.

    2007-01-01

    A person fit test based on the Lagrange multiplier test is presented for three item response theory models for polytomous items: the generalized partial credit model, the sequential model, and the graded response model. The test can also be used in the framework of multidimensional ability

  7. Algorithms for computerized test construction using classical item parameters

    Adema, Jos J.; van der Linden, Willem J.

    1989-01-01

    Recently, linear programming models for test construction were developed. These models were based on the information function from item response theory. In this paper another approach is followed. Two 0-1 linear programming models for the construction of tests using classical item and test

  8. Validation and Structural Analysis of the Kinematics Concept Test

    Lichtenberger, A.; Wagner, C.; Hofer, S. I.; Stem, E.; Vaterlaus, A.

    2017-01-01

    The kinematics concept test (KCT) is a multiple-choice test designed to evaluate students' conceptual understanding of kinematics at the high school level. The test comprises 49 multiple-choice items about velocity and acceleration, which are based on seven kinematic concepts and which make use of three different representations. In the first part…

  9. Evaluating Multiple-Choice Exams in Large Introductory Physics Courses

    Scott, Michael; Stelzer, Tim; Gladding, Gary

    2006-01-01

    The reliability and validity of professionally written multiple-choice exams have been extensively studied for exams such as the SAT, graduate record examination, and the force concept inventory. Much of the success of these multiple-choice exams is attributed to the careful construction of each question, as well as each response. In this study,…

  10. Benford's Law: textbook exercises and multiple-choice testbanks.

    Slepkov, Aaron D; Ironside, Kevin B; DiBattista, David

    2015-01-01

    Benford's Law describes the finding that the distribution of leading (or leftmost) digits of innumerable datasets follows a well-defined logarithmic trend, rather than an intuitive uniformity. In practice this means that the most common leading digit is 1, with an expected frequency of 30.1%, and the least common is 9, with an expected frequency of 4.6%. Currently, the most common application of Benford's Law is in detecting number invention and tampering such as found in accounting-, tax-, and voter-fraud. We demonstrate that answers to end-of-chapter exercises in physics and chemistry textbooks conform to Benford's Law. Subsequently, we investigate whether this fact can be used to gain advantage over random guessing in multiple-choice tests, and find that while testbank answers in introductory physics closely conform to Benford's Law, the testbank is nonetheless secure against such a Benford's attack for banal reasons.

  11. Benford's Law: textbook exercises and multiple-choice testbanks.

    Aaron D Slepkov

    Full Text Available Benford's Law describes the finding that the distribution of leading (or leftmost digits of innumerable datasets follows a well-defined logarithmic trend, rather than an intuitive uniformity. In practice this means that the most common leading digit is 1, with an expected frequency of 30.1%, and the least common is 9, with an expected frequency of 4.6%. Currently, the most common application of Benford's Law is in detecting number invention and tampering such as found in accounting-, tax-, and voter-fraud. We demonstrate that answers to end-of-chapter exercises in physics and chemistry textbooks conform to Benford's Law. Subsequently, we investigate whether this fact can be used to gain advantage over random guessing in multiple-choice tests, and find that while testbank answers in introductory physics closely conform to Benford's Law, the testbank is nonetheless secure against such a Benford's attack for banal reasons.

  12. Procedures for Selecting Items for Computerized Adaptive Tests.

    Kingsbury, G. Gage; Zara, Anthony R.

    1989-01-01

    Several classical approaches and alternative approaches to item selection for computerized adaptive testing (CAT) are reviewed and compared. The study also describes procedures for constrained CAT that may be added to classical item selection approaches to allow them to be used for applied testing. (TJH)

  13. Detecting Test Tampering Using Item Response Theory

    Wollack, James A.; Cohen, Allan S.; Eckerly, Carol A.

    2015-01-01

    Test tampering, especially on tests for educational accountability, is an unfortunate reality, necessitating that the state (or its testing vendor) perform data forensic analyses, such as erasure analyses, to look for signs of possible malfeasance. Few statistical approaches exist for detecting fraudulent erasures, and those that do largely do not…

  14. Item selection and ability estimation adaptive testing

    Pashley, Peter J.; van der Linden, Wim J.; van der Linden, Willem J.; Glas, Cornelis A.W.; Glas, Cees A.W.

    2010-01-01

    The last century saw a tremendous progression in the refinement and use of standardized linear tests. The first administered College Board exam occurred in 1901 and the first Scholastic Assessment Test (SAT) was given in 1926. Since then, progressively more sophisticated standardized linear tests

  15. Quantitative penetration testing with item response theory

    Pieters, W.; Arnold, F.; Stoelinga, M.I.A.

    2013-01-01

    Existing penetration testing approaches assess the vulnerability of a system by determining whether certain attack paths are possible in practice. Therefore, penetration testing has thus far been used as a qualitative research method. To enable quantitative approaches to security risk management,

  16. Quantitative Penetration Testing with Item Response Theory

    Arnold, Florian; Pieters, Wolter; Stoelinga, Mariëlle Ida Antoinette

    2014-01-01

    Existing penetration testing approaches assess the vulnerability of a system by determining whether certain attack paths are possible in practice. Thus, penetration testing has so far been used as a qualitative research method. To enable quantitative approaches to security risk management, including

  17. Quantitative penetration testing with item response theory

    Arnold, Florian; Pieters, Wolter; Stoelinga, Mariëlle

    2013-01-01

    Existing penetration testing approaches assess the vulnerability of a system by determining whether certain attack paths are possible in practice. Thus, penetration testing has so far been used as a qualitative research method. To enable quantitative approaches to security risk management, including

  18. Group differences in the heritability of items and test scores

    Wicherts, J.M.; Johnson, W.

    2009-01-01

    It is important to understand potential sources of group differences in the heritability of intelligence test scores. On the basis of a basic item response model we argue that heritabilities which are based on dichotomous item scores normally do not generalize from one sample to the next. If groups

  19. Mathematical-programming approaches to test item pool design

    Veldkamp, Bernard P.; van der Linden, Willem J.; Ariel, A.

    2002-01-01

    This paper presents an approach to item pool design that has the potential to improve on the quality of current item pools in educational and psychological testing andhence to increase both measurement precision and validity. The approach consists of the application of mathematical programming

  20. Item Response Theory Modeling of the Philadelphia Naming Test

    Fergadiotis, Gerasimos; Kellough, Stacey; Hula, William D.

    2015-01-01

    Purpose: In this study, we investigated the fit of the Philadelphia Naming Test (PNT; Roach, Schwartz, Martin, Grewal, & Brecher, 1996) to an item-response-theory measurement model, estimated the precision of the resulting scores and item parameters, and provided a theoretical rationale for the interpretation of PNT overall scores by relating…

  1. Performance on large-scale science tests: Item attributes that may impact achievement scores

    Gordon, Janet Victoria

    Significant differences in achievement among ethnic groups persist on the eighth-grade science Washington Assessment of Student Learning (WASL). The WASL measures academic performance in science using both scenario and stand-alone question types. Previous research suggests that presenting target items connected to an authentic context, like scenario question types, can increase science achievement scores especially in underrepresented groups and thus help to close the achievement gap. The purpose of this study was to identify significant differences in performance between gender and ethnic subgroups by question type on the 2005 eighth-grade science WASL. MANOVA and ANOVA were used to examine relationships between gender and ethnic subgroups as independent variables with achievement scores on scenario and stand-alone question types as dependent variables. MANOVA revealed no significant effects for gender, suggesting that the 2005 eighth-grade science WASL was gender neutral. However, there were significant effects for ethnicity. ANOVA revealed significant effects for ethnicity and ethnicity by gender interaction in both question types. Effect sizes were negligible for the ethnicity by gender interaction. Large effect sizes between ethnicities on scenario question types became moderate to small effect sizes on stand-alone question types. This indicates the score advantage the higher performing subgroups had over the lower performing subgroups was not as large on stand-alone question types compared to scenario question types. A further comparison examined performance on multiple-choice items only within both question types. Similar achievement patterns between ethnicities emerged; however, achievement patterns between genders changed in boys' favor. Scenario question types appeared to register differences between ethnic groups to a greater degree than stand-alone question types. These differences may be attributable to individual differences in cognition

  2. Multiple choice exams of medical knowledge with open books and web access? A validity study

    O'Neill, Lotte; Simonsen, Eivind Ortind; Knudsen, Ulla Breth

    2015-01-01

    Background: Open book tests have been suggested to lower test anxiety and promote deeper learning strategies. In the Aarhus University medical program, ¼ of the curriculum assess students’ medical knowledge with ‘open book, open web’ (OBOW) multiple choice examinations. We found little existing...

  3. A Two-Tier Multiple Choice Questions to Diagnose Thermodynamic Misconception of Thai and Laos Students

    Kamcharean, Chanwit; Wattanakasiwich, Pornrat

    The objective of this study was to diagnose misconceptions of Thai and Lao students in thermodynamics by using a two-tier multiple-choice test. Two-tier multiple choice questions consist of the first tier, a content-based question and the second tier, a reasoning-based question. Data of student understanding was collected by using 10 two-tier multiple-choice questions. Thai participants were the first-year students (N = 57) taking a fundamental physics course at Chiang Mai University in 2012. Lao participants were high school students in Grade 11 (N = 57) and Grade 12 (N = 83) at Muengnern high school in Xayaboury province, Lao PDR. As results, most students answered content-tier questions correctly but chose incorrect answers for reason-tier questions. When further investigating their incorrect reasons, we found similar misconceptions as reported in previous studies such as incorrectly relating pressure with temperature when presenting with multiple variables.

  4. Industrial Arts Test Development, Book III. Resource Items for Graphics Technology, Power Technology, Production Technology.

    New York State Education Dept., Albany.

    This booklet is designed to assist teachers in developing examinations for classroom use. It is a collection of 955 objective test questions, mostly multiple choice, for industrial arts students in the three areas of graphics technology, power technology, and production technology. Scoring keys are provided. There are no copyright restrictions,…

  5. Item response times in computerized adaptive testing

    Lutz F. Hornke

    2000-01-01

    Full Text Available Tiempos de respuesta al ítem en tests adaptativos informatizados. Los tests adaptativos informatizados (TAI proporcionan puntuaciones y a la vez tiempos de respuesta a los ítems. La investigación sobre el significado adicional que se puede obtener de la información contenida en los tiempos de respuesta es de especial interés. Se dispuso de los datos de 5912 jóvenes en un test adaptativo informatizado. Estudios anteriores indican mayores tiempos de respuesta cuando las respuestas son incorrectas. Este resultado fue replicado en este estudio más amplio. No obstante, los tiempos promedios de respuesta al ítem para las respuestas erróneas y correctas no muestran una interpretación diferencial de la obtenida con los niveles de rasgo, y tampoco correlacionan de manera diferente con unos cuantos tests de capacidad. Se discute si los tiempos de respuesta deben ser interpretados en la misma dimensión que mide el TAI o en otras dimensiones. Desde los primeros años 30 los tiempos de respuesta han sido considerados indicadores de rasgos de personalidad que deben ser diferenciados de los rasgos que miden las puntuaciones del test. Esta idea es discutida y se ofrecen argumentos a favor y en contra. Los acercamientos mas recientes basados en modelos también se muestran. Permanece abierta la pregunta de si se obtiene o no información diagnóstica adicional de un TAI que tenga una toma de datos detallada y programada.

  6. Item response theory analysis of the mechanics baseline test

    Cardamone, Caroline N.; Abbott, Jonathan E.; Rayyan, Saif; Seaton, Daniel T.; Pawl, Andrew; Pritchard, David E.

    2012-02-01

    Item response theory is useful in both the development and evaluation of assessments and in computing standardized measures of student performance. In item response theory, individual parameters (difficulty, discrimination) for each item or question are fit by item response models. These parameters provide a means for evaluating a test and offer a better measure of student skill than a raw test score, because each skill calculation considers not only the number of questions answered correctly, but the individual properties of all questions answered. Here, we present the results from an analysis of the Mechanics Baseline Test given at MIT during 2005-2010. Using the item parameters, we identify questions on the Mechanics Baseline Test that are not effective in discriminating between MIT students of different abilities. We show that a limited subset of the highest quality questions on the Mechanics Baseline Test returns accurate measures of student skill. We compare student skills as determined by item response theory to the more traditional measurement of the raw score and show that a comparable measure of learning gain can be computed.

  7. Detection of differential item functioning using Lagrange multiplier tests

    Glas, Cornelis A.W.

    1998-01-01

    Abstract: In the present paper it is shown that differential item functioning can be evaluated using the Lagrange multiplier test or Rao’s efficient score test. The test is presented in the framework of a number of IRT models such as the Rasch model, the OPLM, the 2-parameter logistic model, the

  8. [A factor analysis method for contingency table data with unlimited multiple choice questions].

    Toyoda, Hideki; Haiden, Reina; Kubo, Saori; Ikehara, Kazuya; Isobe, Yurie

    2016-02-01

    The purpose of this study is to propose a method of factor analysis for analyzing contingency tables developed from the data of unlimited multiple-choice questions. This method assumes that the element of each cell of the contingency table has a binominal distribution and a factor analysis model is applied to the logit of the selection probability. Scree plot and WAIC are used to decide the number of factors, and the standardized residual, the standardized difference between the sample, and the proportion ratio, is used to select items. The proposed method was applied to real product impression research data on advertised chips and energy drinks. Since the results of the analysis showed that this method could be used in conjunction with conventional factor analysis model, and extracted factors were fully interpretable, and suggests the usefulness of the proposed method in the study of psychology using unlimited multiple-choice questions.

  9. Dual processing theory and experts' reasoning: exploring thinking on national multiple-choice questions.

    Durning, Steven J; Dong, Ting; Artino, Anthony R; van der Vleuten, Cees; Holmboe, Eric; Schuwirth, Lambert

    2015-08-01

    An ongoing debate exists in the medical education literature regarding the potential benefits of pattern recognition (non-analytic reasoning), actively comparing and contrasting diagnostic options (analytic reasoning) or using a combination approach. Studies have not, however, explicitly explored faculty's thought processes while tackling clinical problems through the lens of dual process theory to inform this debate. Further, these thought processes have not been studied in relation to the difficulty of the task or other potential mediating influences such as personal factors and fatigue, which could also be influenced by personal factors such as sleep deprivation. We therefore sought to determine which reasoning process(es) were used with answering clinically oriented multiple-choice questions (MCQs) and if these processes differed based on the dual process theory characteristics: accuracy, reading time and answering time as well as psychometrically determined item difficulty and sleep deprivation. We performed a think-aloud procedure to explore faculty's thought processes while taking these MCQs, coding think-aloud data based on reasoning process (analytic, nonanalytic, guessing or combination of processes) as well as word count, number of stated concepts, reading time, answering time, and accuracy. We also included questions regarding amount of work in the recent past. We then conducted statistical analyses to examine the associations between these measures such as correlations between frequencies of reasoning processes and item accuracy and difficulty. We also observed the total frequencies of different reasoning processes in the situations of getting answers correctly and incorrectly. Regardless of whether the questions were classified as 'hard' or 'easy', non-analytical reasoning led to the correct answer more often than to an incorrect answer. Significant correlations were found between self-reported recent number of hours worked with think-aloud word count

  10. THE MULTIPLE CHOICE PROBLEM WITH INTERACTIONS BETWEEN CRITERIA

    Luiz Flavio Autran Monteiro Gomes

    2015-12-01

    Full Text Available ABSTRACT An important problem in Multi-Criteria Decision Analysis arises when one must select at least two alternatives at the same time. This can be denoted as a multiple choice problem. In other words, instead of evaluating each of the alternatives separately, they must be combined into groups of n alternatives, where n = 2. When the multiple choice problem must be solved under multiple criteria, the result is a multi-criteria, multiple choice problem. In this paper, it is shown through examples how this problemcan be tackled on a bipolar scale. The Choquet integral is used in this paper to take care of interactions between criteria. A numerical application example is conducted using data from SEBRAE-RJ, a non-profit private organization that has the mission of promoting competitiveness, sustainable developmentand entrepreneurship in the state of Rio de Janeiro, Brazil. The paper closes with suggestions for future research.

  11. A review of the effects on IRT item parameter estimates with a focus on misbehaving common items in test equating

    Michalis P Michaelides

    2010-10-01

    Full Text Available Many studies have investigated the topic of change or drift in item parameter estimates in the context of Item Response Theory. Content effects, such as instructional variation and curricular emphasis, as well as context effects, such as the wording, position, or exposure of an item have been found to impact item parameter estimates. The issue becomes more critical when items with estimates exhibiting differential behavior across test administrations are used as common for deriving equating transformations. This paper reviews the types of effects on IRT item parameter estimates and focuses on the impact of misbehaving or aberrant common items on equating transformations. Implications relating to test validity and the judgmental nature of the decision to keep or discard aberrant common items are discussed, with recommendations for future research into more informed and formal ways of dealing with misbehaving common items.

  12. A Review of the Effects on IRT Item Parameter Estimates with a Focus on Misbehaving Common Items in Test Equating.

    Michaelides, Michalis P

    2010-01-01

    Many studies have investigated the topic of change or drift in item parameter estimates in the context of item response theory (IRT). Content effects, such as instructional variation and curricular emphasis, as well as context effects, such as the wording, position, or exposure of an item have been found to impact item parameter estimates. The issue becomes more critical when items with estimates exhibiting differential behavior across test administrations are used as common for deriving equating transformations. This paper reviews the types of effects on IRT item parameter estimates and focuses on the impact of misbehaving or aberrant common items on equating transformations. Implications relating to test validity and the judgmental nature of the decision to keep or discard aberrant common items are discussed, with recommendations for future research into more informed and formal ways of dealing with misbehaving common items.

  13. Relative Merits of Four Methods for Scoring Cloze Tests.

    Brown, James Dean

    1980-01-01

    Describes study comparing merits of exact answer, acceptable answer, clozentropy and multiple choice methods for scoring tests. Results show differences among reliability, mean item facility, discrimination and usability, but not validity. (BK)

  14. Assessing Differential Item Functioning on the Test of Relational Reasoning

    Denis Dumas

    2018-03-01

    Full Text Available The test of relational reasoning (TORR is designed to assess the ability to identify complex patterns within visuospatial stimuli. The TORR is designed for use in school and university settings, and therefore, its measurement invariance across diverse groups is critical. In this investigation, a large sample, representative of a major university on key demographic variables, was collected, and the resulting data were analyzed using a multi-group, multidimensional item-response theory model-comparison procedure. No significant differential item functioning was found on any of the TORR items across any of the demographic groups of interest. This finding is interpreted as evidence of the cultural fairness of the TORR, and potential test-development choices that may have contributed to that cultural fairness are discussed.

  15. Feasibility of a multiple-choice mini mental state examination for chronically critically ill patients.

    Miguélez, Marta; Merlani, Paolo; Gigon, Fabienne; Verdon, Mélanie; Annoni, Jean-Marie; Ricou, Bara

    2014-08-01

    Following treatment in an ICU, up to 70% of chronically critically ill patients present neurocognitive impairment that can have negative effects on their quality of life, daily activities, and return to work. The Mini Mental State Examination is a simple, widely used tool for neurocognitive assessment. Although of interest when evaluating ICU patients, the current version is restricted to patients who are able to speak. This study aimed to evaluate the feasibility of a visual, multiple-choice Mini Mental State Examination for ICU patients who are unable to speak. The multiple-choice Mini Mental State Examination and the standard Mini Mental State Examination were compared across three different speaking populations. The interrater and intrarater reliabilities of the multiple-choice Mini Mental State Examination were tested on both intubated and tracheostomized ICU patients. Mixed 36-bed ICU and neuropsychology department in a university hospital. Twenty-six healthy volunteers, 20 neurological patients, 46 ICU patients able to speak, and 30 intubated or tracheostomized ICU patients. None. Multiple-choice Mini Mental State Examination results correlated satisfactorily with standard Mini Mental State Examination results in all three speaking groups: healthy volunteers: intraclass correlation coefficient = 0.43 (95% CI, -0.18 to 0.62); neurology patients: 0.90 (95% CI, 0.82-0.95); and ICU patients able to speak: 0.86 (95% CI, 0.70-0.92). The interrater and intrarater reliabilities were good (0.95 [0.87-0.98] and 0.94 [0.31-0.99], respectively). In all populations, a Bland-Altman analysis showed systematically higher scores using the multiple-choice Mini Mental State Examination. Administration of the multiple-choice Mini Mental State Examination to ICU patients was straightforward and produced exploitable results comparable to those of the standard Mini Mental State Examination. It should be of interest for the assessment and monitoring of the neurocognitive

  16. Comparison between Two Assessment Methods; Modified Essay Questions and Multiple Choice Questions

    Assadi S.N.* MD

    2015-09-01

    Full Text Available Aims Using the best assessment methods is an important factor in educational development of health students. Modified essay questions and multiple choice questions are two prevalent methods of assessing the students. The aim of this study was to compare two methods of modified essay questions and multiple choice questions in occupational health engineering and work laws courses. Materials & Methods This semi-experimental study was performed during 2013 to 2014 on occupational health students of Mashhad University of Medical Sciences. The class of occupational health and work laws course in 2013 was considered as group A and the class of 2014 as group B. Each group had 50 students.The group A students were assessed by modified essay questions method and the group B by multiple choice questions method.Data were analyzed in SPSS 16 software by paired T test and odd’s ratio. Findings The mean grade of occupational health and work laws course was 18.68±0.91 in group A (modified essay questions and was 18.78±0.86 in group B (multiple choice questions which was not significantly different (t=-0.41; p=0.684. The mean grade of chemical chapter (p<0.001 in occupational health engineering and harmful work law (p<0.001 and other (p=0.015 chapters in work laws were significantly different between two groups. Conclusion Modified essay questions and multiple choice questions methods have nearly the same student assessing value for the occupational health engineering and work laws course.

  17. IRT-Estimated Reliability for Tests Containing Mixed Item Formats

    Shu, Lianghua; Schwarz, Richard D.

    2014-01-01

    As a global measure of precision, item response theory (IRT) estimated reliability is derived for four coefficients (Cronbach's a, Feldt-Raju, stratified a, and marginal reliability). Models with different underlying assumptions concerning test-part similarity are discussed. A detailed computational example is presented for the targeted…

  18. Correcting Grade Deflation Caused by Multiple-Choice Scoring.

    Baranchik, Alvin; Cherkas, Barry

    2000-01-01

    Presents a study involving three sections of pre-calculus (n=181) at four-year college where partial credit scoring on multiple-choice questions was examined over an entire semester. Indicates that grades determined by partial credit scoring seemed more reflective of both the quantity and quality of student knowledge than grades determined by…

  19. Using the Multiple Choice Procedure to Measure College Student Gambling

    Butler, Leon Harvey

    2010-01-01

    Research suggests that gambling is similar to addictive behaviors such as substance use. In the current study, gambling was investigated from a behavioral economics perspective. The Multiple Choice Procedure (MCP) with gambling as the target behavior was used to assess for relative reinforcing value, the effect of alternative reinforcers, and…

  20. Multiple choice questiones as a tool for assessment in medical ...

    Methods For this review, a PuBMed online search was carried out for English language ... Advantages and disadvantages of MCQs in medical education are ... multiple-choice question meets many of the educational requirements for an assessment method. The use of automation for grading and low costs makes it a viable ...

  1. Multiple choice questions in electronics and electrical engineering

    DAVIES, T J

    2013-01-01

    A unique compendium of over 2000 multiple choice questions for students of electronics and electrical engineering. This book is designed for the following City and Guilds courses: 2010, 2240, 2320, 2360. It can also be used as a resource for practice questions for any vocational course.

  2. Bayes Factor Covariance Testing in Item Response Models.

    Fox, Jean-Paul; Mulder, Joris; Sinharay, Sandip

    2017-12-01

    Two marginal one-parameter item response theory models are introduced, by integrating out the latent variable or random item parameter. It is shown that both marginal response models are multivariate (probit) models with a compound symmetry covariance structure. Several common hypotheses concerning the underlying covariance structure are evaluated using (fractional) Bayes factor tests. The support for a unidimensional factor (i.e., assumption of local independence) and differential item functioning are evaluated by testing the covariance components. The posterior distribution of common covariance components is obtained in closed form by transforming latent responses with an orthogonal (Helmert) matrix. This posterior distribution is defined as a shifted-inverse-gamma, thereby introducing a default prior and a balanced prior distribution. Based on that, an MCMC algorithm is described to estimate all model parameters and to compute (fractional) Bayes factor tests. Simulation studies are used to show that the (fractional) Bayes factor tests have good properties for testing the underlying covariance structure of binary response data. The method is illustrated with two real data studies.

  3. Reducing the number of options on multiple-choice questions: response time, psychometrics and standard setting.

    Schneid, Stephen D; Armour, Chris; Park, Yoon Soo; Yudkowsky, Rachel; Bordage, Georges

    2014-10-01

    Despite significant evidence supporting the use of three-option multiple-choice questions (MCQs), these are rarely used in written examinations for health professions students. The purpose of this study was to examine the effects of reducing four- and five-option MCQs to three-option MCQs on response times, psychometric characteristics, and absolute standard setting judgements in a pharmacology examination administered to health professions students. We administered two versions of a computerised examination containing 98 MCQs to 38 Year 2 medical students and 39 Year 3 pharmacy students. Four- and five-option MCQs were converted into three-option MCQs to create two versions of the examination. Differences in response time, item difficulty and discrimination, and reliability were evaluated. Medical and pharmacy faculty judges provided three-level Angoff (TLA) ratings for all MCQs for both versions of the examination to allow the assessment of differences in cut scores. Students answered three-option MCQs an average of 5 seconds faster than they answered four- and five-option MCQs (36 seconds versus 41 seconds; p = 0.008). There were no significant differences in item difficulty and discrimination, or test reliability. Overall, the cut scores generated for three-option MCQs using the TLA ratings were 8 percentage points higher (p = 0.04). The use of three-option MCQs in a health professions examination resulted in a time saving equivalent to the completion of 16% more MCQs per 1-hour testing period, which may increase content validity and test score reliability, and minimise construct under-representation. The higher cut scores may result in higher failure rates if an absolute standard setting method, such as the TLA method, is used. The results from this study provide a cautious indication to health professions educators that using three-option MCQs does not threaten validity and may strengthen it by allowing additional MCQs to be tested in a fixed amount

  4. Applying modern psychometric techniques to melodic discrimination testing: Item response theory, computerised adaptive testing, and automatic item generation.

    Harrison, Peter M C; Collins, Tom; Müllensiefen, Daniel

    2017-06-15

    Modern psychometric theory provides many useful tools for ability testing, such as item response theory, computerised adaptive testing, and automatic item generation. However, these techniques have yet to be integrated into mainstream psychological practice. This is unfortunate, because modern psychometric techniques can bring many benefits, including sophisticated reliability measures, improved construct validity, avoidance of exposure effects, and improved efficiency. In the present research we therefore use these techniques to develop a new test of a well-studied psychological capacity: melodic discrimination, the ability to detect differences between melodies. We calibrate and validate this test in a series of studies. Studies 1 and 2 respectively calibrate and validate an initial test version, while Studies 3 and 4 calibrate and validate an updated test version incorporating additional easy items. The results support the new test's viability, with evidence for strong reliability and construct validity. We discuss how these modern psychometric techniques may also be profitably applied to other areas of music psychology and psychological science in general.

  5. Practical Usage of Multiple-Choice Questions as Part of Learning and Self-Evaluation

    Paula Kangasniemi

    2016-12-01

    Full Text Available The poster describes how the multiple-choice questions could be a part of learning, not only assessing. We often think of the role of questions only in order to test the student's skills. We have tested how questions could be a part of learning in our web-based course of information retrieval in Lapland University. In web-based learning there is a need for high-quality mediators. Mediators are learning promoters which trigger, support, and amplify learning. Mediators can be human mediators or tool mediators. The tool mediators are for example; tests, tutorials, guides and diaries. The multiple-choice questions can also be learning promoters which select, interpret and amplify objects for learning. What do you have to take into account when you are preparing multiple-choice questions as mediators? First you have to prioritize teaching objectives: what must be known and what should be known. According to our experience with contact learning, you can assess what the things are that students have problems with and need more guidance on. The most important addition to the questions is feedback during practice. The questions’ answers (wrong or right are not important. The feedback on the answers are important to guide students on how to search. The questions promote students’ self-regulation and self-evaluation. Feedback can be verbal, a screenshot or a video. We have added a verbal feedback for every question and also some screenshots and eight videos in our web-based course.

  6. Graded Multiple Choice Questions: Rewarding Understanding and Preventing Plagiarism

    Denyer, G. S.; Hancock, D.

    2002-08-01

    This paper describes an easily implemented method that allows the generation and analysis of graded multiple-choice examinations. The technique, which uses standard functions in user-end software (Microsoft Excel 5+), can also produce several different versions of an examination that can be employed to prevent the reward of plagarism. The manuscript also discusses the advantages of having a graded marking system for the elimination of ambiguities, use in multi-step calculation questions, and questions that require extrapolation or reasoning. The advantages of the scrambling strategy, which maintains the same question order, is discussed with reference to student equity. The system provides a non-confrontational mechanism for dealing with cheating in large-class multiple-choice examinations, as well as providing a reward for problem solving over surface learning.

  7. The Technical Quality of Test Items Generated Using a Systematic Approach to Item Writing.

    Siskind, Theresa G.; Anderson, Lorin W.

    The study was designed to examine the similarity of response options generated by different item writers using a systematic approach to item writing. The similarity of response options to student responses for the same item stems presented in an open-ended format was also examined. A non-systematic (subject matter expertise) approach and a…

  8. Using Tests as Learning Opportunities.

    Foos, Paul W.; Fisher, Ronald P.

    1988-01-01

    A study involving 105 undergraduates assessed the value of testing as a means of increasing, rather than simply monitoring, learning. Results indicate that fill-in-the-blank and items requiring student inferences were more effective, respectively, than multiple-choice tests and verbatim items in furthering student learning. (TJH)

  9. MCQ testing in higher education: Yes, there are bad items and invalid scores—A case study identifying solutions

    Brown, Gavin

    2017-01-01

    This is a lecture given at Umea University, Sweden in September 2017. It is based on the published study: Brown, G. T. L., & Abdulnabi, H. (2017). Evaluating the quality of higher education instructor-constructed multiple-choice tests: Impact on student grades. Frontiers in Education: Assessment, Testing, & Applied Measurement, 2(24).. doi:10.3389/feduc.2017.00024

  10. Development of a lack of appetite item bank for computer-adaptive testing (CAT)

    Thamsborg, Lise Laurberg Holst; Petersen, Morten Aa; Aaronson, Neil K

    2015-01-01

    to 12 lack of appetite items. CONCLUSIONS: Phases 1-3 resulted in 12 lack of appetite candidate items. Based on a field testing (phase 4), the psychometric characteristics of the items will be assessed and the final item bank will be generated. This CAT item bank is expected to provide precise...

  11. Delayed, but not immediate, feedback after multiple-choice questions increases performance on a subsequent short-answer, but not multiple-choice, exam: evidence for the dual-process theory of memory.

    Sinha, Neha; Glass, Arnold Lewis

    2015-01-01

    Three experiments, two performed in the laboratory and one embedded in a college psychology lecture course, investigated the effects of immediate versus delayed feedback following a multiple-choice exam on subsequent short answer and multiple-choice exams. Performance on the subsequent multiple-choice exam was not affected by the timing of the feedback on the prior exam; however, performance on the subsequent short answer exam was better following delayed than following immediate feedback. This was true regardless of the order in which immediate versus delayed feedback was given. Furthermore, delayed feedback only had a greater effect than immediate feedback on subsequent short answer performance following correct, confident responses on the prior exam. These results indicate that delayed feedback cues a student's prior response and increases subsequent recollection of that response. The practical implication is that delayed feedback is better than immediate feedback during academic testing.

  12. An emotional functioning item bank of 24 items for computerized adaptive testing (CAT) was established

    Petersen, Morten Aa.; Gamper, Eva-Maria; Costantini, Anna

    2016-01-01

    of the widely used EORTC Quality of Life questionnaire (QLQ-C30). STUDY DESIGN AND SETTING: On the basis of literature search and evaluations by international samples of experts and cancer patients, 38 candidate items were developed. The psychometric properties of the items were evaluated in a large...... international sample of cancer patients. This included evaluations of dimensionality, item response theory (IRT) model fit, differential item functioning (DIF), and of measurement precision/statistical power. RESULTS: Responses were obtained from 1,023 cancer patients from four countries. The evaluations showed...... that 24 items could be included in a unidimensional IRT model. DIF did not seem to have any significant impact on the estimation of EF. Evaluations indicated that the CAT measure may reduce sample size requirements by up to 50% compared to the QLQ-C30 EF scale without reducing power. CONCLUSION...

  13. Effects of Differential Item Functioning on Examinees' Test Performance and Reliability of Test

    Lee, Yi-Hsuan; Zhang, Jinming

    2017-01-01

    Simulations were conducted to examine the effect of differential item functioning (DIF) on measurement consequences such as total scores, item response theory (IRT) ability estimates, and test reliability in terms of the ratio of true-score variance to observed-score variance and the standard error of estimation for the IRT ability parameter. The…

  14. Stochastic order in dichotomous item response models for fixed tests, research adaptive tests, or multiple abilities

    van der Linden, Willem J.

    1995-01-01

    Dichotomous item response theory (IRT) models can be viewed as families of stochastically ordered distributions of responses to test items. This paper explores several properties of such distributiom. The focus is on the conditions under which stochastic order in families of conditional

  15. An Effect Size Measure for Raju's Differential Functioning for Items and Tests

    Wright, Keith D.; Oshima, T. C.

    2015-01-01

    This study established an effect size measure for differential functioning for items and tests' noncompensatory differential item functioning (NCDIF). The Mantel-Haenszel parameter served as the benchmark for developing NCDIF's effect size measure for reporting moderate and large differential item functioning in test items. The effect size of…

  16. Effects of Repeated Testing on Short- and Long-Term Memory Performance across Different Test Formats

    Stenlund, Tova; Sundström, Anna; Jonsson, Bert

    2016-01-01

    This study examined whether practice testing with short-answer (SA) items benefits learning over time compared to practice testing with multiple-choice (MC) items, and rereading the material. More specifically, the aim was to test the hypotheses of "retrieval effort" and "transfer appropriate processing" by comparing retention…

  17. Distance teaching using self-marking multiple choice questions.

    Poore, P

    1987-01-01

    In Papua New Guinea health extension officers receive a 3-year course of training in college, followed by a period of in-service training in hospital. They are then posted to a health center, where they are in charge of all health services within their district. While the health extension officers received an excellent basic training, and were provided with books and appropriate, locally produced texts, they often spent months or even years after graduation in remote rural health centers with little communication from colleagues. This paper describes an attempt to improve communication, and to provide support inexpensively by post. Multiple choice questions, with a system for self-marking, were sent by post to rural health workers. Multiple choice questions are used in the education system in Papua New Guinea, and all health extension officers are familiar with the technique. The most obvious and immediate result was the great enthusiasm shown by the majority of health staff involved. In this way a useful exchange of correspondence was established. With this exchange of information and recognition of each other's problems, the quality of patient care must improve.

  18. A simple and fast item selection procedure for adaptive testing

    Veerkamp, W.J.J.; Veerkamp, Wim J.J.; Berger, Martijn; Berger, Martijn P.F.

    1994-01-01

    Items with the highest discrimination parameter values in a logistic item response theory (IRT) model do not necessarily give maximum information. This paper shows which discrimination parameter values (as a function of the guessing parameter and the distance between person ability and item

  19. Development of a Mechanical Engineering Test Item Bank to promote learning outcomes-based education in Japanese and Indonesian higher education institutions

    Jeffrey S. Cross

    2017-11-01

    Full Text Available Following on the 2008-2012 OECD Assessment of Higher Education Learning Outcomes (AHELO feasibility study of civil engineering, in Japan a mechanical engineering learning outcomes assessment working group was established within the National Institute of Education Research (NIER, which became the Tuning National Center for Japan. The purpose of the project is to develop among engineering faculty members, common understandings of engineering learning outcomes, through the collaborative process of test item development, scoring, and sharing of results. By substantiating abstract level learning outcomes into concrete level learning outcomes that are attainable and assessable, and through measuring and comparing the students’ achievement of learning outcomes, it is anticipated that faculty members will be able to draw practical implications for educational improvement at the program and course levels. The development of a mechanical engineering test item bank began with test item development workshops, which led to a series of trial tests, and then to a large scale test implementation in 2016 of 348 first semester master’s students in 9 institutions in Japan, using both multiple choice questions designed to measure the mastery of basic and engineering sciences, and a constructive response task designed to measure “how well students can think like an engineer.” The same set of test items were translated from Japanese into to English and Indonesian, and used to measure achievement of learning outcomes at Indonesia’s Institut Teknologi Bandung (ITB on 37 rising fourth year undergraduate students. This paper highlights how learning outcomes assessment can effectively facilitate learning outcomes-based education, by documenting the experience of Japanese and Indonesian mechanical engineering faculty members engaged in the NIER Test Item Bank project.First published online: 30 November 2017

  20. An Alternative to the 3PL: Using Asymmetric Item Characteristic Curves to Address Guessing Effects

    Lee, Sora; Bolt, Daniel M.

    2018-01-01

    Both the statistical and interpretational shortcomings of the three-parameter logistic (3PL) model in accommodating guessing effects on multiple-choice items are well documented. We consider the use of a residual heteroscedasticity (RH) model as an alternative, and compare its performance to the 3PL with real test data sets and through simulation…

  1. Detection of person misfit in computerized adaptive tests with polytomous items

    van Krimpen-Stoop, Edith; Meijer, R.R.

    2000-01-01

    Item scores that do not fit an assumed item response theory model may cause the latent trait value to be estimated inaccurately. For computerized adaptive tests (CAT) with dichotomous items, several person-fit statistics for detecting nonfitting item score patterns have been proposed. Both for

  2. Uncertainties in the Item Parameter Estimates and Robust Automated Test Assembly

    Veldkamp, Bernard P.; Matteucci, Mariagiulia; de Jong, Martijn G.

    2013-01-01

    Item response theory parameters have to be estimated, and because of the estimation process, they do have uncertainty in them. In most large-scale testing programs, the parameters are stored in item banks, and automated test assembly algorithms are applied to assemble operational test forms. These algorithms treat item parameters as fixed values,…

  3. Bayes factor covariance testing in item response models

    Fox, J.P.; Mulder, J.; Sinharay, Sandip

    2017-01-01

    Two marginal one-parameter item response theory models are introduced, by integrating out the latent variable or random item parameter. It is shown that both marginal response models are multivariate (probit) models with a compound symmetry covariance structure. Several common hypotheses concerning

  4. Bayes Factor Covariance Testing in Item Response Models

    Fox, Jean-Paul; Mulder, Joris; Sinharay, Sandip

    2017-01-01

    Two marginal one-parameter item response theory models are introduced, by integrating out the latent variable or random item parameter. It is shown that both marginal response models are multivariate (probit) models with a compound symmetry covariance structure. Several common hypotheses concerning

  5. Project Physics Tests 1, Concepts of Motion.

    Harvard Univ., Cambridge, MA. Harvard Project Physics.

    Test items relating to Project Physics Unit 1 are presented in this booklet, consisting of 70 multiple-choice and 20 problem-and-essay questions. Concepts of motion are examined with respect to velocities, acceleration, forces, vectors, Newton's laws, and circular motion. Suggestions are made for time consumption in answering some items. Besides…

  6. Creating a Database for Test Items in National Examinations (pp ...

    Nekky Umera

    different programmers create files and application programs over a long period. .... In theory or essay questions, alternative methods of solving problems are explored and ... Unworthy items are those that do not focus on the central concept or.

  7. Item response theory analysis of the life orientation test-revised: age and gender differential item functioning analyses.

    Steca, Patrizia; Monzani, Dario; Greco, Andrea; Chiesi, Francesca; Primi, Caterina

    2015-06-01

    This study is aimed at testing the measurement properties of the Life Orientation Test-Revised (LOT-R) for the assessment of dispositional optimism by employing item response theory (IRT) analyses. The LOT-R was administered to a large sample of 2,862 Italian adults. First, confirmatory factor analyses demonstrated the theoretical conceptualization of the construct measured by the LOT-R as a single bipolar dimension. Subsequently, IRT analyses for polytomous, ordered response category data were applied to investigate the items' properties. The equivalence of the items across gender and age was assessed by analyzing differential item functioning. Discrimination and severity parameters indicated that all items were able to distinguish people with different levels of optimism and adequately covered the spectrum of the latent trait. Additionally, the LOT-R appears to be gender invariant and, with minor exceptions, age invariant. Results provided evidence that the LOT-R is a reliable and valid measure of dispositional optimism. © The Author(s) 2014.

  8. The Effect of English Language on Multiple Choice Question Scores of Thai Medical Students.

    Phisalprapa, Pochamana; Muangkaew, Wayuda; Assanasen, Jintana; Kunavisarut, Tada; Thongngarm, Torpong; Ruchutrakool, Theera; Kobwanthanakun, Surapon; Dejsomritrutai, Wanchai

    2016-04-01

    Universities in Thailand are preparing for Thailand's integration into the ASEAN Economic Community (AEC) by increasing the number of tests in English language. English language is not the native language of Thailand Differences in English language proficiency may affect scores among test-takers, even when subject knowledge among test-takers is comparable and may falsely represent the knowledge level of the test-taker. To study the impact of English language multiple choice test questions on test scores of medical students. The final examination of fourth-year medical students completing internal medicine rotation contains 120 multiple choice questions (MCQ). The languages used on the test are Thai and English at a ratio of 3:1. Individual scores of tests taken in both languages were collected and the effect of English language on MCQ was analyzed Individual MCQ scores were then compared with individual student English language proficiency and student grade point average (GPA). Two hundred ninety five fourth-year medical students were enrolled. The mean percentage of MCQ scores in Thai and English were significantly different (65.0 ± 8.4 and 56.5 ± 12.4, respectively, p English was fair (Spearman's correlation coefficient = 0.41, p English than in Thai language. Students were classified into six grade categories (A, B+, B, C+, C, and D+), which cumulatively measured total internal medicine rotation performance score plus final examination score. MCQ scores from Thai language examination were more closely correlated with total course grades than were the scores from English language examination (Spearman's correlation coefficient = 0.73 (p English proficiency score was very high, at 3.71 ± 0.35 from a total of 4.00. Mean student GPA was 3.40 ± 0.33 from a possible 4.00. English language MCQ examination scores were more highly associated with GPA than with English language proficiency. The use of English language multiple choice question test may decrease scores

  9. International Semiotics: Item Difficulty and the Complexity of Science Item Illustrations in the PISA-2009 International Test Comparison

    Solano-Flores, Guillermo; Wang, Chao; Shade, Chelsey

    2016-01-01

    We examined multimodality (the representation of information in multiple semiotic modes) in the context of international test comparisons. Using Program of International Student Assessment (PISA)-2009 data, we examined the correlation of the difficulty of science items and the complexity of their illustrations. We observed statistically…

  10. The detection of cheating in multiple choice examinations

    Richmond, Peter; Roehner, Bertrand M.

    2015-10-01

    Cheating in examinations is acknowledged by an increasing number of organizations to be widespread. We examine two different approaches to assess their effectiveness at detecting anomalous results, suggestive of collusion, using data taken from a number of multiple-choice examinations organized by the UK Radio Communication Foundation. Analysis of student pair overlaps of correct answers is shown to give results consistent with more orthodox statistical correlations for which confidence limits as opposed to the less familiar "Bonferroni method" can be used. A simulation approach is also developed which confirms the interpretation of the empirical approach. Then the variables Xi =(1 -Ui) Yi +Ui Z are a system of symmetric dependent binary variables (0 , 1 ; p) whose correlation matrix is ρij = r. The proof is easy and given in the paper. Let us add two remarks. • We used the expression "symmetric variables" to reflect the fact that all Xi play the same role. The expression "exchangeable variables" is often used with the same meaning. • The correlation matrix has only positive elements. This is of course imposed by the symmetry condition. ρ12 0, thus violating the symmetry requirement. In the following subsections we will be concerned with the question of uniqueness of the set of Xi generated above. Needless to say, it is useful to know whether the proposition gives the answer or only one among many. More precisely, the problem can be stated as follows.

  11. Multiple Choice Knapsack Problem: example of planning choice in transportation.

    Zhong, Tao; Young, Rhonda

    2010-05-01

    Transportation programming, a process of selecting projects for funding given budget and other constraints, is becoming more complex as a result of new federal laws, local planning regulations, and increased public involvement. This article describes the use of an integer programming tool, Multiple Choice Knapsack Problem (MCKP), to provide optimal solutions to transportation programming problems in cases where alternative versions of projects are under consideration. In this paper, optimization methods for use in the transportation programming process are compared and then the process of building and solving the optimization problems is discussed. The concepts about the use of MCKP are presented and a real-world transportation programming example at various budget levels is provided. This article illustrates how the use of MCKP addresses the modern complexities and provides timely solutions in transportation programming practice. While the article uses transportation programming as a case study, MCKP can be useful in other fields where a similar decision among a subset of the alternatives is required. Copyright 2009 Elsevier Ltd. All rights reserved.

  12. The "Sniffin' Kids" test--a 14-item odor identification test for children.

    Valentin A Schriever

    Full Text Available Tools for measuring olfactory function in adults have been well established. Although studies have shown that olfactory impairment in children may occur as a consequence of a number of diseases or head trauma, until today no consensus on how to evaluate the sense of smell in children exists in Europe. Aim of the study was to develop a modified "Sniffin' Sticks" odor identification test, the "Sniffin' Kids" test for the use in children. In this study 537 children between 6-17 years of age were included. Fourteen odors, which were identified at a high rate by children, were selected from the "Sniffin' Sticks" 16-item odor identification test. Normative date for the 14-item "Sniffin' Kids" odor identification test was obtained. The test was validated by including a group of congenital anosmic children. Results show that the "Sniffin' Kids" test is able to discriminate between normosmia and anosmia with a cutoff value of >7 points on the odor identification test. In addition the test-retest reliability was investigated in a group of 31 healthy children and shown to be ρ = 0.44. With the 14-item odor identification "Sniffin' Kids" test we present a valid and reliable test for measuring olfactory function in children between ages 6-17 years.

  13. Can Free-Response Questions Be Approximated by Multiple-Choice Equivalents?

    Lin, Shih-Yin; Singh, Chandralekha

    2016-01-01

    We discuss a study to evaluate the extent to which free-response questions can be approximated by multiple-choice equivalents. Two carefully designed research-based multiple-choice questions were transformed into a free-response format and administered on the final exam in a calculus-based introductory physics course. The original multiple-choice questions were administered in another, similar introductory physics course on the final exam. Our findings suggest that carefully designed multiple...

  14. Qualitätsverbesserung von MC Fragen [Quality assurance of Multiple Choice Questions

    Rotthoff, Thomas

    2006-08-01

    Full Text Available [english] Because of the missing relevance of graded examinations at the German medical faculties, there was no need to reflect the question quality in written examinations consistently. Through the new national legislation for medical education certification-relevant, faculty-internal examinations are prescribed. Until now, there is a lack of structures and processes which could lead to an improvement of the question quality. To reflect the different performance of the students, a different severity and a good selectivity of the test questions are necessary. For a interdisciplinary examination for fourth year undergraduate students at the University Hospital Duesseldorf, new Multiple choice (MC- questions which are application-orientated, clearly formulated and to a large extent free from formal errors should be developed. The implementation took place in the conception and performance of Workshops for the construction of MC-questions and the appointment of an interdisciplinary review-committee. It could be shown that an author training facilitates and accelerates the review-process for the committee and that a review process reflects itself in a high practise-orientation of the items. Prospectively, high-quality questions which are created in a review-process and metrological analysed could be read into inter-university databases. Therewith the initial expenditure of time could be reduced. The interdisciplinary constitution of the review-committee offers the possibility of an intensified discussion over content and relevance of the questions. [german] Wegen fehlender notenrelevanter Prüfungen an den Medizinischen Fakultäten in Deutschland bestand bisher keine Notwendigkeit, die Fragenqualität in schriftlichen Prüfungen konsequent zu reflektieren. Erst durch die neue Approbationsordnung sind zeugnisrelevante, fakultätsinterne Prüfungen vorgeschrieben. Es fehlen somit oftmals Strukturen und Prozesse, die zu einer Verbesserung der

  15. Modeling Incorrect Responses to Multiple-Choice Items with Multilinear Formula Score Theory.

    1987-08-01

    Eisenhower Avenue University of Leyden Alexandria, VA 22333 Education Research Center Boerhaavelaan 2 Dr. John M. Eddins 2334 EN Leyden University of...22302-0268 Dr. William Montague NPRDC Code 13 Dr. William L. Maloy San Diego, CA 92152-6800 Chief of Naval Education and Training Ms. Kathleen Moreno

  16. The Effect of Error in Item Parameter Estimates on the Test Response Function Method of Linking.

    Kaskowitz, Gary S.; De Ayala, R. J.

    2001-01-01

    Studied the effect of item parameter estimation for computation of linking coefficients for the test response function (TRF) linking/equating method. Simulation results showed that linking was more accurate when there was less error in the parameter estimates, and that 15 or 25 common items provided better results than 5 common items under both…

  17. Statistical Indexes for Monitoring Item Behavior under Computer Adaptive Testing Environment.

    Zhu, Renbang; Yu, Feng; Liu, Su

    A computerized adaptive test (CAT) administration usually requires a large supply of items with accurately estimated psychometric properties, such as item response theory (IRT) parameter estimates, to ensure the precision of examinee ability estimation. However, an estimated IRT model of a given item in any given pool does not always correctly…

  18. Development of Test Items Related to Selected Concepts Within the Scheme the Particle Nature of Matter.

    Doran, Rodney L.; Pella, Milton O.

    The purpose of this study was to develop tests items with a minimum reading demand for use with pupils at grade levels two through six. An item was judged to be acceptable if the item satisfied at least four of six criteria. Approximately 250 students in grades 2-6 participated in the study. Half of the students were given instruction to develop…

  19. Projective Item Response Model for Test-Independent Measurement

    Ip, Edward Hak-Sing; Chen, Shyh-Huei

    2012-01-01

    The problem of fitting unidimensional item-response models to potentially multidimensional data has been extensively studied. The focus of this article is on response data that contains a major dimension of interest but that may also contain minor nuisance dimensions. Because fitting a unidimensional model to multidimensional data results in…

  20. Augmenting Fellow Education Through Spaced Multiple-Choice Questions.

    Barsoumian, Alice E; Yun, Heather C

    2018-01-01

    The San Antonio Uniformed Services Health Education Consortium Infectious Disease Fellowship program historically included a monthly short-answer and multiple-choice quiz. The intent was to ensure medical knowledge in relevant content areas that may not be addressed through clinical rotations, such as operationally relevant infectious disease. After completion, it was discussed in a small group with faculty. Over time, faculty noted increasing dissatisfaction with the activity. Spaced interval education is useful in retention of medical knowledge and skills by medical students and residents. Its use in infectious disease fellow education has not been described. To improve the quiz experience, we assessed the introduction of spaced education curriculum in our program. A pre-intervention survey was distributed to assess the monthly quiz with Likert scale and open-ended questions. A multiple-choice question spaced education curriculum was created using the Qstream(R) platform in 2011. Faculty development on question writing was conducted. Two questions were delivered every 2 d. Incorrectly and correctly answered questions were repeated after 7 and 13 d, respectively. Questions needed to be answered correctly twice to be retired. Fellow satisfaction was assessed at semi-annual fellowship reviews over 5 yr and by a one-time repeat survey. Pre-intervention survey of six fellows indicated dissatisfaction with the time commitment of the monthly quiz (median Likert score of 2, mean 6.5 h to complete), neutral in perceived utility, but satisfaction with knowledge retention (Likert score 4). Eighteen fellows over 5 yr participated in the spaced education curriculum. Three quizzes with 20, 39, and 48 questions were designed. Seventeen percentage of questions addressed operationally relevant topics. Fifty-nine percentage of questions were answered correctly on first attempt, improving to 93% correct answer rate at the end of the analysis. Questions were attempted 2,999 times

  1. An empirical comparison of Item Response Theory and Classical Test Theory

    Špela Progar

    2008-11-01

    Full Text Available Based on nonlinear models between the measured latent variable and the item response, item response theory (IRT enables independent estimation of item and person parameters and local estimation of measurement error. These properties of IRT are also the main theoretical advantages of IRT over classical test theory (CTT. Empirical evidence, however, often failed to discover consistent differences between IRT and CTT parameters and between invariance measures of CTT and IRT parameter estimates. In this empirical study a real data set from the Third International Mathematics and Science Study (TIMSS 1995 was used to address the following questions: (1 How comparable are CTT and IRT based item and person parameters? (2 How invariant are CTT and IRT based item parameters across different participant groups? (3 How invariant are CTT and IRT based item and person parameters across different item sets? The findings indicate that the CTT and the IRT item/person parameters are very comparable, that the CTT and the IRT item parameters show similar invariance property when estimated across different groups of participants, that the IRT person parameters are more invariant across different item sets, and that the CTT item parameters are at least as much invariant in different item sets as the IRT item parameters. The results furthermore demonstrate that, with regards to the invariance property, IRT item/person parameters are in general empirically superior to CTT parameters, but only if the appropriate IRT model is used for modelling the data.

  2. Pursuing the Qualities of a "Good" Test

    Coniam, David

    2014-01-01

    This article examines the issue of the quality of teacher-produced tests, limiting itself in the current context to objective, multiple-choice tests. The article investigates a short, two-part 20-item English language test. After a brief overview of the key test qualities of reliability and validity, the article examines the two subtests in terms…

  3. Modeling differential item functioning with group-specific item parameters: A computerized adaptive testing application

    Makransky, Guido; Glas, Cornelis A.W.

    2013-01-01

    Many important decisions are made based on the results of tests administered under different conditions in the fields of educational and psychological testing. Inaccurate inferences are often made if the property of measurement invariance (MI) is not assessed across these conditions. The importance

  4. Generalizability theory and item response theory

    Glas, Cornelis A.W.; Eggen, T.J.H.M.; Veldkamp, B.P.

    2012-01-01

    Item response theory is usually applied to items with a selected-response format, such as multiple choice items, whereas generalizability theory is usually applied to constructed-response tasks assessed by raters. However, in many situations, raters may use rating scales consisting of items with a

  5. Using Module Analysis for Multiple Choice Responses: A New Method Applied to Force Concept Inventory Data

    Brewe, Eric; Bruun, Jesper; Bearden, Ian G.

    2016-01-01

    We describe "Module Analysis for Multiple Choice Responses" (MAMCR), a new methodology for carrying out network analysis on responses to multiple choice assessments. This method is used to identify modules of non-normative responses which can then be interpreted as an alternative to factor analysis. MAMCR allows us to identify conceptual…

  6. Learning Physics Teaching through Collaborative Design of Conceptual Multiple-Choice Questions

    Milner-Bolotin, Marina

    2015-01-01

    Increasing student engagement through Electronic Response Systems (clickers) has been widely researched. Its success largely depends on the quality of multiple-choice questions used by instructors. This paper describes a pilot project that focused on the implementation of online collaborative multiple-choice question repository, PeerWise, in a…

  7. Using a Classroom Response System to Improve Multiple-Choice Performance in AP[R] Physics

    Bertrand, Peggy

    2009-01-01

    Participation in rigorous high school courses such as Advanced Placement (AP[R]) Physics increases the likelihood of college success, especially for students who are traditionally underserved. Tackling difficult multiple-choice exams should be part of any AP program because well-constructed multiple-choice questions, such as those on AP exams and…

  8. The Answering Process for Multiple-Choice Questions in Collaborative Learning: A Mathematical Learning Model Analysis

    Nakamura, Yasuyuki; Nishi, Shinnosuke; Muramatsu, Yuta; Yasutake, Koichi; Yamakawa, Osamu; Tagawa, Takahiro

    2014-01-01

    In this paper, we introduce a mathematical model for collaborative learning and the answering process for multiple-choice questions. The collaborative learning model is inspired by the Ising spin model and the model for answering multiple-choice questions is based on their difficulty level. An intensive simulation study predicts the possibility of…

  9. Teaching Critical Thinking without (Much) Writing: Multiple-Choice and Metacognition

    Bassett, Molly H.

    2016-01-01

    In this essay, I explore an exam format that pairs multiple-choice questions with required rationales. In a space adjacent to each multiple-choice question, students explain why or how they arrived at the answer they selected. This exercise builds the critical thinking skill known as metacognition, thinking about thinking, into an exam that also…

  10. Application of Item Response Theory to Tests of Substance-related Associative Memory

    Shono, Yusuke; Grenard, Jerry L.; Ames, Susan L.; Stacy, Alan W.

    2015-01-01

    A substance-related word association test (WAT) is one of the commonly used indirect tests of substance-related implicit associative memory and has been shown to predict substance use. This study applied an item response theory (IRT) modeling approach to evaluate psychometric properties of the alcohol- and marijuana-related WATs and their items among 775 ethnically diverse at-risk adolescents. After examining the IRT assumptions, item fit, and differential item functioning (DIF) across gender and age groups, the original 18 WAT items were reduced to 14- and 15-items in the alcohol- and marijuana-related WAT, respectively. Thereafter, unidimensional one- and two-parameter logistic models (1PL and 2PL models) were fitted to the revised WAT items. The results demonstrated that both alcohol- and marijuana-related WATs have good psychometric properties. These results were discussed in light of the framework of a unified concept of construct validity (Messick, 1975, 1989, 1995). PMID:25134051

  11. Above-Level Test Item Functioning across Examinee Age Groups

    Warne, Russell T.; Doty, Kristine J.; Malbica, Anne Marie; Angeles, Victor R.; Innes, Scott; Hall, Jared; Masterson-Nixon, Kelli

    2016-01-01

    "Above-level testing" (also called "above-grade testing," "out-of-level testing," and "off-level testing") is the practice of administering to a child a test that is designed for an examinee population that is older or in a more advanced grade. Above-level testing is frequently used to help educators design…

  12. Psychometric aspects of item mapping for criterion-referenced interpretation and bookmark standard setting.

    Huynh, Huynh

    2010-01-01

    Locating an item on an achievement continuum (item mapping) is well-established in technical work for educational/psychological assessment. Applications of item mapping may be found in criterion-referenced (CR) testing (or scale anchoring, Beaton and Allen, 1992; Huynh, 1994, 1998a, 2000a, 2000b, 2006), computer-assisted testing, test form assembly, and in standard setting methods based on ordered test booklets. These methods include the bookmark standard setting originally used for the CTB/TerraNova tests (Lewis, Mitzel, Green, and Patz, 1999), the item descriptor process (Ferrara, Perie, and Johnson, 2002) and a similar process described by Wang (2003) for multiple-choice licensure and certification examinations. While item response theory (IRT) models such as the Rasch and two-parameter logistic (2PL) models traditionally place a binary item at its location, Huynh has argued in the cited papers that such mapping may not be appropriate in selecting items for CR interpretation and scale anchoring.

  13. Evaluating an Automated Number Series Item Generator Using Linear Logistic Test Models

    Bao Sheng Loe

    2018-04-01

    Full Text Available This study investigates the item properties of a newly developed Automatic Number Series Item Generator (ANSIG. The foundation of the ANSIG is based on five hypothesised cognitive operators. Thirteen item models were developed using the numGen R package and eleven were evaluated in this study. The 16-item ICAR (International Cognitive Ability Resource1 short form ability test was used to evaluate construct validity. The Rasch Model and two Linear Logistic Test Model(s (LLTM were employed to estimate and predict the item parameters. Results indicate that a single factor determines the performance on tests composed of items generated by the ANSIG. Under the LLTM approach, all the cognitive operators were significant predictors of item difficulty. Moderate to high correlations were evident between the number series items and the ICAR test scores, with high correlation found for the ICAR Letter-Numeric-Series type items, suggesting adequate nomothetic span. Extended cognitive research is, nevertheless, essential for the automatic generation of an item pool with predictable psychometric properties.

  14. Secondary Psychometric Examination of the Dimensional Obsessive-Compulsive Scale: Classical Testing, Item Response Theory, and Differential Item Functioning.

    Thibodeau, Michel A; Leonard, Rachel C; Abramowitz, Jonathan S; Riemann, Bradley C

    2015-12-01

    The Dimensional Obsessive-Compulsive Scale (DOCS) is a promising measure of obsessive-compulsive disorder (OCD) symptoms but has received minimal psychometric attention. We evaluated the utility and reliability of DOCS scores. The study included 832 students and 300 patients with OCD. Confirmatory factor analysis supported the originally proposed four-factor structure. DOCS total and subscale scores exhibited good to excellent internal consistency in both samples (α = .82 to α = .96). Patient DOCS total scores reduced substantially during treatment (t = 16.01, d = 1.02). DOCS total scores discriminated between students and patients (sensitivity = 0.76, 1 - specificity = 0.23). The measure did not exhibit gender-based differential item functioning as tested by Mantel-Haenszel chi-square tests. Expected response options for each item were plotted as a function of item response theory and demonstrated that DOCS scores incrementally discriminate OCD symptoms ranging from low to extremely high severity. Incremental differences in DOCS scores appear to represent unbiased and reliable differences in true OCD symptom severity. © The Author(s) 2014.

  15. A Comparison of Multidimensional Item Selection Methods in Simple and Complex Test Designs

    Eren Halil ÖZBERK

    2017-03-01

    Full Text Available In contrast with the previous studies, this study employed various test designs (simple and complex which allow the evaluation of the overall ability score estimations across multiple real test conditions. In this study, four factors were manipulated, namely the test design, number of items per dimension, correlation between dimensions and item selection methods. Using the generated item and ability parameters, dichotomous item responses were generated in by using M3PL compensatory multidimensional IRT model with specified correlations. MCAT composite ability score accuracy was evaluated using absolute bias (ABSBIAS, correlation and the root mean square error (RMSE between true and estimated ability scores. The results suggest that the multidimensional test structure, number of item per dimension and correlation between dimensions had significant effect on item selection methods for the overall score estimations. For simple structure test design it was found that V1 item selection has the lowest absolute bias estimations for both long and short tests while estimating overall scores. As the model gets complex KL item selection method performed better than other two item selection method.

  16. Optimizing the Use of Response Times for Item Selection in Computerized Adaptive Testing

    Choe, Edison M.; Kern, Justin L.; Chang, Hua-Hua

    2018-01-01

    Despite common operationalization, measurement efficiency of computerized adaptive testing should not only be assessed in terms of the number of items administered but also the time it takes to complete the test. To this end, a recent study introduced a novel item selection criterion that maximizes Fisher information per unit of expected response…

  17. Applications of NLP Techniques to Computer-Assisted Authoring of Test Items for Elementary Chinese

    Liu, Chao-Lin; Lin, Jen-Hsiang; Wang, Yu-Chun

    2010-01-01

    The authors report an implemented environment for computer-assisted authoring of test items and provide a brief discussion about the applications of NLP techniques for computer assisted language learning. Test items can serve as a tool for language learners to examine their competence in the target language. The authors apply techniques for…

  18. A Method for Generating Educational Test Items That Are Aligned to the Common Core State Standards

    Gierl, Mark J.; Lai, Hollis; Hogan, James B.; Matovinovic, Donna

    2015-01-01

    The demand for test items far outstrips the current supply. This increased demand can be attributed, in part, to the transition to computerized testing, but, it is also linked to dramatic changes in how 21st century educational assessments are designed and administered. One way to address this growing demand is with automatic item generation.…

  19. Relationships among Classical Test Theory and Item Response Theory Frameworks via Factor Analytic Models

    Kohli, Nidhi; Koran, Jennifer; Henn, Lisa

    2015-01-01

    There are well-defined theoretical differences between the classical test theory (CTT) and item response theory (IRT) frameworks. It is understood that in the CTT framework, person and item statistics are test- and sample-dependent. This is not the perception with IRT. For this reason, the IRT framework is considered to be theoretically superior…

  20. Strategies for Controlling Item Exposure in Computerized Adaptive Testing with the Generalized Partial Credit Model

    Davis, Laurie Laughlin

    2004-01-01

    Choosing a strategy for controlling item exposure has become an integral part of test development for computerized adaptive testing (CAT). This study investigated the performance of six procedures for controlling item exposure in a series of simulated CATs under the generalized partial credit model. In addition to a no-exposure control baseline…

  1. Effects of Using Modified Items to Test Students with Persistent Academic Difficulties

    Elliott, Stephen N.; Kettler, Ryan J.; Beddow, Peter A.; Kurz, Alexander; Compton, Elizabeth; McGrath, Dawn; Bruen, Charles; Hinton, Kent; Palmer, Porter; Rodriguez, Michael C.; Bolt, Daniel; Roach, Andrew T.

    2010-01-01

    This study investigated the effects of using modified items in achievement tests to enhance accessibility. An experiment determined whether tests composed of modified items would reduce the performance gap between students eligible for an alternate assessment based on modified achievement standards (AA-MAS) and students not eligible, and the…

  2. Latent Trait Theory Applications to Test Item Bias Methodology. Research Memorandum No. 1.

    Osterlind, Steven J.; Martois, John S.

    This study discusses latent trait theory applications to test item bias methodology. A real data set is used in describing the rationale and application of the Rasch probabilistic model item calibrations across various ethnic group populations. A high school graduation proficiency test covering reading comprehension, writing mechanics, and…

  3. Test Score Equating Using Discrete Anchor Items versus Passage-Based Anchor Items: A Case Study Using "SAT"® Data. Research Report. ETS RR-14-14

    Liu, Jinghua; Zu, Jiyun; Curley, Edward; Carey, Jill

    2014-01-01

    The purpose of this study is to investigate the impact of discrete anchor items versus passage-based anchor items on observed score equating using empirical data.This study compares an "SAT"® critical reading anchor that contains more discrete items proportionally, compared to the total tests to be equated, to another anchor that…

  4. Piecewise Polynomial Fitting with Trend Item Removal and Its Application in a Cab Vibration Test

    Wu Ren

    2018-01-01

    Full Text Available The trend item of a long-term vibration signal is difficult to remove. This paper proposes a piecewise integration method to remove trend items. Examples of direct integration without trend item removal, global integration after piecewise polynomial fitting with trend item removal, and direct integration after piecewise polynomial fitting with trend item removal were simulated. The results showed that direct integration of the fitted piecewise polynomial provided greater acceleration and displacement precision than the other two integration methods. A vibration test was then performed on a special equipment cab. The results indicated that direct integration by piecewise polynomial fitting with trend item removal was highly consistent with the measured signal data. However, the direct integration method without trend item removal resulted in signal distortion. The proposed method can help with frequency domain analysis of vibration signals and modal parameter identification for such equipment.

  5. Science Library of Test Items. Volume Eight. Mastery Testing Program. Series 3 & 4 Supplements to Introduction and Manual.

    New South Wales Dept. of Education, Sydney (Australia).

    Continuing a series of short tests aimed at measuring student mastery of specific skills in the natural sciences, this supplementary volume includes teachers' notes, a users' guide and inspection copies of test items 27 to 50. Answer keys and test scoring statistics are provided. The items are designed for grades 7 through 10, and a list of the…

  6. Grade 9 Pilot Test. Mathematics. June 1988 = 9e Annee Test Pilote. Mathematiques. Juin 1988.

    Alberta Dept. of Education, Edmonton.

    This pilot test for ninth grade mathematics is written in both French and English. The test consists of 75 multiple-choice items. Students are given 90 minutes to complete the examination and the use of a calculator is highly recommended. The test content covers a wide range of mathematical topics including: decimals; exponents; arithmetic word…

  7. A comparison of item response models for accuracy and speed of item responses with applications to adaptive testing.

    van Rijn, Peter W; Ali, Usama S

    2017-05-01

    We compare three modelling frameworks for accuracy and speed of item responses in the context of adaptive testing. The first framework is based on modelling scores that result from a scoring rule that incorporates both accuracy and speed. The second framework is the hierarchical modelling approach developed by van der Linden (2007, Psychometrika, 72, 287) in which a regular item response model is specified for accuracy and a log-normal model for speed. The third framework is the diffusion framework in which the response is assumed to be the result of a Wiener process. Although the three frameworks differ in the relation between accuracy and speed, one commonality is that the marginal model for accuracy can be simplified to the two-parameter logistic model. We discuss both conditional and marginal estimation of model parameters. Models from all three frameworks were fitted to data from a mathematics and spelling test. Furthermore, we applied a linear and adaptive testing mode to the data off-line in order to determine differences between modelling frameworks. It was found that a model from the scoring rule framework outperformed a hierarchical model in terms of model-based reliability, but the results were mixed with respect to correlations with external measures. © 2017 The British Psychological Society.

  8. Differential Item Functioning (DIF) among Spanish-Speaking English Language Learners (ELLs) in State Science Tests

    Ilich, Maria O.

    Psychometricians and test developers evaluate standardized tests for potential bias against groups of test-takers by using differential item functioning (DIF). English language learners (ELLs) are a diverse group of students whose native language is not English. While they are still learning the English language, they must take their standardized tests for their school subjects, including science, in English. In this study, linguistic complexity was examined as a possible source of DIF that may result in test scores that confound science knowledge with a lack of English proficiency among ELLs. Two years of fifth-grade state science tests were analyzed for evidence of DIF using two DIF methods, Simultaneous Item Bias Test (SIBTest) and logistic regression. The tests presented a unique challenge in that the test items were grouped together into testlets---groups of items referring to a scientific scenario to measure knowledge of different science content or skills. Very large samples of 10, 256 students in 2006 and 13,571 students in 2007 were examined. Half of each sample was composed of Spanish-speaking ELLs; the balance was comprised of native English speakers. The two DIF methods were in agreement about the items that favored non-ELLs and the items that favored ELLs. Logistic regression effect sizes were all negligible, while SIBTest flagged items with low to high DIF. A decrease in socioeconomic status and Spanish-speaking ELL diversity may have led to inconsistent SIBTest effect sizes for items used in both testing years. The DIF results for the testlets suggested that ELLs lacked sufficient opportunity to learn science content. The DIF results further suggest that those constructed response test items requiring the student to draw a conclusion about a scientific investigation or to plan a new investigation tended to favor ELLs.

  9. Quantitative penetration testing with item response theory (extended version)

    Arnold, Florian; Pieters, Wolter; Stoelinga, Mariëlle Ida Antoinette

    2013-01-01

    Existing penetration testing approaches assess the vulnerability of a system by determining whether certain attack paths are possible in practice. Therefore, penetration testing has thus far been used as a qualitative research method. To enable quantitative approaches to security risk management,

  10. Differential item functioning analysis of the Vanderbilt Expertise Test for cars.

    Lee, Woo-Yeol; Cho, Sun-Joo; McGugin, Rankin W; Van Gulick, Ana Beth; Gauthier, Isabel

    2015-01-01

    The Vanderbilt Expertise Test for cars (VETcar) is a test of visual learning for contemporary car models. We used item response theory to assess the VETcar and in particular used differential item functioning (DIF) analysis to ask if the test functions the same way in laboratory versus online settings and for different groups based on age and gender. An exploratory factor analysis found evidence of multidimensionality in the VETcar, although a single dimension was deemed sufficient to capture the recognition ability measured by the test. We selected a unidimensional three-parameter logistic item response model to examine item characteristics and subject abilities. The VETcar had satisfactory internal consistency. A substantial number of items showed DIF at a medium effect size for test setting and for age group, whereas gender DIF was negligible. Because online subjects were on average older than those tested in the lab, we focused on the age groups to conduct a multigroup item response theory analysis. This revealed that most items on the test favored the younger group. DIF could be more the rule than the exception when measuring performance with familiar object categories, therefore posing a challenge for the measurement of either domain-general visual abilities or category-specific knowledge.

  11. The effects of linguistic modification on ESL students' comprehension of nursing course test items.

    Bosher, Susan; Bowles, Melissa

    2008-01-01

    Recent research has indicated that language may be a source of construct-irrelevant variance for non-native speakers of English, or English as a second language (ESL) students, when they take exams. As a result, exams may not accurately measure knowledge of nursing content. One accommodation often used to level the playing field for ESL students is linguistic modification, a process by which the reading load of test items is reduced while the content and integrity of the item are maintained. Research on the effects of linguistic modification has been conducted on examinees in the K-12 population, but is just beginning in other areas. This study describes the collaborative process by which items from a pathophysiology exam were linguistically modified and subsequently evaluated for comprehensibility by ESL students. Findings indicate that in a majority of cases, modification improved examinees' comprehension of test items. Implications for test item writing and future research are discussed.

  12. Factor Structure and Reliability of Test Items for Saudi Teacher Licence Assessment

    Alsadaawi, Abdullah Saleh

    2017-01-01

    The Saudi National Assessment Centre administers the Computer Science Teacher Test for teacher certification. The aim of this study is to explore gender differences in candidates' scores, and investigate dimensionality, reliability, and differential item functioning using confirmatory factor analysis and item response theory. The confirmatory…

  13. Testing for Nonuniform Differential Item Functioning with Multiple Indicator Multiple Cause Models

    Woods, Carol M.; Grimm, Kevin J.

    2011-01-01

    In extant literature, multiple indicator multiple cause (MIMIC) models have been presented for identifying items that display uniform differential item functioning (DIF) only, not nonuniform DIF. This article addresses, for apparently the first time, the use of MIMIC models for testing both uniform and nonuniform DIF with categorical indicators. A…

  14. A Feedback Control Strategy for Enhancing Item Selection Efficiency in Computerized Adaptive Testing

    Weissman, Alexander

    2006-01-01

    A computerized adaptive test (CAT) may be modeled as a closed-loop system, where item selection is influenced by trait level ([theta]) estimation and vice versa. When discrepancies exist between an examinee's estimated and true [theta] levels, nonoptimal item selection is a likely result. Nevertheless, examinee response behavior consistent with…

  15. Australian Biology Test Item Bank, Years 11 and 12. Volume II: Year 12.

    Brown, David W., Ed.; Sewell, Jeffrey J., Ed.

    This document consists of test items which are applicable to biology courses throughout Australia (irrespective of course materials used); assess key concepts within course statement (for both core and optional studies); assess a wide range of cognitive processes; and are relevant to current biological concepts. These items are arranged under…

  16. Australian Biology Test Item Bank, Years 11 and 12. Volume I: Year 11.

    Brown, David W., Ed.; Sewell, Jeffrey J., Ed.

    This document consists of test items which are applicable to biology courses throughout Australia (irrespective of course materials used); assess key concepts within course statement (for both core and optional studies); assess a wide range of cognitive processes; and are relevant to current biological concepts. These items are arranged under…

  17. What Does a Verbal Test Measure? A New Approach to Understanding Sources of Item Difficulty.

    Berk, Eric J. Vanden; Lohman, David F.; Cassata, Jennifer Coyne

    Assessing the construct relevance of mental test results continues to present many challenges, and it has proven to be particularly difficult to assess the construct relevance of verbal items. This study was conducted to gain a better understanding of the conceptual sources of verbal item difficulty using a unique approach that integrates…

  18. The Prediction of Item Parameters Based on Classical Test Theory and Latent Trait Theory

    Anil, Duygu

    2008-01-01

    In this study, the prediction power of the item characteristics based on the experts' predictions on conditions try-out practices cannot be applied was examined for item characteristics computed depending on classical test theory and two-parameters logistic model of latent trait theory. The study was carried out on 9914 randomly selected students…

  19. Development of an item bank for computerized adaptive test (CAT) measurement of pain

    Petersen, Morten Aa.; Aaronson, Neil K; Chie, Wei-Chu

    2016-01-01

    PURPOSE: Patient-reported outcomes should ideally be adapted to the individual patient while maintaining comparability of scores across patients. This is achievable using computerized adaptive testing (CAT). The aim here was to develop an item bank for CAT measurement of the pain domain as measured...... were obtained from 1103 cancer patients from five countries. Psychometric evaluations showed that 16 items could be retained in a unidimensional item bank. Evaluations indicated that use of the CAT measure may reduce sample size requirements with 15-25 % compared to using the QLQ-C30 pain scale....... CONCLUSIONS: We have established an item bank of 16 items suitable for CAT measurement of pain. While being backward compatible with the QLQ-C30, the new item bank will significantly improve measurement precision of pain. We recommend initiating CAT measurement by screening for pain using the two original QLQ...

  20. Análise de itens de uma prova de raciocínio estatístico Analysis of items of a statistical reasoning test

    Claudette Maria Medeiros Vendramini

    2004-12-01

    Full Text Available Este estudo objetivou analisar as 18 questões (do tipo múltipla escolha de uma prova sobre conceitos básicos de Estatística pelas teorias clássica e moderna. Participaram 325 universitários, selecionados aleatoriamente das áreas de humanas, exatas e saúde. A análise indicou que a prova é predominantemente unidimensional e que os itens podem ser mais bem ajustados ao modelo de três parâmetros. Os índices de dificuldade, discriminação e correlação bisserial apresentam valores aceitáveis. Sugere-se a inclusão de novos itens na prova, que busquem confiabilidade e validade para o contexto educacional e revelem o raciocínio estatístico de universitários ao ler representações de dados estatísticos.This study aimed at to analyze the 18 questions (of multiple choice type of a test on basic concepts of Statistics for the classic and modern theories. The test was taken by 325 undergraduate students, randomly selected from the areas of Human, Exact and Health Sciences. The analysis indicated that the test has predominantly one dimension and that the items can be better fitting to the model of three parameters. The indexes of difficulty, discrimination and biserial correlation present acceptable values. It is suggested to include new items to the test in order to obtain reliability and validity to use it in the education context and to reveal the statistical reasoning of undergraduate students when dealing with statistical data representation.

  1. Explanatory item response modelling of an abstract reasoning assessment: A case for modern test design

    Helland, Fredrik

    2016-01-01

    Assessment is an integral part of society and education, and for this reason it is important to know what you measure. This thesis is about explanatory item response modelling of an abstract reasoning assessment, with the objective to create a modern test design framework for automatic generation of valid and precalibrated items of abstract reasoning. Modern test design aims to strengthen the connections between the different components of a test, with a stress on strong theory, systematic it...

  2. A comparison of discriminant logistic regression and Item Response Theory Likelihood-Ratio Tests for Differential Item Functioning (IRTLRDIF) in polytomous short tests.

    Hidalgo, María D; López-Martínez, María D; Gómez-Benito, Juana; Guilera, Georgina

    2016-01-01

    Short scales are typically used in the social, behavioural and health sciences. This is relevant since test length can influence whether items showing DIF are correctly flagged. This paper compares the relative effectiveness of discriminant logistic regression (DLR) and IRTLRDIF for detecting DIF in polytomous short tests. A simulation study was designed. Test length, sample size, DIF amount and item response categories number were manipulated. Type I error and power were evaluated. IRTLRDIF and DLR yielded Type I error rates close to nominal level in no-DIF conditions. Under DIF conditions, Type I error rates were affected by test length DIF amount, degree of test contamination, sample size and number of item response categories. DLR showed a higher Type I error rate than did IRTLRDIF. Power rates were affected by DIF amount and sample size, but not by test length. DLR achieved higher power rates than did IRTLRDIF in very short tests, although the high Type I error rate involved means that this result cannot be taken into account. Test length had an important impact on the Type I error rate. IRTLRDIF and DLR showed a low power rate in short tests and with small sample sizes.

  3. Item response theory, computerized adaptive testing, and PROMIS: assessment of physical function.

    Fries, James F; Witter, James; Rose, Matthias; Cella, David; Khanna, Dinesh; Morgan-DeWitt, Esi

    2014-01-01

    Patient-reported outcome (PRO) questionnaires record health information directly from research participants because observers may not accurately represent the patient perspective. Patient-reported Outcomes Measurement Information System (PROMIS) is a US National Institutes of Health cooperative group charged with bringing PRO to a new level of precision and standardization across diseases by item development and use of item response theory (IRT). With IRT methods, improved items are calibrated on an underlying concept to form an item bank for a "domain" such as physical function (PF). The most informative items can be combined to construct efficient "instruments" such as 10-item or 20-item PF static forms. Each item is calibrated on the basis of the probability that a given person will respond at a given level, and the ability of the item to discriminate people from one another. Tailored forms may cover any desired level of the domain being measured. Computerized adaptive testing (CAT) selects the best items to sharpen the estimate of a person's functional ability, based on prior responses to earlier questions. PROMIS item banks have been improved with experience from several thousand items, and are calibrated on over 21,000 respondents. In areas tested to date, PROMIS PF instruments are superior or equal to Health Assessment Questionnaire and Medical Outcome Study Short Form-36 Survey legacy instruments in clarity, translatability, patient importance, reliability, and sensitivity to change. Precise measures, such as PROMIS, efficiently incorporate patient self-report of health into research, potentially reducing research cost by lowering sample size requirements. The advent of routine IRT applications has the potential to transform PRO measurement.

  4. The quadratic relationship between difficulty of intelligence test items and their correlations with working memory

    Tomasz eSmoleń

    2015-08-01

    Full Text Available Fluid intelligence (Gf is a crucial cognitive ability that involves abstract reasoning in order to solve novel problems. Recent research demonstrated that Gf strongly depends on the individual effectiveness of working memory (WM. We investigated a popular claim that if the storage capacity underlay the WM-Gf correlation, then such a correlation should increase with an increasing number of items or rules (load in a Gf test. As often no such link is observed, on that basis the storage-capacity account is rejected, and alternative accounts of Gf (e.g., related to executive control or processing speed are proposed. Using both analytical inference and numerical simulations, we demonstrated that the load-dependent change in correlation is primarily a function of the amount of floor/ceiling effect for particular items. Thus, the item-wise WM correlation of a Gf test depends on its overall difficulty, and the difficulty distribution across its items. When the early test items yield huge ceiling, but the late items do not approach floor, that correlation will increase throughout the test. If the early items locate themselves between ceiling and floor, but the late items approach floor, the respective correlation will decrease. For a hallmark Gf test, the Raven test, whose items span from ceiling to floor, the quadratic relationship is expected, and it was shown empirically using a large sample and two types of WMC tasks. In consequence, no changes in correlation due to varying WM/Gf load, or lack of them, can yield an argument for or against any theory of WM/Gf. Moreover, as the mathematical properties of the correlation formula make it relatively immune to ceiling/floor effects for overall moderate correlations, only minor changes (if any in the WM-Gf correlation should be expected for many psychological tests.

  5. The quadratic relationship between difficulty of intelligence test items and their correlations with working memory.

    Smolen, Tomasz; Chuderski, Adam

    2015-01-01

    Fluid intelligence (Gf) is a crucial cognitive ability that involves abstract reasoning in order to solve novel problems. Recent research demonstrated that Gf strongly depends on the individual effectiveness of working memory (WM). We investigated a popular claim that if the storage capacity underlay the WM-Gf correlation, then such a correlation should increase with an increasing number of items or rules (load) in a Gf-test. As often no such link is observed, on that basis the storage-capacity account is rejected, and alternative accounts of Gf (e.g., related to executive control or processing speed) are proposed. Using both analytical inference and numerical simulations, we demonstrated that the load-dependent change in correlation is primarily a function of the amount of floor/ceiling effect for particular items. Thus, the item-wise WM correlation of a Gf-test depends on its overall difficulty, and the difficulty distribution across its items. When the early test items yield huge ceiling, but the late items do not approach floor, that correlation will increase throughout the test. If the early items locate themselves between ceiling and floor, but the late items approach floor, the respective correlation will decrease. For a hallmark Gf-test, the Raven-test, whose items span from ceiling to floor, the quadratic relationship is expected, and it was shown empirically using a large sample and two types of WMC tasks. In consequence, no changes in correlation due to varying WM/Gf load, or lack of them, can yield an argument for or against any theory of WM/Gf. Moreover, as the mathematical properties of the correlation formula make it relatively immune to ceiling/floor effects for overall moderate correlations, only minor changes (if any) in the WM-Gf correlation should be expected for many psychological tests.

  6. Construction of Valid and Reliable Test for Assessment of Students

    Osadebe, P. U.

    2015-01-01

    The study was carried out to construct a valid and reliable test in Economics for secondary school students. Two research questions were drawn to guide the establishment of validity and reliability for the Economics Achievement Test (EAT). It is a multiple choice objective test of five options with 100 items. A sample of 1000 students was randomly…

  7. Understanding Test-Takers' Perceptions of Difficulty in EAP Vocabulary Tests: The Role of Experiential Factors

    Oruç Ertürk, Nesrin; Mumford, Simon E.

    2017-01-01

    This study, conducted by two researchers who were also multiple-choice question (MCQ) test item writers at a private English-medium university in an English as a foreign language (EFL) context, was designed to shed light on the factors that influence test-takers' perceptions of difficulty in English for academic purposes (EAP) vocabulary, with the…

  8. Comparison of Classical Test Theory and Item Response Theory in Individual Change Assessment

    Jabrayilov, Ruslan; Emons, Wilco H. M.; Sijtsma, Klaas

    2016-01-01

    Clinical psychologists are advised to assess clinical and statistical significance when assessing change in individual patients. Individual change assessment can be conducted using either the methodologies of classical test theory (CTT) or item response theory (IRT). Researchers have been optimistic

  9. Item Response Theory analysis of Fagerström Test for Cigarette Dependence.

    Svicher, Andrea; Cosci, Fiammetta; Giannini, Marco; Pistelli, Francesco; Fagerström, Karl

    2018-02-01

    The Fagerström Test for Cigarette Dependence (FTCD) and the Heaviness of Smoking Index (HSI) are the gold standard measures to assess cigarette dependence. However, FTCD reliability and factor structure have been questioned and HSI psychometric properties are in need of further investigations. The present study examined the psychometrics properties of the FTCD and the HSI via the Item Response Theory. The study was a secondary analysis of data collected in 862 Italian daily smokers. Confirmatory factor analysis was run to evaluate the dimensionality of FTCD. A Grade Response Model was applied to FTCD and HSI to verify the fit to the data. Both item and test functioning were analyzed and item statistics, Test Information Function, and scale reliabilities were calculated. Mokken Scale Analysis was applied to estimate homogeneity and Loevinger's coefficients were calculated. The FTCD showed unidimensionality and homogeneity for most of the items and for the total score. It also showed high sensitivity and good reliability from medium to high levels of cigarette dependence, although problems related to some items (i.e., items 3 and 5) were evident. HSI had good homogeneity, adequate item functioning, and high reliability from medium to high levels of cigarette dependence. Significant Differential Item Functioning was found for items 1, 4, 5 of the FTCD and for both items of HSI. HSI seems highly recommended in clinical settings addressed to heavy smokers while FTCD would be better used in smokers with a level of cigarette dependence ranging between low and high. Copyright © 2017 Elsevier Ltd. All rights reserved.

  10. Multiple-Choice Exams: An Obstacle for Higher-Level Thinking in Introductory Science Classes

    Stanger-Hall, Kathrin F.

    2012-01-01

    Learning science requires higher-level (critical) thinking skills that need to be practiced in science classes. This study tested the effect of exam format on critical-thinking skills. Multiple-choice (MC) testing is common in introductory science courses, and students in these classes tend to associate memorization with MC questions and may not see the need to modify their study strategies for critical thinking, because the MC exam format has not changed. To test the effect of exam format, I used two sections of an introductory biology class. One section was assessed with exams in the traditional MC format, the other section was assessed with both MC and constructed-response (CR) questions. The mixed exam format was correlated with significantly more cognitively active study behaviors and a significantly better performance on the cumulative final exam (after accounting for grade point average and gender). There was also less gender-bias in the CR answers. This suggests that the MC-only exam format indeed hinders critical thinking in introductory science classes. Introducing CR questions encouraged students to learn more and to be better critical thinkers and reduced gender bias. However, student resistance increased as students adjusted their perceptions of their own critical-thinking abilities. PMID:22949426

  11. Generalizability theory and item response theory

    Glas, Cornelis A.W.; Eggen, T.J.H.M.; Veldkamp, B.P.

    2012-01-01

    Item response theory is usually applied to items with a selected-response format, such as multiple choice items, whereas generalizability theory is usually applied to constructed-response tasks assessed by raters. However, in many situations, raters may use rating scales consisting of items with a selected-response format. This chapter presents a short overview of how item response theory and generalizability theory were integrated to model such assessments. Further, the precision of the esti...

  12. Three Modeling Applications to Promote Automatic Item Generation for Examinations in Dentistry.

    Lai, Hollis; Gierl, Mark J; Byrne, B Ellen; Spielman, Andrew I; Waldschmidt, David M

    2016-03-01

    Test items created for dentistry examinations are often individually written by content experts. This approach to item development is expensive because it requires the time and effort of many content experts but yields relatively few items. The aim of this study was to describe and illustrate how items can be generated using a systematic approach. Automatic item generation (AIG) is an alternative method that allows a small number of content experts to produce large numbers of items by integrating their domain expertise with computer technology. This article describes and illustrates how three modeling approaches to item content-item cloning, cognitive modeling, and image-anchored modeling-can be used to generate large numbers of multiple-choice test items for examinations in dentistry. Test items can be generated by combining the expertise of two content specialists with technology supported by AIG. A total of 5,467 new items were created during this study. From substitution of item content, to modeling appropriate responses based upon a cognitive model of correct responses, to generating items linked to specific graphical findings, AIG has the potential for meeting increasing demands for test items. Further, the methods described in this study can be generalized and applied to many other item types. Future research applications for AIG in dental education are discussed.

  13. Overcoming the effects of differential skewness of test items in scale construction

    Johann M. Schepers

    2004-10-01

    Full Text Available The principal objective of the study was to develop a procedure for overcoming the effects of differential skewness of test items in scale construction. It was shown that the degree of skewness of test items places an upper limit on the correlations between the items, regardless of the contents of the items. If the items are ordered in terms of skewness the resulting inter correlation matrix forms a simplex or a pseudo simplex. Factoring such a matrix results in a multiplicity of factors, most of which are artifacts. A procedure for overcoming this problem was demonstrated with items from the Locus of Control Inventory (Schepers, 1995. The analysis was based on a sample of 1662 first year university students. Opsomming Die hoofdoel van die studie was om ’n prosedure te ontwikkel om die gevolge van differensiële skeefheid van toetsitems, in skaalkonstruksie, teen te werk. Daar is getoon dat die graad van skeefheid van toetsitems ’n boonste grens plaas op die korrelasies tussen die items ongeag die inhoud daarvan. Indien die items gerangskik word volgens graad van skeefheid, sal die interkorrelasiematriks van die items ’n simpleks of pseudosimpleks vorm. Indien so ’n matriks aan faktorontleding onderwerp word, lei dit tot ’n veelheid van faktore waarvan die meerderheid artefakte is. ’n Prosedure om hierdie probleem te bowe te kom, is gedemonstreer met behulp van die items van die Lokus van Beheer-vraelys (Schepers, 1995. Die ontledings is op ’n steekproef van 1662 eerstejaaruniversiteitstudente gebaseer.

  14. An Effective Multimedia Item Shell Design for Individualized Education: The Crome Project

    Irene Cheng

    2008-01-01

    Full Text Available There are several advantages to creating multimedia item types and applying computer-based adaptive testing in education. First is the capability to motivate learning by making the learners feel more engaged and in an interactive environment. Second is a better concept representation, which is not possible in conventional multiple-choice tests. Third is the advantage of individualized curriculum design, rather than a curriculum designed for an average student. Fourth is a good choice of the next question, associated with the appropriate difficulty level based on a student's response to the current question. However, many issues need to be addressed when achieving these goals, including: (a the large number of item types required to represent the current multiple-choice questions in multimedia formats, (b the criterion used to determine the difficulty level of a multimedia question item, and (c the methodology applied to the question selection process for individual students. In this paper, we propose a multimedia item shell design that not only reduces the number of item types required, but also computes difficulty level of an item automatically. The concept of question seed is introduced to make content creation more cost-effective. The proposed item shell framework facilitates efficient communication between user responses at the client, and the scoring agents integrated with a student ability assessor at the server. We also describe approaches for automatically estimating difficulty level of questions, and discuss preliminary evaluation of multimedia item types by students.

  15. Does Educator Training or Experience Affect the Quality of Multiple-Choice Questions?

    Webb, Emily M; Phuong, Jonathan S; Naeger, David M

    2015-10-01

    Physicians receive little training on proper multiple-choice question (MCQ) writing methods. Well-constructed MCQs follow rules, which ensure that a question tests what it is intended to test. Questions that break these are described as "flawed." We examined whether the prevalence of flawed questions differed significantly between those with or without prior training in question writing and between those with different levels of educator experience. We assessed 200 unedited MCQs from a question bank for our senior medical student radiology elective: an equal number of questions (50) were written by faculty with previous training in MCQ writing, other faculty, residents, and medical students. Questions were scored independently by two readers for the presence of 11 distinct flaws described in the literature. Questions written by faculty with MCQ writing training had significantly fewer errors: mean 0.4 errors per question compared to a mean of 1.5-1.7 errors per question for the other groups (P Educator experience alone had no effect on the frequency of flaws; faculty without dedicated training, residents, and students performed similarly. Copyright © 2015 AUR. Published by Elsevier Inc. All rights reserved.

  16. Validity of the ISUOG basic training test

    Hillerup, Niels Emil; Tabor, Ann; Konge, Lars

    2018-01-01

    A certain level of theoretical knowledge is required when performing basic obstetrical and gynecological ultrasound. To assess the adequacy of trainees' basic theoretical knowledge, the International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) has developed a theoretical test of 49...... Multiple Choice Questionnaire (MCQ) items for their basic training courses....

  17. Project Physics Tests 4, Light and Electromagnetism.

    Harvard Univ., Cambridge, MA. Harvard Project Physics.

    Test items relating to Project Physics Unit 4 are presented in this booklet. Included are 70 multiple-choice and 22 problem-and-essay questions. Concepts of light and electromagnetism are examined on charges, reflection, electrostatic forces, electric potential, speed of light, electromagnetic waves and radiations, Oersted's and Faraday's work,…

  18. A Comparison of Procedures for Content-Sensitive Item Selection in Computerized Adaptive Tests.

    Kingsbury, G. Gage; Zara, Anthony R.

    1991-01-01

    This simulation investigated two procedures that reduce differences between paper-and-pencil testing and computerized adaptive testing (CAT) by making CAT content sensitive. Results indicate that the price in terms of additional test items of using constrained CAT for content balancing is much smaller than that of using testlets. (SLD)

  19. Using Set Covering with Item Sampling to Analyze the Infeasibility of Linear Programming Test Assembly Models

    Huitzing, Hiddo A.

    2004-01-01

    This article shows how set covering with item sampling (SCIS) methods can be used in the analysis and preanalysis of linear programming models for test assembly (LPTA). LPTA models can construct tests, fulfilling a set of constraints set by the test assembler. Sometimes, no solution to the LPTA model exists. The model is then said to be…

  20. Development of abbreviated eight-item form of the Penn Verbal Reasoning Test.

    Bilker, Warren B; Wierzbicki, Michael R; Brensinger, Colleen M; Gur, Raquel E; Gur, Ruben C

    2014-12-01

    The ability to reason with language is a highly valued cognitive capacity that correlates with IQ measures and is sensitive to damage in language areas. The Penn Verbal Reasoning Test (PVRT) is a 29-item computerized test for measuring abstract analogical reasoning abilities using language. The full test can take over half an hour to administer, which limits its applicability in large-scale studies. We previously described a procedure for abbreviating a clinical rating scale and a modified procedure for reducing tests with a large number of items. Here we describe the application of the modified method to reducing the number of items in the PVRT to a parsimonious subset of items that accurately predicts the total score. As in our previous reduction studies, a split sample is used for model fitting and validation, with cross-validation to verify results. We find that an 8-item scale predicts the total 29-item score well, achieving a correlation of .9145 for the reduced form for the model fitting sample and .8952 for the validation sample. The results indicate that a drastically abbreviated version, which cuts administration time by more than 70%, can be safely administered as a predictor of PVRT performance. © The Author(s) 2014.

  1. Development of Abbreviated Eight-Item Form of the Penn Verbal Reasoning Test

    Bilker, Warren B.; Wierzbicki, Michael R.; Brensinger, Colleen M.; Gur, Raquel E.; Gur, Ruben C.

    2014-01-01

    The ability to reason with language is a highly valued cognitive capacity that correlates with IQ measures and is sensitive to damage in language areas. The Penn Verbal Reasoning Test (PVRT) is a 29-item computerized test for measuring abstract analogical reasoning abilities using language. The full test can take over half an hour to administer, which limits its applicability in large-scale studies. We previously described a procedure for abbreviating a clinical rating scale and a modified procedure for reducing tests with a large number of items. Here we describe the application of the modified method to reducing the number of items in the PVRT to a parsimonious subset of items that accurately predicts the total score. As in our previous reduction studies, a split sample is used for model fitting and validation, with cross-validation to verify results. We find that an 8-item scale predicts the total 29-item score well, achieving a correlation of .9145 for the reduced form for the model fitting sample and .8952 for the validation sample. The results indicate that a drastically abbreviated version, which cuts administration time by more than 70%, can be safely administered as a predictor of PVRT performance. PMID:24577310

  2. Item Response Theory Analyses of the Cambridge Face Memory Test (CFMT)

    Cho, Sun-Joo; Wilmer, Jeremy; Herzmann, Grit; McGugin, Rankin; Fiset, Daniel; Van Gulick, Ana E.; Ryan, Katie; Gauthier, Isabel

    2014-01-01

    We evaluated the psychometric properties of the Cambridge face memory test (CFMT; Duchaine & Nakayama, 2006). First, we assessed the dimensionality of the test with a bi-factor exploratory factor analysis (EFA). This EFA analysis revealed a general factor and three specific factors clustered by targets of CFMT. However, the three specific factors appeared to be minor factors that can be ignored. Second, we fit a unidimensional item response model. This item response model showed that the CFMT items could discriminate individuals at different ability levels and covered a wide range of the ability continuum. We found the CFMT to be particularly precise for a wide range of ability levels. Third, we implemented item response theory (IRT) differential item functioning (DIF) analyses for each gender group and two age groups (Age ≤ 20 versus Age > 21). This DIF analysis suggested little evidence of consequential differential functioning on the CFMT for these groups, supporting the use of the test to compare older to younger, or male to female, individuals. Fourth, we tested for a gender difference on the latent facial recognition ability with an explanatory item response model. We found a significant but small gender difference on the latent ability for face recognition, which was higher for women than men by 0.184, at age mean 23.2, controlling for linear and quadratic age effects. Finally, we discuss the practical considerations of the use of total scores versus IRT scale scores in applications of the CFMT. PMID:25642930

  3. The impact of two multiple-choice question formats on the problem-solving strategies used by novices and experts.

    Coderre, Sylvain P; Harasym, Peter; Mandin, Henry; Fick, Gordon

    2004-11-05

    Pencil-and-paper examination formats, and specifically the standard, five-option multiple-choice question, have often been questioned as a means for assessing higher-order clinical reasoning or problem solving. This study firstly investigated whether two paper formats with differing number of alternatives (standard five-option and extended-matching questions) can test problem-solving abilities. Secondly, the impact of the alternatives number on psychometrics and problem-solving strategies was examined. Think-aloud protocols were collected to determine the problem-solving strategy used by experts and non-experts in answering Gastroenterology questions, across the two pencil-and-paper formats. The two formats demonstrated equal ability in testing problem-solving abilities, while the number of alternatives did not significantly impact psychometrics or problem-solving strategies utilized. These results support the notion that well-constructed multiple-choice questions can in fact test higher order clinical reasoning. Furthermore, it can be concluded that in testing clinical reasoning, the question stem, or content, remains more important than the number of alternatives.

  4. Benford’s Law: Textbook Exercises and Multiple-Choice Testbanks

    Slepkov, Aaron D.; Ironside, Kevin B.; DiBattista, David

    2015-01-01

    Benford’s Law describes the finding that the distribution of leading (or leftmost) digits of innumerable datasets follows a well-defined logarithmic trend, rather than an intuitive uniformity. In practice this means that the most common leading digit is 1, with an expected frequency of 30.1%, and the least common is 9, with an expected frequency of 4.6%. Currently, the most common application of Benford’s Law is in detecting number invention and tampering such as found in accounting-, tax-, and voter-fraud. We demonstrate that answers to end-of-chapter exercises in physics and chemistry textbooks conform to Benford’s Law. Subsequently, we investigate whether this fact can be used to gain advantage over random guessing in multiple-choice tests, and find that while testbank answers in introductory physics closely conform to Benford’s Law, the testbank is nonetheless secure against such a Benford’s attack for banal reasons. PMID:25689468

  5. Pushing Critical Thinking Skills With Multiple-Choice Questions: Does Bloom's Taxonomy Work?

    Zaidi, Nikki L Bibler; Grob, Karri L; Monrad, Seetha M; Kurtz, Joshua B; Tai, Andrew; Ahmed, Asra Z; Gruppen, Larry D; Santen, Sally A

    2018-06-01

    Medical school assessments should foster the development of higher-order thinking skills to support clinical reasoning and a solid foundation of knowledge. Multiple-choice questions (MCQs) are commonly used to assess student learning, and well-written MCQs can support learner engagement in higher levels of cognitive reasoning such as application or synthesis of knowledge. Bloom's taxonomy has been used to identify MCQs that assess students' critical thinking skills, with evidence suggesting that higher-order MCQs support a deeper conceptual understanding of scientific process skills. Similarly, clinical practice also requires learners to develop higher-order thinking skills that include all of Bloom's levels. Faculty question writers and examinees may approach the same material differently based on varying levels of knowledge and expertise, and these differences can influence the cognitive levels being measured by MCQs. Consequently, faculty question writers may perceive that certain MCQs require higher-order thinking skills to process the question, whereas examinees may only need to employ lower-order thinking skills to render a correct response. Likewise, seemingly lower-order questions may actually require higher-order thinking skills to respond correctly. In this Perspective, the authors describe some of the cognitive processes examinees use to respond to MCQs. The authors propose that various factors affect both the question writer and examinee's interaction with test material and subsequent cognitive processes necessary to answer a question.

  6. Test-retest reliability of Eurofit Physical Fitness items for children with visual impairments

    Houwen, Suzanne; Visscher, Chris; Hartman, Esther; Lemmink, Koen A. P. M.

    The purpose of this study was to examine the test-retest reliability of physical fitness items from the European Test of Physical Fitness (Eurofit) for children with visual impairments. A sample of 21 children, ages 6-12 years, that were recruited from a special school for children with visual

  7. The Relative Importance of Persons, Items, Subtests, and Languages to TOEFL Test Variance.

    Brown, James Dean

    1999-01-01

    Explored the relative contributions to Test of English as a Foreign Language (TOEFL) score dependability of various numbers of persons, items, subtests, languages, and their various interactions. Sampled 15,000 test takers, 1000 each from 15 different language backgrounds. (Author/VWL)

  8. Power and Sample Size Calculations for Logistic Regression Tests for Differential Item Functioning

    Li, Zhushan

    2014-01-01

    Logistic regression is a popular method for detecting uniform and nonuniform differential item functioning (DIF) effects. Theoretical formulas for the power and sample size calculations are derived for likelihood ratio tests and Wald tests based on the asymptotic distribution of the maximum likelihood estimators for the logistic regression model.…

  9. Use of differential item functioning (DIF analysis for bias analysis in test construction

    Marié De Beer

    2004-10-01

    Opsomming Waar differensiële itemfunksioneringsprosedures (DIF-prosedures vir itemontleding gebaseer op itemresponsteorie (IRT tydens toetskonstruksie gebruik word, is dit moontlik om itemkarakteristiekekrommes vir dieselfde item vir verskillende subgroepe voor te stel. Hierdie krommes dui aan hoe elke item vir die verskillende subgroepe op verskillende vermoënsvlakke te funksioneer. DIF word aangetoon deur die area tussen die krommes. DIF is in die konstruksie van die 'Learning Potential Computerised Adaptive test (LPCAT' gebruik om die items te identifiseer wat sydigheid ten opsigte van geslag, kultuur, taal of opleidingspeil geopenbaar het. Items wat ’n voorafbepaalde vlak van DIF oorskry het, is uit die finale itembank weggelaat, ongeag die subgroep wat bevoordeel of benadeel is. Die proses en resultate van die DIF-ontleding word bespreek.

  10. Fostering a student's skill for analyzing test items through an authentic task

    Setiawan, Beni; Sabtiawan, Wahyu Budi

    2017-08-01

    Analyzing test items is a skill that must be mastered by prospective teachers, in order to determine the quality of test questions which have been written. The main aim of this research was to describe the effectiveness of authentic task to foster the student's skill for analyzing test items involving validity, reliability, item discrimination index, level of difficulty, and distractor functioning through the authentic task. The participant of the research is students of science education study program, science and mathematics faculty, Universitas Negeri Surabaya, enrolled for assessment course. The research design was a one-group posttest design. The treatment in this study is that the students were provided an authentic task facilitating the students to develop test items, then they analyze the items like a professional assessor using Microsoft Excel and Anates Software. The data of research obtained were analyzed descriptively, such as the analysis was presented by displaying the data of students' skill, then they were associated with theories or previous empirical studies. The research showed the task facilitated the students to have the skills. Thirty-one students got a perfect score for the analyzing, five students achieved 97% mastery, two students had 92% mastery, and another two students got 89% and 79% of mastery. The implication of the finding was the students who get authentic tasks forcing them to perform like a professional, the possibility of the students for achieving the professional skills will be higher at the end of learning.

  11. Using response-time constraints in item selection to control for differential speededness in computerized adaptive testing

    van der Linden, Willem J.; Scrams, David J.; Schnipke, Deborah L.

    2003-01-01

    This paper proposes an item selection algorithm that can be used to neutralize the effect of time limits in computer adaptive testing. The method is based on a statistical model for the response-time distributions of the test takers on the items in the pool that is updated each time a new item has

  12. Overview of Classical Test Theory and Item Response Theory for Quantitative Assessment of Items in Developing Patient-Reported Outcome Measures

    Cappelleri, Joseph C.; Lundy, J. Jason; Hays, Ron D.

    2014-01-01

    Introduction The U.S. Food and Drug Administration’s patient-reported outcome (PRO) guidance document defines content validity as “the extent to which the instrument measures the concept of interest” (FDA, 2009, p. 12). “Construct validity is now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity” (Strauss & Smith, 2009, p. 7). Hence both qualitative and quantitative information are essential in evaluating the validity of measures. Methods We review classical test theory and item response theory approaches to evaluating PRO measures including frequency of responses to each category of the items in a multi-item scale, the distribution of scale scores, floor and ceiling effects, the relationship between item response options and the total score, and the extent to which hypothesized “difficulty” (severity) order of items is represented by observed responses. Conclusion Classical test theory and item response theory can be useful in providing a quantitative assessment of items and scales during the content validity phase of patient-reported outcome measures. Depending on the particular type of measure and the specific circumstances, either one or both approaches should be considered to help maximize the content validity of PRO measures. PMID:24811753

  13. Science Literacy: How do High School Students Solve PISA Test Items?

    Wati, F.; Sinaga, P.; Priyandoko, D.

    2017-09-01

    The Programme for International Students Assessment (PISA) does assess students’ science literacy in a real-life contexts and wide variety of situation. Therefore, the results do not provide adequate information for the teacher to excavate students’ science literacy because the range of materials taught at schools depends on the curriculum used. This study aims to investigate the way how junior high school students in Indonesia solve PISA test items. Data was collected by using PISA test items in greenhouse unit employed to 36 students of 9th grade. Students’ answer was analyzed qualitatively for each item based on competence tested in the problem. The way how students answer the problem exhibits their ability in particular competence which is influenced by a number of factors. Those are students’ unfamiliarity with test construction, low performance on reading, low in connecting available information and question, and limitation on expressing their ideas effectively and easy-read. As the effort, selected PISA test items can be used in accordance teaching topic taught to familiarize students with science literacy.

  14. FormScanner: Open-Source Solution for Grading Multiple-Choice Exams

    Young, Chadwick; Lo, Glenn; Young, Kaisa; Borsetta, Alberto

    2016-01-01

    The multiple-choice exam remains a staple for many introductory physics courses. In the past, people have graded these by hand or even flaming needles. Today, one usually grades the exams with a form scanner that utilizes optical mark recognition (OMR). Several companies provide these scanners and particular forms, such as the eponymous…

  15. Consequences the extensive use of multiple-choice questions might have on student's reasoning structure

    Raduta, C. M.

    2013-01-01

    Learning physics is a context dependent process. I consider a broader interdisciplinary problem of where differences in understanding and reasoning arise. I suggest the long run effects a multiple choice based learning system as well as society cultural habits and rules might have on student reasoning structure.

  16. Student-Generated Content: Enhancing Learning through Sharing Multiple-Choice Questions

    Hardy, Judy; Bates, Simon P.; Casey, Morag M.; Galloway, Kyle W.; Galloway, Ross K.; Kay, Alison E.; Kirsop, Peter; McQueen, Heather A.

    2014-01-01

    The relationship between students' use of PeerWise, an online tool that facilitates peer learning through student-generated content in the form of multiple-choice questions (MCQs), and achievement, as measured by their performance in the end-of-module examinations, was investigated in 5 large early-years science modules (in physics, chemistry and…

  17. Validity and Reliability of Scores Obtained on Multiple-Choice Questions: Why Functioning Distractors Matter

    Ali, Syed Haris; Carr, Patrick A.; Ruit, Kenneth G.

    2016-01-01

    Plausible distractors are important for accurate measurement of knowledge via multiple-choice questions (MCQs). This study demonstrates the impact of higher distractor functioning on validity and reliability of scores obtained on MCQs. Freeresponse (FR) and MCQ versions of a neurohistology practice exam were given to four cohorts of Year 1 medical…

  18. Visual Attention for Solving Multiple-Choice Science Problem: An Eye-Tracking Analysis

    Tsai, Meng-Jung; Hou, Huei-Tse; Lai, Meng-Lung; Liu, Wan-Yi; Yang, Fang-Ying

    2012-01-01

    This study employed an eye-tracking technique to examine students' visual attention when solving a multiple-choice science problem. Six university students participated in a problem-solving task to predict occurrences of landslide hazards from four images representing four combinations of four factors. Participants' responses and visual attention…

  19. Does Correct Answer Distribution Influence Student Choices When Writing Multiple Choice Examinations?

    Carnegie, Jacqueline A.

    2017-01-01

    Summative evaluation for large classes of first- and second-year undergraduate courses often involves the use of multiple choice question (MCQ) exams in order to provide timely feedback. Several versions of those exams are often prepared via computer-based question scrambling in an effort to deter cheating. An important parameter to consider when…

  20. The Incidence of Clueing in Multiple Choice Testbank Questions in Accounting: Some Evidence from Australia

    Ibbett, Nicole L.; Wheldon, Brett J.

    2016-01-01

    In 2014 Central Queensland University (CQU) in Australia banned the use of multiple choice questions (MCQs) as an assessment tool. One of the reasons given for this decision was that MCQs provide an opportunity for students to "pass" by merely guessing their answers. The mathematical likelihood of a student passing by guessing alone can…

  1. The Use of Management and Marketing Textbook Multiple-Choice Questions: A Case Study.

    Hampton, David R.; And Others

    1993-01-01

    Four management and four marketing professors classified multiple-choice questions in four widely adopted introductory textbooks according to the two levels of Bloom's taxonomy of educational objectives: knowledge and intellectual ability and skill. Inaccuracies may cause instructors to select questions that require less thinking than they intend.…

  2. Comparison of performance on multiple-choice questions and open-ended questions in an introductory astronomy laboratory

    Michelle M. Wooten; Adrienne M. Cool; Edward E. Prather; Kimberly D. Tanner

    2014-01-01

    When considering the variety of questions that can be used to measure students’ learning, instructors may choose to use multiple-choice questions, which are easier to score than responses to open-ended questions. However, by design, analyses of multiple-choice responses cannot describe all of students’ understanding. One method that can be used to learn more about students’ learning is the analysis of the open-ended responses students’ provide when explaining their multiple-choice response. I...

  3. Project Physics Tests 2, Motion in the Heavens.

    Harvard Univ., Cambridge, MA. Harvard Project Physics.

    Test items relating to Project Physics Unit 2 are presented in this booklet. Included are 70 multiple-choice and 22 problem-and-essay questions. Concepts of motion in the heavens are examined for planetary motions, heliocentric theory, forces exerted on the planets, Kepler's laws, gravitational force, Galileo's work, satellite orbits, Jupiter's…

  4. Project Physics Tests 3, The Triumph of Mechanics.

    Harvard Univ., Cambridge, MA. Harvard Project Physics.

    Test items relating to Project Physics Unit 3 are presented in this booklet. Included are 70 multiple-choice and 20 problem-and-essay questions. Concepts of mechanics are examined on energy, momentum, kinetic theory of gases, pulse analyses, "heat death," water waves, power, conservation laws, normal distribution, thermodynamic laws, and…

  5. Nursing Faculty Decision Making about Best Practices in Test Construction, Item Analysis, and Revision

    Killingsworth, Erin Elizabeth

    2013-01-01

    With the widespread use of classroom exams in nursing education there is a great need for research on current practices in nursing education regarding this form of assessment. The purpose of this study was to explore how nursing faculty members make decisions about using best practices in classroom test construction, item analysis, and revision in…

  6. Sensitivity and specificity of the 3-item memory test in the assessment of post traumatic amnesia.

    Andriessen, T.M.J.C.; Jong, B. de; Jacobs, B.; Werf, S.P. van der; Vos, P.E.

    2009-01-01

    PRIMARY OBJECTIVE: To investigate how the type of stimulus (pictures or words) and the method of reproduction (free recall or recognition after a short or a long delay) affect the sensitivity and specificity of a 3-item memory test in the assessment of post traumatic amnesia (PTA). METHODS: Daily

  7. Development of Abbreviated Nine-Item Forms of the Raven's Standard Progressive Matrices Test

    Bilker, Warren B.; Hansen, John A.; Brensinger, Colleen M.; Richard, Jan; Gur, Raquel E.; Gur, Ruben C.

    2012-01-01

    The Raven's Standard Progressive Matrices (RSPM) is a 60-item test for measuring abstract reasoning, considered a nonverbal estimate of fluid intelligence, and often included in clinical assessment batteries and research on patients with cognitive deficits. The goal was to develop and apply a predictive model approach to reduce the number of items…

  8. Reading ability and print exposure: item response theory analysis of the author recognition test.

    Moore, Mariah; Gordon, Peter C

    2015-12-01

    In the author recognition test (ART), participants are presented with a series of names and foils and are asked to indicate which ones they recognize as authors. The test is a strong predictor of reading skill, and this predictive ability is generally explained as occurring because author knowledge is likely acquired through reading or other forms of print exposure. In this large-scale study (1,012 college student participants), we used item response theory (IRT) to analyze item (author) characteristics in order to facilitate identification of the determinants of item difficulty, provide a basis for further test development, and optimize scoring of the ART. Factor analysis suggested a potential two-factor structure of the ART, differentiating between literary and popular authors. Effective and ineffective author names were identified so as to facilitate future revisions of the ART. Analyses showed that the ART is a highly significant predictor of the time spent encoding words, as measured using eyetracking during reading. The relationship between the ART and time spent reading provided a basis for implementing a higher penalty for selecting foils, rather than the standard method of ART scoring (names selected minus foils selected). The findings provide novel support for the view that the ART is a valid indicator of reading volume. Furthermore, they show that frequency data can be used to select items of appropriate difficulty, and that frequency data from corpora based on particular time periods and types of texts may allow adaptations of the test for different populations.

  9. A Method for the Comparison of Item Selection Rules in Computerized Adaptive Testing

    Barrada, Juan Ramon; Olea, Julio; Ponsoda, Vicente; Abad, Francisco Jose

    2010-01-01

    In a typical study comparing the relative efficiency of two item selection rules in computerized adaptive testing, the common result is that they simultaneously differ in accuracy and security, making it difficult to reach a conclusion on which is the more appropriate rule. This study proposes a strategy to conduct a global comparison of two or…

  10. The Disaggregation of Value-Added Test Scores to Assess Learning Outcomes in Economics Courses

    Walstad, William B.; Wagner, Jamie

    2016-01-01

    This study disaggregates posttest, pretest, and value-added or difference scores in economics into four types of economic learning: positive, retained, negative, and zero. The types are derived from patterns of student responses to individual items on a multiple-choice test. The micro and macro data from the "Test of Understanding in College…

  11. Construction of a valid and reliable test to determine knowledge on ...

    Objective: The construction of a questionnaire, in the format of a test, to determine knowledge on dietary fat of higher-educated young adults. Design: The topics on dietary fat included were in accordance with those tested in other studies. Forty multiple-choice items were drafted as questions and incomplete statements ...

  12. Easy and Informative: Using Confidence-Weighted True-False Items for Knowledge Tests in Psychology Courses

    Dutke, Stephan; Barenberg, Jonathan

    2015-01-01

    We introduce a specific type of item for knowledge tests, confidence-weighted true-false (CTF) items, and review experiences of its application in psychology courses. A CTF item is a statement about the learning content to which students respond whether the statement is true or false, and they rate their confidence level. Previous studies using…

  13. The Dysexecutive Questionnaire advanced: item and test score characteristics, 4-factor solution, and severity classification.

    Bodenburg, Sebastian; Dopslaff, Nina

    2008-01-01

    The Dysexecutive Questionnaire (DEX, , Behavioral assessment of the dysexecutive syndrome, 1996) is a standardized instrument to measure possible behavioral changes as a result of the dysexecutive syndrome. Although initially intended only as a qualitative instrument, the DEX has also been used increasingly to address quantitative problems. Until now there have not been more fundamental statistical analyses of the questionnaire's testing quality. The present study is based on an unselected sample of 191 patients with acquired brain injury and reports on the data relating to the quality of the items, the reliability and the factorial structure of the DEX. Item 3 displayed too great an item difficulty, whereas item 11 was not sufficiently discriminating. The DEX's reliability in self-rating is r = 0.85. In addition to presenting the statistical values of the tests, a clinical severity classification of the overall scores of the 4 found factors and of the questionnaire as a whole is carried out on the basis of quartile standards.

  14. Prediction of true test scores from observed item scores and ancillary data.

    Haberman, Shelby J; Yao, Lili; Sinharay, Sandip

    2015-05-01

    In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008 in the case of GRE(®) General Analytical Writing and until 2009 in the case of TOEFL(®) iBT Writing. With use of natural language processing, it is possible to obtain additional information concerning item responses from computer programs such as e-rater(®). In addition, available information relevant to examinee performance may include scores on related tests. We suggest application of standard results from classical test theory to the available data to obtain best linear predictors of true traditional test scores. In performing such analysis, we require estimation of variances and covariances of measurement errors, a task which can be quite difficult in the case of tests with limited numbers of items and with multiple measurements per item. As a consequence, a new estimation method is suggested based on samples of examinees who have taken an assessment more than once. Such samples are typically not random samples of the general population of examinees, so that we apply statistical adjustment methods to obtain the needed estimated variances and covariances of measurement errors. To examine practical implications of the suggested methods of analysis, applications are made to GRE General Analytical Writing and TOEFL iBT Writing. Results obtained indicate that substantial improvements are possible both in terms of reliability of scoring and in terms of assessment reliability. © 2015 The British Psychological Society.

  15. Sensitivity and specificity of the 3-item memory test in the assessment of post traumatic amnesia.

    Andriessen, Teuntje M J C; de Jong, Ben; Jacobs, Bram; van der Werf, Sieberen P; Vos, Pieter E

    2009-04-01

    To investigate how the type of stimulus (pictures or words) and the method of reproduction (free recall or recognition after a short or a long delay) affect the sensitivity and specificity of a 3-item memory test in the assessment of post traumatic amnesia (PTA). Daily testing was performed in 64 consecutively admitted traumatic brain injured patients, 22 orthopedically injured patients and 26 healthy controls until criteria for resolution of PTA were reached. Subjects were randomly assigned to a test with visual or verbal stimuli. Short delay reproduction was tested after an interval of 3-5 minutes, long delay reproduction was tested after 24 hours. Sensitivity and specificity were calculated over the first 4 test days. The 3-word test showed higher sensitivity than the 3-picture test, while specificity of the two tests was equally high. Free recall was a more effortful task than recognition for both patients and controls. In patients, a longer delay between registration and recall resulted in a significant decrease in the number of items reproduced. Presence of PTA is best assessed with a memory test that incorporates the free recall of words after a long delay.

  16. Analysis of the benefits of designing and implementing a virtual didactic model of multiple choice exam and problem-solving heuristic report, for first year engineering students

    Bennun, Leonardo; Santibanez, Mauricio

    2015-01-01

    Improvements in performance and approval obtained by first year engineering students from University of Concepcion, Chile, were studied, once a virtual didactic model of multiple-choice exam, was implemented. This virtual learning resource was implemented in the Web ARCO platform and allows training, by facing test models comparable in both time and difficulty to those that they will have to solve during the course. It also provides a feedback mechanism for both: 1) The students, since they c...

  17. Overview of classical test theory and item response theory for the quantitative assessment of items in developing patient-reported outcomes measures.

    Cappelleri, Joseph C; Jason Lundy, J; Hays, Ron D

    2014-05-01

    The US Food and Drug Administration's guidance for industry document on patient-reported outcomes (PRO) defines content validity as "the extent to which the instrument measures the concept of interest" (FDA, 2009, p. 12). According to Strauss and Smith (2009), construct validity "is now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity" (p. 7). Hence, both qualitative and quantitative information are essential in evaluating the validity of measures. We review classical test theory and item response theory (IRT) approaches to evaluating PRO measures, including frequency of responses to each category of the items in a multi-item scale, the distribution of scale scores, floor and ceiling effects, the relationship between item response options and the total score, and the extent to which hypothesized "difficulty" (severity) order of items is represented by observed responses. If a researcher has few qualitative data and wants to get preliminary information about the content validity of the instrument, then descriptive assessments using classical test theory should be the first step. As the sample size grows during subsequent stages of instrument development, confidence in the numerical estimates from Rasch and other IRT models (as well as those of classical test theory) would also grow. Classical test theory and IRT can be useful in providing a quantitative assessment of items and scales during the content-validity phase of PRO-measure development. Depending on the particular type of measure and the specific circumstances, the classical test theory and/or the IRT should be considered to help maximize the content validity of PRO measures. Copyright © 2014 Elsevier HS Journals, Inc. All rights reserved.

  18. Item analysis of the Spanish version of the Boston Naming Test with a Spanish speaking adult population from Colombia.

    Kim, Stella H; Strutt, Adriana M; Olabarrieta-Landa, Laiene; Lequerica, Anthony H; Rivera, Diego; De Los Reyes Aragon, Carlos Jose; Utria, Oscar; Arango-Lasprilla, Juan Carlos

    2018-02-23

    The Boston Naming Test (BNT) is a widely used measure of confrontation naming ability that has been criticized for its questionable construct validity for non-English speakers. This study investigated item difficulty and construct validity of the Spanish version of the BNT to assess cultural and linguistic impact on performance. Subjects were 1298 healthy Spanish speaking adults from Colombia. They were administered the 60- and 15-item Spanish version of the BNT. A Rasch analysis was computed to assess dimensionality, item hierarchy, targeting, reliability, and item fit. Both versions of the BNT satisfied requirements for unidimensionality. Although internal consistency was excellent for the 60-item BNT, order of difficulty did not increase consistently with item number and there were a number of items that did not fit the Rasch model. For the 15-item BNT, a total of 5 items changed position on the item hierarchy with 7 poor fitting items. Internal consistency was acceptable. Construct validity of the BNT remains a concern when it is administered to non-English speaking populations. Similar to previous findings, the order of item presentation did not correspond with increasing item difficulty, and both versions were inadequate at assessing high naming ability.

  19. Comparison of Performance on Multiple-Choice Questions and Open-Ended Questions in an Introductory Astronomy Laboratory

    Wooten, Michelle M.; Cool, Adrienne M.; Prather, Edward E.; Tanner, Kimberly D.

    2014-01-01

    When considering the variety of questions that can be used to measure students' learning, instructors may choose to use multiple-choice questions, which are easier to score than responses to open-ended questions. However, by design, analyses of multiple-choice responses cannot describe all of students' understanding. One method that can…

  20. Dual process theory and intermediate effect: are faculty and residents' performance on multiple-choice, licensing exam questions different?

    Dong, Ting; Durning, Steven J; Artino, Anthony R; van der Vleuten, Cees; Holmboe, Eric; Lipner, Rebecca; Schuwirth, Lambert

    2015-04-01

    Clinical reasoning is essential for the practice of medicine. Dual process theory conceptualizes reasoning as falling into two general categories: nonanalytic reasoning (pattern recognition) and analytic reasoning (active comparing and contrasting of alternatives). The debate continues regarding how expert performance develops and how individuals make the best use of analytic and nonanalytic processes. Several investigators have identified the unexpected finding that intermediates tend to perform better on licensing examination items than experts, which has been termed the "intermediate effect." We explored differences between faculty and residents on multiple-choice questions (MCQs) using dual process measures (both reading and answering times) to inform this ongoing debate. Faculty (board-certified internists; experts) and residents (internal medicine interns; intermediates) answered live licensing examination MCQs (U.S. Medical Licensing Examination Step 2 Clinical Knowledge and American Board of Internal Medicine Certifying Examination) while being timed. We conducted repeated analysis of variance to compare the 2 groups on average reading time, answering time, and accuracy on various types of items. Faculty and residents did not differ significantly in reading time [F (1,35) = 0.01, p = 0.93], answering time [F (1,35) = 0.60, p = 0.44], or accuracy [F (1,35) = 0.24, p = 0.63] regardless of easy or hard items. Dual process theory was not evidenced in this study. However, this lack of difference between faculty and residents may have been affected by the small sample size of participants and MCQs may not reflect how physicians made decisions in actual practice setting. Reprint & Copyright © 2015 Association of Military Surgeons of the U.S.

  1. A more general model for testing measurement invariance and differential item functioning.

    Bauer, Daniel J

    2017-09-01

    The evaluation of measurement invariance is an important step in establishing the validity and comparability of measurements across individuals. Most commonly, measurement invariance has been examined using 1 of 2 primary latent variable modeling approaches: the multiple groups model or the multiple-indicator multiple-cause (MIMIC) model. Both approaches offer opportunities to detect differential item functioning within multi-item scales, and thereby to test measurement invariance, but both approaches also have significant limitations. The multiple groups model allows 1 to examine the invariance of all model parameters but only across levels of a single categorical individual difference variable (e.g., ethnicity). In contrast, the MIMIC model permits both categorical and continuous individual difference variables (e.g., sex and age) but permits only a subset of the model parameters to vary as a function of these characteristics. The current article argues that moderated nonlinear factor analysis (MNLFA) constitutes an alternative, more flexible model for evaluating measurement invariance and differential item functioning. We show that the MNLFA subsumes and combines the strengths of the multiple group and MIMIC models, allowing for a full and simultaneous assessment of measurement invariance and differential item functioning across multiple categorical and/or continuous individual difference variables. The relationships between the MNLFA model and the multiple groups and MIMIC models are shown mathematically and via an empirical demonstration. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  2. Why Students Answer TIMSS Science Test Items the Way They Do

    Harlow, Ann; Jones, Alister

    2004-04-01

    The purpose of this study was to explore how Year 8 students answered Third International Mathematics and Science Study (TIMSS) questions and whether the test questions represented the scientific understanding of these students. One hundred and seventy-seven students were tested using written test questions taken from the science test used in the Third International Mathematics and Science Study. The degree to which a sample of 38 children represented their understanding of the topics in a written test compared to the level of understanding that could be elicited by an interview is presented in this paper. In exploring student responses in the interview situation this study hoped to gain some insight into the science knowledge that students held and whether or not the test items had been able to elicit this knowledge successfully. We question the usefulness and quality of data from large-scale summative assessments on their own to represent student scientific understanding and conclude that large scale written test items, such as TIMSS, on their own are not a valid way of exploring students'' understanding of scientific concepts. Considerable caution is therefore needed in exploiting the outcomes of international achievement testing when considering educational policy changes or using TIMSS data on their own to represent student understanding.

  3. Redefining diagnostic symptoms of depression using Rasch analysis: testing an item bank suitable for DSM-V and computer adaptive testing.

    Mitchell, Alex J; Smith, Adam B; Al-salihy, Zerak; Rahim, Twana A; Mahmud, Mahmud Q; Muhyaldin, Asma S

    2011-10-01

    We aimed to redefine the optimal self-report symptoms of depression suitable for creation of an item bank that could be used in computer adaptive testing or to develop a simplified screening tool for DSM-V. Four hundred subjects (200 patients with primary depression and 200 non-depressed subjects), living in Iraqi Kurdistan were interviewed. The Mini International Neuropsychiatric Interview (MINI) was used to define the presence of major depression (DSM-IV criteria). We examined symptoms of depression using four well-known scales delivered in Kurdish. The Partial Credit Model was applied to each instrument. Common-item equating was subsequently used to create an item bank and differential item functioning (DIF) explored for known subgroups. A symptom level Rasch analysis reduced the original 45 items to 24 items of the original after the exclusion of 21 misfitting items. A further six items (CESD13 and CESD17, HADS-D4, HADS-D5 and HADS-D7, and CDSS3 and CDSS4) were removed due to misfit as the items were added together to form the item bank, and two items were subsequently removed following the DIF analysis by diagnosis (CESD20 and CDSS9, both of which were harder to endorse for women). Therefore the remaining optimal item bank consisted of 17 items and produced an area under the curve (AUC) of 0.987. Using a bank restricted to the optimal nine items revealed only minor loss of accuracy (AUC = 0.989, sensitivity 96%, specificity 95%). Finally, when restricted to only four items accuracy was still high (AUC was still 0.976; sensitivity 93%, specificity 96%). An item bank of 17 items may be useful in computer adaptive testing and nine or even four items may be used to develop a simplified screening tool for DSM-V major depressive disorder (MDD). Further examination of this item bank should be conducted in different cultural settings.

  4. The Effects of Item Format and Cognitive Domain on Students' Science Performance in TIMSS 2011

    Liou, Pey-Yan; Bulut, Okan

    2017-12-01

    The purpose of this study was to examine eighth-grade students' science performance in terms of two test design components, item format, and cognitive domain. The portion of Taiwanese data came from the 2011 administration of the Trends in International Mathematics and Science Study (TIMSS), one of the major international large-scale assessments in science. The item difficulty analysis was initially applied to show the proportion of correct items. A regression-based cumulative link mixed modeling (CLMM) approach was further utilized to estimate the impact of item format, cognitive domain, and their interaction on the students' science scores. The results of the proportion-correct statistics showed that constructed-response items were more difficult than multiple-choice items, and that the reasoning cognitive domain items were more difficult compared to the items in the applying and knowing domains. In terms of the CLMM results, students tended to obtain higher scores when answering constructed-response items as well as items in the applying cognitive domain. When the two predictors and the interaction term were included together, the directions and magnitudes of the predictors on student science performance changed substantially. Plausible explanations for the complex nature of the effects of the two test-design predictors on student science performance are discussed. The results provide practical, empirical-based evidence for test developers, teachers, and stakeholders to be aware of the differential function of item format, cognitive domain, and their interaction in students' science performance.

  5. On the Relationship between Classical Test Theory and Item Response Theory: From One to the Other and Back

    Raykov, Tenko; Marcoulides, George A.

    2016-01-01

    The frequently neglected and often misunderstood relationship between classical test theory and item response theory is discussed for the unidimensional case with binary measures and no guessing. It is pointed out that popular item response models can be directly obtained from classical test theory-based models by accounting for the discrete…

  6. Using Classical Test Theory and Item Response Theory to Evaluate the LSCI

    Schlingman, Wayne M.; Prather, E. E.; Collaboration of Astronomy Teaching Scholars CATS

    2011-01-01

    Analyzing the data from the recent national study using the Light and Spectroscopy Concept Inventory (LSCI), this project uses both Classical Test Theory (CTT) and Item Response Theory (IRT) to investigate the LSCI itself in order to better understand what it is actually measuring. We use Classical Test Theory to form a framework of results that can be used to evaluate the effectiveness of individual questions at measuring differences in student understanding and provide further insight into the prior results presented from this data set. In the second phase of this research, we use Item Response Theory to form a theoretical model that generates parameters accounting for a student's ability, a question's difficulty, and estimate the level of guessing. The combined results from our investigations using both CTT and IRT are used to better understand the learning that is taking place in classrooms across the country. The analysis will also allow us to evaluate the effectiveness of individual questions and determine whether the item difficulties are appropriately matched to the abilities of the students in our data set. These results may require that some questions be revised, motivating the need for further development of the LSCI. This material is based upon work supported by the National Science Foundation under Grant No. 0715517, a CCLI Phase III Grant for the Collaboration of Astronomy Teaching Scholars (CATS). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

  7. Science Library of Test Items. Volume Nine. Mastery Testing Programme. [Mastery Tests Series 1.] Tests M1-M13.

    New South Wales Dept. of Education, Sydney (Australia).

    As part of a series of tests to measure mastery of specific skills in the natural sciences, copies of the first 13 tests are provided. Skills to be tested include: (1) reading a table; (2) using a biological key; (3) identifying chemical symbols; (4) identifying parts of a human body; (5) reading a line graph; (6) identifying electronic and…

  8. The Linear Logistic Test Model (LLTM as the methodological foundation of item generating rules for a new verbal reasoning test

    HERBERT POINSTINGL

    2009-06-01

    Full Text Available Based on the demand for new verbal reasoning tests to enrich psychological test inventory, a pilot version of a new test was analysed: the 'Family Relation Reasoning Test' (FRRT; Poinstingl, Kubinger, Skoda & Schechtner, forthcoming, in which several basic cognitive operations (logical rules have been embedded/implemented. Given family relationships of varying complexity embedded in short stories, testees had to logically conclude the correct relationship between two individuals within a family. Using empirical data, the linear logistic test model (LLTM; Fischer, 1972, a special case of the Rasch model, was used to test the construct validity of the test: The hypothetically assumed basic cognitive operations had to explain the Rasch model's item difficulty parameters. After being shaped in LLTM's matrices of weights ((qij, none of these operations were corroborated by means of the Andersen's Likelihood Ratio Test.

  9. Science Library of Test Items. Volume Ten. Mastery Testing Programme. [Mastery Tests Series 2.] Tests M14-M26.

    New South Wales Dept. of Education, Sydney (Australia).

    As part of a series of tests to measure mastery of specific skills in the natural sciences, copies of tests 14 through 26 include: (14) calculating an average; (15) identifying parts of the scientific method; (16) reading a geological map; (17) identifying elements, mixtures and compounds; (18) using Ohm's law in calculation; (19) interpreting…

  10. Science Library of Test Items. Volume Twelve. Mastery Testing Programme. [Mastery Tests Series 4.] Tests M39-M50.

    New South Wales Dept. of Education, Sydney (Australia).

    As part of a series of tests to measure mastery of specific skills in the natural sciences, copies of tests 39 through 50 include: (39) using a code; (40) naming the parts of a microscope; (41) calculating density and predicting flotation; (42) estimating metric length; (43) using SI symbols; (44) using s=vt; (45) applying a novel theory; (46)…

  11. Science Library of Test Items. Volume Thirteen. Mastery Testing Program. [Mastery Tests Series 5.] Tests M51-M65.

    New South Wales Dept. of Education, Sydney (Australia).

    As part of a series of tests to measure mastery of specific skills in the natural sciences, copies of tests 51 through 65 include: (51) interpreting atomic and mass numbers; (52) extrapolating from a geological map; (53) matching geological sections and maps; (54) identifying parts of the human eye; (55) identifying the functions of parts of a…

  12. Science Library of Test Items. Volume Eleven. Mastery Testing Programme. [Mastery Tests Series 3.] Tests M27-M38.

    New South Wales Dept. of Education, Sydney (Australia).

    As part of a series of tests to measure mastery of specific skills in the natural sciences, copies of tests 27 through 38 include: (27) reading a grid plan; (28) identifying common invertebrates; (29) characteristics of invertebrates; (30) identifying elements; (31) using scientific notation part I; (32) classifying minerals; (33) predicting the…

  13. A Comparison of the Approaches of Generalizability Theory and Item Response Theory in Estimating the Reliability of Test Scores for Testlet-Composed Tests

    Lee, Guemin; Park, In-Yong

    2012-01-01

    Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several…

  14. Using Two-Tier Test to Identify Primary Students' Conceptual Understanding and Alternative Conceptions in Acid Base

    Bayrak, Beyza Karadeniz

    2013-01-01

    The purpose of this study was to identify primary students' conceptual understanding and alternative conceptions in acid-base. For this reason, a 15 items two-tier multiple choice test administered 56 eighth grade students in spring semester 2009-2010. Data for this study were collected using a conceptual understanding scale prepared to include…

  15. Aufwandsanalyse für computerunterstützte Multiple-Choice Papierklausuren [Cost analysis for computer supported multiple-choice paper examinations

    Mandel, Alexander

    2011-11-01

    Full Text Available [english] Introduction: Multiple-choice-examinations are still fundamental for assessment in medical degree programs. In addition to content related research, the optimization of the technical procedure is an important question. Medical examiners face three options: paper-based examinations with or without computer support or completely electronic examinations. Critical aspects are the effort for formatting, the logistic effort during the actual examination, quality, promptness and effort of the correction, the time for making the documents available for inspection by the students, and the statistical analysis of the examination results.Methods: Since three semesters a computer program for input and formatting of MC-questions in medical and other paper-based examinations is used and continuously improved at Wuerzburg University. In the winter semester (WS 2009/10 eleven, in the summer semester (SS 2010 twelve and in WS 2010/11 thirteen medical examinations were accomplished with the program and automatically evaluated. For the last two semesters the remaining manual workload was recorded. Results: The cost of the formatting and the subsequent analysis including adjustments of the analysis of an average examination with about 140 participants and about 35 questions was 5-7 hours for exams without complications in the winter semester 2009/2010, about 2 hours in SS 2010 and about 1.5 hours in the winter semester 2010/11. Including exams with complications, the average time was about 3 hours per exam in SS 2010 and 2.67 hours for the WS 10/11. Discussion: For conventional multiple-choice exams the computer-based formatting and evaluation of paper-based exams offers a significant time reduction for lecturers in comparison with the manual correction of paper-based exams and compared to purely electronically conducted exams it needs a much simpler technological infrastructure and fewer staff during the exam.[german] Einleitung: Multiple-Choice

  16. Applications of Multidimensional Item Response Theory Models with Covariates to Longitudinal Test Data. Research Report. ETS RR-16-21

    Fu, Jianbin

    2016-01-01

    The multidimensional item response theory (MIRT) models with covariates proposed by Haberman and implemented in the "mirt" program provide a flexible way to analyze data based on item response theory. In this report, we discuss applications of the MIRT models with covariates to longitudinal test data to measure skill differences at the…

  17. Evaluating the validity of the Work Role Functioning Questionnaire (Canadian French version) using classical test theory and item response theory.

    Hong, Quan Nha; Coutu, Marie-France; Berbiche, Djamal

    2017-01-01

    The Work Role Functioning Questionnaire (WRFQ) was developed to assess workers' perceived ability to perform job demands and is used to monitor presenteeism. Still few studies on its validity can be found in the literature. The purpose of this study was to assess the items and factorial composition of the Canadian French version of the WRFQ (WRFQ-CF). Two measurement approaches were used to test the WRFQ-CF: Classical Test Theory (CTT) and non-parametric Item Response Theory (IRT). A total of 352 completed questionnaires were analyzed. A four-factor and three-factor model models were tested and shown respectively good fit with 14 items (Root Mean Square Error of Approximation (RMSEA) = 0.06, Standardized Root Mean Square Residual (SRMR) = 0.04, Bentler Comparative Fit Index (CFI) = 0.98) and with 17 items (RMSEA = 0.059, SRMR = 0.048, CFI = 0.98). Using IRT, 13 problematic items were identified, of which 9 were common with CTT. This study tested different models with fewer problematic items found in a three-factor model. Using a non-parametric IRT and CTT for item purification gave complementary results. IRT is still scarcely used and can be an interesting alternative method to enhance the quality of a measurement instrument. More studies are needed on the WRFQ-CF to refine its items and factorial composition.

  18. Using Cochran's Z Statistic to Test the Kernel-Smoothed Item Response Function Differences between Focal and Reference Groups

    Zheng, Yinggan; Gierl, Mark J.; Cui, Ying

    2010-01-01

    This study combined the kernel smoothing procedure and a nonparametric differential item functioning statistic--Cochran's Z--to statistically test the difference between the kernel-smoothed item response functions for reference and focal groups. Simulation studies were conducted to investigate the Type I error and power of the proposed…

  19. Adaptation and validation into Portuguese language of the six-item cognitive impairment test (6CIT).

    Apóstolo, João Luís Alves; Paiva, Diana Dos Santos; Silva, Rosa Carla Gomes da; Santos, Eduardo José Ferreira Dos; Schultz, Timothy John

    2017-07-25

    The six-item cognitive impairment test (6CIT) is a brief cognitive screening tool that can be administered to older people in 2-3 min. To adapt the 6CIT for the European Portuguese and determine its psychometric properties based on a sample recruited from several contexts (nursing homes; universities for older people; day centres; primary health care units). The original 6CIT was translated into Portuguese and the draft Portuguese version (6CIT-P) was back-translated and piloted. The accuracy of the 6CIT-P was assessed by comparison with the Portuguese Mini-Mental State Examination (MMSE). A convenience sample of 550 older people from various geographical locations in the north and centre of the country was used. The test-retest reliability coefficient was high (r = 0.95). The 6CIT-P also showed good internal consistency (α = 0.88) and corrected item-total correlations ranged between 0.32 and 0.90. Total 6CIT-P and MMSE scores were strongly correlated. The proposed 6CIT-P threshold for cognitive impairment is ≥10 in the Portuguese population, which gives sensitivity of 82.78% and specificity of 84.84%. The accuracy of 6CIT-P, as measured by area under the ROC curve, was 0.91. The 6CIT-P has high reliability and validity and is accurate when used to screen for cognitive impairment.

  20. Step by Step: Biology Undergraduates’ Problem-Solving Procedures during Multiple-Choice Assessment

    Prevost, Luanna B.; Lemons, Paula P.

    2016-01-01

    This study uses the theoretical framework of domain-specific problem solving to explore the procedures students use to solve multiple-choice problems about biology concepts. We designed several multiple-choice problems and administered them on four exams. We trained students to produce written descriptions of how they solved the problem, and this allowed us to systematically investigate their problem-solving procedures. We identified a range of procedures and organized them as domain general, domain specific, or hybrid. We also identified domain-general and domain-specific errors made by students during problem solving. We found that students use domain-general and hybrid procedures more frequently when solving lower-order problems than higher-order problems, while they use domain-specific procedures more frequently when solving higher-order problems. Additionally, the more domain-specific procedures students used, the higher the likelihood that they would answer the problem correctly, up to five procedures. However, if students used just one domain-general procedure, they were as likely to answer the problem correctly as if they had used two to five domain-general procedures. Our findings provide a categorization scheme and framework for additional research on biology problem solving and suggest several important implications for researchers and instructors. PMID:27909021

  1. An intuitive graphical webserver for multiple-choice protein sequence search.

    Banky, Daniel; Szalkai, Balazs; Grolmusz, Vince

    2014-04-10

    Every day tens of thousands of sequence searches and sequence alignment queries are submitted to webservers. The capitalized word "BLAST" becomes a verb, describing the act of performing sequence search and alignment. However, if one needs to search for sequences that contain, for example, two hydrophobic and three polar residues at five given positions, the query formation on the most frequently used webservers will be difficult. Some servers support the formation of queries with regular expressions, but most of the users are unfamiliar with their syntax. Here we present an intuitive, easily applicable webserver, the Protein Sequence Analysis server, that allows the formation of multiple choice queries by simply drawing the residues to their positions; if more than one residue are drawn to the same position, then they will be nicely stacked on the user interface, indicating the multiple choice at the given position. This computer-game-like interface is natural and intuitive, and the coloring of the residues makes possible to form queries requiring not just certain amino acids in the given positions, but also small nonpolar, negatively charged, hydrophobic, positively charged, or polar ones. The webserver is available at http://psa.pitgroup.org. Copyright © 2014 Elsevier B.V. All rights reserved.

  2. The Instrument Implementation of Two-tier Multiple Choice to Analyze Students’ Science Process Skill Profile

    Sukarmin Sukarmin

    2018-01-01

    Full Text Available This research is aimed to analyze the profile of students’ science process skill (SPS by using instrument two-tier multiple choice. This is a descriptive research that describes the profile of students’ SPS. Subjects of the research were 10th-grade students from high, medium and low categorized school. Instrument two-tier multiple choice consists of 30 question that contains an indicator of SPS. The indicator of SPS namely formulating a hypothesis, designing experiment, analyzing data, applying the concept, communicating, making a conclusion. Based on the result of the research and analysis, it shows that: 1 the average of indicator achievement of science process skill at high categorized school on formulating hypothesis is 74,55%, designing experiment is 74,89%, analyzing data is 67,89%, applying concept is 52,89%, communicating is 80,22%, making conclusion is 76%, 2. the average of indicator achievement of science process skill at medium categorized school on formulating hypothesis is 53,47%, designing experiment is 59,86%, analyzing data is 42,22%, applying concept is 33,19%, communicating is 76,25%, making conclusion is 61,53%, 3 the average of indicator achievement of science process skill at low categorized school on formulating hypothesis is 51%, designing experiment is 55,17%, analyzing data is 39,17%, applying concept is 35,83%, communicating is 58,83%, making conclusion is 58%.

  3. Analyzing Item Generation with Natural Language Processing Tools for the "TOEIC"® Listening Test. Research Report. ETS RR-17-52

    Yoon, Su-Youn; Lee, Chong Min; Houghton, Patrick; Lopez, Melissa; Sakano, Jennifer; Loukina, Anastasia; Krovetz, Bob; Lu, Chi; Madani, Nitin

    2017-01-01

    In this study, we developed assistive tools and resources to support TOEIC® Listening test item generation. There has recently been an increased need for a large pool of items for these tests. This need has, in turn, inspired efforts to increase the efficiency of item generation while maintaining the quality of the created items. We aimed to…

  4. Do Self Concept Tests Test Self Concept? An Evaluation of the Validity of Items on the Piers Harris and Coopersmith Measures.

    Lynch, Mervin D.; Chaves, John

    Items from Peirs-Harris and Coopersmith self-concept tests were evaluated against independent measures on three self-constructs, idealized, empathic, and worth. Construct measurements were obtained with the semantic differential and D statistic. Ratings were obtained from 381 children, grades 4-6. For each test, item ratings and construct measures…

  5. Conditioning factors of test-taking engagement in PIAAC: an exploratory IRT modelling approach considering person and item characteristics

    Frank Goldhammer

    2017-11-01

    Full Text Available Abstract Background A potential problem of low-stakes large-scale assessments such as the Programme for the International Assessment of Adult Competencies (PIAAC is low test-taking engagement. The present study pursued two goals in order to better understand conditioning factors of test-taking disengagement: First, a model-based approach was used to investigate whether item indicators of disengagement constitute a continuous latent person variable by domain. Second, the effects of person and item characteristics were jointly tested using explanatory item response models. Methods Analyses were based on the Canadian sample of Round 1 of the PIAAC, with N = 26,683 participants completing test items in the domains of literacy, numeracy, and problem solving. Binary item disengagement indicators were created by means of item response time thresholds. Results The results showed that disengagement indicators define a latent dimension by domain. Disengagement increased with lower educational attainment, lower cognitive skills, and when the test language was not the participant’s native language. Gender did not exert any effect on disengagement, while age had a positive effect for problem solving only. An item’s location in the second of two assessment modules was positively related to disengagement, as was item difficulty. The latter effect was negatively moderated by cognitive skill, suggesting that poor test-takers are especially likely to disengage with more difficult items. Conclusions The negative effect of cognitive skill, the positive effect of item difficulty, and their negative interaction effect support the assumption that disengagement is the outcome of individual expectations about success (informed disengagement.

  6. Detection of advance item knowledge using response times in computer adaptive testing

    Meijer, R.R.; Sotaridona, Leonardo

    2006-01-01

    We propose a new method for detecting item preknowledge in a CAT based on an estimate of “effective response time” for each item. Effective response time is defined as the time required for an individual examinee to answer an item correctly. An unusually short response time relative to the expected

  7. The Social Attribution Task-Multiple Choice (SAT-MC): A Psychometric and Equivalence Study of an Alternate Form.

    Johannesen, Jason K; Lurie, Jessica B; Fiszdon, Joanna M; Bell, Morris D

    2013-01-01

    The Social Attribution Task-Multiple Choice (SAT-MC) uses a 64-second video of geometric shapes set in motion to portray themes of social relatedness and intentions. Considered a test of "Theory of Mind," the SAT-MC assesses implicit social attribution formation while reducing verbal and basic cognitive demands required of other common measures. We present a comparability analysis of the SAT-MC and the new SAT-MC-II, an alternate form created for repeat testing, in a university sample (n = 92). Score distributions and patterns of association with external validation measures were nearly identical between the two forms, with convergent and discriminant validity supported by association with affect recognition ability and lack of association with basic visual reasoning. Internal consistency of the SAT-MC-II was superior (alpha = .81) to the SAT-MC (alpha = .56). Results support the use of SAT-MC and new SAT-MC-II as equivalent test forms. Demonstrating relatively higher association to social cognitive than basic cognitive abilities, the SAT-MC may provide enhanced sensitivity as an outcome measure of social cognitive intervention trials.

  8. Concreteness effects in short-term memory: a test of the item-order hypothesis.

    Roche, Jaclynn; Tolan, G Anne; Tehan, Gerald

    2011-12-01

    The following experiments explore word length and concreteness effects in short-term memory within an item-order processing framework. This framework asserts order memory is better for those items that are relatively easy to process at the item level. However, words that are difficult to process benefit at the item level for increased attention/resources being applied. The prediction of the model is that differential item and order processing can be detected in episodic tasks that differ in the degree to which item or order memory are required by the task. The item-order account has been applied to the word length effect such that there is a short word advantage in serial recall but a long word advantage in item recognition. The current experiment considered the possibility that concreteness effects might be explained within the same framework. In two experiments, word length (Experiment 1) and concreteness (Experiment 2) are examined using forward serial recall, backward serial recall, and item recognition. These results for word length replicate previous studies showing the dissociation in item and order tasks. The same was not true for the concreteness effect. In all three tasks concrete words were better remembered than abstract words. The concreteness effect cannot be explained in terms of an item-order trade off. PsycINFO Database Record (c) 2011 APA, all rights reserved.

  9. Determination of radionuclides in environmental test items at CPHR: traceability and uncertainty calculation.

    Carrazana González, J; Fernández, I M; Capote Ferrera, E; Rodríguez Castro, G

    2008-11-01

    Information about how the laboratory of Centro de Protección e Higiene de las Radiaciones (CPHR), Cuba establishes its traceability to the International System of Units for the measurement of radionuclides in environmental test items is presented. A comparison among different methodologies of uncertainty calculation, including an analysis of the feasibility of using the Kragten-spreadsheet approach, is shown. In the specific case of the gamma spectrometric assay, the influence of each parameter, and the identification of the major contributor, in the relative difference between the methods of uncertainty calculation (Kragten and partial derivative) is described. The reliability of the uncertainty calculation results reported by the commercial software Gamma 2000 from Silena is analyzed.

  10. Determination of radionuclides in environmental test items at CPHR: Traceability and uncertainty calculation

    Carrazana Gonzalez, J.; Fernandez, I.M.; Capote Ferrera, E.; Rodriguez Castro, G.

    2008-01-01

    Information about how the laboratory of Centro de Proteccion e Higiene de las Radiaciones (CPHR), Cuba establishes its traceability to the International System of Units for the measurement of radionuclides in environmental test items is presented. A comparison among different methodologies of uncertainty calculation, including an analysis of the feasibility of using the Kragten-spreadsheet approach, is shown. In the specific case of the gamma spectrometric assay, the influence of each parameter, and the identification of the major contributor, in the relative difference between the methods of uncertainty calculation (Kragten and partial derivative) is described. The reliability of the uncertainty calculation results reported by the commercial software Gamma 2000 from Silena is analyzed

  11. Assessing the discriminating power of item and test scores in the linear factor-analysis model

    Pere J. Ferrando

    2012-01-01

    Full Text Available Las propuestas rigurosas y basadas en un modelo psicométrico para estudiar el impreciso concepto de "capacidad discriminativa" son escasas y generalmente limitadas a los modelos no-lineales para items binarios. En este artículo se propone un marco general para evaluar la capacidad discriminativa de las puntuaciones en ítems y tests que son calibrados mediante el modelo de un factor común. La propuesta se organiza en torno a tres criterios: (a tipo de puntuación, (b rango de discriminación y (c aspecto específico que se evalúa. Dentro del marco propuesto: (a se discuten las relaciones entre 16 medidas, de las cuales 6 parecen ser nuevas, y (b se estudian las relaciones entre ellas. La utilidad de la propuesta en las aplicaciones psicométricas que usan el modelo factorial se ilustra mediante un ejemplo empírico.

  12. A Case Study on an Item Writing Process: Use of Test Specifications, Nature of Group Dynamics, and Individual Item Writers' Characteristics

    Kim, Jiyoung; Chi, Youngshin; Huensch, Amanda; Jun, Heesung; Li, Hongli; Roullion, Vanessa

    2010-01-01

    This article discusses a case study on an item writing process that reflects on our practical experience in an item development project. The purpose of the article is to share our lessons from the experience aiming to demystify item writing process. The study investigated three issues that naturally emerged during the project: how item writers use…

  13. Solving the Single-Sink, Fixed-Charge, Multiple-Choice Transportation Problem by Dynamic Programming

    Christensen, Tue; Andersen, Kim Allan; Klose, Andreas

    2013-01-01

    This paper considers a minimum-cost network flow problem in a bipartite graph with a single sink. The transportation costs exhibit a staircase cost structure because such types of transportation cost functions are often found in practice. We present a dynamic programming algorithm for solving...... this so-called single-sink, fixed-charge, multiple-choice transportation problem exactly. The method exploits heuristics and lower bounds to peg binary variables, improve bounds on flow variables, and reduce the state-space variable. In this way, the dynamic programming method is able to solve large...... instances with up to 10,000 nodes and 10 different transportation modes in a few seconds, much less time than required by a widely used mixed-integer programming solver and other methods proposed in the literature for this problem....

  14. Solving the Single-Sink, Fixed-Charge, Multiple-Choice Transportation Problem by Dynamic Programming

    Rauff Lind Christensen, Tue; Klose, Andreas; Andersen, Kim Allan

    important aspects of supplier selection, an important application of the SSFCTP, this does not reflect the real life situation. First, transportation costs faced by many companies are in fact piecewise linear. Secondly, when suppliers offer discounts, either incremental or all-unit discounts, such savings......The Single-Sink, Fixed-Charge, Multiple-Choice Transportation Problem (SSFCMCTP) is a problem with versatile applications. This problem is a generalization of the Single-Sink, Fixed-Charge Transportation Problem (SSFCTP), which has a fixed-charge, linear cost structure. However, in at least two...... are neglected in the SSFCTP. The SSFCMCTP overcome this problem by incorporating a staircase cost structure in the cost function instead of the usual one used in SSFCTP. We present a dynamic programming algorithm for the resulting problem. To enhance the performance of the generic algorithm a number...

  15. Measuring primary school teachers' pedagogical content knowledge in technology education with a multiple choice test

    Rohaan, E.J.; Taconis, R.; Jochems, W.M.G.; Fatih Tasar, M.; Cakankci, G.; Akgul, E.

    2009-01-01

    Pedagogical content knowledge (PCK) is a crucial part of a teacher’s knowledge base for teaching. Studies in the field of technology education for primary schools showed that this domain of teacher knowledge is related to pupils’ increased learning, motivation, and interest. The common methods to

  16. Step by Step: Biology Undergraduates' Problem-Solving Procedures during Multiple-Choice Assessment.

    Prevost, Luanna B; Lemons, Paula P

    2016-01-01

    This study uses the theoretical framework of domain-specific problem solving to explore the procedures students use to solve multiple-choice problems about biology concepts. We designed several multiple-choice problems and administered them on four exams. We trained students to produce written descriptions of how they solved the problem, and this allowed us to systematically investigate their problem-solving procedures. We identified a range of procedures and organized them as domain general, domain specific, or hybrid. We also identified domain-general and domain-specific errors made by students during problem solving. We found that students use domain-general and hybrid procedures more frequently when solving lower-order problems than higher-order problems, while they use domain-specific procedures more frequently when solving higher-order problems. Additionally, the more domain-specific procedures students used, the higher the likelihood that they would answer the problem correctly, up to five procedures. However, if students used just one domain-general procedure, they were as likely to answer the problem correctly as if they had used two to five domain-general procedures. Our findings provide a categorization scheme and framework for additional research on biology problem solving and suggest several important implications for researchers and instructors. © 2016 L. B. Prevost and P. P. Lemons. CBE—Life Sciences Education © 2016 The American Society for Cell Biology. This article is distributed by The American Society for Cell Biology under license from the author(s). It is available to the public under an Attribution–Noncommercial–Share Alike 3.0 Unported Creative Commons License (http://creativecommons.org/licenses/by-nc-sa/3.0).

  17. Developing a Numerical Ability Test for Students of Education in Jordan: An Application of Item Response Theory

    Abed, Eman Rasmi; Al-Absi, Mohammad Mustafa; Abu shindi, Yousef Abdelqader

    2016-01-01

    The purpose of the present study is developing a test to measure the numerical ability for students of education. The sample of the study consisted of (504) students from 8 universities in Jordan. The final draft of the test contains 45 items distributed among 5 dimensions. The results revealed that acceptable psychometric properties of the test;…

  18. Diagnostic accuracy of a two-item Drug Abuse Screening Test (DAST-2).

    Tiet, Quyen Q; Leyva, Yani E; Moos, Rudolf H; Smith, Brandy

    2017-11-01

    Drug use is prevalent and costly to society, but individuals with drug use disorders (DUDs) are under-diagnosed and under-treated, particularly in primary care (PC) settings. Drug screening instruments have been developed to identify patients with DUDs and facilitate treatment. The Drug Abuse Screening Test (DAST) is one of the most well-known drug screening instruments. However, similar to many such instruments, it is too long for routine use in busy PC settings. This study developed and validated a briefer and more practical DAST for busy PC settings. We recruited 1300 PC patients in two Department of Veterans Affairs (VA) clinics. Participants responded to a structured diagnostic interview. We randomly selected half of the sample to develop and the other half to validate the new instrument. We employed signal detection techniques to select the best DAST items to identify DUDs (based on the MINI) and negative consequences of drug use (measured by the Inventory of Drug Use Consequences). Performance indicators were calculated. The two-item DAST (DAST-2) was 97% sensitive and 91% specific for DUDs in the development sample and 95% sensitive and 89% specific in the validation sample. It was highly sensitive and specific for DUD and negative consequences of drug use in subgroups of patients, including gender, age, race/ethnicity, marital status, educational level, and posttraumatic stress disorder status. The DAST-2 is an appropriate drug screening instrument for routine use in PC settings in the VA and may be applicable in broader range of PC clinics. Published by Elsevier Ltd.

  19. Branched Adaptive Testing with a Rasch-Model-Calibrated Test: Analysing Item Presentation's Sequence Effects Using the Rasch-Model-Based LLTM

    Kubinger, Klaus D.; Reif, Manuel; Yanagida, Takuya

    2011-01-01

    Item position effects provoke serious problems within adaptive testing. This is because different testees are necessarily presented with the same item at different presentation positions, as a consequence of which comparing their ability parameter estimations in the case of such effects would not at all be fair. In this article, a specific…

  20. An Online National Archive of Multiple-Choice Questions for Astro 101 and the Development of the Question Complexity Rubric

    Cormier, S.; Prather, E.; Brissenden, G.

    2011-09-01

    We are developing a national archive of multiple-choice questions for use in the Astronomy 101 classroom. These questions are intended to supplement an instructor's implementation of Think-Pair-Share or for their assessment purposes (i.e., exams and homework). We are also developing the Question Complexity Rubric (QCR) to guide members of the Astro 101 teaching and learning community in assisting us with hierarchically ranking questions in this archive based on their conceptual complexity. Using the QCR, a score is assigned to differentiate each question based on the cognitive steps necessary to comprehensively explain the reasoning pathway to the correct answer. The lowest QCR score is given to questions with a reasoning pathway requiring only declarative knowledge. The highest QCR score is given to questions with a reasoning pathway that requires multiple connected cognitive steps. When completed, the online question archive will provide users with the utility to 1) use the QCR to score questions 2) search for and download questions based on topic and/or QCR score, and 3) add their own questions to the archive. Stop by our poster to test your skills at determining question complexity by trying out the QCR with our sample questions.

  1. A leukocyte activation test identifies food items which induce release of DNA by innate immune peripheral blood leucocytes.

    Garcia-Martinez, Irma; Weiss, Theresa R; Yousaf, Muhammad N; Ali, Ather; Mehal, Wajahat Z

    2018-01-01

    Leukocyte activation (LA) testing identifies food items that induce a patient specific cellular response in the immune system, and has recently been shown in a randomized double blinded prospective study to reduce symptoms in patients with irritable bowel syndrome (IBS). We hypothesized that test reactivity to particular food items, and the systemic immune response initiated by these food items, is due to the release of cellular DNA from blood immune cells. We tested this by quantifying total DNA concentration in the cellular supernatant of immune cells exposed to positive and negative foods from 20 healthy volunteers. To establish if the DNA release by positive samples is a specific phenomenon, we quantified myeloperoxidase (MPO) in cellular supernatants. We further assessed if a particular immune cell population (neutrophils, eosinophils, and basophils) was activated by the positive food items by flow cytometry analysis. To identify the signaling pathways that are required for DNA release we tested if specific inhibitors of key signaling pathways could block DNA release. Foods with a positive LA test result gave a higher supernatant DNA content when compared to foods with a negative result. This was specific as MPO levels were not increased by foods with a positive LA test. Protein kinase C (PKC) inhibitors resulted in inhibition of positive food stimulated DNA release. Positive foods resulted in CD63 levels greater than negative foods in eosinophils in 76.5% of tests. LA test identifies food items that result in release of DNA and activation of peripheral blood innate immune cells in a PKC dependent manner, suggesting that this LA test identifies food items that result in release of inflammatory markers and activation of innate immune cells. This may be the basis for the improvement in symptoms in IBS patients who followed an LA test guided diet.

  2. The Predominance Of Integrative Tests Over Discrete Point Tests In Evaluating The Medical Students' General English Knowledge

    maryam Heydarpour Meymeh

    2009-03-01

    Full Text Available Background and purpose: Multiple choice tests are the most common type of tests used in evaluating the general English knowledge of the students in most medical universities, however the efficacy of these tests are not examined precisely. Wecompare and examine the integrative tests and discrete point tests as measures of the English language knowledge of medical students.Methods: Three tests were given to 60 undergraduate physiotherapy and Audiology students in their second year of study (after passing their general English course. They were divided into 2 groups.The first test for both groups was an integrative test, writing. The second test was a multiple - choice test 0.(prepositions for group one and a multiple - choice test of tensesfor group two. The same items which were mostfi-equently used wrongly in thefirst test were used in the items of the second test. A third test, a TOEFL, was given to the subjects in order to estimate the correlation between this test and tests one and two.Results: The students performed better in the second test, discrete point test rather than the first which was an integrative test. The same grammatical mistakes in the composition were used correctly in the multiple choice tests by the students.Conclusion:Our findings show that student perform better in non-productive rather than productive test. Since being competent English language user is an expected outcome of university language courses it seems warranted to switch to integrative tests as a measure of English language competency.Keywords: INTEGRATIVE TESTS, ENGLISH LANGUAGE FOR MEDICINE, ACADEMIC ENGLISH

  3. Measurement Properties of Two Innovative Item Formats in a Computer-Based Test

    Wan, Lei; Henly, George A.

    2012-01-01

    Many innovative item formats have been proposed over the past decade, but little empirical research has been conducted on their measurement properties. This study examines the reliability, efficiency, and construct validity of two innovative item formats--the figural response (FR) and constructed response (CR) formats used in a K-12 computerized…

  4. Developing and testing items for the South African Personality Inventory (SAPI

    Carin Hill

    2013-11-01

    Research purpose: This article reports on the process of identifying items for, and provides a quantitative evaluation of, the South African Personality Inventory (SAPI items. Motivation for the study: The study intended to develop an indigenous and psychometrically sound personality instrument that adheres to the requirements of South African legislation and excludes cultural bias. Research design, approach and method: The authors used a cross-sectional design. They measured the nine SAPI clusters identified in the qualitative stage of the SAPI project in 11 separate quantitative studies. Convenience sampling yielded 6735 participants. Statistical analysis focused on the construct validity and reliability of items. The authors eliminated items that showed poor performance, based on common psychometric criteria, and selected the best performing items to form part of the final version of the SAPI. Main findings: The authors developed 2573 items from the nine SAPI clusters. Of these, 2268 items were valid and reliable representations of the SAPI facets. Practical/managerial implications: The authors developed a large item pool. It measures personality in South Africa. Researchers can refine it for the SAPI. Furthermore, the project illustrates an approach that researchers can use in projects that aim to develop culturally-informed psychological measures. Contribution/value-add: Personality assessment is important for recruiting, selecting and developing employees. This study contributes to the current knowledge about the early processes researchers follow when they develop a personality instrument that measures personality fairly in different cultural groups, as the SAPI does.

  5. De item-reeks van de cognitieve screening test vergeleken met die van de mini-mental state examination

    Schmand, B.; Deelman, B. G.; Hooijer, C.; Jonker, C.; Lindeboom, J.

    1996-01-01

    The items of the ¿mini-mental state examination' (MMSE) and a Dutch dementia screening instrument, the ¿cognitive screening test' (CST), as well as the ¿geriatric mental status schedule' (GMS) and the ¿Dutch adult reading test' (DART), were administered to 4051 elderly people aged 65 to 84 years.

  6. Gruppenleistungen beim Review von Multiple-Choice-Fragen - Ein Vergleich von face-to-face und virtuellen Gruppen mit und ohne Moderation [Review of multiple-choice-questions and group performance - A comparison of face-to-face and virtual groups with and without facilitation

    Schüttpelz-Brauns, Katrin

    2010-11-01

    Full Text Available [english] Background: Multiple choice questions (MCQs are often used in exams of medical education and need careful quality management for example by the application of review committees. This study investigates whether groups communicating virtually by email are similar to face-to-face groups concerning their review process performance and whether a facilitator has positive effects.Methods: 16 small groups of students were examined, which had to evaluate and correct MCQs under four different conditions. In the second part of the investigation the changed questions were given to a new random sample for the judgement of the item quality.Results: There was no significant influence of the variables “form of review committee” and “facilitation”. However, face-to-face and virtual groups clearly differed in the required treatment times. The test condition “face to face without facilitation” was generally valued most positively concerning taking over responsibility, approach to work, sense of well-being, motivation and concentration on the task.Discussion: Face-to-face and virtual groups are equally effective in the review of MCQs but differ concerning their efficiency. The application of electronic review seems to be possible but is hardly recommendable because of the long process time and technical problems.[german] Einleitung: Multiple-Choice-Fragen (MCF werden in vielen Prüfungen der medizinischen Ausbildung verwendet und bedürfen aus diesem Grund einer sorgfältigen Qualitätssicherung, beispielsweise durch den Einsatz von Review-Komitees. Anhand der vorliegenden empirischen Studie soll erforscht werden, ob virtuell per E-Mail kommunizierende Review-Komitees vergleichbar sind mit face-to-face Review-Komitees hinsichtlich ihrer Leistung beim Review-Prozess und ob sich Moderation positiv auswirkt.Methodik: 16 Kleingruppen von Psychologie-Studenten hatten die Aufgabe unter vier verschiedenen Versuchsbedingungen MCF zu bewerten und zu

  7. A comparative study of students' performance in preclinical physiology assessed by multiple choice and short essay questions.

    Oyebola, D D; Adewoye, O E; Iyaniwura, J O; Alada, A R; Fasanmade, A A; Raji, Y

    2000-01-01

    This study was designed to compare the performance of medical students in physiology when assessed by multiple choice questions (MCQs) and short essay questions (SEQs). The study also examined the influence of factors such as age, sex, O/level grades and JAMB scores on performance in the MCQs and SEQs. A structured questionnaire was administered to 264 medical students' four months before the Part I MBBS examination. Apart from personal data of each student, the questionnaire sought information on the JAMB scores and GCE O' Level grades of each student in English Language, Biology, Chemistry, Physics and Mathematics. The physiology syllabus was divided into five parts and the students were administered separate examinations (tests) on each part. Each test consisted of MCQs and SEQs. The performance in MCQs and SEQs were compared. Also, the effects of JAMB scores and GCE O/level grades on the performance in both the MCQs and SEQs were assessed. The results showed that the students performed better in all MCQ tests than in the SEQs. JAMB scores and O' level English Language grade had no significant effect on students' performance in MCQs and SEQs. However O' level grades in Biology, Chemistry, Physics and Mathematics had significant effects on performance in MCQs and SEQs. Inadequate knowledge of physiology and inability to present information in a logical sequence are believed to be major factors contributing to the poorer performance in the SEQs compared with MCQs. In view of the finding of significant association between performance in MCQs and SEQs and GCE O/level grades in science subjects and mathematics, it was recommended that both JAMB results and the GCE results in the four O/level subjects above may be considered when selecting candidates for admission into the medical schools.

  8. Psychometric characteristics of Clinical Reasoning Problems (CRPs) and its correlation with routine multiple choice question (MCQ) in Cardiology department.

    Derakhshandeh, Zahra; Amini, Mitra; Kojuri, Javad; Dehbozorgian, Marziyeh

    2018-01-01

    Clinical reasoning is one of the most important skills in the process of training a medical student to become an efficient physician. Assessment of the reasoning skills in a medical school program is important to direct students' learning. One of the tests for measuring the clinical reasoning ability is Clinical Reasoning Problems (CRPs). The major aim of this study is to measure psychometric qualities of CRPs and define correlation between this test and routine MCQ in cardiology department of Shiraz medical school. This study was a descriptive study conducted on total cardiology residents of Shiraz Medical School. The study population consists of 40 residents in 2014. The routine CRPs and the MCQ tests was designed based on similar objectives and were carried out simultaneously. Reliability, item difficulty, item discrimination, and correlation between each item and the total score of CRPs were all measured by Excel and SPSS software for checking psycometeric CRPs test. Furthermore, we calculated the correlation between CRPs test and MCQ test. The mean differences of CRPs test score between residents' academic year [second, third and fourth year] were also evaluated by Analysis of variances test (One Way ANOVA) using SPSS software (version 20)(α=0.05). The mean and standard deviation of score in CRPs was 10.19 ±3.39 out of 20; in MCQ, it was 13.15±3.81 out of 20. Item difficulty was in the range of 0.27-0.72; item discrimination was 0.30-0.75 with question No.3 being the exception (that was 0.24). The correlation between each item and the total score of CRP was 0.26-0.87; the correlation between CRPs test and MCQ test was 0.68 (preasoning in residents. It can be included in cardiology residency assessment programs.

  9. Assessment of chromium(VI) release from 848 jewellery items by use of a diphenylcarbazide spot test

    Bregnbak, David; Johansen, Jeanne D.; Hamann, Dathan

    2016-01-01

    We recently evaluated and validated a diphenylcarbazide(DPC)-based screening spot test that can detect the release of chromium(VI) ions (≥0.5 ppm) from various metallic items and leather goods (1). We then screened a selection of metal screws, leather shoes, and gloves, as well as 50 earrings......, and identified chromium(VI) release from one earring. In the present study, we used the DPC spot test to assess chromium(VI) release in a much larger sample of jewellery items (n=848), 160 (19%) of which had previously be shown to contain chromium when analysed with X-ray fluorescence spectroscopy (2)....

  10. Development and validation of a theoretical test in basic laparoscopy

    Strandbygaard, Jeanett; Maagaard, Mathilde; Larsen, Christian Rifbjerg

    2013-01-01

    for first-year residents in obstetrics and gynecology. This study therefore aimed to develop and validate a framework for a theoretical knowledge test, a multiple-choice test, in basic theory related to laparoscopy. METHODS: The content of the multiple-choice test was determined by conducting informal...... conversational interviews with experts in laparoscopy. The subsequent relevance of the test questions was evaluated using the Delphi method involving regional chief physicians. Construct validity was tested by comparing test results from three groups with expected different clinical competence and knowledge.......001). Internal consistency (Cronbach's alpha) was 0.82. There was no evidence of differential item functioning between the three groups tested. CONCLUSIONS: A newly developed knowledge test in basic laparoscopy proved to have content and construct validity. The formula for the development and validation...

  11. Development of an item bank and computer adaptive test for role functioning

    Anatchkova, Milena D; Rose, Matthias; Ware, John E

    2012-01-01

    Role functioning (RF) is a key component of health and well-being and an important outcome in health research. The aim of this study was to develop an item bank to measure impact of health on role functioning.......Role functioning (RF) is a key component of health and well-being and an important outcome in health research. The aim of this study was to develop an item bank to measure impact of health on role functioning....

  12. Learning from peer feedback on student-generated multiple choice questions: Views of introductory physics students

    Kay, Alison E.; Hardy, Judy; Galloway, Ross K.

    2018-06-01

    PeerWise is an online application where students are encouraged to generate a bank of multiple choice questions for their classmates to answer. After answering a question, students can provide feedback to the question author about the quality of the question and the question author can respond to this. Student use of, and attitudes to, this online community within PeerWise was investigated in two large first year undergraduate physics courses, across three academic years, to explore how students interact with the system and the extent to which they believe PeerWise to be useful to their learning. Most students recognized that there is value in engaging with PeerWise, and many students engaged deeply with the system, thinking critically about the quality of their submissions and reflecting on feedback provided to them. Students also valued the breadth of topics and level of difficulty offered by the questions, recognized the revision benefits afforded by the resource, and were often willing to contribute to the community by providing additional explanations and engaging in discussion.

  13. Fostering dental student self-assessment of knowledge by confidence scoring of multiple-choice examinations.

    McMahan, C Alex; Pinckard, R Neal; Jones, Anne Cale; Hendricson, William D

    2014-12-01

    Creating a learning environment that fosters student acquisition of self-assessment behaviors and skills is critically important in the education and training of health professionals. Self-assessment is a vital component of competent practice and lifelong learning. This article proposes applying a version of confidence scoring of multiple-choice questions as one avenue to address this crucial educational objective for students to be able to recognize and admit what they do not know. The confidence scoring algorithm assigns one point for a correct answer, deducts fractional points for an incorrect answer, but rewards students fractional points for leaving the question unanswered in admission that they are unsure of the correct answer. The magnitude of the reward relative to the deduction is selected such that the expected gain due to random guessing, even after elimination of all but one distractor, is never greater than the reward. Curricular implementation of this confidence scoring algorithm should motivate health professions students to develop self-assessment behaviors and enable them to acquire the skills necessary to critically evaluate the extent of their current knowledge throughout their professional careers. This is a professional development competency that is emphasized in the educational standards of the Commission on Dental Accreditation (CODA).

  14. Automatic Generation System of Multiple-Choice Cloze Questions and its Evaluation

    Takuya Goto

    2010-09-01

    Full Text Available Since English expressions vary according to the genres, it is important for students to study questions that are generated from sentences of the target genre. Although various questions are prepared, it is still not enough to satisfy various genres which students want to learn. On the other hand, when producing English questions, sufficient grammatical knowledge and vocabulary are needed, so it is difficult for non-expert to prepare English questions by themselves. In this paper, we propose an automatic generation system of multiple-choice cloze questions from English texts. Empirical knowledge is necessary to produce appropriate questions, so machine learning is introduced to acquire knowledge from existing questions. To generate the questions from texts automatically, the system (1 extracts appropriate sentences for questions from texts based on Preference Learning, (2 estimates a blank part based on Conditional Random Field, and (3 generates distracters based on statistical patterns of existing questions. Experimental results show our method is workable for selecting appropriate sentences and blank part. Moreover, our method is appropriate to generate the available distracters, especially for the sentence that does not contain the proper noun.

  15. Ant system for reliability optimization of a series system with multiple-choice and budget constraints

    Nahas, Nabil; Nourelfath, Mustapha

    2005-01-01

    Many researchers have shown that insect colonies behavior can be seen as a natural model of collective problem solving. The analogy between the way ants look for food and combinatorial optimization problems has given rise to a new computational paradigm, which is called ant system. This paper presents an application of ant system in a reliability optimization problem for a series system with multiple-choice constraints incorporated at each subsystem, to maximize the system reliability subject to the system budget. The problem is formulated as a nonlinear binary integer programming problem and characterized as an NP-hard problem. This problem is solved by developing and demonstrating a problem-specific ant system algorithm. In this algorithm, solutions of the reliability optimization problem are repeatedly constructed by considering the trace factor and the desirability factor. A local search is used to improve the quality of the solutions obtained by each ant. A penalty factor is introduced to deal with the budget constraint. Simulations have shown that the proposed ant system is efficient with respect to the quality of solutions and the computing time

  16. The Use of Case Based Multiple Choice Questions for Assessing Large Group Teaching: Implications on Student's Learning

    Christina Donnelly

    2014-06-01

    Full Text Available The practice of assessments in third level education is extremely important and a rarely disputed part of the university curriculum as a method to demonstrate a student’s learning. However, assessments to test a student’s knowledge and level of understanding are challenging to apply given recent trends which are showing that student numbers are increasing, student demographics are wide ranging and resources are being stretched. As a result of these emerging challenges, lecturers are required to develop a comprehensive assessment to effectively demonstrate student learning, whilst efficiently managing large class sizes. One form of assessment which has been used for efficient assessment is multiple choice questions (MCQs; however this method has been criticised for encouraging surface learning, in comparison to other methods such as essays or case studies. This research explores the impact of blended assessment methods on student learning. This study adopts a rigorous three-staged qualitative methodology to capture third level lecturers’ and students’ perception to (1 the level of learning when using MCQs; (2 the level of learning when blended assessment in the form of case based MCQs are used. The findings illuminate the positive impact of cased based MCQs as students and lecturers suggest that it leads to a higher level of learning and deeper information processing over that of MCQs without case studies. 2 The implications of this research is that this type of assessment contributes to the current thinking within literature on the use of assessments methods, as well as the blending of assessment methods to reach a higher level of learning. It further serves to reinforce the belief that assessments are the greatest influence on students’ learning, and the requirement for both universities and lecturers to reflect on the best form of assessment to test students’ level of understanding, whilst also balancing the real challenges of

  17. The Question Complexity Rubric: Development and Application for a National Archive of Astro 101 Multiple-Choice Questions

    Cormier, Sebastien; Prather, E. E.; Brissenden, G.; Collaboration of Astronomy Teaching Scholars CATS

    2011-01-01

    For the last two years we have been developing an online national archive of multiple-choice questions for use in the Astro 101 classroom. These questions are intended to either supplement an instructor's implementation of Think-Pair-Share or be used for assessment purposes (i.e. exams and homework). In this talk we will describe the development, testing and implementation of the Question Complexity Rubric (QCR), which is designed to guide the ranking of questions in this archive based on their conceptual complexity. Using the QCR, a score is assigned to differentiate each question based on the cognitive steps necessary to comprehensively explain the reasoning pathway to the correct answer. The lowest QCR score is given to questions with a reasoning pathway requiring only declarative knowledge whereas the highest QCR score is given to questions that require multiple pathways of multi-step reasoning. When completed, the online question archive will provide users with the utility to 1) search for and download questions based on subject and average QCR score, 2) use the QCR to score questions, and 3) add their own questions to the archive. We will also discuss other potential applications of the QCR, such as how it informs our work in developing and testing of survey instruments by allowing us to calibrate the range of question complexity. This material is based upon work supported by the National Science Foundation under Grant No. 0715517, a CCLI Phase III Grant for the Collaboration of Astronomy Teaching Scholars (CATS). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

  18. A Multidimensional Partial Credit Model with Associated Item and Test Statistics: An Application to Mixed-Format Tests

    Yao, Lihua; Schwarz, Richard D.

    2006-01-01

    Multidimensional item response theory (IRT) models have been proposed for better understanding the dimensional structure of data or to define diagnostic profiles of student learning. A compensatory multidimensional two-parameter partial credit model (M-2PPC) for constructed-response items is presented that is a generalization of those proposed to…

  19. An Australian Study Comparing the Use of Multiple-Choice Questionnaires with Assignments as Interim, Summative Law School Assessment

    Huang, Vicki

    2017-01-01

    To the author's knowledge, this is the first Australian study to empirically compare the use of a multiple-choice questionnaire (MCQ) with the use of a written assignment for interim, summative law school assessment. This study also surveyed the same student sample as to what types of assessments are preferred and why. In total, 182 undergraduate…

  20. Using a Fine-Grained Multiple-Choice Response Format in Educational Drill-and-Practice Video Games

    Beserra, Vagner; Nussbaum, Miguel; Grass, Antonio

    2017-01-01

    When using educational video games, particularly drill-and-practice video games, there are several ways of providing an answer to a quiz. The majority of paper-based options can be classified as being either multiple-choice or constructed-response. Therefore, in the process of creating an educational drill-and-practice video game, one fundamental…

  1. The Multiple-Choice Model: Some Solutions for Estimation of Parameters in the Presence of Omitted Responses

    Abad, Francisco J.; Olea, Julio; Ponsoda, Vicente

    2009-01-01

    This article deals with some of the problems that have hindered the application of Samejima's and Thissen and Steinberg's multiple-choice models: (a) parameter estimation difficulties owing to the large number of parameters involved, (b) parameter identifiability problems in the Thissen and Steinberg model, and (c) their treatment of omitted…

  2. Differential Item Functioning in While-Listening Performance Tests: The Case of the International English Language Testing System (IELTS) Listening Module

    Aryadoust, Vahid

    2012-01-01

    This article investigates a version of the International English Language Testing System (IELTS) listening test for evidence of differential item functioning (DIF) based on gender, nationality, age, and degree of previous exposure to the test. Overall, the listening construct was found to be underrepresented, which is probably an important cause…

  3. An evaluation of computerized adaptive testing for general psychological distress: combining GHQ-12 and Affectometer-2 in an item bank for public mental health research.

    Stochl, Jan; Böhnke, Jan R; Pickett, Kate E; Croudace, Tim J

    2016-05-20

    Recent developments in psychometric modeling and technology allow pooling well-validated items from existing instruments into larger item banks and their deployment through methods of computerized adaptive testing (CAT). Use of item response theory-based bifactor methods and integrative data analysis overcomes barriers in cross-instrument comparison. This paper presents the joint calibration of an item bank for researchers keen to investigate population variations in general psychological distress (GPD). Multidimensional item response theory was used on existing health survey data from the Scottish Health Education Population Survey (n = 766) to calibrate an item bank consisting of pooled items from the short common mental disorder screen (GHQ-12) and the Affectometer-2 (a measure of "general happiness"). Computer simulation was used to evaluate usefulness and efficacy of its adaptive administration. A bifactor model capturing variation across a continuum of population distress (while controlling for artefacts due to item wording) was supported. The numbers of items for different required reliabilities in adaptive administration demonstrated promising efficacy of the proposed item bank. Psychometric modeling of the common dimension captured by more than one instrument offers the potential of adaptive testing for GPD using individually sequenced combinations of existing survey items. The potential for linking other item sets with alternative candidate measures of positive mental health is discussed since an optimal item bank may require even more items than these.

  4. Introduction to Psychology and Leadership. Part Nine; Morale and Esprit De Corps. Progress Check. Test Item Pool. Segments I & II.

    Westinghouse Learning Corp., Annapolis, MD.

    Test items for the introduction to psychology and leadership course (see the final reports which summarize the course development project, EM 010 418, EM 010 419, and EM 010 484) which were compiled as part of the project documentation and which are coordinated with the text-workbook on morale and esprit de corps (EM 010 439, EM 010 440, and EM…

  5. Probabilistic Approaches to Examining Linguistic Features of Test Items and Their Effect on the Performance of English Language Learners

    Solano-Flores, Guillermo

    2014-01-01

    This article addresses validity and fairness in the testing of English language learners (ELLs)--students in the United States who are developing English as a second language. It discusses limitations of current approaches to examining the linguistic features of items and their effect on the performance of ELL students. The article submits that…

  6. An Analysis of Cross Racial Identity Scale Scores Using Classical Test Theory and Rasch Item Response Models

    Sussman, Joshua; Beaujean, A. Alexander; Worrell, Frank C.; Watson, Stevie

    2013-01-01

    Item response models (IRMs) were used to analyze Cross Racial Identity Scale (CRIS) scores. Rasch analysis scores were compared with classical test theory (CTT) scores. The partial credit model demonstrated a high goodness of fit and correlations between Rasch and CTT scores ranged from 0.91 to 0.99. CRIS scores are supported by both methods.…

  7. Development of a Postacute Hospital Item Bank for the New Pediatric Evaluation of Disability Inventory-Computer Adaptive Test

    Dumas, Helene M.

    2010-01-01

    The PEDI-CAT is a new computer adaptive test (CAT) version of the Pediatric Evaluation of Disability Inventory (PEDI). Additional PEDI-CAT items specific to postacute pediatric hospital care were recently developed using expert reviews and cognitive interviewing techniques. Expert reviews established face and construct validity, providing positive…

  8. Effectiveness of Item Response Theory (IRT) Proficiency Estimation Methods under Adaptive Multistage Testing. Research Report. ETS RR-15-11

    Kim, Sooyeon; Moses, Tim; Yoo, Hanwook Henry

    2015-01-01

    The purpose of this inquiry was to investigate the effectiveness of item response theory (IRT) proficiency estimators in terms of estimation bias and error under multistage testing (MST). We chose a 2-stage MST design in which 1 adaptation to the examinees' ability levels takes place. It includes 4 modules (1 at Stage 1, 3 at Stage 2) and 3 paths…

  9. Examining Construct Congruence for Psychometric Tests: A Note on an Extension to Binary Items and Nesting Effects

    Raykov, Tenko; Marcoulides, George A.; Dimitrov, Dimiter M.; Li, Tatyana

    2018-01-01

    This article extends the procedure outlined in the article by Raykov, Marcoulides, and Tong for testing congruence of latent constructs to the setting of binary items and clustering effects. In this widely used setting in contemporary educational and psychological research, the method can be used to examine if two or more homogeneous…

  10. Biological Science: An Ecological Approach. BSCS Green Version. Teacher's Resource Book and Test Item Bank. Sixth Edition.

    Biological Sciences Curriculum Study, Colorado Springs.

    This book consists of four sections: (1) "Supplemental Materials"; (2) "Supplemental Investigations"; (3) "Test Item Bank"; and (4) "Blackline Masters." The first section provides additional background material related to selected chapters and investigations in the student book. Included are a periodic table of the elements, genetics problems and…

  11. Threats to Validity When Using Open-Ended Items in International Achievement Studies: Coding Responses to the PISA 2012 Problem-Solving Test in Finland

    Arffman, Inga

    2016-01-01

    Open-ended (OE) items are widely used to gather data on student performance in international achievement studies. However, several factors may threaten validity when using such items. This study examined Finnish coders' opinions about threats to validity when coding responses to OE items in the PISA 2012 problem-solving test. A total of 6…

  12. Effect of Item Response Theory (IRT) Model Selection on Testlet-Based Test Equating. Research Report. ETS RR-14-19

    Cao, Yi; Lu, Ru; Tao, Wei

    2014-01-01

    The local item independence assumption underlying traditional item response theory (IRT) models is often not met for tests composed of testlets. There are 3 major approaches to addressing this issue: (a) ignore the violation and use a dichotomous IRT model (e.g., the 2-parameter logistic [2PL] model), (b) combine the interdependent items to form a…

  13. Testing the Item-Order Account of Design Effects Using the Production Effect

    Jonker, Tanya R.; Levene, Merrick; MacLeod, Colin M.

    2014-01-01

    A number of memory phenomena evident in recall in within-subject, mixed-lists designs are reduced or eliminated in between-subject, pure-list designs. The item-order account (McDaniel & Bugg, 2008) proposes that differential retention of order information might underlie this pattern. According to this account, order information may be encoded…

  14. Detecting intrajudge inconsistency in standard setting using test items with a selected-response format

    van der Linden, Willem J.; Vos, Hendrik J.; Chang, Lei

    2002-01-01

    In judgmental standard setting experiments, it may be difficult to specify subjective probabilities that adequately take the properties of the items into account. As a result, these probabilities are not consistent with each other in the sense that they do not refer to the same borderline level of

  15. Design of Web Questionnaires : A Test for Number of Items per Screen

    Toepoel, V.; Das, J.W.M.; van Soest, A.H.O.

    2005-01-01

    This paper presents results from an experimental manipulation of one versus multiple-items per screen format in a Web survey.The purpose of the experiment was to find out if a questionnaire s format influences how respondents provide answers in online questionnaires and if this is depending on

  16. Re-Examining Test Item Issues in the TIMSS Mathematics and Science Assessments

    Wang, Jianjun

    2011-01-01

    As the largest international study ever taken in history, the Trend in Mathematics and Science Study (TIMSS) has been held as a benchmark to measure U.S. student performance in the global context. In-depth analyses of the TIMSS project are conducted in this study to examine key issues of the comparative investigation: (1) item flaws in mathematics…

  17. Higher-Order Asymptotics and Its Application to Testing the Equality of the Examinee Ability Over Two Sets of Items.

    Sinharay, Sandip; Jensen, Jens Ledet

    2018-06-27

    In educational and psychological measurement, researchers and/or practitioners are often interested in examining whether the ability of an examinee is the same over two sets of items. Such problems can arise in measurement of change, detection of cheating on unproctored tests, erasure analysis, detection of item preknowledge, etc. Traditional frequentist approaches that are used in such problems include the Wald test, the likelihood ratio test, and the score test (e.g., Fischer, Appl Psychol Meas 27:3-26, 2003; Finkelman, Weiss, & Kim-Kang, Appl Psychol Meas 34:238-254, 2010; Glas & Dagohoy, Psychometrika 72:159-180, 2007; Guo & Drasgow, Int J Sel Assess 18:351-364, 2010; Klauer & Rettig, Br J Math Stat Psychol 43:193-206, 1990; Sinharay, J Educ Behav Stat 42:46-68, 2017). This paper shows that approaches based on higher-order asymptotics (e.g., Barndorff-Nielsen & Cox, Inference and asymptotics. Springer, London, 1994; Ghosh, Higher order asymptotics. Institute of Mathematical Statistics, Hayward, 1994) can also be used to test for the equality of the examinee ability over two sets of items. The modified signed likelihood ratio test (e.g., Barndorff-Nielsen, Biometrika 73:307-322, 1986) and the Lugannani-Rice approximation (Lugannani & Rice, Adv Appl Prob 12:475-490, 1980), both of which are based on higher-order asymptotics, are shown to provide some improvement over the traditional frequentist approaches in three simulations. Two real data examples are also provided.

  18. Psychometric evaluation of an item bank for computerized adaptive testing of the EORTC QLQ-C30 cognitive functioning dimension in cancer patients

    Dirven, Linda; Groenvold, Mogens; Taphoorn, Martin J. B.

    2017-01-01

    on the field-testing and psychometric evaluation of the item bank for cognitive functioning (CF). METHODS: In previous phases (I-III), 44 candidate items were developed measuring CF in cancer patients. In phase IV, these items were psychometrically evaluated in a large sample of international cancer patients...... model, showing an acceptable fit. Although several items showed DIF, these had a negligible impact on CF estimation. Measurement precision of the item bank was much higher than the two original QLQ-C30 CF items alone, across the whole continuum. Moreover, CAT measurement may on average reduce study...... sample sizes with about 35-40% compared to the original QLQ-C30 CF scale, without loss of power. CONCLUSION: A CF item bank for CAT measurement consisting of 34 items was established, applicable to various cancer patients across countries. This CAT measurement system will facilitate precise and efficient...

  19. Psychometric evaluation of an item bank for computerized adaptive testing of the EORTC QLQ-C30 cognitive functioning dimension in cancer patients.

    Dirven, Linda; Groenvold, Mogens; Taphoorn, Martin J B; Conroy, Thierry; Tomaszewski, Krzysztof A; Young, Teresa; Petersen, Morten Aa

    2017-11-01

    The European Organisation of Research and Treatment of Cancer (EORTC) Quality of Life Group is developing computerized adaptive testing (CAT) versions of all EORTC Quality of Life Questionnaire (QLQ-C30) scales with the aim to enhance measurement precision. Here we present the results on the field-testing and psychometric evaluation of the item bank for cognitive functioning (CF). In previous phases (I-III), 44 candidate items were developed measuring CF in cancer patients. In phase IV, these items were psychometrically evaluated in a large sample of international cancer patients. This evaluation included an assessment of dimensionality, fit to the item response theory (IRT) model, differential item functioning (DIF), and measurement properties. A total of 1030 cancer patients completed the 44 candidate items on CF. Of these, 34 items could be included in a unidimensional IRT model, showing an acceptable fit. Although several items showed DIF, these had a negligible impact on CF estimation. Measurement precision of the item bank was much higher than the two original QLQ-C30 CF items alone, across the whole continuum. Moreover, CAT measurement may on average reduce study sample sizes with about 35-40% compared to the original QLQ-C30 CF scale, without loss of power. A CF item bank for CAT measurement consisting of 34 items was established, applicable to various cancer patients across countries. This CAT measurement system will facilitate precise and efficient assessment of HRQOL of cancer patients, without loss of comparability of results.

  20. The six-item Clock Drawing Test – reliability and validity in mild Alzheimer’s disease

    Jørgensen, Kasper; Kristensen, Maria K; Waldemar, Gunhild

    2015-01-01

    This study presents a reliable, short and practical version of the Clock Drawing Test (CDT) for clinical use and examines its diagnostic accuracy in mild Alzheimer's disease versus elderly nonpatients. Clock drawings from 231 participants were scored independently by four clinical neuropsychologi......This study presents a reliable, short and practical version of the Clock Drawing Test (CDT) for clinical use and examines its diagnostic accuracy in mild Alzheimer's disease versus elderly nonpatients. Clock drawings from 231 participants were scored independently by four clinical...... neuropsychologists blind to diagnostic classification. The interrater agreement of individual scoring criteria was analyzed and items with poor or moderate reliability were excluded. The classification accuracy of the resulting scoring system - the six-item CDT - was examined. We explored the effect of further...

  1. The effect of heightened awareness of observation on consumption of a multi-item laboratory test meal in females.

    Robinson, Eric; Proctor, Michael; Oldham, Melissa; Masic, Una

    2016-09-01

    Human eating behaviour is often studied in the laboratory, but whether the extent to which a participant believes that their food intake is being measured influences consumption of different meal items is unclear. Our main objective was to examine whether heightened awareness of observation of food intake affects consumption of different food items during a lunchtime meal. One hundred and fourteen female participants were randomly assigned to an experimental condition designed to heighten participant awareness of observation or a condition in which awareness of observation was lower, before consuming an ad libitum multi-item lunchtime meal in a single session study. Under conditions of heightened awareness, participants tended to eat less of an energy dense snack food (cookies) in comparison to the less aware condition. Consumption of other meal items and total energy intake were similar in the heightened awareness vs. less aware condition. Exploratory secondary analyses suggested that the effect heightened awareness had on reduced cookie consumption was dependent on weight status, as well as trait measures of dietary restraint and disinhibition, whereby only participants with overweight/obesity, high disinhibition or low restraint reduced their cookie consumption. Heightened awareness of observation may cause females to reduce their consumption of an energy dense snack food during a test meal in the laboratory and this effect may be moderated by participant individual differences. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.

  2. Generalization of the Lord-Wingersky Algorithm to Computing the Distribution of Summed Test Scores Based on Real-Number Item Scores

    Kim, Seonghoon

    2013-01-01

    With known item response theory (IRT) item parameters, Lord and Wingersky provided a recursive algorithm for computing the conditional frequency distribution of number-correct test scores, given proficiency. This article presents a generalized algorithm for computing the conditional distribution of summed test scores involving real-number item…

  3. Re-Fitting for a Different Purpose: A Case Study of Item Writer Practices in Adapting Source Texts for a Test of Academic Reading

    Green, Anthony; Hawkey, Roger

    2012-01-01

    The important yet under-researched role of item writers in the selection and adaptation of texts for high-stakes reading tests is investigated through a case study involving a group of trained item writers working on the International English Language Testing System (IELTS). In the first phase of the study, participants were invited to reflect in…

  4. What Do You Think You Are Measuring? A Mixed-Methods Procedure for Assessing the Content Validity of Test Items and Theory-Based Scaling

    Koller, Ingrid; Levenson, Michael R.; Glück, Judith

    2017-01-01

    The valid measurement of latent constructs is crucial for psychological research. Here, we present a mixed-methods procedure for improving the precision of construct definitions, determining the content validity of items, evaluating the representativeness of items for the target construct, generating test items, and analyzing items on a theoretical basis. To illustrate the mixed-methods content-scaling-structure (CSS) procedure, we analyze the Adult Self-Transcendence Inventory, a self-report measure of wisdom (ASTI, Levenson et al., 2005). A content-validity analysis of the ASTI items was used as the basis of psychometric analyses using multidimensional item response models (N = 1215). We found that the new procedure produced important suggestions concerning five subdimensions of the ASTI that were not identifiable using exploratory methods. The study shows that the application of the suggested procedure leads to a deeper understanding of latent constructs. It also demonstrates the advantages of theory-based item analysis. PMID:28270777

  5. Development of an item bank for the EORTC Role Functioning Computer Adaptive Test (EORTC RF-CAT)

    Gamper, Eva-Maria; Petersen, Morten Aa.; Aaronson, Neil

    2016-01-01

    a computer-adaptive test (CAT) for RF. This was part of a larger project whose objective is to develop a CAT version of the EORTC QLQ-C30 which is one of the most widely used HRQOL instruments in oncology. METHODS: In accordance with EORTC guidelines, the development of the RF-CAT comprised four phases...... with good psychometric properties. The resulting item bank exhibits excellent reliability (mean reliability = 0.85, median = 0.95). Using the RF-CAT may allow sample size savings from 11 % up to 50 % compared to using the QLQ-C30 RF scale. CONCLUSIONS: The RF-CAT item bank improves the precision...

  6. Tests of the validity of a model relating frequency of contaminated items and increasing radiation dose

    Tallentire, A.; Khan, A.A.

    1975-01-01

    The 60 Co radiation response of Bacillus pumilus E601 spores has been characterized when present in a laboratory test system. The suitability of test vessels to act as both containers for irradiation and culture vessels in sterility testing has been checked. Tests have been done with these spores to verify assumptions basic to the general model described in a previous paper. First measurements indicate that the model holds with this laboratory test system. (author)

  7. The differential item functioning and structural equivalence of a nonverbal cognitive ability test for five language groups

    Pieter Schaap

    2011-10-01

    Research purpose: The aim of the study was to determine the differential item functioning (DIF and structural equivalence of a nonverbal cognitive ability test (the PiB/SpEEx Observance test [401] for five South African language groups. Motivation for study: Cultural and language group sensitive tests can lead to unfair discrimination and is a contentious workplace issue in South Africa today. Misconceptions about psychometric testing in industry can cause tests to lose credibility if industries do not use a scientifically sound test-by-test evaluation approach. Research design, approach and method: The researcher used a quasi-experimental design and factor analytic and logistic regression techniques to meet the research aims. The study used a convenience sample drawn from industry and an educational institution. Main findings: The main findings of the study show structural equivalence of the test at a holistic level and nonsignificant DIF effect sizes for most of the comparisons that the researcher made. Practical/managerial implications: This research shows that the PIB/SpEEx Observance Test (401 is not completely language insensitive. One should see it rather as a language-reduced test when people from different language groups need testing. Contribution/value-add: The findings provide supporting evidence that nonverbal cognitive tests are plausible alternatives to verbal tests when one compares people from different language groups.

  8. Varying levels of difficulty index of skills-test items randomly selected by examinees on the Korean emergency medical technician licensing examination

    Bongyeun Koh

    2016-01-01

    Full Text Available Purpose: The goal of this study was to characterize the difficulty index of the items in the skills test components of the class I and II Korean emergency medical technician licensing examination (KEMTLE, which requires examinees to select items randomly. Methods: The results of 1,309 class I KEMTLE examinations and 1,801 class II KEMTLE examinations in 2013 were subjected to analysis. Items from the basic and advanced skills test sections of the KEMTLE were compared to determine whether some were significantly more difficult than others. Results: In the class I KEMTLE, all 4 of the items on the basic skills test showed significant variation in difficulty index (P<0.01, as well as 4 of the 5 items on the advanced skills test (P<0.05. In the class II KEMTLE, 4 of the 5 items on the basic skills test showed significantly different difficulty index (P<0.01, as well as all 3 of the advanced skills test items (P<0.01. Conclusion: In the skills test components of the class I and II KEMTLE, the procedure in which examinees randomly select questions should be revised to require examinees to respond to a set of fixed items in order to improve the reliability of the national licensing examination.

  9. Varying levels of difficulty index of skills-test items randomly selected by examinees on the Korean emergency medical technician licensing examination.

    Koh, Bongyeun; Hong, Sunggi; Kim, Soon-Sim; Hyun, Jin-Sook; Baek, Milye; Moon, Jundong; Kwon, Hayran; Kim, Gyoungyong; Min, Seonggi; Kang, Gu-Hyun

    2016-01-01

    The goal of this study was to characterize the difficulty index of the items in the skills test components of the class I and II Korean emergency medical technician licensing examination (KEMTLE), which requires examinees to select items randomly. The results of 1,309 class I KEMTLE examinations and 1,801 class II KEMTLE examinations in 2013 were subjected to analysis. Items from the basic and advanced skills test sections of the KEMTLE were compared to determine whether some were significantly more difficult than others. In the class I KEMTLE, all 4 of the items on the basic skills test showed significant variation in difficulty index (P<0.01), as well as 4 of the 5 items on the advanced skills test (P<0.05). In the class II KEMTLE, 4 of the 5 items on the basic skills test showed significantly different difficulty index (P<0.01), as well as all 3 of the advanced skills test items (P<0.01). In the skills test components of the class I and II KEMTLE, the procedure in which examinees randomly select questions should be revised to require examinees to respond to a set of fixed items in order to improve the reliability of the national licensing examination.

  10. Developing Testing Accommodations for English Language Learners: Illustrations as Visual Supports for Item Accessibility

    Solano-Flores, Guillermo; Wang, Chao; Kachchaf, Rachel; Soltero-Gonzalez, Lucinda; Nguyen-Le, Khanh

    2014-01-01

    We address valid testing for English language learners (ELLs)--students in the United States who are schooled in English while they are still acquiring English as a second language. Also, we address the need for procedures for systematically developing ELL testing accommodations--changes in tests intended to support ELLs to gain access to the…

  11. On the issue of item selection in computerized adaptive testing with response times

    Veldkamp, Bernard P.

    2016-01-01

    Many standardized tests are now administered via computer rather than paper-and-pencil format. The computer-based delivery mode brings with it certain advantages. One advantage is the ability to adapt the difficulty level of the test to the ability level of the test taker in what has been termed

  12. Item-saving assessment of self-care performance in children with developmental disabilities: A prospective caregiver-report computerized adaptive test

    Chen, Cheng-Te; Chen, Yu-Lan; Lin, Yu-Ching; Hsieh, Ching-Lin; Tzeng, Jeng-Yi

    2018-01-01

    Objective The purpose of this study was to construct a computerized adaptive test (CAT) for measuring self-care performance (the CAT-SC) in children with developmental disabilities (DD) aged from 6 months to 12 years in a content-inclusive, precise, and efficient fashion. Methods The study was divided into 3 phases: (1) item bank development, (2) item testing, and (3) a simulation study to determine the stopping rules for the administration of the CAT-SC. A total of 215 caregivers of children with DD were interviewed with the 73-item CAT-SC item bank. An item response theory model was adopted for examining the construct validity to estimate item parameters after investigation of the unidimensionality, equality of slope parameters, item fitness, and differential item functioning (DIF). In the last phase, the reliability and concurrent validity of the CAT-SC were evaluated. Results The final CAT-SC item bank contained 56 items. The stopping rules suggested were (a) reliability coefficient greater than 0.9 or (b) 14 items administered. The results of simulation also showed that 85% of the estimated self-care performance scores would reach a reliability higher than 0.9 with a mean test length of 8.5 items, and the mean reliability for the rest was 0.86. Administering the CAT-SC could reduce the number of items administered by 75% to 84%. In addition, self-care performances estimated by the CAT-SC and the full item bank were very similar to each other (Pearson r = 0.98). Conclusion The newly developed CAT-SC can efficiently measure self-care performance in children with DD whose performances are comparable to those of TD children aged from 6 months to 12 years as precisely as the whole item bank. The item bank of the CAT-SC has good reliability and a unidimensional self-care construct, and the CAT can estimate self-care performance with less than 25% of the items in the item bank. Therefore, the CAT-SC could be useful for measuring self-care performance in children with

  13. Results of wholesomeness test on basic plan of research and development of food irradiation (7 items)

    Furuya, Tsuyoshi

    1989-01-01

    Twenty years have elapsed since the general research on food irradiation was begun in Japan as the new technology for food preservation, and the research on the wholesomeness of irradiated foods has been carried out in wide range together with the research on irradiation effect, irradiation techniques and economical efficiency. The wholesomeness of irradiated foods includes chronic toxicity including carcinogenic property in the continuous intake for long period, the effect to reproduction function over many generations and the possibility of giving hereditary injury to cells, the nutritional adequacy required for the sustenance of life and the increase of health, and microbiological safety. In Japan, the research on food irradiation was designated as an atomic energy specific general research, and as the objects of research, potato and onion for the prevention of germination, rice and wheat for the protection from noxious insects, fish paste products, wienerwurst and mandarin orange for sterilization were selected. For the irradiation, Co-60 gamma ray was used except the case of mandarin orange using electron beam. The research on all 7 items was finished, and the irradiation of potato was permitted. (K.I.)

  14. Does Correct Answer Distribution Influence Student Choices When Writing Multiple Choice Examinations?

    Jacqueline A. Carnegie

    2017-03-01

    Full Text Available Summative evaluation for large classes of first- and second-year undergraduate courses often involves the use of multiple choice question (MCQ exams in order to provide timely feedback. Several versions of those exams are often prepared via computer-based question scrambling in an effort to deter cheating. An important parameter to consider when preparing multiple exam versions is that they must be equivalent in their assessment of student knowledge. This project investigated a possible influence of correct answer organization on student answer selection when writing multiple versions of MCQ exams. The specific question asked was whether the existence of a series of four to five consecutive MCQs in which the same letter represented the correct answer had a detrimental influence on a student’s ability to continue to select the correct answer as he/she moved through that series. Student outcomes from such exams were compared with results from exams with identical questions but which did not contain such series. These findings were supplemented by student survey data in which students self-assessed the extent to which they paid attention to the distribution of correct answer choices when writing summative exams, both during their initial answer selection and when transferring their answer letters to the Scantron sheet for correction. Despite the fact that more than half of survey respondents indicated that they do make note of answer patterning during exams and that a series of four to five questions with the same letter for the correct answer would encourage many of them to take a second look at their answer choice, the results pertaining to student outcomes suggest that MCQ randomization, even when it does result in short serial arrays of letter-specific correct answers, does not constitute a distraction capable of adversely influencing student performance. Dans les très grandes classes de cours de première et deuxième années, l

  15. Specificity and false positive rates of the Test of Memory Malingering, Rey 15-item Test, and Rey Word Recognition Test among forensic inpatients with intellectual disabilities.

    Love, Christopher M; Glassmire, David M; Zanolini, Shanna Jordan; Wolf, Amanda

    2014-10-01

    This study evaluated the specificity and false positive (FP) rates of the Rey 15-Item Test (FIT), Word Recognition Test (WRT), and Test of Memory Malingering (TOMM) in a sample of 21 forensic inpatients with mild intellectual disability (ID). The FIT demonstrated an FP rate of 23.8% with the standard quantitative cutoff score. Certain qualitative error types on the FIT showed promise and had low FP rates. The WRT obtained an FP rate of 0.0% with previously reported cutoff scores. Finally, the TOMM demonstrated low FP rates of 4.8% and 0.0% on Trial 2 and the Retention Trial, respectively, when applying the standard cutoff score. FP rates are reported for a range of cutoff scores and compared with published research on individuals diagnosed with ID. Results indicated that although the quantitative variables on the FIT had unacceptably high FP rates, the TOMM and WRT had low FP rates, increasing the confidence clinicians can place in scores reflecting poor effort on these measures during ID evaluations. © The Author(s) 2014.

  16. Measuring Cognitive Load in Test Items: Static Graphics versus Animated Graphics

    Dindar, M.; Kabakçi Yurdakul, I.; Inan Dönmez, F.

    2015-01-01

    The majority of multimedia learning studies focus on the use of graphics in learning process but very few of them examine the role of graphics in testing students' knowledge. This study investigates the use of static graphics versus animated graphics in a computer-based English achievement test from a cognitive load theory perspective. Three…

  17. Explanatory Item Response Modeling of Children's Change on a Dynamic Test of Analogical Reasoning

    Stevenson, Claire E.; Hickendorff, Marian; Resing, Wilma C. M.; Heiser, Willem J.; de Boeck, Paul A. L.

    2013-01-01

    Dynamic testing is an assessment method in which training is incorporated into the procedure with the aim of gauging cognitive potential. Large individual differences are present in children's ability to profit from training in analogical reasoning. The aim of this experiment was to investigate sources of these differences on a dynamic test of…

  18. How to Reason with Economic Concepts: Cognitive Process of Japanese Undergraduate Students Solving Test Items

    Asano, Tadayoshi; Yamaoka, Michio

    2015-01-01

    The authors administered a Japanese version of the Test of Understanding in College Economics, the fourth edition (TUCE-4) to assess the economic literacy of Japanese undergraduate students in 2006 and 2009. These two test results were combined to investigate students' cognitive process or reasoning with specific economic concepts and principles…

  19. Applying Item Response Theory to the Development of a Screening Adaptation of the Goldman-Fristoe Test of Articulation-Second Edition

    Brackenbury, Tim; Zickar, Michael J.; Munson, Benjamin; Storkel, Holly L.

    2017-01-01

    Purpose: Item response theory (IRT) is a psychometric approach to measurement that uses latent trait abilities (e.g., speech sound production skills) to model performance on individual items that vary by difficulty and discrimination. An IRT analysis was applied to preschoolers' productions of the words on the Goldman-Fristoe Test of…

  20. Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies: The PRISMA-DTA Statement.

    McInnes, Matthew D F; Moher, David; Thombs, Brett D; McGrath, Trevor A; Bossuyt, Patrick M; Clifford, Tammy; Cohen, Jérémie F; Deeks, Jonathan J; Gatsonis, Constantine; Hooft, Lotty; Hunt, Harriet A; Hyde, Christopher J; Korevaar, Daniël A; Leeflang, Mariska M G; Macaskill, Petra; Reitsma, Johannes B; Rodin, Rachel; Rutjes, Anne W S; Salameh, Jean-Paul; Stevens, Adrienne; Takwoingi, Yemisi; Tonelli, Marcello; Weeks, Laura; Whiting, Penny; Willis, Brian H

    2018-01-23

    Systematic reviews of diagnostic test accuracy synthesize data from primary diagnostic studies that have evaluated the accuracy of 1 or more index tests against a reference standard, provide estimates of test performance, allow comparisons of the accuracy of different tests, and facilitate the identification of sources of variability in test accuracy. To develop the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) diagnostic test accuracy guideline as a stand-alone extension of the PRISMA statement. Modifications to the PRISMA statement reflect the specific requirements for reporting of systematic reviews and meta-analyses of diagnostic test accuracy studies and the abstracts for these reviews. Established standards from the Enhancing the Quality and Transparency of Health Research (EQUATOR) Network were followed for the development of the guideline. The original PRISMA statement was used as a framework on which to modify and add items. A group of 24 multidisciplinary experts used a systematic review of articles on existing reporting guidelines and methods, a 3-round Delphi process, a consensus meeting, pilot testing, and iterative refinement to develop the PRISMA diagnostic test accuracy guideline. The final version of the PRISMA diagnostic test accuracy guideline checklist was approved by the group. The systematic review (produced 64 items) and the Delphi process (provided feedback on 7 proposed items; 1 item was later split into 2 items) identified 71 potentially relevant items for consideration. The Delphi process reduced these to 60 items that were discussed at the consensus meeting. Following the meeting, pilot testing and iterative feedback were used to generate the 27-item PRISMA diagnostic test accuracy checklist. To reflect specific or optimal contemporary systematic review methods for diagnostic test accuracy, 8 of the 27 original PRISMA items were left unchanged, 17 were modified, 2 were added, and 2 were omitted. The 27-item

  1. Algorithms for the Construction of Parallel Tests by Zero-One Programming. Project Psychometric Aspects of Item Banking No. 7. Research Report 86-7.

    Boekkooi-Timminga, Ellen

    Nine methods for automated test construction are described. All are based on the concepts of information from item response theory. Two general kinds of methods for the construction of parallel tests are presented: (1) sequential test design; and (2) simultaneous test design. Sequential design implies that the tests are constructed one after the…

  2. The 40-item Monell Extended Sniffin' Sticks Identification Test (MONEX-40)

    Freiherr, J.; Gordon, A.R.; Alden, E.C.; Ponting, A.L.; Hernandez, M.; Boesveldt, S.; Lundstrom, J.N.

    2012-01-01

    Background Most existing olfactory identification (ID) tests have the primary aim of diagnosing clinical olfactory dysfunction, thereby rendering them sub-optimal for experimental settings where the aim is to detect differences in healthy subjects’ odor ID abilities. Materials and methods We have

  3. Explanatory item response modeling of children's change on a dynamic test of analogical reasoning

    Stevenson, C.E.; Hickendorff, M.; Resing, W.C.M.; Heiser, W.J.; de Boeck, P.A.L.

    Dynamic testing is an assessment method in which training is incorporated into the procedure with the aim of gauging cognitive potential. Large individual differences are present in children's ability to profit from training in analogical reasoning. The aim of this experiment was to investigate

  4. Psychometric evaluation of the EORTC computerized adaptive test (CAT) fatigue item pool

    Petersen, Morten Aa; Giesinger, Johannes M; Holzner, Bernhard

    2013-01-01

    Fatigue is one of the most common symptoms associated with cancer and its treatment. To obtain a more precise and flexible measure of fatigue, the EORTC Quality of Life Group has developed a computerized adaptive test (CAT) measure of fatigue. This is part of an ongoing project developing a CAT...

  5. Evaluation of the box and blocks test, stereognosis and item banks of activity and upper extremity function in youths with brachial plexus birth palsy.

    Mulcahey, Mary Jane; Kozin, Scott; Merenda, Lisa; Gaughan, John; Tian, Feng; Gogola, Gloria; James, Michelle A; Ni, Pengsheng

    2012-09-01

    One of the greatest limitations to measuring outcomes in pediatric orthopaedics is the lack of effective instruments. Computer adaptive testing, which uses large item banks, select only items that are relevant to a child's function based on a previous response and filters items that are too easy or too hard or simply not relevant to the child. In this way, computer adaptive testing provides for a meaningful, efficient, and precise method to evaluate patient-reported outcomes. Banks of items that assess activity and upper extremity (UE) function have been developed for children with cerebral palsy and have enabled computer adaptive tests that showed strong reliability, strong validity, and broader content range when compared with traditional instruments. Because of the void in instruments for children with brachial plexus birth palsy (BPBP) and the importance of having an UE and activity scale, we were interested in how well these items worked in this population. Cross-sectional, multicenter study involving 200 children with BPBP was conducted. The box and block test (BBT) and Stereognosis tests were administered and patient reports of UE function and activity were obtained with the cerebral palsy item banks. Differential item functioning (DIF) was examined. Predictive ability of the BBT and stereognosis was evaluated with proportional odds logistic regression model. Spearman correlations coefficients (rs) were calculated to examine correlation between stereognosis and the BBT and between individual stereognosis items and the total stereognosis score. Six of the 86 items showed DIF, indicating that the activity and UE item banks may be useful for computer adaptive tests for children with BPBP. The penny and the button were strongest predictors of impairment level (odds ratio=0.34 to 0.40]. There was a good positive relationship between total stereognosis and BBT scores (rs=0.60). The BBT had a good negative (rs=-0.55) and good positive (rs=0.55) relationship with

  6. Using Automated Processes to Generate Test Items And Their Associated Solutions and Rationales to Support Formative Feedback

    Mark Gierl

    2015-08-01

    Full Text Available Automatic item generation is the process of using item models to produce assessment tasks using computer technology. An item model is similar to a template that highlights the elements in the task that must be manipulated to produce new items. The purpose of our study is to describe an innovative method for generating large numbers of diverse and heterogeneous items along with their solutions and associated rationales to support formative feedback. We demonstrate the method by generating items in two diverse content areas, mathematics and nonverbal reasoning

  7. The Role of Item Models in Automatic Item Generation

    Gierl, Mark J.; Lai, Hollis

    2012-01-01

    Automatic item generation represents a relatively new but rapidly evolving research area where cognitive and psychometric theories are used to produce tests that include items generated using computer technology. Automatic item generation requires two steps. First, test development specialists create item models, which are comparable to templates…

  8. Testing the robustness of deterministic models of optimal dynamic pricing and lot-sizing for deteriorating items under stochastic conditions

    Ghoreishi, Maryam

    2018-01-01

    Many models within the field of optimal dynamic pricing and lot-sizing models for deteriorating items assume everything is deterministic and develop a differential equation as the core of analysis. Two prominent examples are the papers by Rajan et al. (Manag Sci 38:240–262, 1992) and Abad (Manag......, we will try to expose the model by Abad (1996) and Rajan et al. (1992) to stochastic inputs; however, designing these stochastic inputs such that they as closely as possible are aligned with the assumptions of those papers. We do our investigation through a numerical test where we test the robustness...... of the numerical results reported in Rajan et al. (1992) and Abad (1996) in a simulation model. Our numerical results seem to confirm that the results stated in these papers are indeed robust when being imposed to stochastic inputs....

  9. The effect of Trier Social Stress Test (TSST on item and associative recognition of words and pictures in healthy participants

    Jonathan eGuez

    2016-04-01

    Full Text Available Psychological stress, induced by the Trier Social Stress Test (TSST, has repeatedly been shown to alter memory performance. Although factors influencing memory performance such as stimulus nature (verbal /pictorial and emotional valence have been extensively studied, results whether stress impairs or improves memory are still inconsistent. This study aimed at exploring the effect of TSST on item versus associative memory for neutral, verbal, and pictorial stimuli. 48 healthy subjects were recruited, 24 participants were randomly assigned to the TSST group and the remaining 24 participants were assigned to the control group. Stress reactivity was measured by psychological (subjective state anxiety ratings and physiological (Galvanic skin response recording measurements. Subjects performed an item-association memory task for both stimulus types (words, pictures simultaneously, before, and after the stress/non-stress manipulation. The results showed that memory recognition for pictorial stimuli was higher than for verbal stimuli. Memory for both words and pictures was impaired following TSST; while the source for this impairment was specific to associative recognition in pictures, a more general deficit was observed for verbal material, as expressed in decreased recognition for both items and associations following TSST. Response latency analysis indicated that the TSST manipulation decreased response time but at the cost of memory accuracy. We conclude that stress does not uniformly affect memory; rather it interacts with the task’s cognitive load and stimulus type. Applying the current study results to patients diagnosed with disorders associated with traumatic stress, our findings in healthy subjects under acute stress provide further support for our assertion that patients’ impaired memory originates in poor recollection processing following depletion of attentional resources.

  10. Avanços na psicometria: da Teoria Clássica dos Testes à Teoria de Resposta ao Item

    Laisa Marcorela Andreoli Sartes

    2013-01-01

    Full Text Available No século XX, o desenvolvimento e avaliação das propriedades psicométricas dos testes se embasou principalmente na Teoria Clássica dos Testes (TCT. Muitos testes são longos e redundantes, com medidas influenciáveis pelas características da amostra dos indivíduos avaliados durante seu desenvolvimento, sendo algumas destas limitações consequências do uso da TCT. A Teoria de Resposta ao Item (TRI surgiu como uma possível solução para algumas limitações da TCT, melhorando a qualidade da avaliação da estrutura dos testes. Neste texto comparamos criticamente as características da TCT e da TRI como métodos para avaliação das propriedades psicométricas dos testes. São discutidas as vantagens e limitações de cada método.

  11. Calibrating the Medical Council of Canada's Qualifying Examination Part I using an integrated item response theory framework: a comparison of models and designs.

    De Champlain, Andre F; Boulais, Andre-Philippe; Dallas, Andrew

    2016-01-01

    The aim of this research was to compare different methods of calibrating multiple choice question (MCQ) and clinical decision making (CDM) components for the Medical Council of Canada's Qualifying Examination Part I (MCCQEI) based on item response theory. Our data consisted of test results from 8,213 first time applicants to MCCQEI in spring and fall 2010 and 2011 test administrations. The data set contained several thousand multiple choice items and several hundred CDM cases. Four dichotomous calibrations were run using BILOG-MG 3.0. All 3 mixed item format (dichotomous MCQ responses and polytomous CDM case scores) calibrations were conducted using PARSCALE 4. The 2-PL model had identical numbers of items with chi-square values at or below a Type I error rate of 0.01 (83/3,499 or 0.02). In all 3 polytomous models, whether the MCQs were either anchored or concurrently run with the CDM cases, results suggest very poor fit. All IRT abilities estimated from dichotomous calibration designs correlated very highly with each other. IRT-based pass-fail rates were extremely similar, not only across calibration designs and methods, but also with regard to the actual reported decision to candidates. The largest difference noted in pass rates was 4.78%, which occurred between the mixed format concurrent 2-PL graded response model (pass rate= 80.43%) and the dichotomous anchored 1-PL calibrations (pass rate= 85.21%). Simpler calibration designs with dichotomized items should be implemented. The dichotomous calibrations provided better fit of the item response matrix than more complex, polytomous calibrations.

  12. Calibrating the Medical Council of Canada’s Qualifying Examination Part I using an integrated item response theory framework: a comparison of models and designs

    Andre F. De Champlain

    2016-01-01

    Full Text Available Purpose: The aim of this research was to compare different methods of calibrating multiple choice question (MCQ and clinical decision making (CDM components for the Medical Council of Canada’s Qualifying Examination Part I (MCCQEI based on item response theory. Methods: Our data consisted of test results from 8,213 first time applicants to MCCQEI in spring and fall 2010 and 2011 test administrations. The data set contained several thousand multiple choice items and several hundred CDM cases. Four dichotomous calibrations were run using BILOG-MG 3.0. All 3 mixed item format (dichotomous MCQ responses and polytomous CDM case scores calibrations were conducted using PARSCALE 4. Results: The 2-PL model had identical numbers of items with chi-square values at or below a Type I error rate of 0.01 (83/3,499 or 0.02. In all 3 polytomous models, whether the MCQs were either anchored or concurrently run with the CDM cases, results suggest very poor fit. All IRT abilities estimated from dichotomous calibration designs correlated very highly with each other. IRT-based pass-fail rates were extremely similar, not only across calibration designs and methods, but also with regard to the actual reported decision to candidates. The largest difference noted in pass rates was 4.78%, which occurred between the mixed format concurrent 2-PL graded response model (pass rate= 80.43% and the dichotomous anchored 1-PL calibrations (pass rate= 85.21%. Conclusion: Simpler calibration designs with dichotomized items should be implemented. The dichotomous calibrations provided better fit of the item response matrix than more complex, polytomous calibrations.

  13. Applying Item Response Theory methods to design a learning progression-based science assessment

    Chen, Jing

    Learning progressions are used to describe how students' understanding of a topic progresses over time and to classify the progress of students into steps or levels. This study applies Item Response Theory (IRT) based methods to investigate how to design learning progression-based science assessments. The research questions of this study are: (1) how to use items in different formats to classify students into levels on the learning progression, (2) how to design a test to give good information about students' progress through the learning progression of a particular construct and (3) what characteristics of test items support their use for assessing students' levels. Data used for this study were collected from 1500 elementary and secondary school students during 2009--2010. The written assessment was developed in several formats such as the Constructed Response (CR) items, Ordered Multiple Choice (OMC) and Multiple True or False (MTF) items. The followings are the main findings from this study. The OMC, MTF and CR items might measure different components of the construct. A single construct explained most of the variance in students' performances. However, additional dimensions in terms of item format can explain certain amount of the variance in student performance. So additional dimensions need to be considered when we want to capture the differences in students' performances on different types of items targeting the understanding of the same underlying progression. Items in each item format need to be improved in certain ways to classify students more accurately into the learning progression levels. This study establishes some general steps that can be followed to design other learning progression-based tests as well. For example, first, the boundaries between levels on the IRT scale can be defined by using the means of the item thresholds across a set of good items. Second, items in multiple formats can be selected to achieve the information criterion at all

  14. The development and validation of a two-tiered multiple-choice instrument to identify alternative conceptions in earth science

    Mangione, Katherine Anna

    This study was to determine reliability and validity for a two-tiered, multiple- choice instrument designed to identify alternative conceptions in earth science. Additionally, this study sought to identify alternative conceptions in earth science held by preservice teachers, to investigate relationships between self-reported confidence scores and understanding of earth science concepts, and to describe relationships between content knowledge and alternative conceptions and planning instruction in the science classroom. Eighty-seven preservice teachers enrolled in the MAT program participated in this study. Sixty-eight participants were female, twelve were male, and seven chose not to answer. Forty-seven participants were in the elementary certification program, five were in the middle school certification program, and twenty-nine were pursuing secondary certification. Results indicate that the two-tiered, multiple-choice format can be a reliable and valid method for identifying alternative conceptions. Preservice teachers in all certification areas who participated in this study may possess common alternative conceptions previously identified in the literature. Alternative conceptions included: all rivers flow north to south, the shadow of the Earth covers the Moon causing lunar phases, the Sun is always directly overhead at noon, weather can be predicted by animal coverings, and seasons are caused by the Earth's proximity to the Sun. Statistical analyses indicated differences, however not all of them significant, among all subgroups according to gender and certification area. Generally males outperformed females and preservice teachers pursuing middle school certification had higher scores on the questionnaire followed by those obtaining secondary certification. Elementary preservice teachers scored the lowest. Additionally, self-reported scores of confidence in one's answers and understanding of the earth science concept in question were analyzed. There was a

  15. A Study on Individualized Tests

    Metin YAŞAR

    2017-12-01

    Full Text Available This study aims to compare KR-20 reliability levels of “Paper and Pencil Test” developed according to Classical Test Theory and “Individualized Test” developed according to Item Response Theory (Two-Parameter Logistic Model, and the correlation levels of skill measurements obtained via these two methods in a group of students. Individualized test developed in accordance with the Two-Parameter Logistic Model was applied by means of a question pool consisting of 61 multiple-choice items which can be answered in 13 steps. On the other hand, a paper and pencil test of 47 multiple-choice items was applied to the sample student group. After the test developed according to these two methods was applied to the same group, KR-20 reliability coefficient was calculated as 0.67 for the individualized test and as 0.75 for the paper and pencil test prepared according to Classical test theory. Calculated KR-20 reliability coefficients obtained from the study were converted into Fisher Z and tested at the significance level of 0.05. No meaningful difference was detected at the 0.05 significant difference level between the two KR-20 reliability coefficients obtained from the two methods. Pearson Product-Moment Correlation Coefficient was calculated as 0.36 between the points of the individualized test and the measurement results of the paper and pencil test. A positive yet low correlation was observed between the measurement results obtained from the tests developed according to both methods. Consequently, it was seen that at the 0.05 significance level there was no statistically significant difference between KR-20 reliability coefficients of the tests developed according to the two methods and that there was a low correlation between the skill measurements of the students in both tests, but there was no significant correlation at the 0.05 significance level between the skill measurements obtained from both tests.

  16. Grade 12 Diploma Examination, English 30. Part B: Reading (Multiple Choice). Readings Booklet. 1986 Edition.

    Alberta Dept. of Education, Edmonton.

    Intended for students taking the Grade 12 Examination in English 30 in Alberta, Canada, this reading test (to be administered along with the questions booklet) contains 10 short reading selections taken from fiction, nonfiction, poetry, and drama, including the following: "My Magical Metronome" (Lewis Thomas); "Queen Street…

  17. New Multiple-Choice Measures of Historical Thinking: An Investigation of Cognitive Validity

    Smith, Mark D.

    2018-01-01

    History education scholars have recognized the need for test validity research in recent years and have called for empirical studies that explore how to best measure historical thinking processes. The present study was designed to help answer this call and to provide a model that others can adapt to carry this line of research forward. It employed…

  18. A Novel Multiple Choice Question Generation Strategy: Alternative Uses for Controlled Vocabulary Thesauri in Biomedical-Sciences Education.

    Lopetegui, Marcelo A; Lara, Barbara A; Yen, Po-Yin; Çatalyürek, Ümit V; Payne, Philip R O

    2015-01-01

    Multiple choice questions play an important role in training and evaluating biomedical science students. However, the resource intensive nature of question generation limits their open availability, reducing their contribution to evaluation purposes mainly. Although applied-knowledge questions require a complex formulation process, the creation of concrete-knowledge questions (i.e., definitions, associations) could be assisted by the use of informatics methods. We envisioned a novel and simple algorithm that exploits validated knowledge repositories and generates concrete-knowledge questions by leveraging concepts' relationships. In this manuscript we present the development and validation of a prototype which successfully produced meaningful concrete-knowledge questions, opening new applications for existing knowledge repositories, potentially benefiting students of all biomedical sciences disciplines.

  19. Changes in Word Usage Frequency May Hamper Intergenerational Comparisons of Vocabulary Skills: An Ngram Analysis of Wordsum, WAIS, and WISC Test Items

    Roivainen, Eka

    2014-01-01

    Research on secular trends in mean intelligence test scores shows smaller gains in vocabulary skills than in nonverbal reasoning. One possible explanation is that vocabulary test items become outdated faster compared to nonverbal tasks. The history of the usage frequency of the words on five popular vocabulary tests, the GSS Wordsum, Wechsler…

  20. A Comparison of Item Selection Procedures Using Different Ability Estimation Methods in Computerized Adaptive Testing Based on the Generalized Partial Credit Model

    Ho, Tsung-Han

    2010-01-01

    Computerized adaptive testing (CAT) provides a highly efficient alternative to the paper-and-pencil test. By selecting items that match examinees' ability levels, CAT not only can shorten test length and administration time but it can also increase measurement precision and reduce measurement error. In CAT, maximum information (MI) is the most…

  1. easyCBM CCSS Math Item Scaling and Test Form Revision (2012-2013): Grades 6-8. Technical Report #1313

    Anderson, Daniel; Alonzo, Julie; Tindal, Gerald

    2012-01-01

    The purpose of this technical report is to document the piloting and scaling of new easyCBM mathematics test items aligned with the Common Core State Standards (CCSS) and to describe the process used to revise and supplement the 2012 research version easyCBM CCSS math tests in Grades 6-8. For all operational 2012 research version test forms (10…

  2. Development of knowledge tests for multi-disciplinary emergency training

    Sorensen, J. L.; Thellesen, L.; Strandbygaard, J.

    2015-01-01

    and evaluating a multiple-choice question(MCQ) test for use in a multi-disciplinary training program inobstetric-anesthesia emergencies. Methods: A multi-disciplinary working committee with 12members representing six professional healthcare groups andanother 28 participants were involved. Recurrent revisions......, 40 out of originally50 items were included in the final MCQ test. The MCQ test wasable to distinguish between levels of competence, and good con-struct validity was indicated by a significant difference in the meanscore between consultants and first-year trainees, as well as betweenfirst...

  3. Development of a psychological test to measure ability-based emotional intelligence in the Indonesian workplace using an item response theory

    Fajrianthi

    2017-11-01

    Full Text Available Fajrianthi,1 Rizqy Amelia Zein2 1Department of Industrial and Organizational Psychology, 2Department of Personality and Social Psychology, Faculty of Psychology, Universitas Airlangga, Surabaya, East Java, Indonesia Abstract: This study aimed to develop an emotional intelligence (EI test that is suitable to the Indonesian workplace context. Airlangga Emotional Intelligence Test (Tes Kecerdasan Emosi Airlangga [TKEA] was designed to measure three EI domains: 1 emotional appraisal, 2 emotional recognition, and 3 emotional regulation. TKEA consisted of 120 items with 40 items for each subset. TKEA was developed based on the Situational Judgment Test (SJT approach. To ensure its psychometric qualities, categorical confirmatory factor analysis (CCFA and item response theory (IRT were applied to test its validity and reliability. The study was conducted on 752 participants, and the results showed that test information function (TIF was 3.414 (ability level = 0 for subset 1, 12.183 for subset 2 (ability level = -2, and 2.398 for subset 3 (level of ability = -2. It is concluded that TKEA performs very well to measure individuals with a low level of EI ability. It is worth to note that TKEA is currently at the development stage; therefore, in this study, we investigated TKEA’s item analysis and dimensionality test of each TKEA subset. Keywords: categorical confirmatory factor analysis, emotional intelligence, item response theory 

  4. Item Banking with Embedded Standards

    MacCann, Robert G.; Stanley, Gordon

    2009-01-01

    An item banking method that does not use Item Response Theory (IRT) is described. This method provides a comparable grading system across schools that would be suitable for low-stakes testing. It uses the Angoff standard-setting method to obtain item ratings that are stored with each item. An example of such a grading system is given, showing how…

  5. Test-retest reliability of selected items of Health Behaviour in School-aged Children (HBSC survey questionnaire in Beijing, China

    Liu Yang

    2010-08-01

    Full Text Available Abstract Background Children's health and health behaviour are essential for their development and it is important to obtain abundant and accurate information to understand young people's health and health behaviour. The Health Behaviour in School-aged Children (HBSC study is among the first large-scale international surveys on adolescent health through self-report questionnaires. So far, more than 40 countries in Europe and North America have been involved in the HBSC study. The purpose of this study is to assess the test-retest reliability of selected items in the Chinese version of the HBSC survey questionnaire in a sample of adolescents in Beijing, China. Methods A sample of 95 male and female students aged 11 or 15 years old participated in a test and retest with a three weeks interval. Student Identity numbers of respondents were utilized to permit matching of test-retest questionnaires. 23 items concerning physical activity, sedentary behaviour, sleep and substance use were evaluated by using the percentage of response shifts and the single measure Intraclass Correlation Coefficients (ICC with 95% confidence interval (CI for all respondents and stratified by gender and age. Items on substance use were only evaluated for school children aged 15 years old. Results The percentage of no response shift between test and retest varied from 32% for the item on computer use at weekends to 92% for the three items on smoking. Of all the 23 items evaluated, 6 items (26% showed a moderate reliability, 12 items (52% displayed a substantial reliability and 4 items (17% indicated almost perfect reliability. No gender and age group difference of the test-retest reliability was found except for a few items on sedentary behaviour. Conclusions The overall findings of this study suggest that most selected indicators in the HBSC survey questionnaire have satisfactory test-retest reliability for the students in Beijing. Further test-retest studies in a large

  6. Development and Application of Methods for Estimating Operating Characteristics of Discrete Test Item Responses without Assuming any Mathematical Form.

    Samejima, Fumiko

    In latent trait theory the latent space, or space of the hypothetical construct, is usually represented by some unidimensional or multi-dimensional continuum of real numbers. Like the latent space, the item response can either be treated as a discrete variable or as a continuous variable. Latent trait theory relates the item response to the latent…

  7. Comparing Two Types of Diagnostic Items to Evaluate Understanding of Heat and Temperature Concepts

    Chu, Hye-Eun; Chandrasegaran, A. L.; Treagust, David F.

    2018-01-01

    The purpose of this research was to investigate an efficient method to assess year 8 (age 13-14) students' conceptual understanding of heat and temperature concepts. Two different types of instruments were used in this study: Type 1, consisting of multiple-choice items with open-ended justifications; and Type 2, consisting of two-tier…

  8. Fitting a Mixture Rasch Model to English as a Foreign Language Listening Tests: The Role of Cognitive and Background Variables in Explaining Latent Differential Item Functioning

    Aryadoust, Vahid

    2015-01-01

    The present study uses a mixture Rasch model to examine latent differential item functioning in English as a foreign language listening tests. Participants (n = 250) took a listening and lexico-grammatical test and completed the metacognitive awareness listening questionnaire comprising problem solving (PS), planning and evaluation (PE), mental…

  9. Automated Scoring of Short-Answer Open-Ended GRE® Subject Test Items. ETS GRE® Board Research Report No. 04-02. ETS RR-08-20

    Attali, Yigal; Powers, Don; Freedman, Marshall; Harrison, Marissa; Obetz, Susan

    2008-01-01

    This report describes the development, administration, and scoring of open-ended variants of GRE® Subject Test items in biology and psychology. These questions were administered in a Web-based experiment to registered examinees of the respective Subject Tests. The questions required a short answer of 1-3 sentences, and responses were automatically…

  10. A validated model for the 22-item Sino-Nasal Outcome Test subdomain structure in chronic rhinosinusitis.

    Feng, Allen L; Wesely, Nicholas C; Hoehle, Lloyd P; Phillips, Katie M; Yamasaki, Alisa; Campbell, Adam P; Gregorio, Luciano L; Killeen, Thomas E; Caradonna, David S; Meier, Josh C; Gray, Stacey T; Sedaghat, Ahmad R

    2017-12-01

    Previous studies have identified subdomains of the 22-item Sino-Nasal Outcome Test (SNOT-22), reflecting distinct and largely independent categories of chronic rhinosinusitis (CRS) symptoms. However, no study has validated the subdomain structure of the SNOT-22. This study aims to validate the existence of underlying symptom subdomains of the SNOT-22 using confirmatory factor analysis (CFA) and to develop a subdomain model that practitioners and researchers can use to describe CRS symptomatology. A total of 800 patients with CRS were included into this cross-sectional study (400 CRS patients from Boston, MA, and 400 CRS patients from Reno, NV). Their SNOT-22 responses were analyzed using exploratory factor analysis (EFA) to determine the number of symptom subdomains. A CFA was performed to develop a validated measurement model for the underlying SNOT-22 subdomains along with various tests of validity and goodness of fit. EFA demonstrated 4 distinct factors reflecting: sleep, nasal, otologic/facial pain, and emotional symptoms (Cronbach's alpha, >0.7; Bartlett's test of sphericity, p Kaiser-Meyer-Olkin >0.90), independent of geographic locale. The corresponding CFA measurement model demonstrated excellent measures of fit (root mean square error of approximation, 0.95; Tucker-Lewis index, >0.95) and measures of construct validity (heterotrait-monotrait [HTMT] ratio, 0.7), again independent of geographic locale. The use of the 4-subdomain structure for SNOT-22 (reflecting sleep, nasal, otologic/facial pain, and emotional symptoms of CRS) was validated as the most appropriate to calculate SNOT-22 subdomain scores for patients from different geographic regions using CFA. © 2017 ARS-AAOA, LLC.

  11. Creation and validation of the barriers to alcohol reduction (BAR) scale using classical test theory and item response theory.

    Kunicki, Zachary J; Schick, Melissa R; Spillane, Nichea S; Harlow, Lisa L

    2018-06-01

    Those who binge drink are at increased risk for alcohol-related consequences when compared to non-binge drinkers. Research shows individuals may face barriers to reducing their drinking behavior, but few measures exist to assess these barriers. This study created and validated the Barriers to Alcohol Reduction (BAR) scale. Participants were college students ( n  = 230) who endorsed at least one instance of past-month binge drinking (4+ drinks for women or 5+ drinks for men). Using classical test theory, exploratory structural equation modeling found a two-factor structure of personal/psychosocial barriers and perceived program barriers. The sub-factors, and full scale had reasonable internal consistency (i.e., coefficient omega = 0.78 (personal/psychosocial), 0.82 (program barriers), and 0.83 (full measure)). The BAR also showed evidence for convergent validity with the Brief Young Adult Alcohol Consequences Questionnaire ( r  = 0.39, p  Theory (IRT) analysis showed the two factors separately met the unidimensionality assumption, and provided further evidence for severity of the items on the two factors. Results suggest that the BAR measure appears reliable and valid for use in an undergraduate student population of binge drinkers. Future studies may want to re-examine this measure in a more diverse sample.

  12. Lord-Wingersky Algorithm Version 2.0 for Hierarchical Item Factor Models with Applications in Test Scoring, Scale Alignment, and Model Fit Testing.

    Cai, Li

    2015-06-01

    Lord and Wingersky's (Appl Psychol Meas 8:453-461, 1984) recursive algorithm for creating summed score based likelihoods and posteriors has a proven track record in unidimensional item response theory (IRT) applications. Extending the recursive algorithm to handle multidimensionality is relatively simple, especially with fixed quadrature because the recursions can be defined on a grid formed by direct products of quadrature points. However, the increase in computational burden remains exponential in the number of dimensions, making the implementation of the recursive algorithm cumbersome for truly high-dimensional models. In this paper, a dimension reduction method that is specific to the Lord-Wingersky recursions is developed. This method can take advantage of the restrictions implied by hierarchical item factor models, e.g., the bifactor model, the testlet model, or the two-tier model, such that a version of the Lord-Wingersky recursive algorithm can operate on a dramatically reduced set of quadrature points. For instance, in a bifactor model, the dimension of integration is always equal to 2, regardless of the number of factors. The new algorithm not only provides an effective mechanism to produce summed score to IRT scaled score translation tables properly adjusted for residual dependence, but leads to new applications in test scoring, linking, and model fit checking as well. Simulated and empirical examples are used to illustrate the new applications.

  13. Item Analysis di una prova di lettura a scelta multipla della certificazione di italiano per stranieri CILS (livello B1; sessione estiva 2012

    Paolo Torresan

    2014-10-01

    Full Text Available Nell’articolo presentiamo un’analisi degli item di una prova di lettura a scelta multipla di livello B1 della certificazione CILS (Università per Stranieri di Siena. L’indagine si muove da una prima ricognizione del testo su cui si basa la prova, con uno studio delle modifiche cui è andata soggetta per mano dell’item writer, per poi ragionare sull’analisi di ogni singolo item, grazie ai dati emersi dalla somministrazione della prova a 161 studenti di italiano di livello corrispondente sparsi per il pianeta. Dalla nostra ricerca si evince che si danno un item ambiguo (# 1, per via della presenza di due chiavi, e un item di difficile risoluzione, per via della mancanza di informazioni utili per desumere il significato del vocabolo cui si riferisce (# 4.In this article we present an analysis of items in a reading multiple-choice test, B1 level, of the CILS certification (Università per Stranieri di Siena. The research starts with a preliminary recognition of the text on which the test is based, with a study of the modifications it has undergone by the item writer’s hand, and proceeds to reason about the analysis of every single item, using data from the ministration of the test to 161 students of Italian in the corresponding level, from all over the planet. From our research it emerges that the test presents an ambiguous item (# 1, with two keys, and a difficult item, without enough information to make clear the meaning of the word it refers to (# 4.

  14. A comparative analysis of multiple-choice and student performance-task assessment in the high school biology classroom

    Cushing, Patrick Ryan

    This study compared the performance of high school students on laboratory assessments. Thirty-four high school students who were enrolled in the second semester of a regular biology class or had completed the biology course the previous semester participated in this study. They were randomly assigned to examinations of two formats, performance-task and traditional multiple-choice, from two content areas, using a compound light microscope and diffusion. Students were directed to think-aloud as they performed the assessments. Additional verbal data were obtained during interviews following the assessment. The tape-recorded narrative data were analyzed for type and diversity of knowledge and skill categories, and percentage of in-depth processing demonstrated. While overall mean scores on the assessments were low, elicited statements provided additional insight into student cognition. Results indicated that a greater diversity of knowledge and skill categories was elicited by the two microscope assessments and by the two performance-task assessments. In addition, statements demonstrating in-depth processing were coded most frequently in narratives elicited during clinical interviews following the diffusion performance-task assessment. This study calls for individual teachers to design authentic assessment practices and apply them to daily classroom routines. Authentic assessment should be an integral part of the learning process and not merely an end result. In addition, teachers are encouraged to explicitly identify and model, through think-aloud methods, desired cognitive behaviors in the classroom.

  15. A unified factor-analytic approach to the detection of item and test bias: Illustration with the effect of providing calculators to students with dyscalculia

    Lee, M. K.

    2016-01-01

    Full Text Available An absence of measurement bias against distinct groups is a prerequisite for the use of a given psychological instrument in scientific research or high-stakes assessment. Factor analysis is the framework explicitly adopted for the identification of such bias when the instrument consists of a multi-test battery, whereas item response theory is employed when the focus narrows to a single test composed of discrete items. Item response theory can be treated as a mild nonlinearization of the standard factor model, and thus the essential unity of bias detection at the two levels merits greater recognition. Here we illustrate the benefits of a unified approach with a real-data example, which comes from a statewide test of mathematics achievement where examinees diagnosed with dyscalculia were accommodated with calculators. We found that items that can be solved by explicit arithmetical computation became easier for the accommodated examinees, but the quantitative magnitude of this differential item functioning (measurement bias was small.

  16. Exploring Secondary Students' Knowledge and Misconceptions about Influenza: Development, validation, and implementation of a multiple-choice influenza knowledge scale

    Romine, William L.; Barrow, Lloyd H.; Folk, William R.

    2013-07-01

    Understanding infectious diseases such as influenza is an important element of health literacy. We present a fully validated knowledge instrument called the Assessment of Knowledge of Influenza (AKI) and use it to evaluate knowledge of influenza, with a focus on misconceptions, in Midwestern United States high-school students. A two-phase validation process was used. In phase 1, an initial factor structure was calculated based on 205 students of grades 9-12 at a rural school. In phase 2, one- and two-dimensional factor structures were analyzed from the perspectives of classical test theory and the Rasch model using structural equation modeling and principal components analysis (PCA) on Rasch residuals, respectively. Rasch knowledge measures were calculated for 410 students from 6 school districts in the Midwest, and misconceptions were verified through the χ 2 test. Eight items measured knowledge of flu transmission, and seven measured knowledge of flu management. While alpha reliability measures for the subscales were acceptable, Rasch person reliability measures and PCA on residuals advocated for a single-factor scale. Four misconceptions were found, which have not been previously documented in high-school students. The AKI is the first validated influenza knowledge assessment, and can be used by schools and health agencies to provide a quantitative measure of impact of interventions aimed at increasing understanding of influenza. This study also adds significantly to the literature on misconceptions about influenza in high-school students, a necessary step toward strategic development of educational interventions for these students.

  17. Development of a psychological test to measure ability-based emotional intelligence in the Indonesian workplace using an item response theory.

    Fajrianthi; Zein, Rizqy Amelia

    2017-01-01

    This study aimed to develop an emotional intelligence (EI) test that is suitable to the Indonesian workplace context. Airlangga Emotional Intelligence Test (Tes Kecerdasan Emosi Airlangga [TKEA]) was designed to measure three EI domains: 1) emotional appraisal, 2) emotional recognition, and 3) emotional regulation. TKEA consisted of 120 items with 40 items for each subset. TKEA was developed based on the Situational Judgment Test (SJT) approach. To ensure its psychometric qualities, categorical confirmatory factor analysis (CCFA) and item response theory (IRT) were applied to test its validity and reliability. The study was conducted on 752 participants, and the results showed that test information function (TIF) was 3.414 (ability level = 0) for subset 1, 12.183 for subset 2 (ability level = -2), and 2.398 for subset 3 (level of ability = -2). It is concluded that TKEA performs very well to measure individuals with a low level of EI ability. It is worth to note that TKEA is currently at the development stage; therefore, in this study, we investigated TKEA's item analysis and dimensionality test of each TKEA subset.

  18. The Development Of A Diagnostic Reading Test Of English For The Students Of Medical Faculty, Brawijaya University

    Indah Winarni

    2003-01-01

    Full Text Available This paper describes the development of a diagnostic test of multiple choice reading comprehension as an initial stage in developing teaching materials for medical students learning English. Sample texts were collected from all the departments in the faculty. Selection of relevant texts involved the participation of some subject lecturers. Sixty one items were developed from fifteen texts to be reduced to forty items after pilot testing. Face validity was improved. The main trial was carried out to twenty nine students and item analysis was carried out. The test showed low level of concurrent validity and the internal consistency showed a moderate level of reliability. The low level of concurrent validity was suspected to result from the test being too difficult for the testees as the item analysis had revealed.

  19. More than the Verbal Stimulus Matters: Visual Attention in Language Assessment for People with Aphasia Using Multiple-Choice Image Displays

    Heuer, Sabine; Ivanova, Maria V.; Hallowell, Brooke

    2017-01-01

    Purpose: Language comprehension in people with aphasia (PWA) is frequently evaluated using multiple-choice displays: PWA are asked to choose the image that best corresponds to the verbal stimulus in a display. When a nontarget image is selected, comprehension failure is assumed. However, stimulus-driven factors unrelated to linguistic…

  20. Incorporating Multiple-Choice Questions into an AACSB Assurance of Learning Process: A Course-Embedded Assessment Application to an Introductory Finance Course

    Santos, Michael R.; Hu, Aidong; Jordan, Douglas

    2014-01-01

    The authors offer a classification technique to make a quantitative skills rubric more operational, with the groupings of multiple-choice questions to match the student learning levels in knowledge, calculation, quantitative reasoning, and analysis. The authors applied this classification technique to the mid-term exams of an introductory finance…

  1. Predicting Social and Communicative Ability in School-Age Children with Autism Spectrum Disorder: A Pilot Study of the Social Attribution Task, Multiple Choice

    Burger-Caplan, Rebecca; Saulnier, Celine; Jones, Warren; Klin, Ami

    2016-01-01

    The Social Attribution Task, Multiple Choice is introduced as a measure of implicit social cognitive ability in children, addressing a key challenge in quantification of social cognitive function in autism spectrum disorder, whereby individuals can often be successful in explicit social scenarios, despite marked social adaptive deficits. The…

  2. Does the think-aloud protocol reflect thinking? Exploring functional neuroimaging differences with thinking (answering multiple choice questions) versus thinking aloud

    Durning, S.J.; Artino, A.R.; Beckman, T.J.; Graner, J.; Vleuten, C.P.M. van der; Holmboe, E.; Schuwirth, L.

    2013-01-01

    Background: Whether the think-aloud protocol is a valid measure of thinking remains uncertain. Therefore, we used functional magnetic resonance imaging (fMRI) to investigate potential functional neuroanatomic differences between thinking (answering multiple-choice questions in real time) versus

  3. The Italian version of the 16-item prodromal questionnaire (iPQ-16): Field-test and psychometric features.

    Lorenzo, Pelizza; Silvia, Azzali; Federica, Paterlini; Sara, Garlassi; Ilaria, Scazza; Pupo, Simona; Andrea, Raballo

    2018-03-20

    Among current early screeners for psychosis-risk states, the Prodromal Questionnaire-16 items (PQ-16) is often used. We aimed to assess validity and reliability of the Italian version of the PQ-16 in a young adult help-seeking population. We included 154 individuals aged 18-35years seeking help at the Reggio Emilia outpatient mental health services in a large semirural catchment area (550.000 inhabitants). Participants completed the Italian version of the PQ-16 (iPQ-16) and were subsequently evaluated with the Comprehensive Assessment of At-Risk Mental States (CAARMS). We examined diagnostic accuracy (i.e. specificity, sensitivity, negative and positive likelihood ratios, and negative and positive predictive values) and content, convergent, and concurrent validity between PQ-16 and CAARMS using Cronbach's alpha, Spearman's rho, and Cohen's kappa, respectively. We also tested the validity of the adopted PQ-16 cut-offs through Receiver Operating Characteristic (ROC) curves plotted against CAARMS diagnoses and the 1-year predictive validity of the PQ-16. The iPQ-16 showed high internal consistency and acceptable diagnostic accuracy and concurrent validity. ROC analyses pointed to a cut-off score of ≥5 as best cut-off. After 12months of follow-up, 8.7% of participants with a PQ-16 symptom total score of ≥5 who were below the CAARMS psychosis threshold at the baseline, developed a psychotic disorder. Psychometric properties of the iPQ-16 were satisfactory. Copyright © 2018. Published by Elsevier B.V.

  4. The Differences among Three-, Four-, and Five-Option-Item Formats in the Context of a High-Stakes English-Language Listening Test

    Lee, HyeSun; Winke, Paula

    2013-01-01

    We adapted three practice College Scholastic Ability Tests (CSAT) of English listening, each with five-option items, to create four- and three-option versions by asking 73 Korean speakers or learners of English to eliminate the least plausible options in two rounds. Two hundred and sixty-four Korean high school English-language learners formed…

  5. An Item Response Theory-Based, Computerized Adaptive Testing Version of the MacArthur-Bates Communicative Development Inventory: Words & Sentences (CDI:WS)

    Makransky, Guido; Dale, Philip S.; Havmose, Philip; Bleses, Dorthe

    2016-01-01

    Purpose: This study investigated the feasibility and potential validity of an item response theory (IRT)-based computerized adaptive testing (CAT) version of the MacArthur-Bates Communicative Development Inventory: Words & Sentences (CDI:WS; Fenson et al., 2007) vocabulary checklist, with the objective of reducing length while maintaining…

  6. Testing ESL pragmatics development and validation of a web-based assessment battery

    Roever, Carsten

    2014-01-01

    Although second language learners' pragmatic competence (their ability to use language in context) is an essential part of their general communicative competence, it has not been a part of second language tests. This book helps fill this gap by describing the development and validation of a web-based test of ESL pragmalinguistics. The instrument assesses learners' knowledge of routine formulae, speech acts, and implicature in 36 multiple-choice and brief-response items. The test's quantitative and qualitative validation with 300 learners showed high reliability and provided strong evidence of

  7. Using classical test theory, item response theory, and Rasch measurement theory to evaluate patient-reported outcome measures: a comparison of worked examples.

    Petrillo, Jennifer; Cano, Stefan J; McLeod, Lori D; Coon, Cheryl D

    2015-01-01

    To provide comparisons and a worked example of item- and scale-level evaluations based on three psychometric methods used in patient-reported outcome development-classical test theory (CTT), item response theory (IRT), and Rasch measurement theory (RMT)-in an analysis of the National Eye Institute Visual Functioning Questionnaire (VFQ-25). Baseline VFQ-25 data from 240 participants with diabetic macular edema from a randomized, double-masked, multicenter clinical trial were used to evaluate the VFQ at the total score level. CTT, RMT, and IRT evaluations were conducted, and results were assessed in a head-to-head comparison. Results were similar across the three methods, with IRT and RMT providing more detailed diagnostic information on how to improve the scale. CTT led to the identification of two problematic items that threaten the validity of the overall scale score, sets of redundant items, and skewed response categories. IRT and RMT additionally identified poor fit for one item, many locally dependent items, poor targeting, and disordering of over half the response categories. Selection of a psychometric approach depends on many factors. Researchers should justify their evaluation method and consider the intended audience. If the instrument is being developed for descriptive purposes and on a restricted budget, a cursory examination of the CTT-based psychometric properties may be all that is possible. In a high-stakes situation, such as the development of a patient-reported outcome instrument for consideration in pharmaceutical labeling, however, a thorough psychometric evaluation including IRT or RMT should be considered, with final item-level decisions made on the basis of both quantitative and qualitative results. Copyright © 2015. Published by Elsevier Inc.

  8. Gender-Based Differential Item Performance in Mathematics Achievement Items.

    Doolittle, Allen E.; Cleary, T. Anne

    1987-01-01

    Eight randomly equivalent samples of high school seniors were each given a unique form of the ACT Assessment Mathematics Usage Test (ACTM). Signed measures of differential item performance (DIP) were obtained for each item in the eight ACTM forms. DIP estimates were analyzed and a significant item category effect was found. (Author/LMO)

  9. Differential Item Functioning Analysis Using a Mixture 3-Parameter Logistic Model with a Covariate on the TIMSS 2007 Mathematics Test

    Choi, Youn-Jeng; Alexeev, Natalia; Cohen, Allan S.

    2015-01-01

    The purpose of this study was to explore what may be contributing to differences in performance in mathematics on the Trends in International Mathematics and Science Study 2007. This was done by using a mixture item response theory modeling approach to first detect latent classes in the data and then to examine differences in performance on items…

  10. Item validity vs. item discrimination index: a redundancy?

    Panjaitan, R. L.; Irawati, R.; Sujana, A.; Hanifah, N.; Djuanda, D.

    2018-03-01

    In several literatures about evaluation and test analysis, it is common to find that there are calculations of item validity as well as item discrimination index (D) with different formula for each. Meanwhile, other resources said that item discrimination index could be obtained by calculating the correlation between the testee’s score in a particular item and the testee’s score on the overall test, which is actually the same concept as item validity. Some research reports, especially undergraduate theses tend to include both item validity and item discrimination index in the instrument analysis. It seems that these concepts might overlap for both reflect the test quality on measuring the examinees’ ability. In this paper, examples of some results of data processing on item validity and item discrimination index were compared. It would be discussed whether item validity and item discrimination index can be represented by one of them only or it should be better to present both calculations for simple test analysis, especially in undergraduate theses where test analyses were included.

  11. Small group learning: effect on item analysis and accuracy of self-assessment of medical students.

    Biswas, Shubho Subrata; Jain, Vaishali; Agrawal, Vandana; Bindra, Maninder

    2015-01-01

    Small group sessions are regarded as a more active and student-centered approach to learning. Item analysis provides objective evidence of whether such sessions improve comprehension and make the topic easier for students, in addition to assessing the relative benefit of the sessions to good versus poor performers. Self-assessment makes students aware of their deficiencies. Small group sessions can also help students develop the ability to self-assess. This study was carried out to assess the effect of small group sessions on item analysis and students' self-assessment. A total of 21 female and 29 male first year medical students participated in a small group session on topics covered by didactic lectures two weeks earlier. It was preceded and followed by two multiple choice question (MCQ) tests, in which students were asked to self-assess their likely score. The MCQs used were item analyzed in a previous group and were chosen of matching difficulty and discriminatory indices for the pre- and post-tests. The small group session improved the marks of both genders equally, but female performance was better. The session made the items easier; increasing the difficulty index significantly but there was no significant alteration in the discriminatory index. There was overestimation in the self-assessment of both genders, but male overestimation was greater. The session improved the self-assessment of students in terms of expected marks and expectation of passing. Small group session improved the ability of students to self-assess their knowledge and increased the difficulty index of items reflecting students' better performance.

  12. Item level diagnostics and model - data fit in item response theory ...

    Item response theory (IRT) is a framework for modeling and analyzing item response data. Item-level modeling gives IRT advantages over classical test theory. The fit of an item score pattern to an item response theory (IRT) models is a necessary condition that must be assessed for further use of item and models that best fit ...

  13. Item-focussed Trees for the Identification of Items in Differential Item Functioning.

    Tutz, Gerhard; Berger, Moritz

    2016-09-01

    A novel method for the identification of differential item functioning (DIF) by means of recursive partitioning techniques is proposed. We assume an extension of the Rasch model that allows for DIF being induced by an arbitrary number of covariates for each item. Recursive partitioning on the item level results in one tree for each item and leads to simultaneous selection of items and variables that induce DIF. For each item, it is possible to detect groups of subjects with different item difficulties, defined by combinations of characteristics that are not pre-specified. The way a DIF item is determined by covariates is visualized in a small tree and therefore easily accessible. An algorithm is proposed that is based on permutation tests. Various simulation studies, including the comparison with traditional approaches to identify items with DIF, show the applicability and the competitive performance of the method. Two applications illustrate the usefulness and the advantages of the new method.

  14. Test-retest reliability at the item level and total score level of the Norwegian version of the Spinal Cord Injury Falls Concern Scale (SCI-FCS).

    Roaldsen, Kirsti Skavberg; Måøy, Åsa Blad; Jørgensen, Vivien; Stanghelle, Johan Kvalvik

    2016-05-01

    Translation of the Spinal Cord Injury Falls Concern Scale (SCI-FCS), and investigation of test-retest reliability on item-level and total-score-level. Translation, adaptation and test-retest study. A specialized rehabilitation setting in Norway. Fifty-four wheelchair users with a spinal cord injury. The median age of the cohort was 49 years, and the median number of years after injury was 13. Interventions/measurements: The SCI-FCS was translated and back-translated according to guidelines. Individuals answered the SCI-FCS twice over the course of one week. We investigated item-level test-retest reliability using Svensson's rank-based statistical method for disagreement analysis of paired ordinal data. For relative reliability, we analyzed the total-score-level test-retest reliability with intraclass correlation coefficients (ICC2.1), the standard error of measurement (SEM), and the smallest detectable change (SDC) for absolute reliability/measurement-error assessment and Cronbach's alpha for internal consistency. All items showed satisfactory percentage agreement (≥69%) between test and retest. There were small but non-negligible systematic disagreements among three items; we recovered an 11-13% higher chance for a lower second score. There was no disagreement due to random variance. The test-retest agreement (ICC2.1) was excellent (0.83). The SEM was 2.6 (12%), and the SDC was 7.1 (32%). The Cronbach's alpha was high (0.88). The Norwegian SCI-FCS is highly reliable for wheelchair users with chronic spinal cord injuries.

  15. Effects of Misbehaving Common Items on Aggregate Scores and an Application of the Mantel-Haenszel Statistic in Test Equating. CSE Report 688

    Michaelides, Michalis P.

    2006-01-01

    Consistent behavior is a desirable characteristic that common items are expected to have when administered to different groups. Findings from the literature have established that items do not always behave in consistent ways; item indices and IRT item parameter estimates of the same items differ when obtained from different administrations.…

  16. Software Note: Using BILOG for Fixed-Anchor Item Calibration

    DeMars, Christine E.; Jurich, Daniel P.

    2012-01-01

    The nonequivalent groups anchor test (NEAT) design is often used to scale item parameters from two different test forms. A subset of items, called the anchor items or common items, are administered as part of both test forms. These items are used to adjust the item calibrations for any differences in the ability distributions of the groups taking…

  17. The optimal sequence and selection of screening test items to predict fall risk in older disabled women: the Women's Health and Aging Study.

    Lamb, Sarah E; McCabe, Chris; Becker, Clemens; Fried, Linda P; Guralnik, Jack M

    2008-10-01

    Falls are a major cause of disability, dependence, and death in older people. Brief screening algorithms may be helpful in identifying risk and leading to more detailed assessment. Our aim was to determine the most effective sequence of falls screening test items from a wide selection of recommended items including self-report and performance tests, and to compare performance with other published guidelines. Data were from a prospective, age-stratified, cohort study. Participants were 1002 community-dwelling women aged 65 years old or older, experiencing at least some mild disability. Assessments of fall risk factors were conducted in participants' homes. Fall outcomes were collected at 6 monthly intervals. Algorithms were built for prediction of any fall over a 12-month period using tree classification with cross-set validation. Algorithms using performance tests provided the best prediction of fall events, and achieved moderate to strong performance when compared to commonly accepted benchmarks. The items selected by the best performing algorithm were the number of falls in the last year and, in selected subpopulations, frequency of difficulty balancing while walking, a 4 m walking speed test, body mass index, and a test of knee extensor strength. The algorithm performed better than that from the American Geriatric Society/British Geriatric Society/American Academy of Orthopaedic Surgeons and other guidance, although these findings should be treated with caution. Suggestions are made on the type, number, and sequence of tests that could be used to maximize estimation of the probability of falling in older disabled women.

  18. Is It Working? Distractor Analysis Results from the Test Of Astronomy STandards (TOAST) Assessment Instrument

    Slater, Stephanie

    2009-05-01

    The Test Of Astronomy STandards (TOAST) assessment instrument is a multiple-choice survey tightly aligned to the consensus learning goals stated by the American Astronomical Society - Chair's Conference on ASTRO 101, the American Association of the Advancement of Science's Project 2061 Benchmarks, and the National Research Council's National Science Education Standards. Researchers from the Cognition in Astronomy, Physics and Earth sciences Research (CAPER) Team at the University of Wyoming's Science and Math Teaching Center (UWYO SMTC) have been conducting a question-by-question distractor analysis procedure to determine the sensitivity and effectiveness of each item. In brief, the frequency each possible answer choice, known as a foil or distractor on a multiple-choice test, is determined and compared to the existing literature on the teaching and learning of astronomy. In addition to having statistical difficulty and discrimination values, a well functioning assessment item will show students selecting distractors in the relative proportions to how we expect them to respond based on known misconceptions and reasoning difficulties. In all cases, our distractor analysis suggests that all items are functioning as expected. These results add weight to the validity of the Test Of Astronomy STandards (TOAST) assessment instrument, which is designed to help instructors and researchers measure the impact of course-length duration instructional strategies for undergraduate science survey courses with learning goals tightly aligned to the consensus goals of the astronomy education community.

  19. The construct equivalence and item bias of the pib/SpEEx conceptualisation-ability test for members of five language groups in South Africa

    Pieter Schaap

    2008-11-01

    Full Text Available This study’s objective was to determine whether the Potential Index Batteries/Situation Specific Evaluation Expert (PIB/SpEEx conceptualisation (100 ability test displays construct equivalence and item bias for members of five selected language groups in South Africa. The sample consisted of a non-probability convenience sample (N = 6 261 of members of five language groups (speakers of Afrikaans, English, North Sotho, Setswana and isiZulu working in the medical and beverage industries or studying at higher-educational institutions. Exploratory factor analysis with target rotations confrmed the PIB/SpEEx 100’s construct equivalence for the respondents from these five language groups. No evidence of either uniform or non-uniform item bias of practical signifcance was found for the sample.

  20. Item analysis and evaluation in the examinations in the faculty of ...

    2014-11-05

    Nov 5, 2014 ... Key words: Classical test theory, item analysis, item difficulty, item discrimination, item response theory, reliability ... the probability of answering an item correctly or of attaining ..... A Monte Carlo comparison of item and person.

  1. Developing Pairwise Preference-Based Personality Test and Experimental Investigation of Its Resistance to Faking Effect by Item Response Model

    Usami, Satoshi; Sakamoto, Asami; Naito, Jun; Abe, Yu

    2016-01-01

    Recent years have shown increased awareness of the importance of personality tests in educational, clinical, and occupational settings, and developing faking-resistant personality tests is a very pragmatic issue for achieving more precise measurement. Inspired by Stark (2002) and Stark, Chernyshenko, and Drasgow (2005), we develop a pairwise…

  2. The Impact Analysis of Psychological Reliability of Population Pilot Study For Selection of Particular Reliable Multi-Choice Item Test in Foreign Language Research Work

    Seyed Hossein Fazeli

    2010-10-01

    Full Text Available The purpose of research described in the current study is the psychological reliability, its’ importance, application, and more to investigate on the impact analysis of psychological reliability of population pilot study for selection of particular reliable multi-choice item test in foreign language research work. The population for subject recruitment was all under graduated students from second semester at large university in Iran (both male and female that study English as a compulsory paper. In Iran, English is taught as a foreign language.

  3. Equal Opportunity in the Classroom: Test Construction in a Diversity-Sensitive Environment.

    Ghorpade, Jai; Lackritz, James R.

    1998-01-01

    Two multiple-choice tests and one essay test were taken by 231 students (50/50 male/female, 192 White, 39 East Asian, Black, Mexican American, or Middle Eastern). Multiple-choice tests showed no significant differences in equal employment opportunity terms; women and men scored about the same on essays, but minority students had significantly…

  4. Defining surgical criteria for empty nose syndrome: Validation of the office-based cotton test and clinical interpretability of the validated Empty Nose Syndrome 6-Item Questionnaire.

    Thamboo, Andrew; Velasquez, Nathalia; Habib, Al-Rahim R; Zarabanda, David; Paknezhad, Hassan; Nayak, Jayakar V

    2017-08-01

    The validated Empty Nose Syndrome 6-Item Questionnaire (ENS6Q) identifies empty nose syndrome (ENS) patients. The unvalidated cotton test assesses improvement in ENS-related symptoms. By first validating the cotton test using the ENS6Q, we define the minimal clinically important difference (MCID) score for the ENS6Q. Individual case-control study. Fifteen patients diagnosed with ENS and 18 controls with non-ENS sinonasal conditions underwent office cotton placement. Both groups completed ENS6Q testing in three conditions-precotton, cotton in situ, and postcotton-to measure the reproducibility of ENS6Q scoring. Participants also completed a five-item transition scale ranging from "much better" to "much worse" to rate subjective changes in nasal breathing with and without cotton placement. Mean changes for each transition point, and the ENS6Q MCID, were then calculated. In the precotton condition, significant differences (P < .001) in all ENS6Q questions between ENS and controls were noted. With cotton in situ, nearly all prior ENS6Q differences normalized between ENS and control patients. For ENS patients, the changes in the mean differences between the precotton and cotton in situ conditions compared to postcotton versus cotton in situ conditions were insignificant among individuals. Including all 33 participants, the mean change in the ENS6Q between the parameters "a little better" and "about the same" was 4.25 (standard deviation [SD] = 5.79) and -2.00 (SD = 3.70), giving an MCID of 6.25. Cotton testing is a validated office test to assess for ENS patients. Cotton testing also helped to determine the MCID of the ENS6Q, which is a 7-point change from the baseline ENS6Q score. 3b. Laryngoscope, 127:1746-1752, 2017. © 2017 The American Laryngological, Rhinological and Otological Society, Inc.

  5. Making System Dynamics Cool IV : Teaching & Testing with Cases & Quizzes

    Pruyt, E.

    2012-01-01

    This follow-up paper presents cases and multiple choice questions for teaching and testing System Dynamics modeling. These cases and multiple choice questions were developed and used between January 2012 and April 2012 a large System Dynamics course (250+ 2nd year BSc and 40+ MSc students per year)

  6. North Star Ambulatory Assessment, 6-minute walk test and timed items in ambulant boys with Duchenne muscular dystrophy.

    Mazzone, Elena; Martinelli, Diego; Berardinelli, Angela; Messina, Sonia; D'Amico, Adele; Vasco, Gessica; Main, Marion; Doglio, Luca; Politano, Luisa; Cavallaro, Filippo; Frosini, Silvia; Bello, Luca; Carlesi, Adelina; Bonetti, Anna Maria; Zucchini, Elisabetta; De Sanctis, Roberto; Scutifero, Marianna; Bianco, Flaviana; Rossi, Francesca; Motta, Maria Chiara; Sacco, Annalisa; Donati, Maria Alice; Mongini, Tiziana; Pini, Antonella; Battini, Roberta; Pegoraro, Elena; Pane, Marika; Pasquini, Elisabetta; Bruno, Claudio; Vita, Giuseppe; de Waure, Chiara; Bertini, Enrico; Mercuri, Eugenio

    2010-11-01

    The North Star Ambulatory Assessment is a functional scale specifically designed for ambulant boys affected by Duchenne muscular dystrophy (DMD). Recently the 6-minute walk test has also been used as an outcome measure in trials in DMD. The aim of our study was to assess a large cohort of ambulant boys affected by DMD using both North Star Assessment and 6-minute walk test. More specifically, we wished to establish the spectrum of findings for each measure and their correlation. This is a prospective multicentric study involving 10 centers. The cohort included 112 ambulant DMD boys of age ranging between 4.10 and 17 years (mean 8.18±2.3 DS). Ninety-one of the 112 were on steroids: 37/91 on intermittent and 54/91 on daily regimen. The scores on the North Star assessment ranged from 6/34 to 34/34. The distance on the 6-minute walk test ranged from 127 to 560.6 m. The time to walk 10 m was between 3 and 15 s. The time to rise from the floor ranged from 1 to 27.5 s. Some patients were unable to rise from the floor. As expected the results changed with age and were overall better in children treated with daily steroids. The North Star assessment had a moderate to good correlation with 6-minute walk test and with timed rising from floor but less with 10 m timed walk/run test. The 6-minute walk test in contrast had better correlation with 10 m timed walk/run test than with timed rising from floor. These findings suggest that a combination of these outcome measures can be effectively used in ambulant DMD boys and will provide information on different aspects of motor function, that may not be captured using a single measure. Copyright © 2010. Published by Elsevier B.V.

  7. Using item response theory to explore the psychometric properties of extended matching questions examination in undergraduate medical education

    Lawton Gemma

    2005-03-01

    Full Text Available Abstract Background As assessment has been shown to direct learning, it is critical that the examinations developed to test clinical competence in medical undergraduates are valid and reliable. The use of extended matching questions (EMQ has been advocated to overcome some of the criticisms of using multiple-choice questions to test factual and applied knowledge. Methods We analysed the results from the Extended Matching Questions Examination taken by 4th year undergraduate medical students in the academic year 2001 to 2002. Rasch analysis was used to examine whether the set of questions used in the examination mapped on to a unidimensional scale, the degree of difficulty of questions within and between the various medical and surgical specialties and the pattern of responses within individual questions to assess the impact of the distractor options. Results Analysis of a subset of items and of the full examination demonstrated internal construct validity and the absence of bias on the majority of questions. Three main patterns of response selection were identified. Conclusion Modern psychometric methods based upon the work of Rasch provide a useful approach to the calibration and analysis of EMQ undergraduate medical assessments. The approach allows for a formal test of the unidimensionality of the questions and thus the validity of the summed score. Given the metric calibration which follows fit to the model, it also allows for the establishment of items banks to facilitate continuity and equity in exam standards.

  8. Empirical versus Random Item Selection in the Design of Intelligence Test Short Forms--The WISC-R Example.

    Goh, David S.

    1979-01-01

    The advantages of using psychometric thoery to design short forms of intelligence tests are demonstrated by comparing such usage to a systematic random procedure that has previously been used. The Wechsler Intelligence Scale for Children Revised (WISC-R) Short Form is presented as an example. (JKS)

  9. Test-retest reliability of Antonovsky's 13-item sense of coherence scale in patients with hand-related disorders

    Hansen, Alice Ørts; Kristensen, Hanne Kaae; Cederlund, Ragnhild

    2017-01-01

    to be a powerful tool to measure the ICF component personal factors, which could have an impact on patients' rehabilitation outcomes. Implications for rehabilitation Antonovsky's SOC-13 scale showed test-retest reliability for patients with hand-related disorders. The SOC-13 scale could be a suitable tool to help...... measure personal factors....

  10. Cross-cultural development of an item list for computer-adaptive testing of fatigue in oncological patients

    Giesinger, Johannes M.; Petersen, Morten Aa.; Grønvold, Mogens

    2011-01-01

    Within an ongoing project of the EORTC Quality of Life Group, we are developing computerized adaptive test (CAT) measures for the QLQ-C30 scales. These new CAT measures are conceptualised to reflect the same constructs as the QLQ-C30 scales. Accordingly, the Fatigue-CAT is intended to capture phy...... physical and general fatigue....

  11. On-Demand Testing and Maintaining Standards for General Qualifications in the UK Using Item Response Theory: Possibilities and Challenges

    He, Qingping

    2012-01-01

    Background: Although on-demand testing is being increasingly used in many areas of assessment, it has not been adopted in high stakes examinations like the General Certificate of Secondary Education (GCSE) and General Certificate of Education Advanced level (GCE A level) offered by awarding organisations (AOs) in the UK. One of the major issues…

  12. Using Standards and Empirical Evidence to Develop Academic English Proficiency Test Items in Reading. CSE Technical Report 664

    Bailey, Alison L.; Stevens, Robin; Butler, Frances A.; Huang, Becky; Miyoshi, Judy N.

    2005-01-01

    The work we report focuses on utilizing linguistic profiles of mathematics, science and social studies textbook selections for the creation of reading test specifications. Once we determined that a text and associated tasks fit within the parameters established in Butler et al. (2004), they underwent both internal and external review by language…

  13. Evaluation of the Relative Validity and Test-Retest Reliability of a 15-Item Beverage Intake Questionnaire in Children and Adolescents.

    Hill, Catelyn E; MacDougall, Carly R; Riebl, Shaun K; Savla, Jyoti; Hedrick, Valisa E; Davy, Brenda M

    2017-11-01

    Added sugar intake, in the form of sugar-sweetened beverages (SSBs), may contribute to weight gain and obesity development in children and adolescents. A valid and reliable brief beverage intake assessment tool for children and adolescents could facilitate research in this area. The purpose of this investigation was to evaluate the relative validity and test-retest reliability of a 15-item beverage intake questionnaire (BEVQ) for assessing usual beverage intake in children and adolescents. This cross-sectional investigation included four study visits within a 2- to 3-week time period. Participants (333 enrolled; 98% completion rate) were children aged 6 to 11 years and adolescents aged 12 to18 years recruited from the New River Valley, VA, region from January 2014 to September 2015. Study visits included assessment of height/weight, health history, and four 24-hour dietary recalls (24HRs). The BEVQ was completed at two visits (BEVQ 1, BEVQ 2). To evaluate relative validity, BEVQ 1 was compared with habitual beverage intake determined by the averaged 24HR. To evaluate test-retest reliability, BEVQ 1 was compared with BEVQ 2. Analyses included descriptive statistics, independent sample t tests, χ 2 tests, one-way analysis of variance, paired sample t tests, and correlational analyses. In the full sample, self-reported water and total SSB intake were not different between BEVQ 1 and 24HR (mean differences 0±1 fl oz and 0±1 fl oz, respectively; both P values >0.05). Reported intake across all beverage categories was significantly correlated between BEVQ 1 and BEVQ 2 (Pbeverages was not different (all P values >0.05) between BEVQ 1 and 24HR (mean differences: whole milk=3±4 kcal, reduced-fat milk=9±5 kcal, and fat-free milk=7±6 kcal, which is 7±15 total beverage kilocalories). In adolescents (n=200), water and SSB kilocalories were not different (both P values >0.05) between BEVQ 1 and 24HR (mean differences: -1±1 fl oz and 12±9 kcal, respectively). A 15

  14. Analyzing Test-Taking Behavior: Decision Theory Meets Psychometric Theory.

    Budescu, David V; Bo, Yuanchao

    2015-12-01

    We investigate the implications of penalizing incorrect answers to multiple-choice tests, from the perspective of both test-takers and test-makers. To do so, we use a model that combines a well-known item response theory model with prospect theory (Kahneman and Tversky, Prospect theory: An analysis of decision under risk, Econometrica 47:263-91, 1979). Our results reveal that when test-takers are fully informed of the scoring rule, the use of any penalty has detrimental effects for both test-takers (they are always penalized in excess, particularly those who are risk averse and loss averse) and test-makers (the bias of the estimated scores, as well as the variance and skewness of their distribution, increase as a function of the severity of the penalty).

  15. Using Differential Item Functioning Procedures to Explore Sources of Item Difficulty and Group Performance Characteristics.

    Scheuneman, Janice Dowd; Gerritz, Kalle

    1990-01-01

    Differential item functioning (DIF) methodology for revealing sources of item difficulty and performance characteristics of different groups was explored. A total of 150 Scholastic Aptitude Test items and 132 Graduate Record Examination general test items were analyzed. DIF was evaluated for males and females and Blacks and Whites. (SLD)

  16. Medical Students' vs. Family Physicians' Assessment of Practical and Logical Values of Pathophysiology Multiple-Choice Questions

    Secic, Damir; Husremovic, Dzenana; Kapur, Eldan; Jatic, Zaim; Hadziahmetovic, Nina; Vojnikovic, Benjamin; Fajkic, Almir; Meholjic, Amir; Bradic, Lejla; Hadzic, Amila

    2017-01-01

    Testing strategies can either have a very positive or negative effect on the learning process. The aim of this study was to examine the degree of consistency in evaluating the practicality and logic of questions from a medical school pathophysiology test, between students and family medicine doctors. The study engaged 77 family medicine doctors…

  17. Validation and structural analysis of the kinematics concept test

    A. Lichtenberger

    2017-04-01

    Full Text Available The kinematics concept test (KCT is a multiple-choice test designed to evaluate students’ conceptual understanding of kinematics at the high school level. The test comprises 49 multiple-choice items about velocity and acceleration, which are based on seven kinematic concepts and which make use of three different representations. In the first part of this article we describe the development and the validation process of the KCT. We applied the KCT to 338 Swiss high school students who attended traditional teaching in kinematics. We analyzed the response data to provide the psychometric properties of the test. In the second part we present the results of a structural analysis of the test. An exploratory factor analysis of 664 student answers finally uncovered the seven kinematics concepts as factors. However, the analysis revealed a hierarchical structure of concepts. At the higher level, mathematical concepts group together, and then split up into physics concepts at the lower level. Furthermore, students who seem to understand a concept in one representation have difficulties transferring the concept to similar problems in another representation. Both results have implications for teaching kinematics. First, teaching mathematical concepts beforehand might be beneficial for learning kinematics. Second, instructions have to be designed to teach students the change between different representations.

  18. Validation and structural analysis of the kinematics concept test

    Lichtenberger, A.; Wagner, C.; Hofer, S. I.; Stern, E.; Vaterlaus, A.

    2017-06-01

    The kinematics concept test (KCT) is a multiple-choice test designed to evaluate students' conceptual understanding of kinematics at the high school level. The test comprises 49 multiple-choice items about velocity and acceleration, which are based on seven kinematic concepts and which make use of three different representations. In the first part of this article we describe the development and the validation process of the KCT. We applied the KCT to 338 Swiss high school students who attended traditional teaching in kinematics. We analyzed the response data to provide the psychometric properties of the test. In the second part we present the results of a structural analysis of the test. An exploratory factor analysis of 664 student answers finally uncovered the seven kinematics concepts as factors. However, the analysis revealed a hierarchical structure of concepts. At the higher level, mathematical concepts group together, and then split up into physics concepts at the lower level. Furthermore, students who seem to understand a concept in one representation have difficulties transferring the concept to similar problems in another representation. Both results have implications for teaching kinematics. First, teaching mathematical concepts beforehand might be beneficial for learning kinematics. Second, instructions have to be designed to teach students the change between different representations.

  19. Assessment of free and cued recall in Alzheimer's disease and vascular and frontotemporal dementia with 24-item Grober and Buschke test.

    Cerciello, Milena; Isella, Valeria; Proserpi, Alice; Papagno, Costanza

    2017-01-01

    Alzheimer's disease (AD), vascular dementia (VaD) and frontotemporal dementia (FTD) are the most common forms of dementia. It is well known that memory deficits in AD are different from those in VaD and FTD, especially with respect to cued recall. The aim of this clinical study was to compare the memory performance in 15 AD, 10 VaD and 9 FTD patients and 20 normal controls by means of a 24-item Grober-Buschke test [8]. The patients' groups were comparable in terms of severity of dementia. We considered free and total recall (free plus cued) both in immediate and delayed recall and computed an Index of Sensitivity to Cueing (ISC) [8] for immediate and delayed trials. We assessed whether cued recall predicted the subsequent free recall across our patients' groups. We found that AD patients recalled fewer items from the beginning and were less sensitive to cueing supporting the hypothesis that memory disorders in AD depend on encoding and storage deficit. In immediate recall VaD and FTD showed a similar memory performance and a stronger sensitivity to cueing than AD, suggesting that memory disorders in these patients are due to a difficulty in spontaneously implementing efficient retrieval strategies. However, we found a lower ISC in the delayed recall compared to the immediate trials in VaD than FTD due to a higher forgetting in VaD.

  20. Compreensão da leitura: análise do funcionamento diferencial dos itens de um Teste de Cloze Reading comprehension: differential item functioning analysis of a Cloze Test

    Katya Luciane Oliveira

    2012-01-01

    Full Text Available Este estudo teve por objetivos investigar o ajuste de um Teste de Cloze ao modelo Rasch e avaliar a dificuldade na resposta ao item em razão do gênero das pessoas (DIF. Participaram da pesquisa 573 alunos das 5ª a 8ª séries do ensino fundamental de escolas públicas estaduais dos estados de São Paulo e Minas Gerais. O teste de Cloze foi aplicado de forma coletiva. A análise do instrumento evidenciou um bom ajuste ao modelo Rasch, bem como os itens foram respondidos conforme o padrão esperado, demonstrando um bom ajuste, também. Quanto ao DIF, apenas três itens indicaram diferenciar o gênero. Com base nos dados, identificou-se que houve equilíbrio nas respostas dadas pelos meninos e meninas.The objectives of the present study were to investigate the adaptation of a Cloze test to the Rasch Model as well as to evaluate the Differential Item Functioning (DIF in relation to gender. The sample was composed by 573 students from 5th to 8th grades of public schools in the state of São Paulo. The cloze test was applied collectively. The analysis of the instrument revealed its adaptation to Rash Model and that the items were responded according to the expected pattern, showing good adjustment, as well. Regarding DIF, only three items were differentiated by gender. Based on the data, results indicated a balance in the answers given by boys and girls.

  1. An Improved Measure of Reading Skill: The Cognitive Structure Test

    Sorrells, Robert

    1997-01-01

    This study compared the construct validity and the predictive validity of a new test, called the Cognitive Structure Test, to multiple-choice tests of reading skill, namely the Armed Forces Vocational...

  2. Validation of science virtual test to assess 8th grade students' critical thinking on living things and environmental sustainability theme

    Rusyati, Lilit; Firman, Harry

    2017-05-01

    This research was motivated by the importance of multiple-choice questions that indicate the elements and sub-elements of critical thinking and implementation of computer-based test. The method used in this research was descriptive research for profiling the validation of science virtual test to measure students' critical thinking in junior high school. The participant is junior high school students of 8th grade (14 years old) while science teacher and expert as the validators. The instrument that used as a tool to capture the necessary data are sheet of an expert judgment, sheet of legibility test, and science virtual test package in multiple choice form with four possible answers. There are four steps to validate science virtual test to measure students' critical thinking on the theme of "Living Things and Environmental Sustainability" in 7th grade Junior High School. These steps are analysis of core competence and basic competence based on curriculum 2013, expert judgment, legibility test and trial test (limited and large trial test). The test item criterion based on trial test are accepted, accepted but need revision, and rejected. The reliability of the test is α = 0.747 that categorized as `high'. It means the test instruments used is reliable and high consistency. The validity of Rxy = 0.63 means that the validity of the instrument was categorized as `high' according to interpretation value of Rxy (correlation).

  3. Status of the Review of Electric Items in Spain Related to the Post-Fukushima Stress Test Programme

    Martinez Moreno, Manuel R.; Perez Rodriguez, Alfonso

    2015-01-01

    Spain Authorities has established a comprehensive compilation of the actions currently related to the post-Fukushima program. It has been initiated both at national and international level and it is developed in an Action Plan. This Plan is aligned to the 6 topics identified in the August 2012 CNS-EOM report, and organized in four parts. One of these parts is related to the loss of electrical power and with a clear objective in implemented new features on increase robustness. This program has been reinforced and the task of Electric Issues has been incremented as a consequence of this Plan. The normal tasks of the Electric Systems and I and C Branch will be presented with the Fukushima related issues as well. The Consejo de Seguridad Nuclear -CSN-(Nuclear Safety Council) maintains a permanent program of control and surveillance of nuclear safety issues in Spanish Nuclear Power Plants. The Electric Systems and I and C Branch of the CSN have different tasks related Electric Issues: - Inspection, control and evaluation of different topics in normal and accidents operation. - Surveillance Testing Inspections. - Design Modifications Inspections and evaluation. - Reactive inspections - Other activities: Participation in Escered project (a before Fukushima Accident) with an objective of analyzed exterior grid stability and check that electric faults in the NPPs vicinity did not cause the simultaneous loss of the offsite supplies fault effects with interaction in inner related systems. Other task related with the management of aging and long-term operation. Now, as a consequence, it has been incremented its task with some new Fukushima related topics: - Analysis of beyond accident related with U.S. SBO Rule (Reg. Guide 1.155) is a part of the design bases for the Spanish plants designed by Westinghouse/ General Electric; switchyard/grid events and extreme weather events are considered, with 10 minutes to connect an alternate source (if provided; if not, use of d

  4. 不同认知成分在图形推理测验项目难度预测中的作用%The Role of Different Cognitive Components in the Prediction of the Figural Reasoning Test's Item Difficulty

    李中权; 王力; 张厚粲; 周仁来

    2011-01-01

    Figural reasoning tests (as represented by Raven's tests) are widely applied as effective measures of fluid intelligence in recruitment and personnel selection. However, several studies have revealed that those tests are not appropriate anymore due to high item exposure rates. Computerized automatic item generation (AIG) has gradually been recognized as a promising technique in handling item exposure. Understanding sources of item variation constitutes the initial stage of Computerized AIG, that is, searching for the underlying processing components and the stimuli that significantly influence those components. Some studies have explored sources of item variation, but so far there are no consistent results. This study investigated the relation between item difficulties and stimuli factors (e.g., familiarity of figures, abstraction of attributes, perceptual organization, and memory load) and determines the relative importance of those factors in predicting item difficulities.Eight sets of figural reasoning tests (each set containing 14 items imitating items from Raven's Advanced Progressive Matrics, APM) were constructed manipulating the familiarity of figures, the degree of abstraction of attributes, the perceptual organization as well as the types and number of rules. Using anchor-test design, these tests were administrated via the internet to 6323 participants with 10 items drawing from APAM as anchor items; thus, each participant completed 14 items from either one set and 10 anchor items within half an hour. In order to prevent participants from using response elimination strategy, we presented one item stem first, then alternatives in turn, and asked participants to determine which alternative was the best.DIMTEST analyses were conducted on the participants' responses on each of eight tests. Results showed that items measure a single dimension on each test. Likelihood ratio test indicated that the data fit two-parameter logistic model (2PL) best. Items were

  5. Towards an authoring system for item construction

    Rikers, Jos H.A.N.

    1988-01-01

    The process of writing test items is analyzed, and a blueprint is presented for an authoring system for test item writing to reduce invalidity and to structure the process of item writing. The developmental methodology is introduced, and the first steps in the process are reported. A historical

  6. Using Cognitive Testing to Develop Items for Surveying Asian American Cancer Patients and Their Caregivers as a Pathway to Culturally Competent Care.

    Bolcic-Jankovic, Dragana; Lu, Fengxin; Colten, Mary Ellen; McCarthy, Ellen P

    2016-02-01

    We report the results from cognitive interviews with Asian American patients and their caregivers. We interviewed seven caregivers and six patients who were all bilingual Asian Americans. The main goal of the cognitive interviews was to test a survey instrument developed for a study about perspectives of Asian American patients with advanced cancer who are facing decisions around end-of-life care. We were particularly interested to see whether items commonly used in White and Black populations are culturally meaningful and equivalent in Asian populations, primarily those of Chinese and Vietnamese ethnicity. Our exploration shows that understanding respondents' language proficiency, degree of acculturation, and cultural context of receiving, processing, and communicating information about medical care can help design questions that are appropriate for Asian American patients and caregivers, and therefore can help researchers obtain quality data about the care Asian American cancer patients receive. © The Author(s) 2016.

  7. Furnace System Testing to Support Lower-Temperature Stabilization of High Chloride Plutonium Oxide Items at the Hanford Plutonium Finishing Plant

    Schmidt, Andrew J.; Gerber, Mark A.; Fischer, Christopher M.; Elmore, Monte R.

    2003-01-01

    High chloride content plutonium (HCP) oxides are impure plutonium oxide scrap which contains NaCl, KCl, MgCl2 and/or CaCl2 salts at potentially high concentrations and must be stabilized at 950 C per the DOE Standard, DOE-STD-3013-2000. The chlorides pose challenges to stabilization because volatile chloride salts and decomposition products can corrode furnace heating elements and downstream ventilation components. Thermal stabilization of HCP items at 750 C (without water washing) is being investigated as an alternative method for meeting the intent of DOE STD 3013-2000. This report presents the results from a series of furnace tests conducted to develop material balance and system operability data for supporting the evaluation of lower-temperature thermal stabilization

  8. Applying Item Response Theory Methods to Examine the Impact of Different Response Formats

    Hohensinn, Christine; Kubinger, Klaus D.

    2011-01-01

    In aptitude and achievement tests, different response formats are usually used. A fundamental distinction must be made between the class of multiple-choice formats and the constructed response formats. Previous studies have examined the impact of different response formats applying traditional statistical approaches, but these influences can also…

  9. Investigating Robustness of Item Response Theory Proficiency Estimators to Atypical Response Behaviors under Two-Stage Multistage Testing. ETS GRE® Board Research Report. ETS GRE®-16-03. ETS Research Report No. RR-16-22

    Kim, Sooyeon; Moses, Tim

    2016-01-01

    The purpose of this study is to evaluate the extent to which item response theory (IRT) proficiency estimation methods are robust to the presence of aberrant responses under the "GRE"® General Test multistage adaptive testing (MST) design. To that end, a wide range of atypical response behaviors affecting as much as 10% of the test items…

  10. Measuring outcomes in allergic rhinitis: psychometric characteristics of a Spanish version of the congestion quantifier seven-item test (CQ7

    Mullol Joaquim

    2011-03-01

    Full Text Available Abstract Background No control tools for nasal congestion (NC are currently available in Spanish. This study aimed to adapt and validate the Congestion Quantifier Seven Item Test (CQ7 for Spain. Methods CQ7 was adapted from English following international guidelines. The instrument was validated in an observational, prospective study in allergic rhinitis patients with NC (N = 166 and a control group without NC (N = 35. Participants completed the CQ7, MOS sleep questionnaire, and a measure of psychological well-being (PGWBI. Clinical data included NC severity rating, acoustic rhinometry, and total symptom score (TSS. Internal consistency was assessed using Cronbach's alpha and test-retest reliability using the intraclass correlation coefficient (ICC. Construct validity was tested by examining correlations with other outcome measures and ability to discriminate between groups classified by NC severity. Sensitivity and specificity were assessed using Area under the Receiver Operating Curve (AUC and responsiveness over time using effect sizes (ES. Results Cronbach's alpha for the CQ7 was 0.92, and the ICC was 0.81, indicating good reliability. CQ7 correlated most strongly with the TSS (r = 0.60, p Conclusions The Spanish version of the CQ7 is appropriate for detecting, measuring, and monitoring NC in allergic rhinitis patients.

  11. Teste de Raciocínio Auditivo Musical (RAu: estudo inicial por meio da Teoria de Reposta ao Item Test de Raciocinio Auditivo Musical (RAu: estudio inicial a través de la Teoría de Repuesta al Ítem Auditory Musical Reasoning Test: an initial study with Item Response Theory

    Fernando Pessotto

    2012-12-01

    Full Text Available A presente pesquisa tem como objetivo buscar evidências de validade com base na estrutura interna e de critério para um instrumento de avaliação do processamento auditivo das habilidades musicais (Teste de Processamento Auditivo com Estímulos Musicais, RAu. Para tanto, foram avaliadas 162 pessoas de ambos os sexos, sendo 56,8% homens, com faixa etária entre 15 e 59 anos (M=27,5; DP=9,01. Os participantes foram divididos entre músicos (N=24, amadores (N=62 e leigos (N=76, de acordo com o nível de conhecimento em música. Por meio da análise Full Information Factor Analysis, verificou-se a dimensionalidade do instrumento, e também as propriedades dos itens, por meio da Teoria de Resposta ao Item (TRI. Além disso, buscou-se identificar a capacidade de discriminação entre os grupos de músicos e não-músicos. Os dados encontrados apontam evidências de que os itens medem uma dimensão principal (alfa=0,92 com alta capacidade para diferenciar os grupos de músicos profissionais, amadores e leigos, obtendo-se um coeficiente de validade de critério de r=0,68. Os resultado indicam evidências positivas de precisão e validade para o RAu.La presente investigación tiene como objetivo buscar evidencias de validez basadas en la estructura interna y de criterio para un instrumento de evaluación del procesamiento auditivo de las habilidades musicales (Test de Procesamiento Auditivo con Estímulos Musicales, RAu. Para eso, fueron evaluadas 162 personas de ambos los sexos, siendo 56,8% hombres, con rango de edad entre 15 y 59 años (M=27,5; DP=9,01. Los participantes fueron divididos entre músicos (N=24, aficionados (N=62 y laicos (N=76 de acuerdo con el nivel de conocimiento en música. Por medio del análisis Full Information Factor Analysis se verificó la dimensionalidad del instrumento y también las propiedades de los ítems a través de la Teoría de Respuesta al Ítem (TRI. Además, se buscó identificar la capacidad de discriminaci

  12. A Review of Classical Methods of Item Analysis.

    French, Christine L.

    Item analysis is a very important consideration in the test development process. It is a statistical procedure to analyze test items that combines methods used to evaluate the important characteristics of test items, such as difficulty, discrimination, and distractibility of the items in a test. This paper reviews some of the classical methods for…

  13. Validation of the Ten-Item Internet Gaming Disorder Test (IGDT-10) and evaluation of the nine DSM-5 Internet Gaming Disorder criteria.

    Király, Orsolya; Sleczka, Pawel; Pontes, Halley M; Urbán, Róbert; Griffiths, Mark D; Demetrovics, Zsolt

    2017-01-01

    The inclusion of Internet Gaming Disorder (IGD) in the DSM-5 (Section 3) has given rise to much scholarly debate regarding the proposed criteria and their operationalization. The present study's aim was threefold: to (i) develop and validate a brief psychometric instrument (Ten-Item Internet Gaming Disorder Test; IGDT-10) to assess IGD using definitions suggested in DSM-5, (ii) contribute to ongoing debate regards the usefulness and validity of each of the nine IGD criteria (using Item Response Theory [IRT]), and (iii) investigate the cut-off threshold suggested in the DSM-5. An online gamer sample of 4887 gamers (age range 14-64years, mean age 22.2years [SD=6.4], 92.5% male) was collected through Facebook and a gaming-related website with the cooperation of a popular Hungarian gaming magazine. A shopping voucher of approx. 300 Euros was drawn between participants to boost participation (i.e., lottery incentive). Confirmatory factor analysis and a structural regression model were used to test the psychometric properties of the IGDT-10 and IRT analysis was conducted to test the measurement performance of the nine IGD criteria. Finally, Latent Class Analysis along with sensitivity and specificity analysis were used to investigate the cut-off threshold proposed in the DSM-5. Analysis supported IGDT-10's validity, reliability, and suitability to be used in future research. Findings of the IRT analysis suggest IGD is manifested through a different set of symptoms depending on the level of severity of the disorder. More specifically, "continuation", "preoccupation", "negative consequences" and "escape" were associated with lower severity of IGD, while "tolerance", "loss of control", "giving up other activities" and "deception" criteria were associated with more severe levels. "Preoccupation" and "escape" provided very little information to the estimation IGD severity. Finally, the DSM-5 suggested threshold appeared to be supported by our statistical analyses. IGDT-10 is

  14. Formulation of multiple choice questions as a revision exercise at the end of a teaching module in biochemistry.

    Bobby, Zachariah; Radhika, M R; Nandeesha, H; Balasubramanian, A; Prerna, Singh; Archana, Nimesh; Thippeswamy, D N

    2012-01-01

    The graduate medical students often get less opportunity for clarifying their doubts and to reinforce their concepts after lecture classes. Assessment of the effect of MCQ preparation by graduate medical students as a revision exercise on the topic "Mineral metabolism." At the end of regular teaching module on the topic "Mineral metabolism," graduate medical students were asked to prepare the stems of 15 MCQs based on the four discriminators given for each. They were told that one of the discriminators could be the answer for the MCQ and the remaining three could be the distracters. They were further guided in their task by providing few key word(s) in the stem of the expected MCQ. In the first phase of the exercise, the students attempted the MCQ preparation individually without peer consultation. In the second phase, the students participated in small group discussion to formulate the best MCQs of the group. The effects on low, medium, and high achievers were evaluated by pre and post-tests with the same set of MCQs. Both the individual endeavor in Phase 1 and small group discussion in Phase 2 for the formulation of MCQs significantly contributed to the gain from the exercise. The gains from the individual task and from small group discussion were equal among the different categories of students. Both phases of the exercise were equally beneficial for the low, medium, and high achievers. The high and medium achievers retained the gain from the exercise even after 1 week of the exercise whereas the low achievers could not retain the gain completely. Formulation of MCQs is an effective and useful unconventional revision exercise in Biochemistry for graduate medical students. Copyright © 2012 Wiley Periodicals, Inc.

  15. Semantic Similarity Measures for the Generation of Science Tests in Basque

    Aldabe, Itziar; Maritxalar, Montse

    2014-01-01

    The work we present in this paper aims to help teachers create multiple-choice science tests. We focus on a scientific vocabulary-learning scenario taking place in a Basque-language educational environment. In this particular scenario, we explore the option of automatically generating Multiple-Choice Questions (MCQ) by means of Natural Language…

  16. Assessing the test-retest repeatability of the Vietnamese version of the National Eye Institute 25-item Visual Function Questionnaire among bilateral cataract patients for a Vietnamese population.

    To, Kien Gia; Meuleners, Lynn; Chen, Huei-Yang; Lee, Andy; Do, Dung Van; Duong, Dat Van; Phi, Tien Duy; Tran, Hoang Huy; Nguyen, Nguyen Do

    2014-06-01

    To determine the test-retest repeatability of the National Eye Institute 25-item Visual Function Questionnaire (NEI VFQ-25) for use with older Vietnamese adults with bilateral cataract. The questionnaire was translated into Vietnamese and back-translated into English by two independent translators. Patients with bilateral cataract aged 50 and older completed the questionnaire on two separate occasions, one to two weeks after first administration of the questionnaire. Test-retest repeatability was assessed using the Cronbach's α and intraclass correlation coefficients. The average age of participants was 67 ± 8 years and most participants were female (73%). Internal consistency was acceptable with the α coefficient above 0.7 for all subscales and intraclass correlation coefficients were 0.6 or greater in all subscales. The Vietnamese NEI VFQ-25 is reliable for use in studies assessing vision-related quality of life in older adults with bilateral cataract in Vietnam. We propose some modifications to the NEI-VFQ questions to reflect activities of older people in Vietnam. © 2013 ACOTA.

  17. Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions.

    Haberman, Shelby J; Sinharay, Sandip; Chon, Kyong Hee

    2013-07-01

    Residual analysis (e.g. Hambleton & Swaminathan, Item response theory: principles and applications, Kluwer Academic, Boston, 1985; Hambleton, Swaminathan, & Rogers, Fundamentals of item response theory, Sage, Newbury Park, 1991) is a popular method to assess fit of item response theory (IRT) models. We suggest a form of residual analysis that may be applied to assess item fit for unidimensional IRT models. The residual analysis consists of a comparison of the maximum-likelihood estimate of the item characteristic curve with an alternative ratio estimate of the item characteristic curve. The large sample distribution of the residual is proved to be standardized normal when the IRT model fits the data. We compare the performance of our suggested residual to the standardized residual of Hambleton et al. (Fundamentals of item response theory, Sage, Newbury Park, 1991) in a detailed simulation study. We then calculate our suggested residuals using data from an operational test. The residuals appear to be useful in assessing the item fit for unidimensional IRT models.

  18. The establisment of an achievement test for determination of primary teachers’ knowledge level of earthquake

    Aydin, Süleyman, E-mail: yupul@hotmail.com; Haşiloğlu, M. Akif, E-mail: mehmet.hasiloglu@hotmail.com; Kunduraci, Ayşe, E-mail: ayse-kndrc@hotmail.com [Ağrı İbrahim Çeçen University, Faculty of Education, Science Education, Ağrı (Turkey)

    2016-04-18

    In this study it was aimed to improve an academic achievement test to establish the students’ knowledge about the earthquake and the ways of protection from earthquakes. In the method of this study, the steps that Webb (1994) was created to improve an academic achievement test for a unit were followed. In the developmental process of multiple choice test having 25 questions, was prepared to measure the pre-service teachers’ knowledge levels about the earthquake and the ways of protection from earthquakes. The multiple choice test was presented to view of six academics (one of them was from geographic field and five of them were science educator) and two expert teachers in science Prepared test was applied to 93 pre-service teachers studying in elementary education department in 2014-2015 academic years. As a result of validity and reliability of the study, the test was composed of 20 items. As a result of these applications, Pearson Moments Multiplication half-reliability coefficient was found to be 0.94. When this value is adjusted according to Spearman Brown reliability coefficient the reliability coefficient was set at 0.97.

  19. The establisment of an achievement test for determination of primary teachers’ knowledge level of earthquake

    Aydin, Süleyman; Haşiloğlu, M. Akif; Kunduraci, Ayşe

    2016-01-01

    In this study it was aimed to improve an academic achievement test to establish the students’ knowledge about the earthquake and the ways of protection from earthquakes. In the method of this study, the steps that Webb (1994) was created to improve an academic achievement test for a unit were followed. In the developmental process of multiple choice test having 25 questions, was prepared to measure the pre-service teachers’ knowledge levels about the earthquake and the ways of protection from earthquakes. The multiple choice test was presented to view of six academics (one of them was from geographic field and five of them were science educator) and two expert teachers in science Prepared test was applied to 93 pre-service teachers studying in elementary education department in 2014-2015 academic years. As a result of validity and reliability of the study, the test was composed of 20 items. As a result of these applications, Pearson Moments Multiplication half-reliability coefficient was found to be 0.94. When this value is adjusted according to Spearman Brown reliability coefficient the reliability coefficient was set at 0.97.

  20. Dutch translation and cross-cultural adaptation of the PROMIS® physical function item bank and cognitive pre-test in Dutch arthritis patients.

    Oude Voshaar, Martijn Ah; Ten Klooster, Peter M; Taal, Erik; Krishnan, Eswar; van de Laar, Mart Afj

    2012-03-05

    Patient-reported physical function is an established outcome domain in clinical studies in rheumatology. To overcome the limitations of the current generation of questionnaires, the Patient-Reported Outcomes Measurement Information System (PROMIS®) project in the USA has developed calibrated item banks for measuring several domains of health status in people with a wide range of chronic diseases. The aim of this study was to translate and cross-culturally adapt the PROMIS physical function item bank to the Dutch language and to pretest it in a sample of patients with arthritis. The items of the PROMIS physical function item bank were translated using rigorous forward-backward protocols and the translated version was subsequently cognitively pretested in a sample of Dutch patients with rheumatoid arthritis. Few issues were encountered in the forward-backward translation. Only 5 of the 124 items to be translated had to be rewritten because of culturally inappropriate content. Subsequent pretesting showed that overall, questions of the Dutch version were understood as they were intended, while only one item required rewriting. Results suggest that the translated version of the PROMIS physical function item bank is semantically and conceptually equivalent to the original. Future work will be directed at creating a Dutch-Flemish final version of the item bank to be used in research with Dutch speaking populations.