WorldWideScience

Sample records for plato-based test item

  1. Writing better test items.

    Science.gov (United States)

    Aucoin, Julia W

    2005-01-01

    Professional development specialists have had little opportunity to learn how to write test items to meet the expectations of today's graduate nurse. Schools of nursing have moved away from knowledge-level test items and have had to develop more application and analysis items to prepare graduates for the National Council Licensure Examination (NCLEX). This same type of question can be used effectively to support a competence assessment system and document critical thinking skills.

  2. Screening Test Items for Differential Item Functioning

    Science.gov (United States)

    Longford, Nicholas T.

    2014-01-01

    A method for medical screening is adapted to differential item functioning (DIF). Its essential elements are explicit declarations of the level of DIF that is acceptable and of the loss function that quantifies the consequences of the two kinds of inappropriate classification of an item. Instead of a single level and a single function, sets of…

  3. Selected response test items.

    Science.gov (United States)

    Tomey, A M

    1999-01-01

    Classroom assessment is complex and challenging. Teachers need to consider the cognitive, affective, and psychomotor levels for achievement of their educational objectives. This series of six articles discusses how to develop testing blue-prints; selected-response tests, including multiple-choice, true-false, matching, or other objective tests; completion or essay testing; problem solving/critical thinking activities; performance assessment; and computer-based testing.

  4. Computerized adaptive testing with item cloning

    NARCIS (Netherlands)

    Glas, Cornelis A.W.; van der Linden, Willem J.

    2003-01-01

    To increase the number of items available for adaptive testing and reduce the cost of item writing, the use of techniques of item cloning has been proposed. An important consequence of item cloning is possible variability between the item parameters. To deal with this variability, a multilevel item

  5. Item Overexposure in Computerized Classification Tests Using Sequential Item Selection

    Directory of Open Access Journals (Sweden)

    Alan Huebner

    2012-06-01

    Full Text Available Computerized classification tests (CCTs often use sequential item selection which administers items according to maximizing psychometric information at a cut point demarcating passing and failing scores. This paper illustrates why this method of item selection leads to the overexposure of a significant number of items, and the performances of three different methods for controlling maximum item exposure rates in CCTs are compared. Specifically, the Sympson-Hetter, restricted, and item eligibility methods are examined in two studies realistically simulating different types of CCTs and are evaluated based upon criteria including classification accuracy, the number of items exceeding the desired maximum exposure rate, and test overlap. The pros and cons of each method are discussed from a practical perspective.

  6. Unidimensional Interpretations for Multidimensional Test Items

    Science.gov (United States)

    Kahraman, Nilufer

    2013-01-01

    This article considers potential problems that can arise in estimating a unidimensional item response theory (IRT) model when some test items are multidimensional (i.e., show a complex factorial structure). More specifically, this study examines (1) the consequences of model misfit on IRT item parameter estimates due to unintended minor item-level…

  7. Unidimensional Interpretations for Multidimensional Test Items

    Science.gov (United States)

    Kahraman, Nilufer

    2013-01-01

    This article considers potential problems that can arise in estimating a unidimensional item response theory (IRT) model when some test items are multidimensional (i.e., show a complex factorial structure). More specifically, this study examines (1) the consequences of model misfit on IRT item parameter estimates due to unintended minor item-level…

  8. Matrix Sampling of Test Items. ERIC Digest.

    Science.gov (United States)

    Childs, Ruth A.; Jaciw, Andrew P.

    This Digest describes matrix sampling of test items as an approach to achieving broad coverage while minimizing testing time per student. Matrix sampling involves developing a complete set of items judged to cover the curriculum, then dividing the items into subsets and administering one subset to each student. Matrix sampling, by limiting the…

  9. Item calibration in incomplete testing designs

    NARCIS (Netherlands)

    Eggen, Theo J.H.M.; Verhelst, Norman D.

    2011-01-01

    This study discusses the justifiability of item parameter estimation in incomplete testing designs in item response theory. Marginal maximum likelihood (MML) as well as conditional maximum likelihood (CML) procedures are considered in three commonly used incomplete designs: random incomplete, multis

  10. Item Calibration in Incomplete Testing Designs

    Science.gov (United States)

    Eggen, Theo J. H. M.; Verhelst, Norman D.

    2011-01-01

    This study discusses the justifiability of item parameter estimation in incomplete testing designs in item response theory. Marginal maximum likelihood (MML) as well as conditional maximum likelihood (CML) procedures are considered in three commonly used incomplete designs: random incomplete, multistage testing and targeted testing designs.…

  11. Classroom Test Writing: Effects of Item Format on Test Quality.

    Science.gov (United States)

    Torabi-Parizi, Rosa; Campbell, Noma Jo

    1982-01-01

    Investigates the effects of varying the placement of blanks and the number of options available in multiple-choice items on the reliability of fifth-grade students' scores. Results indicate that scores on three-choice item tests were not less reliable than scores on four-choice item tests. A similar finding was found regarding the placement of…

  12. Validation of Physics Standardized Test Items

    Science.gov (United States)

    Marshall, Jill

    2008-10-01

    The Texas Physics Assessment Team (TPAT) examined the Texas Assessment of Knowledge and Skills (TAKS) to determine whether it is a valid indicator of physics preparation for future course work and employment, and of the knowledge and skills needed to act as an informed citizen in a technological society. We categorized science items from the 2003 and 2004 10th and 11th grade TAKS by content area(s) covered, knowledge and skills required to select the correct answer, and overall quality. We also analyzed a 5000 student sample of item-level results from the 2004 11th grade exam using standard statistical methods employed by test developers (factor analysis and Item Response Theory). Triangulation of our results revealed strengths and weaknesses of the different methods of analysis. The TAKS was found to be only weakly indicative of physics preparation and we make recommendations for increasing the validity of standardized physics testing..

  13. Algorithmic test design using classical item parameters

    NARCIS (Netherlands)

    van der Linden, Willem J.; Adema, Jos J.

    1988-01-01

    Two optimalization models for the construction of tests with a maximal value of coefficient alpha are given. Both models have a linear form and can be solved by using a branch-and-bound algorithm. The first model assumes an item bank calibrated under the Rasch model and can be used, for instance, wh

  14. Bayesian item selection criteria for adaptive testing

    NARCIS (Netherlands)

    Linden, van der Wim J.

    1996-01-01

    R.J. Owen (1975) proposed an approximate empirical Bayes procedure for item selection in adaptive testing. The procedure replaces the true posterior by a normal approximation with closed-form expressions for its first two moments. This approximation was necessary to minimize the computational comple

  15. Bayesian item selection criteria for adaptive testing

    NARCIS (Netherlands)

    van der Linden, Willem J.

    1996-01-01

    R.J. Owen (1975) proposed an approximate empirical Bayes procedure for item selection in adaptive testing. The procedure replaces the true posterior by a normal approximation with closed-form expressions for its first two moments. This approximation was necessary to minimize the computational comple

  16. Bayesian Item Selection in Constrained Adaptive Testing Using Shadow Tests

    Science.gov (United States)

    Veldkamp, Bernard P.

    2010-01-01

    Application of Bayesian item selection criteria in computerized adaptive testing might result in improvement of bias and MSE of the ability estimates. The question remains how to apply Bayesian item selection criteria in the context of constrained adaptive testing, where large numbers of specifications have to be taken into account in the item…

  17. Using automatic item generation to create multiple-choice test items.

    Science.gov (United States)

    Gierl, Mark J; Lai, Hollis; Turner, Simon R

    2012-08-01

    Many tests of medical knowledge, from the undergraduate level to the level of certification and licensure, contain multiple-choice items. Although these are efficient in measuring examinees' knowledge and skills across diverse content areas, multiple-choice items are time-consuming and expensive to create. Changes in student assessment brought about by new forms of computer-based testing have created the demand for large numbers of multiple-choice items. Our current approaches to item development cannot meet this demand. We present a methodology for developing multiple-choice items based on automatic item generation (AIG) concepts and procedures. We describe a three-stage approach to AIG and we illustrate this approach by generating multiple-choice items for a medical licensure test in the content area of surgery. To generate multiple-choice items, our method requires a three-stage process. Firstly, a cognitive model is created by content specialists. Secondly, item models are developed using the content from the cognitive model. Thirdly, items are generated from the item models using computer software. Using this methodology, we generated 1248 multiple-choice items from one item model. Automatic item generation is a process that involves using models to generate items using computer technology. With our method, content specialists identify and structure the content for the test items, and computer technology systematically combines the content to generate new test items. By combining these outcomes, items can be generated automatically. © Blackwell Publishing Ltd 2012.

  18. Optimal item pool design for computerized adaptive tests with polytomous items using GPCM

    Directory of Open Access Journals (Sweden)

    Xuechun Zhou

    2014-09-01

    Full Text Available Computerized adaptive testing (CAT is a testing procedure with advantages in improving measurement precision and increasing test efficiency. An item pool with optimal characteristics is the foundation for a CAT program to achieve those desirable psychometric features. This study proposed a method to design an optimal item pool for tests with polytomous items using the generalized partial credit model (G-PCM. It extended a method for approximating optimality with polytomous items being described succinctly for the purpose of pool design. Optimal item pools were generated using CAT simulations with and without practical constraints of content balancing and item exposure control. The performances of the item pools were evaluated against an operational item pool. The results indicated that the item pools designed with stratification based on discrimination parameters performed well with an efficient use of the less discriminative items within the target accuracy levels. The implications for developing item pools are also discussed.

  19. Instructional Topics in Educational Measurement (ITEMS) Module: Using Automated Processes to Generate Test Items

    Science.gov (United States)

    Gierl, Mark J.; Lai, Hollis

    2013-01-01

    Changes to the design and development of our educational assessments are resulting in the unprecedented demand for a large and continuous supply of content-specific test items. One way to address this growing demand is with automatic item generation (AIG). AIG is the process of using item models to generate test items with the aid of computer…

  20. A Process for Reviewing and Evaluating Generated Test Items

    Science.gov (United States)

    Gierl, Mark J.; Lai, Hollis

    2016-01-01

    Testing organization needs large numbers of high-quality items due to the proliferation of alternative test administration methods and modern test designs. But the current demand for items far exceeds the supply. Test items, as they are currently written, evoke a process that is both time-consuming and expensive because each item is written,…

  1. An Item Analysis and Validity Investigation of Bender Visual Motor Gestalt Test Score Items

    Science.gov (United States)

    Lambert, Nadine M.

    1971-01-01

    This investigation attempted to demonstrate the utility of standard item analysis procedures for selecting the most reliable and valid items for scoring Bender Visual Motor Gestalt Test test records. (Author)

  2. Investigating Item Exposure Control Methods in Computerized Adaptive Testing

    Science.gov (United States)

    Ozturk, Nagihan Boztunc; Dogan, Nuri

    2015-01-01

    This study aims to investigate the effects of item exposure control methods on measurement precision and on test security under various item selection methods and item pool characteristics. In this study, the Randomesque (with item group sizes of 5 and 10), Sympson-Hetter, and Fade-Away methods were used as item exposure control methods. Moreover,…

  3. Evaluation of Northwest University, Kano Post-UTME Test Items Using Item Response Theory

    Science.gov (United States)

    Bichi, Ado Abdu; Hafiz, Hadiza; Bello, Samira Abdullahi

    2016-01-01

    High-stakes testing is used for the purposes of providing results that have important consequences. Validity is the cornerstone upon which all measurement systems are built. This study applied the Item Response Theory principles to analyse Northwest University Kano Post-UTME Economics test items. The developed fifty (50) economics test items was…

  4. Examining item difficulty and response time on perceptual ability test items.

    Science.gov (United States)

    Yang, Chien-Lin; O'Neill, Thomas R; Kramer, Gene A

    2002-01-01

    This study examined item calibration stability in relation to response time and the levels of item difficulty between different response time groups on a sample of 389 examinees responding to six different subtest items of the Perceptual Ability Test (PAT). The results indicated that no Differential Item Functioning (DIF) was found and a significant correlation coefficient of item difficulty was formed between slow and fast responders. Three distinct levels of difficulty emerged among the six subtests across groups. Slow responders spent significantly more time than fast responders on the four most difficult subtests. A positive significant relationship was found between item difficulty and response time across groups on the overall perceptual ability test items. Overall, this study found that: 1) the same underlying construct is being measured across groups, 2) the PAT scores were equally useful across groups, 3) different sources of item difficulty may exist among the six subtests, and 4) more difficult test items may require more time to answer.

  5. Alternate item types: continuing the quest for authentic testing.

    Science.gov (United States)

    Wendt, Anne; Kenny, Lorraine E

    2009-03-01

    Many test developers suggest that multiple-choice items can be used to evaluate critical thinking if the items are focused on measuring higher order thinking ability. The literature supports the use of alternate item types to assess additional competencies, such as higher level cognitive processing and critical thinking, as well as ways to allow examinees to demonstrate their competencies differently. This research study surveyed nurses after taking a test composed of alternate item types paired with multiple-choice items. The participants were asked to provide opinions regarding the items and the item formats. Demographic information was asked. In addition, information was collected as the participants responded to the items. The results of this study reveal that the participants thought that, in general, the items were more authentic and allowed them to demonstrate their competence better than multiple-choice items did. Further investigation into the optimal blend of alternate items and multiple-choice items is needed.

  6. A Procedure for Linear Polychotomous Scoring of Test Items

    Science.gov (United States)

    1993-10-01

    associated with the response categories of test items . When tests are scored using these scoring weights, test reliability increases. The new procedure is...program POLY. The example demonstrates how polyweighting can be used to calibrate and score test items drawn from an item bank that is too large to

  7. An Analytical Method of Identifying Biased Test Items.

    Science.gov (United States)

    Plake, Barbara S.; Hoover, H. D.

    1979-01-01

    A follow-up technique is needed to identify items contributing to items-by-groups interaction when using an ANOVA procedure to examine a test for biased items. The method described includes distribution theory for assessing level of significance and is sensitive to items at all difficulty levels. (Author/GSK)

  8. Guide to good practices for the development of test items

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    1997-01-01

    While the methodology used in developing test items can vary significantly, to ensure quality examinations, test items should be developed systematically. Test design and development is discussed in the DOE Guide to Good Practices for Design, Development, and Implementation of Examinations. This guide is intended to be a supplement by providing more detailed guidance on the development of specific test items. This guide addresses the development of written examination test items primarily. However, many of the concepts also apply to oral examinations, both in the classroom and on the job. This guide is intended to be used as guidance for the classroom and laboratory instructor or curriculum developer responsible for the construction of individual test items. This document focuses on written test items, but includes information relative to open-reference (open book) examination test items, as well. These test items have been categorized as short-answer, multiple-choice, or essay. Each test item format is described, examples are provided, and a procedure for development is included. The appendices provide examples for writing test items, a test item development form, and examples of various test item formats.

  9. Modeling Local Item Dependence in Cloze and Reading Comprehension Test Items Using Testlet Response Theory

    Science.gov (United States)

    Baghaei, Purya; Ravand, Hamdollah

    2016-01-01

    In this study the magnitudes of local dependence generated by cloze test items and reading comprehension items were compared and their impact on parameter estimates and test precision was investigated. An advanced English as a foreign language reading comprehension test containing three reading passages and a cloze test was analyzed with a…

  10. Effect of Multiple Testing Adjustment in Differential Item Functioning Detection

    Science.gov (United States)

    Kim, Jihye; Oshima, T. C.

    2013-01-01

    In a typical differential item functioning (DIF) analysis, a significance test is conducted for each item. As a test consists of multiple items, such multiple testing may increase the possibility of making a Type I error at least once. The goal of this study was to investigate how to control a Type I error rate and power using adjustment…

  11. A Review of Test Item Types

    Science.gov (United States)

    2008-03-06

    for integration with the CAT- ASVAB as it exists now. The CAT-ASVAB currently employs only dichotomously scored items using the 3 parameter logistic ...model ( 3PL ; Lord & Novick, 1968). IRT models appropriate for polytomously scored items (e.g., Muraki, 1997) are available, and mixing of models is not...problematic within the IRT framework per se. Nevertheless, the current CAT-ASVAB infrastructure is configured to work with the 3PL model only, and

  12. The Identification of Radex Properties in Objective Test Items.

    Science.gov (United States)

    Seddon, G. M.; And Others

    1981-01-01

    In a Monte Carlo simulation, a methodology was developed to investigate the existence of radex properties among objective test items. In an experiment with items covering four categories of Bloom's cognitive domain taxonomy, the items did not have the factorial properties of a radex with four levels of complexity. (Author/BW)

  13. Multistage Computerized Adaptive Testing with Uniform Item Exposure

    Science.gov (United States)

    Edwards, Michael C.; Flora, David B.; Thissen, David

    2012-01-01

    This article describes a computerized adaptive test (CAT) based on the uniform item exposure multi-form structure (uMFS). The uMFS is a specialization of the multi-form structure (MFS) idea described by Armstrong, Jones, Berliner, and Pashley (1998). In an MFS CAT, the examinee first responds to a small fixed block of items. The items comprising…

  14. An emotional functioning item bank of 24 items for computerized adaptive testing (CAT) was established

    DEFF Research Database (Denmark)

    Petersen, Morten Aa.; Gamper, Eva-Maria; Costantini, Anna

    2016-01-01

    OBJECTIVE: To improve measurement precision, the European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Group is developing an item bank for computerized adaptive testing (CAT) of emotional functioning (EF). The item bank will be within the conceptual framework...... of the widely used EORTC Quality of Life questionnaire (QLQ-C30). STUDY DESIGN AND SETTING: On the basis of literature search and evaluations by international samples of experts and cancer patients, 38 candidate items were developed. The psychometric properties of the items were evaluated in a large...... international sample of cancer patients. This included evaluations of dimensionality, item response theory (IRT) model fit, differential item functioning (DIF), and of measurement precision/statistical power. RESULTS: Responses were obtained from 1,023 cancer patients from four countries. The evaluations showed...

  15. Application of Unidimensional Item Response Models to Tests with Items Sensitive to Secondary Dimensions

    Science.gov (United States)

    Zhang, Bo

    2008-01-01

    In this research, the author addresses whether the application of unidimensional item response models provides valid interpretation of test results when administering items sensitive to multiple latent dimensions. Overall, the present study found that unidimensional models are quite robust to the violation of the unidimensionality assumption due…

  16. Estimating the Reliability of a Test Containing Multiple Item Formats.

    Science.gov (United States)

    Qualls, Audrey L.

    1995-01-01

    Classically parallel, tau-equivalently parallel, and congenerically parallel models representing various degrees of part-test parallelism and their appropriateness for tests composed of multiple item formats are discussed. An appropriate reliability estimate for a test with multiple item formats is presented and illustrated. (SLD)

  17. Costs of Matrix Sampling of Test Items. ERIC Digest.

    Science.gov (United States)

    Childs, Ruth A.; Jaciw, Andrew P.

    Matrix sampling of test items, the division of a set of items into different versions of a test form, is used by several large-scale testing programs. This Digest discusses nine categories of costs associated with matrix sampling. These categories are: (1) development costs; (2) materials costs; (3) administration costs; (4) educational costs; (5)…

  18. Optimal Bayesian Adaptive Design for Test-Item Calibration

    NARCIS (Netherlands)

    Linden, van der Wim J.; Ren, Hao

    2015-01-01

    An optimal adaptive design for test-item calibration based on Bayesian optimality criteria is presented. The design adapts the choice of field-test items to the examinees taking an operational adaptive test using both the information in the posterior distributions of their ability parameters and the

  19. Textiles and Design Library of Test Items. Volume I.

    Science.gov (United States)

    Smith, Jan, Ed.

    As one in a series of test item collections developed by the Assessment and Evaluation Unit of the Directorate of Studies, items of value from past tests are made available to teachers for the construction of unit tests, term examinations or as a basis for class discussion. Each collection is reviewed for content validity and reliability. The test…

  20. Refine test items for accurate measurement: six valuable tips.

    Science.gov (United States)

    Siroky, Karen; Di Leonardi, Bette Case

    2015-01-01

    Nursing Professional Development (NPD) specialists frequently design test items to assess competence, to measure learning outcomes, and to create active learning experiences. This article presents six valuable tips for improving test items and using test results to strengthen validity of measurement. NPD specialists can readily apply these tips and examples to measure knowledge with greater accuracy.

  1. Assessing the quality of multiple-choice test items.

    Science.gov (United States)

    Clifton, Sandra L; Schriner, Cheryl L

    2010-01-01

    With the focus of nursing education geared toward teaching students to think critically, faculty need to assure that test items require students to use a high level of cognitive processing. To evaluate their examinations, the authors assessed multiple-choice test items on final nursing examinations. The assessment included determining cognitive learning levels and frequency of items among 3 adult health courses, comparing difficulty values with cognitive learning levels, and examining discrimination values and the relationship to distracter performance.

  2. Interim Report of Field Test of Expedient Pavement Repairs (Test Items 1-15).

    Science.gov (United States)

    1980-03-01

    OF FIELD TEST OFEXPEDIENT PAVEMENT REPAIRS ( TEST ITEMS 1-15). RAYMONDS.4OLLINGS lWGINE E :AH1 DIVISION .I MARŜ 6 // il) JNTEIM REPUT TiJUL1077...Pavement Repairs ( Test Items 1-15) 6 PERFORMING ORG. REPORT NUMBER 7. AUTHOR(&) S. CONTRACT OR GRANT NUMBER(&) Raymond S. Rollings 9. PERFORMING...21 Item 14 Test Results .................... 33 22 Density Results, Item 15....................34 23 Summary of Test Items ......................37 24

  3. Testing Linear Models for Ability Parameters in Item Response Models

    NARCIS (Netherlands)

    Glas, Cees A.W.; Hendrawan, Irene

    2005-01-01

    Methods for testing hypotheses concerning the regression parameters in linear models for the latent person parameters in item response models are presented. Three tests are outlined: A likelihood ratio test, a Lagrange multiplier test and a Wald test. The tests are derived in a marginal maximum like

  4. Differential functioning of Bender Visual-Motor Gestalt Test items.

    Science.gov (United States)

    Sisto, Fermino Fernandes; Dos Santos, Acácia Aparecida Angeli; Noronha, Ana Paula Porto

    2010-02-01

    Differential Item Functioning (DIF) refers to items that do not function the same way for comparable members of different groups. The present study focuses on analyzing and classifying sex-related differential item functioning in the Bender Visual-Motor Gestalt Test. Subjects were 1,052 children attending public schools (513 boys, 539 girls, ages 6-10 years). The protocols were scored using the Bender Graduated Scoring System, which evaluates only the distortion criterion using the Rasch logistic response model. The scoring system fit the Rasch model, although two items were found to be biased by sex. When analyzing differential functioning of items for boys and girls separately, the number of differentially functioning items was equal.

  5. Mathematical-programming approaches to test item pool design

    NARCIS (Netherlands)

    Veldkamp, Bernard P.; van der Linden, Willem J.; Ariel, A.

    2002-01-01

    This paper presents an approach to item pool design that has the potential to improve on the quality of current item pools in educational and psychological testing andhence to increase both measurement precision and validity. The approach consists of the application of mathematical programming

  6. Using response times for item selection in adaptive testing

    NARCIS (Netherlands)

    Linden, van der Wim J.

    2008-01-01

    Response times on items can be used to improve item selection in adaptive testing provided that a probabilistic model for their distribution is available. In this research, the author used a hierarchical modeling framework with separate first-level models for the responses and response times and a s

  7. Group differences in the heritability of items and test scores

    NARCIS (Netherlands)

    Wicherts, J.M.; Johnson, W.

    2009-01-01

    It is important to understand potential sources of group differences in the heritability of intelligence test scores. On the basis of a basic item response model we argue that heritabilities which are based on dichotomous item scores normally do not generalize from one sample to the next. If groups

  8. A lognormal model for response times on test items

    NARCIS (Netherlands)

    van der Linden, Willem J.

    2006-01-01

    A lognormal model for the response times of a person on a set of test items is investigated. The model has a parameter structure analogous to the two-parameter logistic response models in item response theory, with a parameter for the speed of each person as well as parameters for the time intensity

  9. Item selection and ability estimation adaptive testing

    NARCIS (Netherlands)

    Pashley, Peter J.; van der Linden, Wim J.; van der Linden, Willem J.; Glas, Cornelis A.W.; Glas, Cees A.W.

    2010-01-01

    The last century saw a tremendous progression in the refinement and use of standardized linear tests. The first administered College Board exam occurred in 1901 and the first Scholastic Assessment Test (SAT) was given in 1926. Since then, progressively more sophisticated standardized linear tests

  10. Quantitative penetration testing with item response theory

    NARCIS (Netherlands)

    Arnold, Florian; Pieters, Wolter; Stoelinga, Mariëlle

    2014-01-01

    Existing penetration testing approaches assess the vulnerability of a system by determining whether certain attack paths are possible in practice. Thus, penetration testing has so far been used as a qualitative research method. To enable quantitative approaches to security risk management, including

  11. Quantitative penetration testing with item response theory

    NARCIS (Netherlands)

    Pieters, W.; Arnold, F.; Stoelinga, M.I.A.

    2013-01-01

    Existing penetration testing approaches assess the vulnerability of a system by determining whether certain attack paths are possible in practice. Therefore, penetration testing has thus far been used as a qualitative research method. To enable quantitative approaches to security risk management, in

  12. Quantitative penetration testing with item response theory

    NARCIS (Netherlands)

    Arnold, Florian; Pieters, Wolter; Stoelinga, Mariëlle

    2013-01-01

    Existing penetration testing approaches assess the vulnerability of a system by determining whether certain attack paths are possible in practice. Thus, penetration testing has so far been used as a qualitative research method. To enable quantitative approaches to security risk management, including

  13. Optimal Bayesian Adaptive Design for Test-Item Calibration.

    Science.gov (United States)

    van der Linden, Wim J; Ren, Hao

    2015-06-01

    An optimal adaptive design for test-item calibration based on Bayesian optimality criteria is presented. The design adapts the choice of field-test items to the examinees taking an operational adaptive test using both the information in the posterior distributions of their ability parameters and the current posterior distributions of the field-test parameters. Different criteria of optimality based on the two types of posterior distributions are possible. The design can be implemented using an MCMC scheme with alternating stages of sampling from the posterior distributions of the test takers' ability parameters and the parameters of the field-test items while reusing samples from earlier posterior distributions of the other parameters. Results from a simulation study demonstrated the feasibility of the proposed MCMC implementation for operational item calibration. A comparison of performances for different optimality criteria showed faster calibration of substantial numbers of items for the criterion of D-optimality relative to A-optimality, a special case of c-optimality, and random assignment of items to the test takers.

  14. Differential Weighting of Items to Improve University Admission Test Validity

    OpenAIRE

    Eduardo Backhoff Escudero; Felipe Tirado Segura; Norma Larrazolo Reyna

    2001-01-01

    This paper gives an evaluation of different ways to increase university admission test criterion-related validity, by differentially weighting test items. We compared four methods of weighting multiple-choice items of the Basic Skills and Knowledge Examination (EXHCOBA): (1) punishing incorrect responses by a constant factor, (2) weighting incorrect responses, considering the levels of error, (3) weighting correct responses, considering the item’s difficulty, based on the Classic Measur...

  15. Test Item Linguistic Complexity and Assessments for Deaf Students

    Science.gov (United States)

    Cawthon, Stephanie

    2011-01-01

    Linguistic complexity of test items is one test format element that has been studied in the context of struggling readers and their participation in paper-and-pencil tests. The present article presents findings from an exploratory study on the potential relationship between linguistic complexity and test performance for deaf readers. A total of 64…

  16. Test Item Linguistic Complexity and Assessments for Deaf Students

    Science.gov (United States)

    Cawthon, Stephanie

    2011-01-01

    Linguistic complexity of test items is one test format element that has been studied in the context of struggling readers and their participation in paper-and-pencil tests. The present article presents findings from an exploratory study on the potential relationship between linguistic complexity and test performance for deaf readers. A total of 64…

  17. Modeling Answer Changes on Test Items

    Science.gov (United States)

    van der Linden, Wim J.; Jeon, Minjeong

    2012-01-01

    The probability of test takers changing answers upon review of their initial choices is modeled. The primary purpose of the model is to check erasures on answer sheets recorded by an optical scanner for numbers and patterns that may be indicative of irregular behavior, such as teachers or school administrators changing answer sheets after their…

  18. Assessing Differential Item Functioning in Performance Tests.

    Science.gov (United States)

    Zwick, Rebecca; And Others

    Although the belief has been expressed that performance assessments are intrinsically more fair than multiple-choice measures, some forms of performance assessment may in fact be more likely than conventional tests to tap construct-irrelevant factors. As performance assessment grows in popularity, it will be increasingly important to monitor the…

  19. Item response times in computerized adaptive testing

    Directory of Open Access Journals (Sweden)

    Lutz F. Hornke

    2000-01-01

    Full Text Available Tiempos de respuesta al ítem en tests adaptativos informatizados. Los tests adaptativos informatizados (TAI proporcionan puntuaciones y a la vez tiempos de respuesta a los ítems. La investigación sobre el significado adicional que se puede obtener de la información contenida en los tiempos de respuesta es de especial interés. Se dispuso de los datos de 5912 jóvenes en un test adaptativo informatizado. Estudios anteriores indican mayores tiempos de respuesta cuando las respuestas son incorrectas. Este resultado fue replicado en este estudio más amplio. No obstante, los tiempos promedios de respuesta al ítem para las respuestas erróneas y correctas no muestran una interpretación diferencial de la obtenida con los niveles de rasgo, y tampoco correlacionan de manera diferente con unos cuantos tests de capacidad. Se discute si los tiempos de respuesta deben ser interpretados en la misma dimensión que mide el TAI o en otras dimensiones. Desde los primeros años 30 los tiempos de respuesta han sido considerados indicadores de rasgos de personalidad que deben ser diferenciados de los rasgos que miden las puntuaciones del test. Esta idea es discutida y se ofrecen argumentos a favor y en contra. Los acercamientos mas recientes basados en modelos también se muestran. Permanece abierta la pregunta de si se obtiene o no información diagnóstica adicional de un TAI que tenga una toma de datos detallada y programada.

  20. Selecting Test Item Types To Evaluate Library Skills.

    Science.gov (United States)

    Fagan, Jody Condit

    2002-01-01

    This article outlines the advantages and disadvantages of various question types in tests for library classes, including selected-response, constructed-response and alternative-response test items. It examines a test case in which students in a for-credit library course were given a take home quiz with search story problems. Sample "search story"…

  1. Adaptive testing for psychological assessment: how many items are enough to run an adaptive testing algorithm?

    Science.gov (United States)

    Wagner-Menghin, Michaela M; Masters, Geoff N

    2013-01-01

    Although the principles of adaptive testing were established in the psychometric literature many years ago (e.g., Weiss, 1977), and practice of adaptive testing is established in educational assessment, it not yet widespread in psychological assessment. One obstacle to adaptive psychological testing is a lack of clarity about the necessary number of items to run an adaptive algorithm. The study explores the relationship between item bank size, test length and measurement precision. Simulated adaptive test runs (allowing a maximum of 30 items per person) out of an item bank with 10 items per ability level (covering .5 logits, 150 items total) yield a standard error of measurement (SEM) of .47 (.39) after an average of 20 (29) items for 85-93% (64-82%) of the simulated rectangular sample. Expanding the bank to 20 items per level (300 items total) did not improve the algorithm's performance significantly. With a small item bank (5 items per ability level, 75 items total) it is possible to reach the same SEM as with a conventional test, but with fewer items or a better SEM with the same number of items.

  2. A Method for Severely Constrained Item Selection in Adaptive Testing.

    Science.gov (United States)

    Stocking, Martha L.; Swanson, Len

    1993-01-01

    A method is presented for incorporating a large number of constraints on adaptive item selection in the construction of computerized adaptive tests. The method, which emulates practices of expert test specialists, is illustrated for verbal and quantitative measures. Its foundation is application of a weighted deviations model and algorithm. (SLD)

  3. Differential Item Functioning on Two Tests of EFL Proficiency.

    Science.gov (United States)

    Ryan, Katherine E.; Bachman, Lyle F.

    1992-01-01

    The extent to which items from the Test of English as a Foreign Language and the First Certificate in English function differently for test-takers of equal ability from different native language and curricular backgrounds was investigated. Results suggest a need for methods like logistic regression to examine nonuniform differential item…

  4. Differential Weighting of Items to Improve University Admission Test Validity

    Directory of Open Access Journals (Sweden)

    Eduardo Backhoff Escudero

    2001-05-01

    Full Text Available This paper gives an evaluation of different ways to increase university admission test criterion-related validity, by differentially weighting test items. We compared four methods of weighting multiple-choice items of the Basic Skills and Knowledge Examination (EXHCOBA: (1 punishing incorrect responses by a constant factor, (2 weighting incorrect responses, considering the levels of error, (3 weighting correct responses, considering the item’s difficulty, based on the Classic Measurement Theory, and (4 weighting correct responses, considering the item’s difficulty, based on the Item Response Theory. Results show that none of these methods increased the instrument’s predictive validity, although they did improve its concurrent validity. It was concluded that it is appropriate to score the test by simply adding up correct responses.

  5. Unidimensional IRT Item Parameter Estimates across Equivalent Test Forms with Confounding Specifications within Dimensions

    Science.gov (United States)

    Matlock, Ki Lynn; Turner, Ronna

    2016-01-01

    When constructing multiple test forms, the number of items and the total test difficulty are often equivalent. Not all test developers match the number of items and/or average item difficulty within subcontent areas. In this simulation study, six test forms were constructed having an equal number of items and average item difficulty overall.…

  6. IRT-Estimated Reliability for Tests Containing Mixed Item Formats

    Science.gov (United States)

    Shu, Lianghua; Schwarz, Richard D.

    2014-01-01

    As a global measure of precision, item response theory (IRT) estimated reliability is derived for four coefficients (Cronbach's a, Feldt-Raju, stratified a, and marginal reliability). Models with different underlying assumptions concerning test-part similarity are discussed. A detailed computational example is presented for the targeted…

  7. Some new item selection criteria for adaptive testing

    NARCIS (Netherlands)

    Veerkamp, Wim J.J.; Berger, Martijn P.F.

    1994-01-01

    In this study some alternative item selection criteria for adaptive testing are proposed. These criteria take into account the uncertainty of the ability estimates. A general weighted information criterion is suggested of which the usual maximum information criterion and the suggested alternative cr

  8. DIF Analysis for Pretest Items in Computer-Adaptive Testing.

    Science.gov (United States)

    Zwick, Rebecca; And Others

    A simulation study of methods of assessing differential item functioning (DIF) in computer-adaptive tests (CATs) was conducted by Zwick, Thayer, and Wingersky (in press, 1993). Results showed that modified versions of the Mantel-Haenszel and standardization methods work well with CAT data. DIF methods were also investigated for nonadaptive…

  9. Differential Item Functioning on the Graduate Management Admission Test.

    Science.gov (United States)

    O'Neill, Kathleen A.; And Others

    The purpose of this study was to identify differentially functioning items on operational administrations of the Graduate Management Admission Test (GMAT) through the use of the Mantel-Haenszel statistic. Retrospective analyses of data collected over 3 years are reported for black/white and female/male comparisons for the Verbal and Quantitative…

  10. Analysis of Individual "Test Of Astronomy STandards" (TOAST) Item Responses

    Science.gov (United States)

    Slater, Stephanie J.; Schleigh, Sharon Price; Stork, Debra J.

    2015-01-01

    The development of valid and reliable strategies to efficiently determine the knowledge landscape of introductory astronomy college students is an effort of great interest to the astronomy education community. This study examines individual item response rates from a widely used conceptual understanding survey, the Test Of Astronomy Standards…

  11. Expected linking error resulting from item parameter drift among the common Items on Rasch calibrated tests.

    Science.gov (United States)

    Miller, G Edward; Gesn, Paul Randall; Rotou, Jourania

    2005-01-01

    In state assessment programs that employ Rasch-based common item linking procedures, the linking constant is usually estimated with only those common items not identified as exhibiting item difficulty parameter drift. Since state assessments typically contain a fixed number of items, an item classified as exhibiting parameter drift during the linking process remains on the exam as a scorable item even if it is removed from the common item set. Under the assumption that item parameter drift has occurred for one or more of the common items, the expected effect of including or excluding the "affected" item(s) in the estimation of the linking constant is derived in this article. If the item parameter drift is due solely to factors not associated with a change in examinee achievement, no linking error will (be expected to) occur given that the linking constant is estimated only with the items not identified as "affected"; linking error will (be expected to) occur if the linking constant is estimated with all common items. However, if the item parameter drift is due solely to change in examinee achievement, the opposite is true: no linking error will (be expected to) occur if the linking constant is estimated with all common items; linking error will (be expected to) occur if the linking constant is estimated only with the items not identified as "affected".

  12. Designing multiple-choice test items at higher cognitive levels.

    Science.gov (United States)

    Su, Whei Ming; Osisek, Paul J; Montgomery, Cynthia; Pellar, Suzanne

    2009-01-01

    In the midst of a nursing faculty shortage, many academic institutions hire clinicians who are not formally prepared for an academic role. These novice faculty face an immediate need to develop teaching skills. One area in particular is test construction. To address this need, the authors describe how faculty from one course designed multiple-choice test items at higher cognitive levels and simultaneously achieved congruence with critical-thinking learning objectives defined by the course.

  13. The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education.

    Science.gov (United States)

    Downing, Steven M

    2005-01-01

    The purpose of this research was to study the effects of violations of standard multiple-choice item writing principles on test characteristics, student scores, and pass-fail outcomes. Four basic science examinations, administered to year-one and year-two medical students, were randomly selected for study. Test items were classified as either standard or flawed by three independent raters, blinded to all item performance data. Flawed test questions violated one or more standard principles of effective item writing. Thirty-six to sixty-five percent of the items on the four tests were flawed. Flawed items were 0-15 percentage points more difficult than standard items measuring the same construct. Over all four examinations, 646 (53%) students passed the standard items while 575 (47%) passed the flawed items. The median passing rate difference between flawed and standard items was 3.5 percentage points, but ranged from -1 to 35 percentage points. Item flaws had little effect on test score reliability or other psychometric quality indices. Results showed that flawed multiple-choice test items, which violate well established and evidence-based principles of effective item writing, disadvantage some medical students. Item flaws introduce the systematic error of construct-irrelevant variance to assessments, thereby reducing the validity evidence for examinations and penalizing some examinees.

  14. Comparing Methods for Item Analysis: The Impact of Different Item-Selection Statistics on Test Difficulty

    Science.gov (United States)

    Jones, Andrew T.

    2011-01-01

    Practitioners often depend on item analysis to select items for exam forms and have a variety of options available to them. These include the point-biserial correlation, the agreement statistic, the B index, and the phi coefficient. Although research has demonstrated that these statistics can be useful for item selection, no research as of yet has…

  15. Item Pool Design for an Operational Variable-Length Computerized Adaptive Test

    Science.gov (United States)

    He, Wei; Reckase, Mark D.

    2014-01-01

    For computerized adaptive tests (CATs) to work well, they must have an item pool with sufficient numbers of good quality items. Many researchers have pointed out that, in developing item pools for CATs, not only is the item pool size important but also the distribution of item parameters and practical considerations such as content distribution…

  16. A Stepwise Test Characteristic Curve Method to Detect Item Parameter Drift

    Science.gov (United States)

    Guo, Rui; Zheng, Yi; Chang, Hua-Hua

    2015-01-01

    An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the…

  17. Relation of field independence and test-item format to student performance on written piagetian tests

    Science.gov (United States)

    Ló; Pez-Rupérez, F.; Palacios, C.; Sanchez, J.

    In this study we have investigated the relationship between the field-dependence-independence (FDI) dimension as measured by the Group Embedded Figures Test (GEFT) and subject performance on the Longeot test, a pencil-and-paper Piagetian test, through the open or closed format of its items. The sample consisted of 141 high school students. Correlation and variance analysis show that the FDI dimension and GEFT correlate significantly on only those items on the Longeot test that require formal reasoning. The effect of open- or closed-item format is found exclusively for formal items; only the open format discriminates significantly (at the 0.01 level) between the field-dependent and -independent subjects performing on this type of item. Some implications of these results for science education are discussed.

  18. The use of predicted values for item parameters in item response theory models: An application in intelligence tests

    NARCIS (Netherlands)

    Matteucci, M.; S. Mignani, Prof.; Veldkamp, Bernard P.

    2012-01-01

    In testing, item response theory models are widely used in order to estimate item parameters and individual abilities. However, even unidimensional models require a considerable sample size so that all parameters can be estimated precisely. The introduction of empirical prior information about candi

  19. Field Test of Expedient Pavement Repairs (Test Items 16-35).

    Science.gov (United States)

    1980-11-01

    7 AD-AlG 903 A IR FORCE ENGINEERING AND SERVICES CENTER TYNDALL AF--ETC FG 13/2 FIELD TEST OF EXPEDIENT PAVEMENT REPAIRS ( TEST ITEMS 16-35).(U...PAVEMENT REPAIRS ~( TEST ITEMS 16-35) MICHAEL T. McNERNEY ENGINEERING RESEARCH DIVISION NOVEMBER 1980 ’ FINAL REPORT JULY 1978 - SEPTEMBER 1979 APPROVED FOR...EXPEDIENT PAVEMENT REPAIRS Final Reportg- July 1978- ( TEST ITEMS 16-35), Segtr W79 7. "AUTHOR(s) 8 CONTRACT OR GRANT NUMBER(s) Michael T./lcNernev, P

  20. Differential item functioning in the figure classification test

    Directory of Open Access Journals (Sweden)

    E. van Zyl

    1998-06-01

    Full Text Available The elimination of unfair discrimination and cultural bias of any kind, is a contentious workplace issue in contemporary South Africa. To ensure fairness in testing, psychometric instruments are subjected to empirical investigations for the detection of possible bias that could lead to selection decisions constituting unfair discrimination. This study was conducted to explore the possible existence of differential item functioning (DIF, or potential bias, in the Figure Classification Test (A121 by means of the Mantel-Haenszel chi-square technique. The sample consisted of 498 men at a production company in the Western Cape. Although statistical analysis revealed significant differences between the mean test scores of three racial groups on the test, very few items were identified as having statistically significant DIF. The possibility is discussed that, despite the presence of some DIF, the differences between the means may not be due to the measuring instrument itself being biased/ but rather to extraneous sources of variation, such as the unequal education and socio-economic backgrounds of the racial groups. It was concluded that there is very little evidence of item bias in the test. Opsomming Die uitskakeling van onregverdige diskriminasie en kultuursydigheid van enige aard, is tans 'n omstrede kwessie in die werkpiek in Suid-Afrika. Ten einde regverdigheid in toetsing te verseker, word psigomefrriese toetse onderwerp aan empiriese ondersoeke na die moontlikheid van sydigheid wat kan lei tot keuringsbesluite wat onregverdige diskriminasie meebring. Hierdie ondersoek is ondemeem om die moontlikheid van differensiele itemfunksionering (DIF, of potensiële sydigheid, in die Figuurindelingtoets (A121, met behulp van die Mantel-Haenszel chikwadraattegniek, te ondersoek. Die steekproef het bestaan uit 498 mans by 'n produksiemaatskappy in die Wes-Kaap. Alhoewel statistiese ontleding beduidende verskille in gemiddelde toetstellings van drie

  1. Evaluating the Psychometric Characteristics of Generated Multiple-Choice Test Items

    Science.gov (United States)

    Gierl, Mark J.; Lai, Hollis; Pugh, Debra; Touchie, Claire; Boulais, André-Philippe; De Champlain, André

    2016-01-01

    Item development is a time- and resource-intensive process. Automatic item generation integrates cognitive modeling with computer technology to systematically generate test items. To date, however, items generated using cognitive modeling procedures have received limited use in operational testing situations. As a result, the psychometric…

  2. Evaluating the Psychometric Characteristics of Generated Multiple-Choice Test Items

    Science.gov (United States)

    Gierl, Mark J.; Lai, Hollis; Pugh, Debra; Touchie, Claire; Boulais, André-Philippe; De Champlain, André

    2016-01-01

    Item development is a time- and resource-intensive process. Automatic item generation integrates cognitive modeling with computer technology to systematically generate test items. To date, however, items generated using cognitive modeling procedures have received limited use in operational testing situations. As a result, the psychometric…

  3. An Effect Size Measure for Raju's Differential Functioning for Items and Tests

    Science.gov (United States)

    Wright, Keith D.; Oshima, T. C.

    2015-01-01

    This study established an effect size measure for differential functioning for items and tests' noncompensatory differential item functioning (NCDIF). The Mantel-Haenszel parameter served as the benchmark for developing NCDIF's effect size measure for reporting moderate and large differential item functioning in test items. The effect size of…

  4. A Model-Based Method for Content Validation of Automatically Generated Test Items

    Science.gov (United States)

    Zhang, Xinxin; Gierl, Mark

    2016-01-01

    The purpose of this study is to describe a methodology to recover the item model used to generate multiple-choice test items with a novel graph theory approach. Beginning with the generated test items and working backward to recover the original item model provides a model-based method for validating the content used to automatically generate test…

  5. Item response theory analysis of the life orientation test-revised: age and gender differential item functioning analyses.

    Science.gov (United States)

    Steca, Patrizia; Monzani, Dario; Greco, Andrea; Chiesi, Francesca; Primi, Caterina

    2015-06-01

    This study is aimed at testing the measurement properties of the Life Orientation Test-Revised (LOT-R) for the assessment of dispositional optimism by employing item response theory (IRT) analyses. The LOT-R was administered to a large sample of 2,862 Italian adults. First, confirmatory factor analyses demonstrated the theoretical conceptualization of the construct measured by the LOT-R as a single bipolar dimension. Subsequently, IRT analyses for polytomous, ordered response category data were applied to investigate the items' properties. The equivalence of the items across gender and age was assessed by analyzing differential item functioning. Discrimination and severity parameters indicated that all items were able to distinguish people with different levels of optimism and adequately covered the spectrum of the latent trait. Additionally, the LOT-R appears to be gender invariant and, with minor exceptions, age invariant. Results provided evidence that the LOT-R is a reliable and valid measure of dispositional optimism.

  6. Stochastic order in dichotomous item response models for fixed tests, research adaptive tests, or multiple abilities

    NARCIS (Netherlands)

    van der Linden, Willem J.

    1995-01-01

    Dichotomous item response theory (IRT) models can be viewed as families of stochastically ordered distributions of responses to test items. This paper explores several properties of such distributiom. The focus is on the conditions under which stochastic order in families of conditional distribution

  7. A Practical Procedure for the Construction and Reliability Analysis of Fixed Length Tests with Random Drawn Test Items

    NARCIS (Netherlands)

    Draaijer, S.; Klinkenberg, S.

    2015-01-01

    A procedure to construct valid and fair fixed-length tests with randomly drawn items from an item bank is described. The procedure provides guidelines for the set-up of a typical achievement test with regard to the number of items in the bank and the number of items for each position in a test. Furt

  8. A Procedure for Linear Polychotomous Scoring of Test Items (Computer Diskette).

    Science.gov (United States)

    weights that are then associated with the response categories of test items . When tests are scored using these scoring weights, test reliability...program poly. The example demonstrates how polyweighting can be used to calibrate and score test items drawn from an item bank that is too large to

  9. Small-Item Contact Test Method, FY11 Release

    Science.gov (United States)

    2012-07-01

    sections on an item. Often designated in a sampling plan and marked on an item to enable quick reference. • sessile drop : A liquid droplet that is...Full item contamination illustration 25 9. Localized contamination illustration 25 10. Gross sample collection technique for small items 32 Blank...decontaminant performance. contamination set: A specific contamination density, drop volume, and deposition pattern combination used for dose confirmation

  10. Weighted Maximum-a-Posteriori Estimation in Tests Composed of Dichotomous and Polytomous Items

    Science.gov (United States)

    Sun, Shan-Shan; Tao, Jian; Chang, Hua-Hua; Shi, Ning-Zhong

    2012-01-01

    For mixed-type tests composed of dichotomous and polytomous items, polytomous items often yield more information than dichotomous items. To reflect the difference between the two types of items and to improve the precision of ability estimation, an adaptive weighted maximum-a-posteriori (WMAP) estimation is proposed. To evaluate the performance of…

  11. The Rasch Model and Missing Data, with an Emphasis on Tailoring Test Items.

    Science.gov (United States)

    de Gruijter, Dato N. M.

    Many applications of educational testing have a missing data aspect (MDA). This MDA is perhaps most pronounced in item banking, where each examinee responds to a different subtest of items from a large item pool and where both person and item parameter estimates are needed. The Rasch model is emphasized, and its non-parametric counterpart (the…

  12. Detection of person misfit in computerized adaptive tests with polytomous items

    NARCIS (Netherlands)

    Krimpen-Stoop, van Edith M.L.A.; Meijer, Rob R.

    2002-01-01

    Item scores that do not fit an assumed item response theory model may cause the latent trait value to be inaccurately estimated. For a computerized adaptive test (CAT) using dichotomous items, several person-fit statistics for detecting mis.tting item score patterns have been proposed. Both for pape

  13. Weighted Maximum-a-Posteriori Estimation in Tests Composed of Dichotomous and Polytomous Items

    Science.gov (United States)

    Sun, Shan-Shan; Tao, Jian; Chang, Hua-Hua; Shi, Ning-Zhong

    2012-01-01

    For mixed-type tests composed of dichotomous and polytomous items, polytomous items often yield more information than dichotomous items. To reflect the difference between the two types of items and to improve the precision of ability estimation, an adaptive weighted maximum-a-posteriori (WMAP) estimation is proposed. To evaluate the performance of…

  14. A Simulation Study of Methods for Assessing Differential Item Functioning in Computer-Adaptive Tests.

    Science.gov (United States)

    Zwick, Rebecca; And Others

    Simulated data were used to investigate the performance of modified versions of the Mantel-Haenszel and standardization methods of differential item functioning (DIF) analysis in computer-adaptive tests (CATs). Each "examinee" received 25 items out of a 75-item pool. A three-parameter logistic item response model was assumed, and…

  15. Determining differential item functioning and its effect on the test scores of selected pib indexes, using item response theory techniques

    Directory of Open Access Journals (Sweden)

    Pieter Schaap

    2001-02-01

    Full Text Available The objective of this article is to present the results of an investigation into the item and test characteristics of two tests of the Potential Index Batteries (PIB in terms of differential item functioning (DIP and the effect thereof on test scores of different race groups. The English Vocabulary (Index 12 and Spelling Tests (Index 22 of the PIB were analysed for white, black and coloured South Africans. Item response theory (IRT methods were used to identify items which function differentially for white, black and coloured race groups. Opsomming Die doel van hierdie artikel is om die resultate van n ondersoek na die item- en toetseienskappe van twee PIB (Potential Index Batteries toetse in terme van itemsydigheid en die invloed wat dit op die toetstellings van rassegroepe het, weer te gee. Die Potential Index Batteries (PIB se Engelse Woordeskat (Index 12 en Spellingtoetse (Index 22 is ten opsigte van blanke, swart en gekleurde Suid-Afrikaners ontleed. Itemresponsteorie (IRT is gebruik om items te identifiseer wat as sydig (DIP vir die onderskeie rassegroepe beskou kan word.

  16. The Psychological Effect of Errors in Standardized Language Test Items on EFL Students' Responses to the Following Item

    Science.gov (United States)

    Khaksefidi, Saman

    2017-01-01

    This study investigates the psychological effect of a wrong question with wrong items on answering to the next question in a test of structure. Forty students selected through stratified random sampling are given 15 questions of a standardized test namely a TOEFL structure test in which questions number 7 and number 11 are wrong and their answers…

  17. Testing psychometric properties of the 30-item general health questionnaire.

    Science.gov (United States)

    Klainin-Yobas, Piyanee; He, Hong-Gu

    2014-01-01

    This study aimed to evaluate the psychometric properties of the General Health Questionnaire (GHQ-30) given conflicting findings in the literature. A cross-sectional, nonexperimental research was used with a convenience sample of 271 American female health care professionals. Data were collected by using self-reported questionnaires. A series of exploratory factor analyses (EFAs), confirmatory factor analyses (CFAs), and structural equation modeling (SEM) were performed to examine underlying dimensions of the GHQ-30. Results from EFAs and CFAs revealed the three-factor composition (positive affect, anxiety, and depressed mood). All factor loadings were statistically significant, and one pair of error variance was allowed to be correlated. All factors contained questionnaire items with acceptable face validity and demonstrated good internal consistency reliability. Results from SEM further confirmed underlying constructs of the scale. To our knowledge, this is the first study that extensively tested the psychometric properties of the GHQ-30, taking both statistical and substantive issues into consideration.

  18. Relevance of Item Analysis in Standardizing an Achievement Test in Teaching of Physical Science in B.Ed Syllabus

    Science.gov (United States)

    Marie, S. Maria Josephine Arokia; Edannur, Sreekala

    2015-01-01

    This paper focused on the analysis of test items constructed in the paper of teaching Physical Science for B.Ed. class. It involved the analysis of difficulty level and discrimination power of each test item. Item analysis allows selecting or omitting items from the test, but more importantly item analysis is a tool to help the item writer improve…

  19. What factors make science test items especially difficult for students from minority groups?

    Directory of Open Access Journals (Sweden)

    Are Turmo

    2012-06-01

    Full Text Available Substantial gaps in science performance between majority and minority students are often found instandardized tests used in primary school. But at the item level, the gaps may vary significantly. Theaims of this study are: (1 to identify features of the test items in science (grade 5 and grade 8 students that can potentially explain group differences; and (2 to analyze what factors make test itemsespecially difficult for minority students. Explanatory variables such as reading load, item difficulty,item writing load, and use of the multiple-choice format are found to be major factors. The analysis reveals no empirical relationships between performance gap and either item subject domain, item test location, or the number of illustrations used in the item. Subtle issues regarding the design ofitems may influence the size of the performance gap at item level over and above the main explanatory variables. The gap can be reduced significantly by choosing “minority friendly” items.

  20. Test Accessibility: Item Reviews and Lessons Learned from Four State Assessments

    Directory of Open Access Journals (Sweden)

    Peter A. Beddow

    2013-01-01

    Full Text Available The push toward universally designed assessments has influenced several states to modify items from their general achievement tests to improve their accessibility for all test takers. The current study involved the review of 159 items used by one state across four content areas including science, coupled with the review of 261 science items in three other states. The item reviews were conducted using the Accessibility Rating Matrix (Beddow et al. 2009, a tool for systematically identifying access barriers in test items, and for facilitating the subsequent modification process. The design allowed for within-state comparisons across several variables for one state and for within-content area (i.e., science comparisons across states. Findings indicated that few items were optimally accessible and ratings were consistent across content areas, states, grade bands, and item types. Suggestions for modifying items are discussed and recommendations are offered to guide the development of optimally accessible test items.

  1. Development of a lack of appetite item bank for computer-adaptive testing (CAT)

    DEFF Research Database (Denmark)

    Thamsborg, Lise Laurberg Holst; Petersen, Morten Aa; Aaronson, Neil K

    2015-01-01

    measurement precision. The EORTC Quality of Life Group is developing a CAT version of the widely used EORTC QLQ-C30 questionnaire. Here, we report on the development of the lack of appetite CAT. METHODS: The EORTC approach to CAT development comprises four phases: literature search, operationalization, pre......-testing, and field testing. Phases 1-3 are described in this paper. First, a list of items was retrieved from the literature. This was refined, deleting redundant and irrelevant items. Next, new items fitting the "QLQ-C30 item style" were created. These were evaluated by international samples of experts and cancer...... to 12 lack of appetite items. CONCLUSIONS: Phases 1-3 resulted in 12 lack of appetite candidate items. Based on a field testing (phase 4), the psychometric characteristics of the items will be assessed and the final item bank will be generated. This CAT item bank is expected to provide precise...

  2. Origin bias of test items compromises the validity and fairness of curriculum comparisons

    NARCIS (Netherlands)

    Muijtjens, Arno M. M.; Schuwirth, Lambert W. T.; Cohen-Schotanus, Janke; van der Vleuten, Cees P. M.

    2007-01-01

    OBJECTIVE To determine whether items of progress tests used for inter-curriculum comparison favour students from the medical school where the items were produced (i.e. whether the origin bias of test items is a potential confounder in comparisons between curricula). METHODS We investigated scores of

  3. Do Linguistic Features of Science Test Items Prevent English Language Learners from Demonstrating Their Knowledge?

    Science.gov (United States)

    Noble, Tracy; Kachchaf, Rachel; Rosebery, Ann; Warren, Beth; O'Connor, Mary Catherine; Wang, Yang

    2014-01-01

    Little research has examined individual linguistic features that influence English language learners (ELLs) test performance. Furthermore, research has yet to explore the relationship between the science strand of test items and the types of linguistic features the items include. Utilizing Differential Item Functioning, this study examines ELL…

  4. Multidimensional Linking for Tests with Mixed Item Types

    Science.gov (United States)

    Yao, Lihua; Boughton, Keith

    2009-01-01

    Numerous assessments contain a mixture of multiple choice (MC) and constructed response (CR) item types and many have been found to measure more than one trait. Thus, there is a need for multidimensional dichotomous and polytomous item response theory (IRT) modeling solutions, including multidimensional linking software. For example,…

  5. Studies on statistical models for polytomously scored test items

    NARCIS (Netherlands)

    Akkermans, Wies

    1998-01-01

    This dissertation, which is structured as a collection of self-contained papers, will be concerned mainly with di�erences between item response models. The purpose of item response theory (IRT) is estimation of a hypothesized latent variable, such as, for example, intelligence or ability in geograph

  6. Visual Items in Tests of Intelligence for Children.

    Science.gov (United States)

    Wyver, Shirley R.; Markham, Roslyn; Hlavacek, Sonia

    1999-01-01

    A study compared the performance of 15 children (ages 5-12) with visual impairments and 15 controls on the Comprehension and Similarities items of the Wechsler Intelligence Scale for Children-Revised. Results indicated the children with visual impairments were disadvantaged by comprehension-type items with high visual content. (CR)

  7. Statistical tests of conditional independence between responses and/or response times on test items

    NARCIS (Netherlands)

    van der Linden, Willem J.; Glas, Cornelis A.W.

    2010-01-01

    Three plausible assumptions of conditional independence in a hierarchical model for responses and response times on test items are identified. For each of the assumptions, a Lagrange multiplier test of the null hypothesis of conditional independence against a parametric alternative is derived. The t

  8. Employment of Item Response Theory to measure change in Children's Analogical Thinking Modifiability Test

    OpenAIRE

    Queiroz,Odoisa Antunes de; Primi,Ricardo; Carvalho,Lucas de Francisco; Enumo,Sônia Regina Fiorim

    2013-01-01

    Dynamic testing, with an intermediate phase of assistance, measures changes between pretest and post-test assuming a common metric between them. To test this assumption we applied the Item Response Theory in the responses of 69 children to dynamic cognitive testing Children's Analogical Thinking Modifiability Test adapted, with 12 items, totaling 828 responses, with the purpose of verifying if the original scale yields the same results as the equalized scale obtained by Item Response Theory i...

  9. Are vocabulary tests measurement invariant between age groups? An item response analysis of three popular tests.

    Science.gov (United States)

    Fox, Mark C; Berry, Jane M; Freeman, Sara P

    2014-12-01

    Relatively high vocabulary scores of older adults are generally interpreted as evidence that older adults possess more of a common ability than younger adults. Yet, this interpretation rests on empirical assumptions about the uniformity of item-response functions between groups. In this article, we test item response models of differential responding against datasets containing younger-, middle-aged-, and older-adult responses to three popular vocabulary tests (the Shipley, Ekstrom, and WAIS-R) to determine whether members of different age groups who achieve the same scores have the same probability of responding in the same categories (e.g., correct vs. incorrect) under the same conditions. Contrary to the null hypothesis of measurement invariance, datasets for all three tests exhibit substantial differential responding. Members of different age groups who achieve the same overall scores exhibit differing response probabilities in relation to the same items (differential item functioning) and appear to approach the tests in qualitatively different ways that generalize across items. Specifically, younger adults are more likely than older adults to leave items unanswered for partial credit on the Ekstrom, and to produce 2-point definitions on the WAIS-R. Yet, older adults score higher than younger adults, consistent with most reports of vocabulary outcomes in the cognitive aging literature. In light of these findings, the most generalizable conclusion to be drawn from the cognitive aging literature on vocabulary tests is simply that older adults tend to score higher than younger adults, and not that older adults possess more of a common ability.

  10. Analysis of Software Test Item Generation- Comparison Between High Skilled and Low Skilled Engineers

    Institute of Scientific and Technical Information of China (English)

    Masayuki Hirayama; Osamu Mizuno; Tohru Kikuno

    2005-01-01

    Recent software system contain many functions to provide various services. According to this tendency, it is difficult to ensure software quality and to eliminate crucial faults by conventional software testing methods. So taking the effect of test engineer's skill on test item generation into consideration, we propose a new test item generation method,which supports the generation of test items for illegal behavior of the system. The proposed method can generate test items based on use-case analysis, deviation analysis for legal behavior, and faults tree analysis for system fault situations. From the results of the experimental applications of our method, we confirmed that test items for illegal behavior of a system were effectively generated, and also the proposed method could effectively assist test item generation by an engineer with low-level skill.

  11. Solving Verbal Analogies: Some Cognitive Components of Intelligence Test Items

    Science.gov (United States)

    Whitely, Susan E.

    1976-01-01

    The results indicate that although relational concepts influence the cognitive aptitudes which are reflected in analogy item performance, success in solving analogies does not depend on individual differences in some major aspects of processing relationships. (Author/DEP)

  12. International Semiotics: Item Difficulty and the Complexity of Science Item Illustrations in the PISA-2009 International Test Comparison

    Science.gov (United States)

    Solano-Flores, Guillermo; Wang, Chao; Shade, Chelsey

    2016-01-01

    We examined multimodality (the representation of information in multiple semiotic modes) in the context of international test comparisons. Using Program of International Student Assessment (PISA)-2009 data, we examined the correlation of the difficulty of science items and the complexity of their illustrations. We observed statistically…

  13. The "Sniffin' Kids" test--a 14-item odor identification test for children.

    Directory of Open Access Journals (Sweden)

    Valentin A Schriever

    Full Text Available Tools for measuring olfactory function in adults have been well established. Although studies have shown that olfactory impairment in children may occur as a consequence of a number of diseases or head trauma, until today no consensus on how to evaluate the sense of smell in children exists in Europe. Aim of the study was to develop a modified "Sniffin' Sticks" odor identification test, the "Sniffin' Kids" test for the use in children. In this study 537 children between 6-17 years of age were included. Fourteen odors, which were identified at a high rate by children, were selected from the "Sniffin' Sticks" 16-item odor identification test. Normative date for the 14-item "Sniffin' Kids" odor identification test was obtained. The test was validated by including a group of congenital anosmic children. Results show that the "Sniffin' Kids" test is able to discriminate between normosmia and anosmia with a cutoff value of >7 points on the odor identification test. In addition the test-retest reliability was investigated in a group of 31 healthy children and shown to be ρ = 0.44. With the 14-item odor identification "Sniffin' Kids" test we present a valid and reliable test for measuring olfactory function in children between ages 6-17 years.

  14. Test-taker perception of what test items measure: a potential impact of face validity on student learning

    National Research Council Canada - National Science Library

    Sato, Takanori; Ikeda, Naoki

    2015-01-01

    ... agree.University students in Japan and Korea (N = 179) were given past entrance examinations administered in the respective countries and asked to read test items and record what ability they thought each item...

  15. Detecting Differential Item Functioning and Differential Test Functioning on Math School Final-exam

    Directory of Open Access Journals (Sweden)

    - Mansyur

    2016-08-01

    Full Text Available This study aims at finding out the characteristics of Differential Item Functioning (DIF and Differential Test Functioning (DTF on school final-exam for Math subject based on Item Response Theory (ITR. The subjects of this study were questions and all of the students’ answer sheets chosen by using convenience sampling method and obtained 286 responses consisted of 147 male and 149 female students’ responses. The data of this study collected using documentation technique by quoting the response of Math school final-exam participants. The data analysis of this study was Item Response Theory approach with model 2P of Lord’s chi-square DIF method. This study showed that from 40 question items analysed theoretically using Item Response Theory (ITR, affected Differential Item Functioning (DIF gender was ten items and affected DIF location (area was 13 items. Meanwhile, Differential Test Functioning (DTF was benefitted for female and least profitable to citizen.

  16. An Empirical Investigation of Methods for Assessing Item Fit for Mixed Format Tests

    Science.gov (United States)

    Chon, Kyong Hee; Lee, Won-Chan; Ansley, Timothy N.

    2013-01-01

    Empirical information regarding performance of model-fit procedures has been a persistent need in measurement practice. Statistical procedures for evaluating item fit were applied to real test examples that consist of both dichotomously and polytomously scored items. The item fit statistics used in this study included the PARSCALE's G[squared],…

  17. V-TECS Criterion-Referenced Test Item Bank for Radiologic Technology Occupations.

    Science.gov (United States)

    Reneau, Fred; And Others

    This Vocational-Technical Education Consortium of States (V-TECS) criterion-referenced test item bank provides 696 multiple-choice items and 33 matching items for radiologic technology occupations. These job titles are included: radiologic technologist, chief; radiologic technologist; nuclear medicine technologist; radiation therapy technologist;…

  18. The Impact of Discourse Features of Science Test Items on ELL Performance

    Science.gov (United States)

    Kachchaf, Rachel; Noble, Tracy; Rosebery, Ann; Wang, Yang; Warren, Beth; O'Connor, Mary Catherine

    2014-01-01

    Most research on linguistic features of test items negatively impacting English language learners' (ELLs') performance has focused on lexical and syntactic features, rather than discourse features that operate at the level of the whole item. This mixed-methods study identified two discourse features in 162 multiple-choice items on a standardized…

  19. Preliminary results of a proficiency testing of industrial CT scanners using small polymer items

    DEFF Research Database (Denmark)

    Angel, Jais Andreas Breusch; Cantatore, Angela; De Chiffre, Leonardo

    2012-01-01

    This work presents preliminary results concerning a proficiency testing for intercomparison of industrial CT scanners. Two audit items, similar to common industrial parts, were selected for circulation. The two items were a single polymer complex geometry part and a simple geometry item made of two...

  20. Evaluating innovative items for the NCLEX, part I: usability and pilot testing.

    Science.gov (United States)

    Wendt, Anne; Harmes, J Christine

    2009-01-01

    National Council of State Boards of Nursing (NCSBN) has recently conducted preliminary research on the feasibility of including various types of innovative test questions (items) on the NCLEX. This article focuses on the participants' reactions to and their strategies for interacting with various types of innovative items. Part 2 in the May/June issue will focus on the innovative item templates and evaluation of the statistical characteristics and the level of cognitive processing required to answer the examination items.

  1. PENGEMBANGAN DAN ANALISIS SOAL ULANGAN KENAIKAN KELAS KIMIA SMA KELAS X BERDASARKAN CLASSICAL TEST THEORY DAN ITEM RESPONSE THEORY

    Directory of Open Access Journals (Sweden)

    Mr Nahadi

    2011-10-01

    Full Text Available This research is title “Test Development and Analysis of First Grade Senior High School Final Examination in chemistry Based on Classical Test Theory and Item Response Theory”. This research is conducted to develop a standard test instrument for final examination in senior high school at first grade using analysis based on classical test theory and item response theory. The test is a multiple choice test which consists of 75 items. Each item has five options. The research method is research and development method to get a product of test items which fulfill item criterion such as validity, reliability, item discrimination, item difficulty and distracting options quality based on classical test theory and validity, reliability, item discrimination, item difficulty and pseudo-guessing based on item response theory. The three parameter item response theory model is used in this research. Research and development method is conducted until preliminary field test to 102 first grade students in senior high school. Based on the research result, the test fulfills criterion as a good instrument based on classical test theory and item response theory. The final examination test items have vary of item quality so that some of them need a revision to make them better either for the stem and the options. From the total of 75 test items, 21 test items are declined and 54 test items are accepted.

  2. Development of the Basic Core Test Items of National Nurse's License Examination

    Directory of Open Access Journals (Sweden)

    Cho Ja Kim

    2004-01-01

    Full Text Available The purpose of this study is to develop a classification framework for the test elements of the National Registered Nurse's License Examination and to divide the test items into standard and basic core on the basis of the RN's job descriptions. And the adequa to proportion of the basic core test items is going to be identified. Method and results: In order to develop the classification framework of the National Registered Nurse's License Examination, RN's job descriptions, nursing standards, and the specific learning objectives of nursing courses were reviewed. And a survey was used to identify which entity would be appropriate for a reference to the basic core test items. 146 of professors from schools of nursing and members of each division of Korean Academic Society of Nursing(KASN were participated in the survey. The study showed the 98% of respondents agreed to use RN's job descriptions in selecting the basic core test items and 30% for the basic core test would be appropriate. And the contents, the selection criteria, and the proportion of the basic core test items were developed by the members of this research, the members of the National RN's License Examination subcommittee, and the presidents of each division of KASN. The total of 1990 standard test items were selected among 3524 items, that 3 out of 7 members in the research team agreed to choose. Duplicated items in the standard items were deleted. 205 items out of the 1990 standard items were selected as the basic core test items. And 14 items were added in Medical Laws and Ethics which leads the total of 219 basic core test items. ln conclusion, the 99 items, 30% of total current examination items were chosen as the final basic core test items using the delphimethod. Further studies are needed to validate the current National License Examination for RN on the basis of the 99 basic core test items.

  3. Difficulty and Discrimination Parameters of Boston Naming Test Items in a Consecutive Clinical Series

    Science.gov (United States)

    Pedraza, Otto; Sachs, Bonnie C.; Ferman, Tanis J.; Rush, Beth K.; Lucas, John A.

    2011-01-01

    The Boston Naming Test is one of the most widely used neuropsychological instruments; yet, there has been limited use of modern psychometric methods to investigate its properties at the item level. The current study used Item response theory to examine each item's difficulty and discrimination properties, as well as the test's measurement precision across the range of naming ability. Participants included 300 consecutive referrals to the outpatient neuropsychology service at Mayo Clinic in Florida. Results showed that successive items do not necessarily reflect a monotonic increase in psychometric difficulty, some items are inadequate to distinguish individuals at various levels of naming ability, multiple items provide redundant psychometric information, and measurement precision is greatest for persons within a low-average range of ability. These findings may be used to develop short forms, improve reliability in future test versions by replacing psychometrically poor items, and analyze profiles of intra-individual variability. PMID:21593059

  4. Difficulty and discrimination parameters of Boston naming test items in a consecutive clinical series.

    Science.gov (United States)

    Pedraza, Otto; Sachs, Bonnie C; Ferman, Tanis J; Rush, Beth K; Lucas, John A

    2011-08-01

    The Boston Naming Test is one of the most widely used neuropsychological instruments; yet, there has been limited use of modern psychometric methods to investigate its properties at the item level. The current study used Item response theory to examine each item's difficulty and discrimination properties, as well as the test's measurement precision across the range of naming ability. Participants included 300 consecutive referrals to the outpatient neuropsychology service at Mayo Clinic in Florida. Results showed that successive items do not necessarily reflect a monotonic increase in psychometric difficulty, some items are inadequate to distinguish individuals at various levels of naming ability, multiple items provide redundant psychometric information, and measurement precision is greatest for persons within a low-average range of ability. These findings may be used to develop short forms, improve reliability in future test versions by replacing psychometrically poor items, and analyze profiles of intra-individual variability.

  5. Caution warranted in extrapolating from Boston Naming Test item gradation construct.

    Science.gov (United States)

    Beattey, Robert A; Murphy, Hilary; Cornwell, Melinda; Braun, Thomas; Stein, Victoria; Goldstein, Martin; Bender, Heidi Allison

    2017-01-01

    The Boston Naming Test (BNT) was designed to present items in order of difficulty based on word frequency. Changes in word frequencies over time, however, would frustrate extrapolation in clinical and research settings based on the theoretical construct because performance on the BNT might reflect changes in ecological frequency of the test items, rather than performance across items of increasing difficulty. This study identifies the ecological frequency of BNT items at the time of publication using the American Heritage Word Frequency Book and determines changes in frequency over time based on the frequency distribution of BNT items across a current corpus, the Corpus of Contemporary American English. Findings reveal an uneven distribution of BNT items across 2 corpora and instances of negligible differentiation in relative word frequency across test items. As BNT items are not presented in order from least to most frequent, clinicians and researchers should exercise caution in relying on the BNT as presenting items in increasing order of difficulty. A method is proposed for distributing confrontation-naming items to be explicitly measured against test items that are normally distributed across the corpus of a given language.

  6. Effects of Reducing the Cognitive Load of Mathematics Test Items on Student Performance

    Directory of Open Access Journals (Sweden)

    Susan C. Gillmor

    2015-01-01

    Full Text Available This study explores a new item-writing framework for improving the validity of math assessment items. The authors transfer insights from Cognitive Load Theory (CLT, traditionally used in instructional design, to educational measurement. Fifteen, multiple-choice math assessment items were modified using research-based strategies for reducing extraneous cognitive load. An experimental design with 222 middle-school students tested the effects of the reduced cognitive load items on student performance and anxiety. Significant findings confirm the main research hypothesis that reducing the cognitive load of math assessment items improves student performance. Three load-reducing item modifications are identified as particularly effective for reducing item difficulty: signalling important information, aesthetic item organization, and removing extraneous content. Load reduction was not shown to impact student anxiety. Implications for classroom assessment and future research are discussed.

  7. [Difference analysis among majors in medical parasitology exam papers by test item bank proposition].

    Science.gov (United States)

    Jia, Lin-Zhi; Ya-Jun, Ma; Cao, Yi; Qian, Fen; Li, Xiang-Yu

    2012-04-30

    The quality index among "Medical Parasitology" exam papers and measured data for students in three majors from the university in 2010 were compared and analyzed. The exam papers were formed from the test item bank. The alpha reliability coefficients of the three exam papers were above 0.70. The knowledge structure and capacity structure of the exam papers were basically balanced. But the alpha reliability coefficients of the second major was the lowest, mainly due to quality of test items in the exam paper and the failure of revising the index of test item bank in time. This observation demonstrated that revising the test items and their index in the item bank according to the measured data can improve the quality of test item bank proposition and reduce the difference among exam papers.

  8. A-Stratified Computerized Adaptive Testing with Unequal Item Exposure across Strata.

    Science.gov (United States)

    Deng, Hui; Chang, Hua-Hua

    The purpose of this study was to compare a proposed revised a-stratified, or alpha-stratified, USTR method of test item selection with the original alpha-stratified multistage computerized adaptive testing approach (STR) and the use of maximum Fisher information (FSH) with respect to test efficiency and item pool usage using simulated computerized…

  9. The construction of parallel tests from IRT-based item banks

    NARCIS (Netherlands)

    Boekkooi-Timminga, Ellen

    1989-01-01

    The construction of parallel tests from item response theory (IRT) based item banks is discussed. Tests are considered parallel whenever their information functions are identical. After the methods for constructing parallel tests are considered, the computational complexity of 0-1 linear programming

  10. The Impact of Test Dimensionality, Common-Item Set Format, and Scale Linking Methods on Mixed-Format Test Equating

    Science.gov (United States)

    Öztürk-Gübes, Nese; Kelecioglu, Hülya

    2016-01-01

    The purpose of this study was to examine the impact of dimensionality, common-item set format, and different scale linking methods on preserving equity property with mixed-format test equating. Item response theory (IRT) true-score equating (TSE) and IRT observed-score equating (OSE) methods were used under common-item nonequivalent groups design.…

  11. Using cognitive interviewing for test items to assess physical function in children with cerebral palsy.

    Science.gov (United States)

    Dumas, Helene M; Watson, Kyle; Fragala-Pinkham, Maria A; Haley, Stephen M; Bilodeau, Nathalie; Montpetit, Kathleen; Gorton, George E; Mulcahey, M J; Tucker, Carole A

    2008-01-01

    The purpose of this study was to assess the content, format, and comprehension of test items and responses developed for use in a computer adaptive test (CAT) of physical function for children with cerebral palsy (CP). After training in cognitive interviewing techniques, investigators defined item intent and developed questions for each item. Parents of children with CP (n = 27) participated in interviews probing item meaning, item wording, and response choice adequacy and appropriateness. Qualitative analysis identified 3 themes: item clarity; relevance, context, and attribution; and problems with wording or tone. Parents reported the importance of delineating task components, assistance amount, and environmental context. Cognitive interviewing provided valuable information about the validity of new items and insight to improve relevance and context. We believe that the development of CATs in pediatric rehabilitation may ultimately reduce the impact of the issues identified.

  12. The Development, Validation and Application of an External Criterion Measure of Achievement Test Item Bias.

    Science.gov (United States)

    Harms, Robert A.

    Based on John Rawls' theory of justice as fairness, a nine-item rating scale was developed to serve as a criterion in studies of test item bias. Two principles underlie the scale: (1) Within a defined usage, test items should not affect students so that they are unable to do as well as their abilities would indicate; and (2) within the domain of a…

  13. The Development, Validation and Application of an External Criterion Measure of Achievement Test Item Bias.

    Science.gov (United States)

    Harms, Robert A.

    Based on John Rawls' theory of justice as fairness, a nine-item rating scale was developed to serve as a criterion in studies of test item bias. Two principles underlie the scale: (1) Within a defined usage, test items should not affect students so that they are unable to do as well as their abilities would indicate; and (2) within the domain of a…

  14. Small-Item Vapor Test Method, FY11 Release

    Science.gov (United States)

    2012-07-01

    exposed. sessile drop : A liquid droplet that is firmly attached to a surface. If the droplet significantly spreads across the surface, it is better...following information regarding the observations. o Written description of applied drops as they appeared for each sample (e.g., sessile , spread). o... techniques for this method. The airflow and air volume are key variables required to assess risk. Residual agent: Because full-item extraction

  15. Origin bias of test items compromises the validity and fairness of curriculum comparisons.

    Science.gov (United States)

    Muijtjens, Arno M M; Schuwirth, Lambert W T; Cohen-Schotanus, Janke; van der Vleuten, Cees P M

    2007-12-01

    To determine whether items of progress tests used for inter-curriculum comparison favour students from the medical school where the items were produced (i.e. whether the origin bias of test items is a potential confounder in comparisons between curricula). We investigated scores of students from different schools on subtests consisting of progress test items constructed by authors from the different schools. In a cross-institutional collaboration between 3 medical schools, progress tests are jointly constructed and simultaneously administered to all students at the 3 schools. Test score data for 6 consecutive progress tests were investigated. Participants consisted of approximately 5000 undergraduate medical students from 3 medical schools. The main outcome measure was the difference between the scores on subtests of items constructed by authors from 2 of the collaborating schools (subtest difference score). The subtest difference scores showed that students obtained better results on items produced at their own schools. This effect was more pronounced in Years 2-5 of the curriculum than in Year 1, and diminished in Year 6. Progress test items were subject to origin bias. As a consequence, all participating schools should contribute equal numbers of test items if tests are to be used for valid and fair inter-curriculum comparisons.

  16. Considering the Use of General and Modified Assessment Items in Computerized Adaptive Testing

    Science.gov (United States)

    Wyse, Adam E.; Albano, Anthony D.

    2015-01-01

    This article used several data sets from a large-scale state testing program to examine the feasibility of combining general and modified assessment items in computerized adaptive testing (CAT) for different groups of students. Results suggested that several of the assumptions made when employing this type of mixed-item CAT may not be met for…

  17. Applications of NLP Techniques to Computer-Assisted Authoring of Test Items for Elementary Chinese

    Science.gov (United States)

    Liu, Chao-Lin; Lin, Jen-Hsiang; Wang, Yu-Chun

    2010-01-01

    The authors report an implemented environment for computer-assisted authoring of test items and provide a brief discussion about the applications of NLP techniques for computer assisted language learning. Test items can serve as a tool for language learners to examine their competence in the target language. The authors apply techniques for…

  18. Effects of Using Modified Items to Test Students with Persistent Academic Difficulties

    Science.gov (United States)

    Elliott, Stephen N.; Kettler, Ryan J.; Beddow, Peter A.; Kurz, Alexander; Compton, Elizabeth; McGrath, Dawn; Bruen, Charles; Hinton, Kent; Palmer, Porter; Rodriguez, Michael C.; Bolt, Daniel; Roach, Andrew T.

    2010-01-01

    This study investigated the effects of using modified items in achievement tests to enhance accessibility. An experiment determined whether tests composed of modified items would reduce the performance gap between students eligible for an alternate assessment based on modified achievement standards (AA-MAS) and students not eligible, and the…

  19. The Value of Self-Test Items in Tape-Slide Instruction

    Science.gov (United States)

    Jones, Hazel C.

    1976-01-01

    Two tape-slide sequences in general pathology were used in an experiment to assess the value of self-test items and to determine whether it is better to intersperse the self-test items in the program or place them all at the end of the sequence. A highly favorable attitude to the latter method is reported. (Author/LBH)

  20. The Comparability of the Statistical Characteristics of Test Items Generated by Computer Algorithms.

    Science.gov (United States)

    Meisner, Richard; And Others

    This paper presents a study on the generation of mathematics test items using algorithmic methods. The history of this approach is briefly reviewed and is followed by a survey of the research to date on the statistical parallelism of algorithmically generated mathematics items. Results are presented for 8 parallel test forms generated using 16…

  1. A review of methods for evaluating the fit of item score patterns on a test

    NARCIS (Netherlands)

    Meijer, R.R.; Sijtsma, Klaas

    1999-01-01

    Methods are discussed that can be used to investigate the fit of an item score pattern to a test model. Model-based tests and personality inventories are administered to more than 100 million people a year and, as a result, individual fit is of great concern. Item Response Theory (IRT) modeling and

  2. Effects of Using Modified Items to Test Students with Persistent Academic Difficulties

    Science.gov (United States)

    Elliott, Stephen N.; Kettler, Ryan J.; Beddow, Peter A.; Kurz, Alexander; Compton, Elizabeth; McGrath, Dawn; Bruen, Charles; Hinton, Kent; Palmer, Porter; Rodriguez, Michael C.; Bolt, Daniel; Roach, Andrew T.

    2010-01-01

    This study investigated the effects of using modified items in achievement tests to enhance accessibility. An experiment determined whether tests composed of modified items would reduce the performance gap between students eligible for an alternate assessment based on modified achievement standards (AA-MAS) and students not eligible, and the…

  3. A Preliminary Analysis of the Linguistic Complexity of Numeracy Skills Test Items for Pre Service Teachers

    Science.gov (United States)

    O'Keeffe, Lisa

    2016-01-01

    Language is frequently discussed as barrier to mathematics word problems. Hence this paper presents the initial findings of a linguistic analysis of numeracy skills test sample items. The theoretical perspective of multi-modal text analysis underpinned this study, in which data was extracted from the ten sample numeracy test items released by the…

  4. A Method for Generating Educational Test Items That Are Aligned to the Common Core State Standards

    Science.gov (United States)

    Gierl, Mark J.; Lai, Hollis; Hogan, James B.; Matovinovic, Donna

    2015-01-01

    The demand for test items far outstrips the current supply. This increased demand can be attributed, in part, to the transition to computerized testing, but, it is also linked to dramatic changes in how 21st century educational assessments are designed and administered. One way to address this growing demand is with automatic item generation.…

  5. Relationships among Classical Test Theory and Item Response Theory Frameworks via Factor Analytic Models

    Science.gov (United States)

    Kohli, Nidhi; Koran, Jennifer; Henn, Lisa

    2015-01-01

    There are well-defined theoretical differences between the classical test theory (CTT) and item response theory (IRT) frameworks. It is understood that in the CTT framework, person and item statistics are test- and sample-dependent. This is not the perception with IRT. For this reason, the IRT framework is considered to be theoretically superior…

  6. A review of methods for evaluating the fit of item score patterns on a test

    NARCIS (Netherlands)

    Meijer, Rob R.; Sijtsma, Klaas

    1999-01-01

    Methods are discussed that can be used to investigate the fit of an item score pattern to a test model. Model-based tests and personality inventories are administered to more than 100 million people a year and, as a result, individual fit is of great concern. Item Response Theory (IRT) modeling and

  7. Secondary Psychometric Examination of the Dimensional Obsessive-Compulsive Scale: Classical Testing, Item Response Theory, and Differential Item Functioning.

    Science.gov (United States)

    Thibodeau, Michel A; Leonard, Rachel C; Abramowitz, Jonathan S; Riemann, Bradley C

    2015-12-01

    The Dimensional Obsessive-Compulsive Scale (DOCS) is a promising measure of obsessive-compulsive disorder (OCD) symptoms but has received minimal psychometric attention. We evaluated the utility and reliability of DOCS scores. The study included 832 students and 300 patients with OCD. Confirmatory factor analysis supported the originally proposed four-factor structure. DOCS total and subscale scores exhibited good to excellent internal consistency in both samples (α = .82 to α = .96). Patient DOCS total scores reduced substantially during treatment (t = 16.01, d = 1.02). DOCS total scores discriminated between students and patients (sensitivity = 0.76, 1 - specificity = 0.23). The measure did not exhibit gender-based differential item functioning as tested by Mantel-Haenszel chi-square tests. Expected response options for each item were plotted as a function of item response theory and demonstrated that DOCS scores incrementally discriminate OCD symptoms ranging from low to extremely high severity. Incremental differences in DOCS scores appear to represent unbiased and reliable differences in true OCD symptom severity.

  8. Differential Item Functioning (DIF) among Spanish-Speaking English Language Learners (ELLs) in State Science Tests

    Science.gov (United States)

    Ilich, Maria O.

    Psychometricians and test developers evaluate standardized tests for potential bias against groups of test-takers by using differential item functioning (DIF). English language learners (ELLs) are a diverse group of students whose native language is not English. While they are still learning the English language, they must take their standardized tests for their school subjects, including science, in English. In this study, linguistic complexity was examined as a possible source of DIF that may result in test scores that confound science knowledge with a lack of English proficiency among ELLs. Two years of fifth-grade state science tests were analyzed for evidence of DIF using two DIF methods, Simultaneous Item Bias Test (SIBTest) and logistic regression. The tests presented a unique challenge in that the test items were grouped together into testlets---groups of items referring to a scientific scenario to measure knowledge of different science content or skills. Very large samples of 10, 256 students in 2006 and 13,571 students in 2007 were examined. Half of each sample was composed of Spanish-speaking ELLs; the balance was comprised of native English speakers. The two DIF methods were in agreement about the items that favored non-ELLs and the items that favored ELLs. Logistic regression effect sizes were all negligible, while SIBTest flagged items with low to high DIF. A decrease in socioeconomic status and Spanish-speaking ELL diversity may have led to inconsistent SIBTest effect sizes for items used in both testing years. The DIF results for the testlets suggested that ELLs lacked sufficient opportunity to learn science content. The DIF results further suggest that those constructed response test items requiring the student to draw a conclusion about a scientific investigation or to plan a new investigation tended to favor ELLs.

  9. The value of self-test items in tape--slide instruction.

    Science.gov (United States)

    Jones, H C

    1976-07-01

    Two tape-slide sequences in general pathology were used in an experiment to assess the value of self-test items and to determine whether it is better to intersperse the self-test items in the programme or place them all at the end of the sequence. The programmes were presented to a random sample of thirty-six students from a class of 149 in one of three forms: version 1, tape-slide programme without self-test items; version 2, tape-slide programme with self-test items interspersed between sections of the sequence; and version 3, tape-slide programme with all self-test items at the end of the sequence. Each student worked a pre-test before studying versions 1, 2 or 3 of a programme. A week later they worked through the post-test which was identical to the pre-test. At the same time they filled in a short attitude questionnaire on the teaching method. All students learned from the programmes. There was improvement in the post-test on the mean pre-test scores for all versions of both programmes. For one programme there was no significant difference between the mean post-test scores for students studying versions 1, 2 or 3, but for the other programme there was a significant difference between the versions. In this case the inclusion of self-test items was better for learning than no self-test items, and it was better if the self-test items were placed at the end of the sequence. A highly favourable attitude to the method is reported.

  10. The effects of linguistic modification on ESL students' comprehension of nursing course test items.

    Science.gov (United States)

    Bosher, Susan; Bowles, Melissa

    2008-01-01

    Recent research has indicated that language may be a source of construct-irrelevant variance for non-native speakers of English, or English as a second language (ESL) students, when they take exams. As a result, exams may not accurately measure knowledge of nursing content. One accommodation often used to level the playing field for ESL students is linguistic modification, a process by which the reading load of test items is reduced while the content and integrity of the item are maintained. Research on the effects of linguistic modification has been conducted on examinees in the K-12 population, but is just beginning in other areas. This study describes the collaborative process by which items from a pathophysiology exam were linguistically modified and subsequently evaluated for comprehensibility by ESL students. Findings indicate that in a majority of cases, modification improved examinees' comprehension of test items. Implications for test item writing and future research are discussed.

  11. The Role of Psychometric Modeling in Test Validation: An Application of Multidimensional Item Response Theory

    Science.gov (United States)

    Schilling, Stephen G.

    2007-01-01

    In this paper the author examines the role of item response theory (IRT), particularly multidimensional item response theory (MIRT) in test validation from a validity argument perspective. The author provides justification for several structural assumptions and interpretations, taking care to describe the role he believes they should play in any…

  12. A Stochastic Method for Balancing Item Exposure Rates in Computerized Classification Tests

    Science.gov (United States)

    Huebner, Alan; Li, Zhushan

    2012-01-01

    Computerized classification tests (CCTs) classify examinees into categories such as pass/fail, master/nonmaster, and so on. This article proposes the use of stochastic methods from sequential analysis to address item overexposure, a practical concern in operational CCTs. Item overexposure is traditionally dealt with in CCTs by the Sympson-Hetter…

  13. Restrictive Stochastic Item Selection Methods in Cognitive Diagnostic Computerized Adaptive Testing

    Science.gov (United States)

    Wang, Chun; Chang, Hua-Hua; Huebner, Alan

    2011-01-01

    This paper proposes two new item selection methods for cognitive diagnostic computerized adaptive testing: the restrictive progressive method and the restrictive threshold method. They are built upon the posterior weighted Kullback-Leibler (KL) information index but include additional stochastic components either in the item selection index or in…

  14. Detection of aberrant item score patterns in computerized adaptive testing : An empirical example using the CUSUM

    NARCIS (Netherlands)

    Egberink, Iris J. L.; Meijer, Rob R.; Veldkamp, Bernard P.; Schakel, Lolle; Smid, Nico G.

    2010-01-01

    The scalability of individual trait scores on a computerized adaptive test (CAT) was assessed through investigating the consistency of individual item score patterns. A sample of N = 428 persons completed a personality CAT as part of a career development procedure. To detect inconsistent item score

  15. The Prediction of Item Parameters Based on Classical Test Theory and Latent Trait Theory

    Science.gov (United States)

    Anil, Duygu

    2008-01-01

    In this study, the prediction power of the item characteristics based on the experts' predictions on conditions try-out practices cannot be applied was examined for item characteristics computed depending on classical test theory and two-parameters logistic model of latent trait theory. The study was carried out on 9914 randomly selected students…

  16. Criterion-Referenced Test (CRT) Items for Air Conditioning, Heating and Refrigeration.

    Science.gov (United States)

    Davis, Diane, Ed.

    These criterion-referenced test (CRT) items for air conditioning, heating, and refrigeration are keyed to the Missouri Air Conditioning, Heating, and Refrigeration Competency Profile. The items are designed to work with both the Vocational Instructional Management System and Vocational Administrative Management System. For word processing and…

  17. Predicting Item Difficulty in a Reading Comprehension Test with an Artificial Neural Network.

    Science.gov (United States)

    Perkins, Kyle; And Others

    1995-01-01

    This article reports the results of using a three-layer back propagation artificial neural network to predict item difficulty in a reading comprehension test. Three classes of variables were examined: text structure, propositional analysis, and cognitive demand. Results demonstrate that the networks can consistently predict item difficulty. (JL)

  18. Optimal stratification of item pools in α-stratified computerized adaptive testing

    NARCIS (Netherlands)

    Chang, Hua-Hua; Linden, van der Wim J.

    2003-01-01

    A method based on 0-1 linear programming (LP) is presented to stratify an item pool optimally for use in α-stratified adaptive testing. Because the 0-1 LP model belongs to the subclass of models with a network flow structure, efficient solutions are possible. The method is applied to a previous item

  19. Studying Differential Item Functioning via Latent Variable Modeling: A Note on a Multiple-Testing Procedure

    Science.gov (United States)

    Raykov, Tenko; Marcoulides, George A.; Lee, Chun-Lung; Chang, Chi

    2013-01-01

    This note is concerned with a latent variable modeling approach for the study of differential item functioning in a multigroup setting. A multiple-testing procedure that can be used to evaluate group differences in response probabilities on individual items is discussed. The method is readily employed when the aim is also to locate possible…

  20. Modeling and Testing Differential Item Functioning in Unidimensional Binary Item Response Models with a Single Continuous Covariate: A Functional Data Analysis Approach.

    Science.gov (United States)

    Liu, Yang; Magnus, Brooke E; Thissen, David

    2016-06-01

    Differential item functioning (DIF), referring to between-group variation in item characteristics above and beyond the group-level disparity in the latent variable of interest, has long been regarded as an important item-level diagnostic. The presence of DIF impairs the fit of the single-group item response model being used, and calls for either model modification or item deletion in practice, depending on the mode of analysis. Methods for testing DIF with continuous covariates, rather than categorical grouping variables, have been developed; however, they are restrictive in parametric forms, and thus are not sufficiently flexible to describe complex interaction among latent variables and covariates. In the current study, we formulate the probability of endorsing each test item as a general bivariate function of a unidimensional latent trait and a single covariate, which is then approximated by a two-dimensional smoothing spline. The accuracy and precision of the proposed procedure is evaluated via Monte Carlo simulations. If anchor items are available, we proposed an extended model that simultaneously estimates item characteristic functions (ICFs) for anchor items, ICFs conditional on the covariate for non-anchor items, and the latent variable density conditional on the covariate-all using regression splines. A permutation DIF test is developed, and its performance is compared to the conventional parametric approach in a simulation study. We also illustrate the proposed semiparametric DIF testing procedure with an empirical example.

  1. Item selection and ability estimation in adaptive testing

    NARCIS (Netherlands)

    Linden, van der Wim J.; Pashley, Peter J.; Linden, van der Wim J.; Glas, Cees A.W.

    2010-01-01

    The last century saw a tremendous progression in the refinement and use of standardized linear tests. The first administered College Board exam occurred in 1901 and the first Scholastic Assessment Test (SAT) was given in 1926. Since then, progressively more sophisticated standardized linear tests ha

  2. Ability or Access-Ability: Differential Item Functioning of Items on Alternate Performance-Based Assessment Tests for Students with Visual Impairments

    Science.gov (United States)

    Zebehazy, Kim T.; Zigmond, Naomi; Zimmerman, George J.

    2012-01-01

    Introduction: This study investigated differential item functioning (DIF) of test items on Pennsylvania's Alternate System of Assessment (PASA) for students with visual impairments and severe cognitive disabilities and what the reasons for the differences may be. Methods: The Wilcoxon signed ranks test was used to analyze differences in the scores…

  3. Development of an item bank for computerized adaptive test (CAT) measurement of pain

    DEFF Research Database (Denmark)

    Petersen, Morten Aa.; Aaronson, Neil K; Chie, Wei-Chu

    2016-01-01

    PURPOSE: Patient-reported outcomes should ideally be adapted to the individual patient while maintaining comparability of scores across patients. This is achievable using computerized adaptive testing (CAT). The aim here was to develop an item bank for CAT measurement of the pain domain as measured...... by the EORTC QLQ-C30 questionnaire. METHODS: The development process consisted of four steps: (1) literature search, (2) formulation of new items and expert evaluations, (3) pretesting and (4) field-testing and psychometric analyses for the final selection of items. RESULTS: In step 1, we identified 337 pain...... were obtained from 1103 cancer patients from five countries. Psychometric evaluations showed that 16 items could be retained in a unidimensional item bank. Evaluations indicated that use of the CAT measure may reduce sample size requirements with 15-25 % compared to using the QLQ-C30 pain scale...

  4. The quadratic relationship between difficulty of intelligence test items and their correlations with working memory

    Directory of Open Access Journals (Sweden)

    Tomasz eSmoleń

    2015-08-01

    Full Text Available Fluid intelligence (Gf is a crucial cognitive ability that involves abstract reasoning in order to solve novel problems. Recent research demonstrated that Gf strongly depends on the individual effectiveness of working memory (WM. We investigated a popular claim that if the storage capacity underlay the WM-Gf correlation, then such a correlation should increase with an increasing number of items or rules (load in a Gf test. As often no such link is observed, on that basis the storage-capacity account is rejected, and alternative accounts of Gf (e.g., related to executive control or processing speed are proposed. Using both analytical inference and numerical simulations, we demonstrated that the load-dependent change in correlation is primarily a function of the amount of floor/ceiling effect for particular items. Thus, the item-wise WM correlation of a Gf test depends on its overall difficulty, and the difficulty distribution across its items. When the early test items yield huge ceiling, but the late items do not approach floor, that correlation will increase throughout the test. If the early items locate themselves between ceiling and floor, but the late items approach floor, the respective correlation will decrease. For a hallmark Gf test, the Raven test, whose items span from ceiling to floor, the quadratic relationship is expected, and it was shown empirically using a large sample and two types of WMC tasks. In consequence, no changes in correlation due to varying WM/Gf load, or lack of them, can yield an argument for or against any theory of WM/Gf. Moreover, as the mathematical properties of the correlation formula make it relatively immune to ceiling/floor effects for overall moderate correlations, only minor changes (if any in the WM-Gf correlation should be expected for many psychological tests.

  5. The quadratic relationship between difficulty of intelligence test items and their correlations with working memory.

    Science.gov (United States)

    Smolen, Tomasz; Chuderski, Adam

    2015-01-01

    Fluid intelligence (Gf) is a crucial cognitive ability that involves abstract reasoning in order to solve novel problems. Recent research demonstrated that Gf strongly depends on the individual effectiveness of working memory (WM). We investigated a popular claim that if the storage capacity underlay the WM-Gf correlation, then such a correlation should increase with an increasing number of items or rules (load) in a Gf-test. As often no such link is observed, on that basis the storage-capacity account is rejected, and alternative accounts of Gf (e.g., related to executive control or processing speed) are proposed. Using both analytical inference and numerical simulations, we demonstrated that the load-dependent change in correlation is primarily a function of the amount of floor/ceiling effect for particular items. Thus, the item-wise WM correlation of a Gf-test depends on its overall difficulty, and the difficulty distribution across its items. When the early test items yield huge ceiling, but the late items do not approach floor, that correlation will increase throughout the test. If the early items locate themselves between ceiling and floor, but the late items approach floor, the respective correlation will decrease. For a hallmark Gf-test, the Raven-test, whose items span from ceiling to floor, the quadratic relationship is expected, and it was shown empirically using a large sample and two types of WMC tasks. In consequence, no changes in correlation due to varying WM/Gf load, or lack of them, can yield an argument for or against any theory of WM/Gf. Moreover, as the mathematical properties of the correlation formula make it relatively immune to ceiling/floor effects for overall moderate correlations, only minor changes (if any) in the WM-Gf correlation should be expected for many psychological tests.

  6. Air Force Officer Qualifying Test Form T: Initial Item-, Test-, Factor-,and Composite-Level Analyses

    Science.gov (United States)

    2016-12-01

    AFRL-RH-WP-TR-2016-0093 AIR FORCE OFFICER QUALIFYING TEST FORM T: INITIAL ITEM-, TEST -, FACTOR-, AND COMPOSITE-LEVEL ANALYSES...July 2016 – 28 Nov 2016 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER FA8650-11-C-6158 Air Force Officer Qualifying Test Form T...Initial Item-, Test -, Factor-, and Composite-Level Analyses 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 62202F 6. AUTHOR(S) 5d. PROJECT

  7. Quantitative Penetration Testing with Item Response Theory (extended version)

    NARCIS (Netherlands)

    Arnold, F.; Pieters, W.; Stoelinga, M.I.A.

    2013-01-01

    Existing penetration testing approaches assess the vulnerability of a system by determining whether certain attack paths are possible in practice. Thus, penetration testing has so far been used as a qualitative research method. To enable quantitative approaches to security risk management, including

  8. Quantitative penetration testing with item response theory (extended version)

    NARCIS (Netherlands)

    Arnold, Florian; Pieters, Wolter; Stoelinga, Mariëlle

    2013-01-01

    Existing penetration testing approaches assess the vulnerability of a system by determining whether certain attack paths are possible in practice. Therefore, penetration testing has thus far been used as a qualitative research method. To enable quantitative approaches to security risk management, in

  9. Comparison of Procedures for Detecting Test-Item Bias with Both Internal and External Ability Criteria.

    Science.gov (United States)

    Shepard, Lorrie, And Others

    1981-01-01

    Sixteen approaches for detecting item bias were compared on samples of Black, White, and Chicano elementary school pupils using the Lorge-Thorndike and Raven's Coloured Progressive Matrices tests. Recommendations for practical use are made. (JKS)

  10. Can a Multidimensional Test Be Evaluated with Unidimensional Item Response Theory?

    Science.gov (United States)

    Wiberg, Marie

    2012-01-01

    The aim of this study was to evaluate possible consequences of using unidimensional item response theory (UIRT) on a multidimensional college admission test. The test consists of 5 subscales and can be divided into two sections, that is, it can be considered both as a unidimensional and a multidimensional test. The test was examined with both UIRT…

  11. [Reference Intervals of Standard Test Items in Ningen Dock Examination].

    Science.gov (United States)

    Yamakado, Minoru

    2016-03-01

    Reference intervals (RIs) were derived from records of 1,499,288 individuals who underwent ningen dock examination in 188 institutes which belong to Japan Society of Ningen Dock in 2012. Targets were 27 basic laboratory tests, including the body mass index (BMI) and systolic and diastolic blood pressures (SBP, DBP). Individuals fulfilling strict criteria were chosen: SBP dock results will enable the appropriate interpretation of test results in health screening, and promote the effective application of CDLs for therapeutic intervention, taking into account the sex, age, and other health attributes.

  12. Examining item-position effects in large-scale assessment using the Linear Logistic Test Model

    Directory of Open Access Journals (Sweden)

    CHRISTINE HOHENSINN

    2008-09-01

    Full Text Available When administering large-scale assessments, item-position effects are of particular importance because the applied test designs very often contain several test booklets with the same items presented at different test positions. Establishing such position effects would be most critical; it would mean that the estimated item parameters do not depend exclusively on the items’ difficulties due to content but also on their presentation positions. As a consequence, item calibration would be biased. By means of the linear logistic test model (LLTM, item-position effects can be tested. In this paper, the results of a simulation study demonstrating how LLTM is indeed able to detect certain position effects in the framework of a large-scale assessment are presented first. Second, empirical item-position effects of a specific large-scale competence assessment in mathematics (4th grade students are analyzed using the LLTM. The results indicate that a small fatigue effect seems to take place. The most important consequence of the given paper is that it is advisable to try pertinent simulation studies before an analysis of empirical data takes place; the reason is, that for the given example, the suggested Likelihood-Ratio test neither holds the nominal type-I-risk, nor qualifies as “robust”, and furthermore occasionally shows very low power.

  13. Information-Processing on Intelligence Test Items: Some Response Components

    Science.gov (United States)

    Whitely, Susan E.

    1977-01-01

    A factor analysis was used to study the relationships among response time and accuracy scores for a verbal analogies test, as well as a number of experimental variables designed to measure a series of information processing stages of the analogies task. (CTM)

  14. Pretest Item Analyses Using Polynomial Logistic Regression: An Approach to Small Sample Calibration Problems Associated with Computerized Adaptive Testing.

    Science.gov (United States)

    Patsula, Liane N.; Pashley, Peter J.

    Many large-scale testing programs routinely pretest new items alongside operational (or scored) items to determine their empirical characteristics. If these pretest items pass certain statistical criteria, they are placed into an operational item pool; otherwise they are edited and re-pretested or simply discarded. In these situations, reliable…

  15. A New Method for Assessing the Statistical Significance in the Differential Functioning of Items and Tests (DFIT) Framework

    Science.gov (United States)

    Oshima, T. C.; Raju, Nambury S.; Nanda, Alice O.

    2006-01-01

    A new item parameter replication method is proposed for assessing the statistical significance of the noncompensatory differential item functioning (NCDIF) index associated with the differential functioning of items and tests framework. In this new method, a cutoff score for each item is determined by obtaining a (1-alpha ) percentile rank score…

  16. A comparison of item response theory-based methods for examining differential item functioning in object naming test by language of assessment among older Latinos

    Directory of Open Access Journals (Sweden)

    Frances M. Yang

    2011-12-01

    Full Text Available Object naming tests are commonly included in neuropsychological test batteries. Differential item functioning (DIF in these tests due to cultural and language differences may compromise the validity of cognitive measures in diverse populations. We evaluated 26 object naming items for DIF due to Spanish and English language translations among Latinos (n=1,159, mean age of 70.5 years old (Standard Deviation (SD±7.2, using the following four item response theory-based ap-proaches: Mplus/Multiple Indicator, Multiple Causes (Mplus/MIMIC; Muthén & Muthén, 1998-2011, Item Response Theory Likelihood Ratio Differential Item Functioning (IRTLRDIF/MULTILOG; Thissen, 1991, 2001, difwithpar/Parscale (Crane, Gibbons, Jolley, & van Belle, 2006; Muraki & Bock, 2003, and Differential Functioning of Items and Tests/MULTILOG (DFIT/MULTILOG; Flowers, Oshima, & Raju, 1999; Thissen, 1991. Overall, there was moderate to near perfect agreement across methods. Fourteen items were found to exhibit DIF and 5 items observed consistently across all methods, which were more likely to be answered correctly by individuals tested in Spanish after controlling for overall ability.

  17. A comparison of item response theory-based methods for examining differential item functioning in object naming test by language of assessment among older Latinos.

    Science.gov (United States)

    Yang, Frances M; Heslin, Kevin C; Mehta, Kala M; Yang, Cheng-Wu; Ocepek-Welikson, Katja; Kleinman, Marjorie; Morales, Leo S; Hays, Ron D; Stewart, Anita L; Mungas, Dan; Jones, Richard N; Teresi, Jeanne A

    2011-01-01

    Object naming tests are commonly included in neuropsychological test batteries. Differential item functioning (DIF) in these tests due to cultural and language differences may compromise the validity of cognitive measures in diverse populations. We evaluated 26 object naming items for DIF due to Spanish and English language translations among Latinos (n=1,159), mean age of 70.5 years old (Standard Deviation (SD)±7.2), using the following four item response theory-based approaches: Mplus/Multiple Indicator, Multiple Causes (Mplus/MIMIC; Muthén & Muthén, 1998-2011), Item Response Theory Likelihood Ratio Differential Item Functioning (IRTLRDIF/MULTILOG; Thissen, 1991, 2001), difwithpar/Parscale (Crane, Gibbons, Jolley, & van Belle, 2006; Muraki & Bock, 2003), and Differential Functioning of Items and Tests/MULTILOG (DFIT/MULTILOG; Flowers, Oshima, & Raju, 1999; Thissen, 1991). Overall, there was moderate to near perfect agreement across methods. Fourteen items were found to exhibit DIF and 5 items observed consistently across all methods, which were more likely to be answered correctly by individuals tested in Spanish after controlling for overall ability.

  18. Overcoming the effects of differential skewness of test items in scale construction

    Directory of Open Access Journals (Sweden)

    Johann M. Schepers

    2004-10-01

    Full Text Available The principal objective of the study was to develop a procedure for overcoming the effects of differential skewness of test items in scale construction. It was shown that the degree of skewness of test items places an upper limit on the correlations between the items, regardless of the contents of the items. If the items are ordered in terms of skewness the resulting inter correlation matrix forms a simplex or a pseudo simplex. Factoring such a matrix results in a multiplicity of factors, most of which are artifacts. A procedure for overcoming this problem was demonstrated with items from the Locus of Control Inventory (Schepers, 1995. The analysis was based on a sample of 1662 first year university students. Opsomming Die hoofdoel van die studie was om ’n prosedure te ontwikkel om die gevolge van differensiële skeefheid van toetsitems, in skaalkonstruksie, teen te werk. Daar is getoon dat die graad van skeefheid van toetsitems ’n boonste grens plaas op die korrelasies tussen die items ongeag die inhoud daarvan. Indien die items gerangskik word volgens graad van skeefheid, sal die interkorrelasiematriks van die items ’n simpleks of pseudosimpleks vorm. Indien so ’n matriks aan faktorontleding onderwerp word, lei dit tot ’n veelheid van faktore waarvan die meerderheid artefakte is. ’n Prosedure om hierdie probleem te bowe te kom, is gedemonstreer met behulp van die items van die Lokus van Beheer-vraelys (Schepers, 1995. Die ontledings is op ’n steekproef van 1662 eerstejaaruniversiteitstudente gebaseer.

  19. Analysis Test of Understanding of Vectors with the Three-Parameter Logistic Model of Item Response Theory and Item Response Curves Technique

    Science.gov (United States)

    Rakkapao, Suttida; Prasitpong, Singha; Arayathanitkul, Kwan

    2016-01-01

    This study investigated the multiple-choice test of understanding of vectors (TUV), by applying item response theory (IRT). The difficulty, discriminatory, and guessing parameters of the TUV items were fit with the three-parameter logistic model of IRT, using the parscale program. The TUV ability is an ability parameter, here estimated assuming…

  20. Item response theory analyses of the Cambridge Face Memory Test (CFMT).

    Science.gov (United States)

    Cho, Sun-Joo; Wilmer, Jeremy; Herzmann, Grit; McGugin, Rankin Williams; Fiset, Daniel; Van Gulick, Ana E; Ryan, Kaitlin F; Gauthier, Isabel

    2015-06-01

    We evaluated the psychometric properties of the Cambridge Face Memory Test (CFMT; Duchaine & Nakayama, 2006). First, we assessed the dimensionality of the test with a bifactor exploratory factor analysis (EFA). This EFA analysis revealed a general factor and 3 specific factors clustered by targets of CFMT. However, the 3 specific factors appeared to be minor factors that can be ignored. Second, we fit a unidimensional item response model. This item response model showed that the CFMT items could discriminate individuals at different ability levels and covered a wide range of the ability continuum. We found the CFMT to be particularly precise for a wide range of ability levels. Third, we implemented item response theory (IRT) differential item functioning (DIF) analyses for each gender group and 2 age groups (age ≤ 20 vs. age > 21). This DIF analysis suggested little evidence of consequential differential functioning on the CFMT for these groups, supporting the use of the test to compare older to younger, or male to female, individuals. Fourth, we tested for a gender difference on the latent facial recognition ability with an explanatory item response model. We found a significant but small gender difference on the latent ability for face recognition, which was higher for women than men by 0.184, at age mean 23.2, controlling for linear and quadratic age effects. Finally, we discuss the practical considerations of the use of total scores versus IRT scale scores in applications of the CFMT.

  1. Development of Abbreviated Eight-Item Form of the Penn Verbal Reasoning Test

    Science.gov (United States)

    Bilker, Warren B.; Wierzbicki, Michael R.; Brensinger, Colleen M.; Gur, Raquel E.; Gur, Ruben C.

    2014-01-01

    The ability to reason with language is a highly valued cognitive capacity that correlates with IQ measures and is sensitive to damage in language areas. The Penn Verbal Reasoning Test (PVRT) is a 29-item computerized test for measuring abstract analogical reasoning abilities using language. The full test can take over half an hour to administer, which limits its applicability in large-scale studies. We previously described a procedure for abbreviating a clinical rating scale and a modified procedure for reducing tests with a large number of items. Here we describe the application of the modified method to reducing the number of items in the PVRT to a parsimonious subset of items that accurately predicts the total score. As in our previous reduction studies, a split sample is used for model fitting and validation, with cross-validation to verify results. We find that an 8-item scale predicts the total 29-item score well, achieving a correlation of .9145 for the reduced form for the model fitting sample and .8952 for the validation sample. The results indicate that a drastically abbreviated version, which cuts administration time by more than 70%, can be safely administered as a predictor of PVRT performance. PMID:24577310

  2. Using set covering with item sampling to analyze the infeasibility of linear programming test assembly models

    NARCIS (Netherlands)

    Huitzing, HA

    2004-01-01

    This article shows how set covering with item sampling (SCIS) methods can be used in the analysis and preanalysis of linear programming models for test assembly (LPTA). LPTA models can construct tests, fulfilling a set of constraints set by the test assembler. Sometimes, no solution to the LPTA mode

  3. Item Transformation for Computer Assisted Language Testing: The Adaptation of the Spanish University Entrance Examination

    Science.gov (United States)

    Laborda, Jesus Garcia; Bakieva, Margarita; Gonzalez-Such, Jose; Pavon, Ana Sevilla

    2010-01-01

    Since the Spanish Educational system is changing and promoting the use of online tests, it was necessary to study the transformation of test items in the "Spanish University Entrance Examination" (IB P.A.U.) to diminish the effect of test delivery changes (through its computerization) in order to affect the least the current model. The…

  4. Psychometric equivalence of recorded spondaic words as test items.

    Science.gov (United States)

    Bilger, R C; Matthies, M L; Meyer, T A; Griffiths, S K

    1998-06-01

    In the determination of the speech-reception threshold (SRT), spondaic words are assumed to be homogeneous with respect to intelligibility; and the assumption of equal intelligibility requires that the words be comparable for all signal levels. Previous attempts to assess the equal intelligibility assumption using word thresholds as the sole criterion are not an adequate basis for specifying the equality of intelligibility. In the present study, the recorded spondaic words (Tillman recording) were analyzed in an attempt to create a more homogeneous set of spondaic words for future laboratory work. To achieve this goal, the data reported by Young, Dudley, and Gunter (1982) and data collected in our laboratory were fitted to a logistic function (psychometric function) from which a 50% point (threshold) and slope were obtained. To specify their acoustical parameters, the recorded spondaic words were digitized and the RMS level and duration of each syllable and word were calculated. None of the RMS or duration measures were correlated with word thresholds, so no attempt was made to equate level or duration. On the other hand, when the threshold of each word was adjusted to equal the mean threshold of the set (n = 36), the dispersion among word thresholds and slopes was greatly reduced. Further, we recommend that small sets of "equally intelligible" spondaic words not be used for clinical testing because set size is a strong factor in determining threshold for spondees (Meyer & Bilger, 1997; Punch & Howard, 1985).

  5. Classical item and test analysis with graphics: the ViSta-CITA program.

    Science.gov (United States)

    Ledesma, Rubén Daniel; Molina, J Gabriel

    2009-11-01

    Current advances in test development theory have mostly been influenced by item response theory. Notwithstanding this, classical test theory still plays a major part in the development of tests for applied educational and behavioral research. This article describes ViSta-CITA, a computer program that implements a set of classical item and test analysis methods that incorporate innovative graphics whose aim is to provide deeper insight into analysis results. Such an aim is achieved through the SpreadPlot, a graphical method designed to display multiple, simultaneous, interactive views of the analysis results. It behaves on a dynamic basis, so that users' changes (e.g., selecting a subset of items) are automatically updated in the graphical windows showing the analysis results. Moreover, ViSta-CITA is freely available, and its code is open to modifications or additions by the user. Features such as these constitute useful tools for research and teaching purposes related to test development.

  6. Detecting Differential Item Functioning and Differential Test Functioning on Math School Final-exam

    OpenAIRE

    - Mansyur; - Muliana

    2016-01-01

    This study aims at finding out the characteristics of Differential Item Functioning (DIF) and Differential Test Functioning (DTF) on school final-exam for Math subject based on Item Response Theory (ITR). The subjects of this study were questions and all of the students’ answer sheets chosen by using convenience sampling method and obtained 286 responses consisted of 147 male and 149 female students’ responses. The data of this study collected using documentation technique by quoting the resp...

  7. Peeking into personality test answers: inter- and intraindividual variety in item interpretations.

    Science.gov (United States)

    Arro, Grete

    2013-03-01

    Personality research of today applies basically inventories having neither unambiguously interpretable items nor responses. The substantive process of generating the test answer is rarely investigated and thus the possible field of meanings, out of which the answer is created, remains hidden. In order to investigate the possible array of spontaneous answers to personality test items, a situative open-ended personality inventory was developed to determine individuals' ways of interpreting personality test items and relevant personality descriptions for individuals. The children's sample (N = 704 of 10-13 year olds) answered five free-response contextualized personality test questions, each related to one of the Five Factor Model personality dimensions. It was revealed that there is no universal interpretation of an item. First, different children's answers to same question described different personality dimensions - substantial number of the respondents' answers did not reflect the personality domain assumed in an item. So there are several ways to interpret test questions; answers may refer to different personality dimensions and not necessarily the one assumed by the researcher. Second, a number of children mentioned more than one personality trait for one item, indicating that even within one person there may be several relevant interpretations of the same item. Considering personality traits as occurring one by one and mutually exclusively during personality test answering may be artificial; in reality trait combinations may reflect actual reaction. In sum, the results suggest there is no single predictable interpretational trajectory in meaning construction process if semiotically mediated constructs, e.g., personality reflection, are assessed.

  8. Analysis test of understanding of vectors with the three-parameter logistic model of item response theory and item response curves technique

    Science.gov (United States)

    Rakkapao, Suttida; Prasitpong, Singha; Arayathanitkul, Kwan

    2016-12-01

    This study investigated the multiple-choice test of understanding of vectors (TUV), by applying item response theory (IRT). The difficulty, discriminatory, and guessing parameters of the TUV items were fit with the three-parameter logistic model of IRT, using the parscale program. The TUV ability is an ability parameter, here estimated assuming unidimensionality and local independence. Moreover, all distractors of the TUV were analyzed from item response curves (IRC) that represent simplified IRT. Data were gathered on 2392 science and engineering freshmen, from three universities in Thailand. The results revealed IRT analysis to be useful in assessing the test since its item parameters are independent of the ability parameters. The IRT framework reveals item-level information, and indicates appropriate ability ranges for the test. Moreover, the IRC analysis can be used to assess the effectiveness of the test's distractors. Both IRT and IRC approaches reveal test characteristics beyond those revealed by the classical analysis methods of tests. Test developers can apply these methods to diagnose and evaluate the features of items at various ability levels of test takers.

  9. Development of Items for a Pedagogical Content Knowledge Test Based on Empirical Analysis of Pupils' Errors

    Science.gov (United States)

    Jüttner, Melanie; Neuhaus, Birgit J.

    2012-05-01

    In view of the lack of instruments for measuring biology teachers' pedagogical content knowledge (PCK), this article reports on a study about the development of PCK items for measuring teachers' knowledge of pupils' errors and ways for dealing with them. This study investigated 9th and 10th grade German pupils' (n = 461) drawings in an achievement test about the knee-jerk in biology, which were analysed by using the inductive qualitative analysis of their content. The empirical data were used for the development of the items in the PCK test. The validation of the items was determined with think-aloud interviews of German secondary school teachers (n = 5). If the item was determined, the reliability was tested by the results of German secondary school biology teachers (n = 65) who took the PCK test. The results indicated that these items are satisfactorily reliable (Cronbach's alpha values ranged from 0.60 to 0.65). We suggest a larger sample size and American biology teachers be used in our further studies. The findings of this study about teachers' professional knowledge from the PCK test could provide new information about the influence of teachers' knowledge on their pupils' understanding of biology and their possible errors in learning biology.

  10. QUALITY TESTING OF HEAT TREATMENT OF MEDIUM-CARBON STEEL CONSTRUCTION ITEMS BASED ON THE BIPOLAR PULSED REMAGNETIZATION

    Directory of Open Access Journals (Sweden)

    V. F. Matyuk

    2014-01-01

    Full Text Available The features of bipolar pulsed remagnetization of construction medium-carbon steel items for testing the heat treatment temperature and structure of these items are discussed, the methods of bipolar pulse remagnetization providing testing of items of considered steels are suggested.

  11. Power and Sample Size Calculations for Logistic Regression Tests for Differential Item Functioning

    Science.gov (United States)

    Li, Zhushan

    2014-01-01

    Logistic regression is a popular method for detecting uniform and nonuniform differential item functioning (DIF) effects. Theoretical formulas for the power and sample size calculations are derived for likelihood ratio tests and Wald tests based on the asymptotic distribution of the maximum likelihood estimators for the logistic regression model.…

  12. Cognitive Diagnostic Models for Tests with Multiple-Choice and Constructed-Response Items

    Science.gov (United States)

    Kuo, Bor-Chen; Chen, Chun-Hua; Yang, Chih-Wei; Mok, Magdalena Mo Ching

    2016-01-01

    Traditionally, teachers evaluate students' abilities via their total test scores. Recently, cognitive diagnostic models (CDMs) have begun to provide information about the presence or absence of students' skills or misconceptions. Nevertheless, CDMs are typically applied to tests with multiple-choice (MC) items, which provide less diagnostic…

  13. The Relative Importance of Persons, Items, Subtests, and Languages to TOEFL Test Variance.

    Science.gov (United States)

    Brown, James Dean

    1999-01-01

    Explored the relative contributions to Test of English as a Foreign Language (TOEFL) score dependability of various numbers of persons, items, subtests, languages, and their various interactions. Sampled 15,000 test takers, 1000 each from 15 different language backgrounds. (Author/VWL)

  14. Test-retest reliability of Eurofit Physical Fitness items for children with visual impairments

    NARCIS (Netherlands)

    Houwen, Suzanne; Visscher, Chris; Hartman, Esther; Lemmink, Koen A. P. M.

    2006-01-01

    The purpose of this study was to examine the test-retest reliability of physical fitness items from the European Test of Physical Fitness (Eurofit) for children with visual impairments. A sample of 21 children, ages 6-12 years, that were recruited from a special school for children with visual impai

  15. Power and Sample Size Calculations for Logistic Regression Tests for Differential Item Functioning

    Science.gov (United States)

    Li, Zhushan

    2014-01-01

    Logistic regression is a popular method for detecting uniform and nonuniform differential item functioning (DIF) effects. Theoretical formulas for the power and sample size calculations are derived for likelihood ratio tests and Wald tests based on the asymptotic distribution of the maximum likelihood estimators for the logistic regression model.…

  16. Fostering a student's skill for analyzing test items through an authentic task

    Science.gov (United States)

    Setiawan, Beni; Sabtiawan, Wahyu Budi

    2017-08-01

    Analyzing test items is a skill that must be mastered by prospective teachers, in order to determine the quality of test questions which have been written. The main aim of this research was to describe the effectiveness of authentic task to foster the student's skill for analyzing test items involving validity, reliability, item discrimination index, level of difficulty, and distractor functioning through the authentic task. The participant of the research is students of science education study program, science and mathematics faculty, Universitas Negeri Surabaya, enrolled for assessment course. The research design was a one-group posttest design. The treatment in this study is that the students were provided an authentic task facilitating the students to develop test items, then they analyze the items like a professional assessor using Microsoft Excel and Anates Software. The data of research obtained were analyzed descriptively, such as the analysis was presented by displaying the data of students' skill, then they were associated with theories or previous empirical studies. The research showed the task facilitated the students to have the skills. Thirty-one students got a perfect score for the analyzing, five students achieved 97% mastery, two students had 92% mastery, and another two students got 89% and 79% of mastery. The implication of the finding was the students who get authentic tasks forcing them to perform like a professional, the possibility of the students for achieving the professional skills will be higher at the end of learning.

  17. An Enhanced Automated Test Item Creation Based on Learners Preferred Concept Space

    Directory of Open Access Journals (Sweden)

    Mohammad AL-Smadi

    2016-03-01

    Full Text Available Recently, research has become increasingly inter-ested in developing tools that are able to automatically create test items out of text-based learning contents. Such tools might not only support instructors in creating tests or exams but also learners in self-assessing their learning progress. This paper presents an enhanced automatic question-creation tool (EAQC that has been recently developed. EAQC extracts the most important key phrases (concepts out of a textual learning content and automatically creates test items based on these concepts. Moreover, this paper discusses two studies for the evaluation of EAQC application in real learning settings. The first study showed that concepts extracted by the EAQC often but not always reflect the concepts extracted by learners. Learners typically extracted fewer concepts than the EAQC and there was a great inter-individual variation between learners with regard to which concepts they experienced as relevant. Accordingly, the second study investigated whether the functionality of the EAQC can be improved in a way that valid test items are created if the tool was fed with concepts provided by learners. The results showed that the quality of semi-automated creation of test items were satisfactory. Moreover, this depicts the EAQC flexibility in adapting its workflow to the individual needs of the learners.

  18. Effects of Item Parameter Drift on Vertical Scaling with the Nonequivalent Groups with Anchor Test (NEAT) Design

    Science.gov (United States)

    Ye, Meng; Xin, Tao

    2014-01-01

    The authors explored the effects of drifting common items on vertical scaling within the higher order framework of item parameter drift (IPD). The results showed that if IPD occurred between a pair of test levels, the scaling performance started to deviate from the ideal state, as indicated by bias of scaling. When there were two items drifting…

  19. Optimal Item Pool Design for a Highly Constrained Computerized Adaptive Test

    Science.gov (United States)

    He, Wei

    2010-01-01

    Item pool quality has been regarded as one important factor to help realize enhanced measurement quality for the computerized adaptive test (CAT) (e.g., Flaugher, 2000; Jensema, 1977; McBride & Wise, 1976; Reckase, 1976; 2003; van der Linden, Ariel, & Veldkamp, 2006; Veldkamp & van der Linden, 2000; Xing & Hambleton, 2004). However, studies are…

  20. Improving Item Response Theory Model Calibration by Considering Response Times in Psychological Tests

    Science.gov (United States)

    Ranger, Jochen; Kuhn, Jorg-Tobias

    2012-01-01

    Research findings indicate that response times in personality scales are related to the trait level according to the so-called speed-distance hypothesis. Against this background, Ferrando and Lorenzo-Seva proposed a latent trait model for the responses and response times in a test. The model consists of two components, a standard item response…

  1. Developing and Validating Test Items for First-Year Computer Science Courses

    Science.gov (United States)

    Vahrenhold, Jan; Paul, Wolfgang

    2014-01-01

    We report on the development, validation, and implementation of a collection of test items designed to detect misconceptions related to first-year computer science courses. To this end, we reworked the development scheme proposed by Almstrum et al. ("SIGCSE Bulletin" 38(4):132-145, 2006) to include students' artifacts and to…

  2. Reading ability and print exposure: item response theory analysis of the author recognition test.

    Science.gov (United States)

    Moore, Mariah; Gordon, Peter C

    2015-12-01

    In the author recognition test (ART), participants are presented with a series of names and foils and are asked to indicate which ones they recognize as authors. The test is a strong predictor of reading skill, and this predictive ability is generally explained as occurring because author knowledge is likely acquired through reading or other forms of print exposure. In this large-scale study (1,012 college student participants), we used item response theory (IRT) to analyze item (author) characteristics in order to facilitate identification of the determinants of item difficulty, provide a basis for further test development, and optimize scoring of the ART. Factor analysis suggested a potential two-factor structure of the ART, differentiating between literary and popular authors. Effective and ineffective author names were identified so as to facilitate future revisions of the ART. Analyses showed that the ART is a highly significant predictor of the time spent encoding words, as measured using eyetracking during reading. The relationship between the ART and time spent reading provided a basis for implementing a higher penalty for selecting foils, rather than the standard method of ART scoring (names selected minus foils selected). The findings provide novel support for the view that the ART is a valid indicator of reading volume. Furthermore, they show that frequency data can be used to select items of appropriate difficulty, and that frequency data from corpora based on particular time periods and types of texts may allow adaptations of the test for different populations.

  3. Developing and Validating Test Items for First-Year Computer Science Courses

    Science.gov (United States)

    Vahrenhold, Jan; Paul, Wolfgang

    2014-01-01

    We report on the development, validation, and implementation of a collection of test items designed to detect misconceptions related to first-year computer science courses. To this end, we reworked the development scheme proposed by Almstrum et al. ("SIGCSE Bulletin" 38(4):132-145, 2006) to include students' artifacts and to…

  4. Some New Item Selection Criteria for Adaptive Testing. Research Report 94-6.

    Science.gov (United States)

    Veerkamp, Wim J. J.; Berger, Martijn P. F.

    In this study some alternative item selection criteria for adaptive testing are proposed. These criteria take into account the uncertainty of the ability estimates. A general weighted information criterion is suggested of which the usual maximum information criterion and the suggested alternative criteria are special cases. A simulation study was…

  5. A latent trait look at pretest-posttest validation of criterion-referenced test items

    NARCIS (Netherlands)

    van der Linden, Willem J.

    1981-01-01

    Since Cox and Vargas (1966) introduced their pretest-posttest validity index for criterion-referenced test items, a great number of additions and modifications have followed. All are based on the idea of gain scoring; that is, they are computed from the differences between proportions of pretest and

  6. A hierarchical framework for modeling speed and accuracy on test items

    NARCIS (Netherlands)

    van der Linden, Willem J.

    2005-01-01

    Current modeling of response times on test items has been influenced by the experimental paradigm of reaction-time research in psychology. For instance, some of the models have a parameter structure that was chosen to represent a speed-accuracy tradeoff, while others equate speed directly with respo

  7. A hierarchical framework for modeling speed and accuracy on test items

    NARCIS (Netherlands)

    van der Linden, Willem J.

    2007-01-01

    Current modeling of response times on test items has been strongly influenced by the paradigm of experimental reaction-time research in psychology. For instance, some of the models have a parameter structure that was chosen to represent a speed-accuracy tradeoff, while others equate speed directly w

  8. Developing and Validating Test Items for First-Year Computer Science Courses

    Science.gov (United States)

    Vahrenhold, Jan; Paul, Wolfgang

    2014-01-01

    We report on the development, validation, and implementation of a collection of test items designed to detect misconceptions related to first-year computer science courses. To this end, we reworked the development scheme proposed by Almstrum et al. ("SIGCSE Bulletin" 38(4):132-145, 2006) to include students' artifacts and to…

  9. Learning to Think Spatially: What Do Students "See" in Numeracy Test Items?

    Science.gov (United States)

    Diezmann, Carmel M.; Lowrie, Tom

    2012-01-01

    Learning to think spatially in mathematics involves developing proficiency with graphics. This paper reports on 2 investigations of spatial thinking and graphics. The first investigation explored the importance of graphics as 1 of 3 communication systems (i.e. text, symbols, graphics) used to provide information in numeracy test items. The results…

  10. ITEM ANALYSIS IN MULTIPLE-CHOICE LISTENING TESTS FROM CILS CERTIFICATE IN ITALIAN AS A FOREIGN LANGUAGE

    Directory of Open Access Journals (Sweden)

    Paulo Torresan

    2015-12-01

    Full Text Available This paper analyses three multiple-choice listening tests from CILS certificate in Italian as a Foreign Language (level B1, summer session 2009 and 2012.Item Analysis involves examining the behavior of each individual item based on statistical data on answers from a sample. It offers answers to questions such as: do the items allow for sufficient discrimination between candidates of different skill levels? Do the keys and distractors work appropriately?Our investigation reveals certain issues of undercalibration, non-correspondence between items and information present in the text, and item distribution. In one case, the item’s construction risks misleading the test taker (item #2, summer session 2012.As well as providing an example of item analysis, this study allows the reader to gain awareness of how difficult it is to design an exercise widely used in both testing and teaching centers, that is, the multiple-choice question.

  11. Creation of New Items and Forms for the Project A Assembling Objects Test

    Science.gov (United States)

    1994-08-01

    correct; D, a measure of item discrimination - the rpbi , between the item score (correct or incorrect) and the total score on the 36 original items...D2 another measure of item discrimination - the rpbis between the item score and the total score on the 18 original items of the same type (marked

  12. Plausibility Functions of Iowa Vocabulary Test Items Estimated by the Simple Sum Procedure of the Conditional P.D.F. Approach.

    Science.gov (United States)

    1984-12-01

    Vocabulary Test items . In so doing, the normal ogive model was adopted for the correct answers of those items, and those items were used as the substitute for...of informative distractors for certain test items . The model validation study accompanied to it indicates that for most items the normal ogive model is

  13. Cross-cultural development of an item list for computer-adaptive testing of fatigue in oncological patients

    Directory of Open Access Journals (Sweden)

    Oberguggenberger Anne S

    2011-03-01

    Full Text Available Abstract Introduction Within an ongoing project of the EORTC Quality of Life Group, we are developing computerized adaptive test (CAT measures for the QLQ-C30 scales. These new CAT measures are conceptualised to reflect the same constructs as the QLQ-C30 scales. Accordingly, the Fatigue-CAT is intended to capture physical and general fatigue. Methods The EORTC approach to CAT development comprises four phases (literature search, operationalisation, pre-testing, and field testing. Phases I-III are described in detail in this paper. A literature search for fatigue items was performed in major medical databases. After refinement through several expert panels, the remaining items were used as the basis for adapting items and/or formulating new items fitting the EORTC item style. To obtain feedback from patients with cancer, these English items were translated into Danish, French, German, and Spanish and tested in the respective countries. Results Based on the literature search a list containing 588 items was generated. After a comprehensive item selection procedure focusing on content, redundancy, item clarity and item difficulty a list of 44 fatigue items was generated. Patient interviews (n = 52 resulted in 12 revisions of wording and translations. Discussion The item list developed in phases I-III will be further investigated within a field-testing phase (IV to examine psychometric characteristics and to fit an item response theory model. The Fatigue CAT based on this item bank will provide scores that are backward-compatible to the original QLQ-C30 fatigue scale.

  14. Competency-based classification of COMLEX-USA cognitive examination test items.

    Science.gov (United States)

    Langenau, Erik; Pugliano, Gina; Roberts, William

    2011-06-01

    The Comprehensive Osteopathic Medical Licensing Examination-USA (COMLEX-USA) currently assesses osteopathic medical knowledge via a series of 3 progressive cognitive examinations and 1 clinical skills assessment. In 2009, the National Board of Osteopathic Medical Examiners created the Fundamental Osteopathic Medical Competencies (FOMC) document to outline the essential competencies required for the practice of osteopathic medicine. To measure the distribution and extent to which cognitive examination items of the current series of COMLEX-USA assess knowledge of each of the medical competencies included in the FOMC document. Eight graduate medical education panelists with expertise in competency-based assessment reviewed 1046 multiple-choice examination items extracted from the 3 COMLEX-USA cognitive examinations (Level 1, Level 2-Cognitive Evaluation, and Level 3) used during the 2008-2009 testing cycle. The 8 panelists individually judged each item to classify it as 1 of the 6 fundamental osteopathic medical competencies described in the FOMC document. Panelists made 8368 judgments. The majority of the sample examination items were classified as either patient care (3343 [40%]) or medical knowledge (4236 [51%]). Panelists also reported these 2 competencies as being the easiest to define, teach, and assess. The frequency of medical knowledge examination items decreased throughout the COMLEX-USA series (69%, 43%, 40%); conversely, items classified as interpersonal and communication skills, systems-based practice, practice-based learning and improvement, and professionalism increased throughout the 3-examination series. Results indicate that knowledge of each of the 6 competencies is being assessed to some extent with the current COMLEX-USA format. These findings provide direction for the enhancement of existing examinations and development of new assessment tools.

  15. What is the Ability Emotional Intelligence Test (MSCEIT good for? An evaluation using item response theory.

    Directory of Open Access Journals (Sweden)

    Marina Fiori

    Full Text Available The ability approach has been indicated as promising for advancing research in emotional intelligence (EI. However, there is scarcity of tests measuring EI as a form of intelligence. The Mayer Salovey Caruso Emotional Intelligence Test, or MSCEIT, is among the few available and the most widespread measure of EI as an ability. This implies that conclusions about the value of EI as a meaningful construct and about its utility in predicting various outcomes mainly rely on the properties of this test. We tested whether individuals who have the highest probability of choosing the most correct response on any item of the test are also those who have the strongest EI ability. Results showed that this is not the case for most items: The answer indicated by experts as the most correct in several cases was not associated with the highest ability; furthermore, items appeared too easy to challenge individuals high in EI. Overall results suggest that the MSCEIT is best suited to discriminate persons at the low end of the trait. Results are discussed in light of applied and theoretical considerations.

  16. What is the Ability Emotional Intelligence Test (MSCEIT) good for? An evaluation using item response theory.

    Science.gov (United States)

    Fiori, Marina; Antonietti, Jean-Philippe; Mikolajczak, Moira; Luminet, Olivier; Hansenne, Michel; Rossier, Jérôme

    2014-01-01

    The ability approach has been indicated as promising for advancing research in emotional intelligence (EI). However, there is scarcity of tests measuring EI as a form of intelligence. The Mayer Salovey Caruso Emotional Intelligence Test, or MSCEIT, is among the few available and the most widespread measure of EI as an ability. This implies that conclusions about the value of EI as a meaningful construct and about its utility in predicting various outcomes mainly rely on the properties of this test. We tested whether individuals who have the highest probability of choosing the most correct response on any item of the test are also those who have the strongest EI ability. Results showed that this is not the case for most items: The answer indicated by experts as the most correct in several cases was not associated with the highest ability; furthermore, items appeared too easy to challenge individuals high in EI. Overall results suggest that the MSCEIT is best suited to discriminate persons at the low end of the trait. Results are discussed in light of applied and theoretical considerations.

  17. Prediction of true test scores from observed item scores and ancillary data.

    Science.gov (United States)

    Haberman, Shelby J; Yao, Lili; Sinharay, Sandip

    2015-05-01

    In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008 in the case of GRE(®) General Analytical Writing and until 2009 in the case of TOEFL(®) iBT Writing. With use of natural language processing, it is possible to obtain additional information concerning item responses from computer programs such as e-rater(®). In addition, available information relevant to examinee performance may include scores on related tests. We suggest application of standard results from classical test theory to the available data to obtain best linear predictors of true traditional test scores. In performing such analysis, we require estimation of variances and covariances of measurement errors, a task which can be quite difficult in the case of tests with limited numbers of items and with multiple measurements per item. As a consequence, a new estimation method is suggested based on samples of examinees who have taken an assessment more than once. Such samples are typically not random samples of the general population of examinees, so that we apply statistical adjustment methods to obtain the needed estimated variances and covariances of measurement errors. To examine practical implications of the suggested methods of analysis, applications are made to GRE General Analytical Writing and TOEFL iBT Writing. Results obtained indicate that substantial improvements are possible both in terms of reliability of scoring and in terms of assessment reliability.

  18. Domain-General and Domain-Specific Creative-Thinking Tests: Effects of Gender and Item Content on Test Performance

    Science.gov (United States)

    Hong, Eunsook; Peng, Yun; O'Neil, Harold F., Jr.; Wu, Junbin

    2013-01-01

    The study examined the effects of gender and item content of domain-general and domain-specific creative-thinking tests on four subscale scores of creative-thinking (fluency, flexibility, originality, and elaboration). Chinese tenth-grade students (234 males and 244 females) participated in the study. Domain-general creative thinking was measured…

  19. Item and associative memory in amnestic mild cognitive impairment: performance on standardized memory tests.

    Science.gov (United States)

    Troyer, Angela K; Murphy, Kelly J; Anderson, Nicole D; Hayman-Abello, Brent A; Craik, Fergus I M; Moscovitch, Morris

    2008-01-01

    The earliest neuroanatomical changes in amnestic mild cognitive impairment (aMCI) involve the hippocampus and entorhinal cortex, structures implicated in the integration and learning of associative information. The authors hypothesized that individuals with aMCI would have impairments in associative memory above and beyond the known impairments in item memory. A group of 29 individuals with aMCI and 30 matched control participants were administered standardized tests of object-location recall and symbol-symbol recall, from which both item and associative recall scores were derived. As expected, item recall was impaired in the aMCI group relative to controls. Associative recall in the aMCI group was even more impaired than was item recall. The best group discriminators were measures of associative recall, with which the sensitivity and specificity for detecting aMCI were 76% and 90% for symbol-symbol recall and were 86% and 97% for object-location recall. Associative recall may be particularly sensitive to early cognitive change in aMCI, because this ability relies heavily on the medial temporal lobe structures that are affected earliest in aMCI. Incorporating measures of associative recall into clinical evaluations of individuals with memory change may be useful for detecting aMCI.

  20. Writing multiple-choice test items that promote and measure critical thinking.

    Science.gov (United States)

    Morrison, S; Free, K W

    2001-01-01

    Faculties are concerned about measurement of critical thinking especially since the National League for Nursing Accrediting Commission cited such measurement as a requirement for accreditation (NLNAC, 1997). Some writers and researchers (Alfaro-LeFevre, 1995; Blat, 1989; McPeck, 1981, 1990) describe the need to measure critical thinking within the context of a specific discipline. Based on McPeck's position that critical thinking is discipline-specific, guidelines for developing multiple-choice test items as a means of measuring critical thinking within the discipline of nursing are discussed. Specifically, criteria described by Morrison, Smith, and Britt (1996) for writing critical-thinking multiple-choice test items are reviewed and explained for promoting and measuring critical thinking.

  1. Latent Class Analysis of Differential Item Functioning on the Peabody Picture Vocabulary Test-III

    Science.gov (United States)

    Webb, Mi-young Lee; Cohen, Allan S.; Schwanenflugel, Paula J.

    2008-01-01

    This study investigated the use of latent class analysis for the detection of differences in item functioning on the Peabody Picture Vocabulary Test-Third Edition (PPVT-III). A two-class solution for a latent class model appeared to be defined in part by ability because Class 1 was lower in ability than Class 2 on both the PPVT-III and the…

  2. Differential functioning of mini-mental test items according to disease.

    Science.gov (United States)

    Prieto, G; Delgado, A R; Perea, M V; Ladera, V

    2011-10-01

    Comparing the height of males and females would be impossible if the measuring device did not have the same properties for both populations. In a similar way, the cognitive level of diverse groups of patients should not be compared if the test has different measurement properties for these groups. Lack of Differential Item Functioning (DIF) is a condition for measurement invariance between populations. The most internationally used screening test for dementia, the MMSE (or Mini-mental State Examination), has been analysed using an advanced psychometric technique, the Rasch Model. The objective was to determine the invariance of mini-mental measurements from diverse groups: Parkinson's disease patients, Alzheimer's type dementia and normal subjects. The hypothesis was that the scores would not show DIF against any of these groups. The total sample was composed of 400 subjects. Significant differences between groups were found. However, the quantitative comparison only makes sense if no evidence against measurement invariance was found: given the kind of items showing DIF against Parkinson's disease patients, the MMSE seems to underestimate the cognitive level of these patients. Despite the extended use of this test, 11 items out of 30 show DIF and consequently score comparisons between groups are not justified. Copyright © 2010 Sociedad Española de Neurología. Published by Elsevier Espana. All rights reserved.

  3. The Effects of the Number of Options per Item and Student Ability on Test Validity and Reliability.

    Science.gov (United States)

    Trevisan, Michael S.; And Others

    1991-01-01

    The reliability and validity of multiple-choice tests were computed as a function of the number of options per item and student ability for 435 parochial high school juniors, who were administered the Washington Pre-College Test Battery. Results suggest the efficacy of the three-option item. (SLD)

  4. On the Relationship between Classical Test Theory and Item Response Theory: From One to the Other and Back

    Science.gov (United States)

    Raykov, Tenko; Marcoulides, George A.

    2016-01-01

    The frequently neglected and often misunderstood relationship between classical test theory and item response theory is discussed for the unidimensional case with binary measures and no guessing. It is pointed out that popular item response models can be directly obtained from classical test theory-based models by accounting for the discrete…

  5. PISA Test Items and School-Based Examinations in Greece: Exploring the Relationship between Global and Local Assessment Discourses

    Science.gov (United States)

    Anagnostopoulou, Kyriaki; Hatzinikita, Vassilia; Christidou, Vasilia; Dimopoulos, Kostas

    2013-01-01

    The paper explores the relationship of the global and the local assessment discourses as expressed by Programme for International Student Assessment (PISA) test items and school-based examinations, respectively. To this end, the paper compares PISA test items related to living systems and the context of life, health, and environment, with Greek…

  6. Salience of Guilty Knowledge Test items affects accuracy in realistic mock crimes.

    Science.gov (United States)

    Jokinen, Anne; Santtila, Pekka; Ravaja, Niklas; Puttonen, Sampsa

    2006-10-01

    Guilty Knowledge Test measuring electrodermal reactions was carried out in order to investigate the quality of different questions and the validity of the test in a situation that resembled a true crime. Fifty participants were randomly assigned to commit one of two realistic mock crimes, and were later tested with GKTs concerning both the crime they had enacted and the one they had no knowledge of. Different scoring systems (SCRs and peak amplitudes as well as raw and standardised scores) were employed and compared when analyzing the results. Although there were some false positives, the test was able to differentiate between the groups of guilty and innocent participants. With the best scoring systems, the test was able to classify up to 84% of the innocent and up to 76% of the guilty correctly according to a logistic regression analysis. ROC areas reflecting these same results reached values above .80. Questions on matters that demanded the participants' attention and were easier to remember had better discriminative power. With nearly all scoring methods, there was a significant interaction between the salience of the relevant items and the guilt of the participants. Participants reacted more strongly to salient relevant items when they were guilty, while no different reactions were observed for the non-salient items between guilty and innocent participants. It is suggested that, although the Guilty Knowledge Test appears to be a valid measure of guilty knowledge even in crimes that are close to real crimes, the principles on which guilty knowledge test questions are constructed should be more clearly specified.

  7. A note on using alpha and stratified alpha to estimate the reliability of a test composed of item parcels.

    Science.gov (United States)

    Rae, Gordon

    2008-11-01

    Several authors have suggested that prior to conducting a confirmatory factor analysis it may be useful to group items into a smaller number of item 'parcels' or 'testlets'. The present paper mathematically shows that coefficient alpha based on these parcel scores will only exceed alpha based on the entire set of items if W, the ratio of the average covariance of items between parcels to the average covariance of items within parcels, is greater than unity. If W is less than unity, however, and errors of measurement are uncorrelated, then stratified alpha will be a better lower bound to the reliability of a measure than the other two coefficients. Stratified alpha are also equal to the true reliability of a test when items within parcels are essentially tau-equivalent if one assumes that errors of measurement are not correlated.

  8. Testing Three-Item Versions for Seven of Young's Maladaptive Schema

    Science.gov (United States)

    Blau, Gary; DiMino, John; Sheridan, Natalie; Pred, Robert S.; Beverly, Clyde; Chessler, Marcy

    2015-01-01

    The Young Schema Questionnaire (YSQ) in either long-form (205- item) or short-form (75-item or 90-item) versions has demonstrated its clinical usefulness for assessing early maladaptive schemas. However, even a 75 or 90-item "short form", particularly when combined with other measures, can represent a lengthy…

  9. Using Cochran's Z Statistic to Test the Kernel-Smoothed Item Response Function Differences between Focal and Reference Groups

    Science.gov (United States)

    Zheng, Yinggan; Gierl, Mark J.; Cui, Ying

    2010-01-01

    This study combined the kernel smoothing procedure and a nonparametric differential item functioning statistic--Cochran's Z--to statistically test the difference between the kernel-smoothed item response functions for reference and focal groups. Simulation studies were conducted to investigate the Type I error and power of the proposed…

  10. Two Test Items to Explore High School Students' Beliefs of Sample Size When Sampling from Large Populations

    Science.gov (United States)

    Bill, Anthony; Henderson, Sally; Penman, John

    2010-01-01

    Two test items that examined high school students' beliefs of sample size for large populations using the context of opinion polls conducted prior to national and state elections were developed. A trial of the two items with 21 male and 33 female Year 9 students examined their naive understanding of sample size: over half of students chose a…

  11. Developing Items to Measure Theory of Planned Behavior Constructs for Opioid Administration for Children: Pilot Testing.

    Science.gov (United States)

    Vincent, Catherine; Riley, Barth B; Wilkie, Diana J

    2015-12-01

    The Theory of Planned Behavior (TpB) is useful to direct nursing research aimed at behavior change. As proposed in the TpB, individuals' attitudes, perceived norms, and perceived behavior control predict their intentions to perform a behavior and subsequently predict their actual performance of the behavior. Our purpose was to apply Fishbein and Ajzen's guidelines to begin development of a valid and reliable instrument for pediatric nurses' attitudes, perceived norms, perceived behavior control, and intentions to administer PRN opioid analgesics when hospitalized children self-report moderate to severe pain. Following Fishbein and Ajzen's directions, we were able to define the behavior of interest and specify the research population, formulate items for direct measures, elicit salient beliefs shared by our target population and formulate items for indirect measures, and prepare and test our questionnaire. For the pilot testing of internal consistency of measurement items, Cronbach alphas were between 0.60 and 0.90 for all constructs. Test-retest reliability correlations ranged from 0.63 to 0.90. Following Fishbein and Ajzen's guidelines was a feasible and organized approach for instrument development. In these early stages, we demonstrated good reliability for most subscales, showing promise for the instrument and its use in pain management research. Better understanding of the TpB constructs will facilitate the development of interventions targeted toward nurses' attitudes, perceived norms, and/or perceived behavior control to ultimately improve their pain behaviors toward reducing pain for vulnerable children. Copyright © 2015 American Society for Pain Management Nursing. Published by Elsevier Inc. All rights reserved.

  12. Effects of three combinations of plyometric and weight training programs on selected physical fitness test items.

    Science.gov (United States)

    Ford, H T; Puckett, J R; Drummond, J P; Sawyer, K; Gantt, K; Fussell, C

    1983-06-01

    To determine the effects of prescribed training programs on 5 physical fitness test items, each of 50 high school boys participated for 10 wk. in one of three programs (wrestling, softball, and plyometrics; weight training; and weight training and plyometrics). (a) On the sit-ups, 40-yd. dash, vertical jump, and pull-ups, each group improved significantly from pre- to posttest. (b) On the shuttle run, none of the groups improved significantly from pre- to posttest. (c) On the vertical jump, groups had a significant effect, but the interaction was nonsignificant. No effects were significant.

  13. Science Library of Test Items. Volume Eleven. Mastery Testing Programme. [Mastery Tests Series 3.] Tests M27-M38.

    Science.gov (United States)

    New South Wales Dept. of Education, Sydney (Australia).

    As part of a series of tests to measure mastery of specific skills in the natural sciences, copies of tests 27 through 38 include: (27) reading a grid plan; (28) identifying common invertebrates; (29) characteristics of invertebrates; (30) identifying elements; (31) using scientific notation part I; (32) classifying minerals; (33) predicting the…

  14. Effects of L1 Definitions and Cognate Status of Test Items on the Vocabulary Size Test

    Science.gov (United States)

    Elgort, Irina

    2013-01-01

    This study examines the development and evaluation of a bilingual Vocabulary Size Test (VST, Nation, 2006). A bilingual (English-Russian) test was developed and administered to 121 intermediate proficiency EFL learners (native speakers of Russian), alongside the original monolingual (English-only) version of the test. A comparison of the bilingual…

  15. Does Test Item Performance Increase with Test-to-Standards Alignment?

    Science.gov (United States)

    Traynor, Anne

    2017-01-01

    Variation in test performance among examinees from different regions or national jurisdictions is often partially attributed to differences in the degree of content correspondence between local school or training program curricula, and the test of interest. This posited relationship between test-curriculum correspondence, or "alignment,"…

  16. Error analysis and passage dependency of test items from a standardized test of multiple-sentence reading comprehension for aphasic and non-brain-damaged adults.

    Science.gov (United States)

    Nicholas, L E; Brookshire, R H

    1987-11-01

    Aphasic and non-brain-damaged adults were tested with two forms of the Nelson Reading Skills Test (NRST; Hanna. Schell, & Schreiner, 1977). The NRST is a standardized measure of silent reading for students in Grades 3 through 9 and assesses comprehension of information at three levels of inference (literal, translational, and higher level). Subjects' responses to NRST test items were evaluated to determine if their performance differed on literal, translational, and higher level items. Subjects' performance was also evaluated to determine the passage dependency of NRST test items--the extent to which readers had to rely on information in the NRST reading passages to answer test items. Higher level NRST test items (requiring complex inferences) were significantly more difficult for both non-brain-damaged and aphasic adults than literal items (not requiring inferences) or translational items (requiring simple inferences). The passage dependency of NRST test items for aphasic readers was higher than those reported by Nicholas, MacLennan, and Brookshire (1986) for multiple-sentence reading tests designed for aphasic adults. This suggests that the NRST is a more valid measure of the multiple-sentence reading comprehension of aphasic adults than the other tests evaluated by Nicholas et al. (1986).

  17. Developing energy and momentum conceptual survey (EMCS) with four-tier diagnostic test items

    Science.gov (United States)

    Afif, Nur Faadhilah; Nugraha, Muhammad Gina; Samsudin, Achmad

    2017-05-01

    Students' conceptions of work and energy are important to support the learning process in the classroom. For that reason, a diagnostic test instrument is needed to diagnose students' conception of work and energy. As a result, the researcher decided to develop Energy and Momentum Conceptual Survey (EMCS) instrument test into four-tier test diagnostic items. The purpose of this research is organized as the first step of four-tier test-formatted EMCS development as one of diagnostic test instruments on work and Energy. The research method used the 4D model (Defining, Designing, Developing and Disseminating). The instrument developed has been tested to 39 students in one of Senior High Schools. The resulting research showed that four-tier test-formatted EMCS is able to diagnose students' conception level of work and energy concept. It can be concluded that the development of four-tier test-formatted EMCS is one of potential diagnostic test instruments that able to obtain the category of students who understand concepts, misconceptions and do not understand about Work and Energy concept at all.

  18. Development of the four-item Letter and Shape Drawing test (LSD-4): A brief bedside test of visuospatial function.

    Science.gov (United States)

    Williams, Olugbenga Alaba; O'Connell, Henry; Leonard, Maeve; Awan, Fahad; White, Debbie; McKenna, Frank; Hannigan, Ailish; Cullen, Walter; Exton, Chris; Enudi, Walter; Dunne, Colum; Adamis, Dimitrios; Meagher, David

    2017-01-01

    Conventional bedside tests of visuospatial function such as the Clock Drawing (CDT) and Intersecting Pentagons (IPT) lack consistency in delivery and interpretation. We compared performance on a novel test of visuospatial ability - the LSD - with the IPT, CDT and MMSE in 180 acute elderly medical inpatients [mean age 79.7±7.1 (range 62-96); 91 females (50.6%)]. 124 (69%) scored ≤23 on the MMSE; 60 with mild (score 18-23) and 64 with severe (score ≤17) impairment. 78 (43%) scored ≥6 on the CDT, while for the IPT, 87 (47%) scored ≥4. The CDT and IPT agreed on the classification of 138 patients (77%) with modest-strong agreement with the MMSE categories. Correlation between the LSD and visuospatial tests was high. A four-item version of the LSD incorporating items 1,10,12,15 had high correlation with the LSD-15 and strong association with MMSE categories. The LSD-4 provides a brief and easily interpreted bedside test of visuospatial function that has high coverage of elderly patients with neurocognitive impairment, good agreement with conventional tests of visuospatial ability and favourable ability to identify significant cognitive impairment. [181 words].

  19. Adaptation and validation into Portuguese language of the six-item cognitive impairment test (6CIT).

    Science.gov (United States)

    Apóstolo, João Luís Alves; Paiva, Diana Dos Santos; Silva, Rosa Carla Gomes da; Santos, Eduardo José Ferreira Dos; Schultz, Timothy John

    2017-07-25

    The six-item cognitive impairment test (6CIT) is a brief cognitive screening tool that can be administered to older people in 2-3 min. To adapt the 6CIT for the European Portuguese and determine its psychometric properties based on a sample recruited from several contexts (nursing homes; universities for older people; day centres; primary health care units). The original 6CIT was translated into Portuguese and the draft Portuguese version (6CIT-P) was back-translated and piloted. The accuracy of the 6CIT-P was assessed by comparison with the Portuguese Mini-Mental State Examination (MMSE). A convenience sample of 550 older people from various geographical locations in the north and centre of the country was used. The test-retest reliability coefficient was high (r = 0.95). The 6CIT-P also showed good internal consistency (α = 0.88) and corrected item-total correlations ranged between 0.32 and 0.90. Total 6CIT-P and MMSE scores were strongly correlated. The proposed 6CIT-P threshold for cognitive impairment is ≥10 in the Portuguese population, which gives sensitivity of 82.78% and specificity of 84.84%. The accuracy of 6CIT-P, as measured by area under the ROC curve, was 0.91. The 6CIT-P has high reliability and validity and is accurate when used to screen for cognitive impairment.

  20. Development and Testing of a 3-Item Screening Tool for Problematic Internet Use.

    Science.gov (United States)

    Moreno, Megan A; Arseniev-Koehler, Alina; Selkie, Ellen

    2016-09-01

    To develop and validate the Problematic and Risky Internet Use Screening Scale (PRIUSS)-3 screening scale, a short scale to screen for Problematic Internet Use. This scale development study applied standard processes using separate samples for training and testing datasets. We recruited participants from schools and colleges in 6 states and 2 countries. We selected 3 initial versions of a PRIUSS-3 using correlation to the PRIUSS-18 score. We evaluated these 3 potential screening scales for conceptual coherence, factor loading, sensitivity, and specificity. We selected a 3-item screening tool and evaluated it in 2 separate testing sets using receiver operating curves. Our study sample included 1079 adolescents and young adults. The PRIUSS-3 included items addressing anxiety when away from the Internet, loss of motivation when on the Internet, and feelings of withdrawal when away from the Internet. This screening scale had a sensitivity of 100% and specificity of 69%. A score of ≥3 on the PRIUSS-3 was the threshold to follow up with the PRIUSS-18. Similar to other clinical screening tools, the PRIUSS-3 can be administered quickly in a clinical or research setting. Positive screens should be followed by administering the full PRIUSS-18. Given the pervasive presence of the Internet in youth's lives, screening and counseling for Problematic Internet Use can be facilitated by use of this validated screening tool. Copyright © 2016. Published by Elsevier Inc.

  1. Sleep Can Reduce the Testing Effect: It Enhances Recall of Restudied Items but Can Leave Recall of Retrieved Items Unaffected

    Science.gov (United States)

    Bäuml, Karl-Heinz T.; Holterman, Christoph; Abel, Magdalena

    2014-01-01

    The testing effect refers to the finding that retrieval practice in comparison to restudy of previously encoded contents can improve memory performance and reduce time-dependent forgetting. Naturally, long retention intervals include both wake and sleep delay, which can influence memory contents differently. In fact, sleep immediately after…

  2. Sleep Can Reduce the Testing Effect: It Enhances Recall of Restudied Items but Can Leave Recall of Retrieved Items Unaffected

    Science.gov (United States)

    Bäuml, Karl-Heinz T.; Holterman, Christoph; Abel, Magdalena

    2014-01-01

    The testing effect refers to the finding that retrieval practice in comparison to restudy of previously encoded contents can improve memory performance and reduce time-dependent forgetting. Naturally, long retention intervals include both wake and sleep delay, which can influence memory contents differently. In fact, sleep immediately after…

  3. Development of an item bank for the EORTC Role Functioning Computer Adaptive Test (EORTC RF-CAT)

    DEFF Research Database (Denmark)

    Gamper, Eva-Maria; Petersen, Morten Aa.; Aaronson, Neil

    2016-01-01

    BACKGROUND: Role functioning (RF) as a core construct of health-related quality of life (HRQOL) comprises aspects of occupational and social roles relevant for patients in all treatment phases as well as for survivors. The objective of the current study was to improve its assessment by developing......, and evaluation of the psychometric performance of the RF-CAT. RESULTS: Phases I-III yielded a list of 12 items eligible for phase IV field-testing. The field-testing sample included 1,023 patients from Austria, Denmark, Italy, and the UK. Psychometric evaluation and item response theory analyses yielded 10 items...

  4. Too Good to be Used: Analyzing Utilization of the Test Program for Certain Commercial Items in the Air Force

    Science.gov (United States)

    2014-12-01

    openness of a combined synopsis -solicitation under the Test Program for Certain Commercial Items. c. Benefit #3–Greater Efficiencies In terms of... radiological attack; or (3) the acquisition does not exceed the threshold and can be treated as an acquisition of commercial items in accordance with FAR...chemical, or radiological attack. (p. 56) This final sort shows the total actions eligible to use FAR Subpart 13.5, Test Program for Certain

  5. Performance on large-scale science tests: Item attributes that may impact achievement scores

    Science.gov (United States)

    Gordon, Janet Victoria

    , characteristics of test items themselves and/or opportunities to learn. Suggestions for future research are made.

  6. Interpreting gains and losses in conceptual test using Item Response Theory

    CERN Document Server

    Lamine, Brahim

    2015-01-01

    Conceptual tests are widely used by physics instructors to assess students' conceptual understanding and compare teaching methods. It is common to look at students' changes in their answers between a pre-test and a post-test to quantify a transition in student's conceptions. This is often done by looking at the proportion of incorrect answers in the pre-test that changes to correct answers in the post-test -- the gain -- and the proportion of correct answers that changes to incorrect answers -- the loss. By comparing theoretical predictions to experimental data on the Force Concept Inventory, we shown that Item Response Theory (IRT) is able to fairly well predict the observed gains and losses. We then use IRT to quantify the student's changes in a test-retest situation when no learning occurs and show that $i)$ up to 25\\% of total answers can change due to the non-deterministic nature of student's answer and that $ii)$ gains and losses can go from 0\\% to 100\\%. Still using IRT, we highlight the conditions tha...

  7. Analysing Item Position Effects due to Test Booklet Design within Large-Scale Assessment

    Science.gov (United States)

    Hohensinn, Christine; Kubinger, Klaus D.; Reif, Manuel; Schleicher, Eva; Khorramdel, Lale

    2011-01-01

    For large-scale assessments, usually booklet designs administering the same item at different positions within a booklet are used. Therefore, the occurrence of position effects influencing the difficulty of the item is a crucial issue. Not taking learning or fatigue effects into account would result in a bias of estimated item difficulty. The…

  8. Evaluation of an Item Bank for a Computerized Adaptive Test of Activity in Children With Cerebral Palsy

    Science.gov (United States)

    Haley, Stephen M.; Fragala-Pinkham, Maria A.; Dumas, Helene M.; Ni, Pengsheng; Gorton, George E.; Watson, Kyle; Montpetit, Kathleen; Bilodeau, Nathalie; Hambleton, Ronald K.; Tucker, Carole A.

    2009-01-01

    Background: Contemporary clinical assessments of activity are needed across the age span for children with cerebral palsy (CP). Computerized adaptive testing (CAT) has the potential to efficiently administer items for children across wide age spans and functional levels. Objective: The objective of this study was to examine the psychometric properties of a new item bank and simulated computerized adaptive test to assess activity level abilities in children with CP. Design: This was a cross-sectional item calibration study. Methods: The convenience sample consisted of 308 children and youth with CP, aged 2 to 20 years (X=10.7, SD=4.0), recruited from 4 pediatric hospitals. We collected parent-report data on an initial set of 45 activity items. Using an Item Response Theory (IRT) approach, we compared estimated scores from the activity item bank with concurrent instruments, examined discriminate validity, and developed computer simulations of a CAT algorithm with multiple stop rules to evaluate scale coverage, score agreement with CAT algorithms, and discriminant and concurrent validity. Results: Confirmatory factor analysis supported scale unidimensionality, local item dependence, and invariance. Scores from the computer simulations of the prototype CATs with varying stop rules were consistent with scores from the full item bank (r=.93–.98). The activity summary scores discriminated across levels of upper-extremity and gross motor severity and were correlated with the Pediatric Outcomes Data Collection Instrument (PODCI) physical function and sports subscale (r=.86), the Functional Independence Measure for Children (Wee-FIM) (r=.79), and the Pediatric Quality of Life Inventory–Cerebral Palsy version (r=.74). Limitations: The sample size was small for such IRT item banks and CAT development studies. Another limitation was oversampling of children with CP at higher functioning levels. Conclusions: The new activity item bank appears to have promise for use in a CAT

  9. A Case Study on an Item Writing Process: Use of Test Specifications, Nature of Group Dynamics, and Individual Item Writers' Characteristics

    Science.gov (United States)

    Kim, Jiyoung; Chi, Youngshin; Huensch, Amanda; Jun, Heesung; Li, Hongli; Roullion, Vanessa

    2010-01-01

    This article discusses a case study on an item writing process that reflects on our practical experience in an item development project. The purpose of the article is to share our lessons from the experience aiming to demystify item writing process. The study investigated three issues that naturally emerged during the project: how item writers use…

  10. Assessing the discriminating power of item and test scores in the linear factor-analysis model

    Directory of Open Access Journals (Sweden)

    Pere J. Ferrando

    2012-01-01

    Full Text Available Las propuestas rigurosas y basadas en un modelo psicométrico para estudiar el impreciso concepto de "capacidad discriminativa" son escasas y generalmente limitadas a los modelos no-lineales para items binarios. En este artículo se propone un marco general para evaluar la capacidad discriminativa de las puntuaciones en ítems y tests que son calibrados mediante el modelo de un factor común. La propuesta se organiza en torno a tres criterios: (a tipo de puntuación, (b rango de discriminación y (c aspecto específico que se evalúa. Dentro del marco propuesto: (a se discuten las relaciones entre 16 medidas, de las cuales 6 parecen ser nuevas, y (b se estudian las relaciones entre ellas. La utilidad de la propuesta en las aplicaciones psicométricas que usan el modelo factorial se ilustra mediante un ejemplo empírico.

  11. The 20 item prosopagnosia index (PI20): relationship with the Glasgow face-matching test.

    Science.gov (United States)

    Shah, Punit; Sowden, Sophie; Gaule, Anne; Catmur, Caroline; Bird, Geoffrey

    2015-11-01

    The 20 item prosopagnosia index (PI20) was recently developed to identify individuals with developmental prosopagnosia. While the PI20's principal purpose is to aid researchers and clinicians, it was suggested that it may serve as a useful screening tool to identify people with face recognition difficulties in applied settings where face matching is a critical part of their occupation. Although the PI20 has been validated using behavioural measures of face recognition, it has yet to be validated against a measure of face-matching ability that is more representative of applied settings. In this study, the PI20 was therefore administered with the Glasgow face-matching test (GFMT). A strong correlation was observed between PI20 and GFMT scores, providing further validation for the PI20, indicating that it is likely to be of value in applied settings.

  12. Determination of radionuclides in environmental test items at CPHR: traceability and uncertainty calculation.

    Science.gov (United States)

    Carrazana González, J; Fernández, I M; Capote Ferrera, E; Rodríguez Castro, G

    2008-11-01

    Information about how the laboratory of Centro de Protección e Higiene de las Radiaciones (CPHR), Cuba establishes its traceability to the International System of Units for the measurement of radionuclides in environmental test items is presented. A comparison among different methodologies of uncertainty calculation, including an analysis of the feasibility of using the Kragten-spreadsheet approach, is shown. In the specific case of the gamma spectrometric assay, the influence of each parameter, and the identification of the major contributor, in the relative difference between the methods of uncertainty calculation (Kragten and partial derivative) is described. The reliability of the uncertainty calculation results reported by the commercial software Gamma 2000 from Silena is analyzed.

  13. Determination of radionuclides in environmental test items at CPHR: Traceability and uncertainty calculation

    Energy Technology Data Exchange (ETDEWEB)

    Carrazana Gonzalez, J. [Centro de Proteccion e Higiene de las Radiaciones, P.O. Box 6195, La Habana (Cuba)], E-mail: carrazana@cphr.edu.cu; Fernandez, I.M.; Capote Ferrera, E.; Rodriguez Castro, G. [Centro de Proteccion e Higiene de las Radiaciones, P.O. Box 6195, La Habana (Cuba)

    2008-11-15

    Information about how the laboratory of Centro de Proteccion e Higiene de las Radiaciones (CPHR), Cuba establishes its traceability to the International System of Units for the measurement of radionuclides in environmental test items is presented. A comparison among different methodologies of uncertainty calculation, including an analysis of the feasibility of using the Kragten-spreadsheet approach, is shown. In the specific case of the gamma spectrometric assay, the influence of each parameter, and the identification of the major contributor, in the relative difference between the methods of uncertainty calculation (Kragten and partial derivative) is described. The reliability of the uncertainty calculation results reported by the commercial software Gamma 2000 from Silena is analyzed.

  14. Developing a Numerical Ability Test for Students of Education in Jordan: An Application of Item Response Theory

    Science.gov (United States)

    Abed, Eman Rasmi; Al-Absi, Mohammad Mustafa; Abu shindi, Yousef Abdelqader

    2016-01-01

    The purpose of the present study is developing a test to measure the numerical ability for students of education. The sample of the study consisted of (504) students from 8 universities in Jordan. The final draft of the test contains 45 items distributed among 5 dimensions. The results revealed that acceptable psychometric properties of the test;…

  15. Variance Difference between Maximum Likelihood Estimation Method and Expected A Posteriori Estimation Method Viewed from Number of Test Items

    Science.gov (United States)

    Mahmud, Jumailiyah; Sutikno, Muzayanah; Naga, Dali S.

    2016-01-01

    The aim of this study is to determine variance difference between maximum likelihood and expected A posteriori estimation methods viewed from number of test items of aptitude test. The variance presents an accuracy generated by both maximum likelihood and Bayes estimation methods. The test consists of three subtests, each with 40 multiple-choice…

  16. The 15-item version of the Boston Naming Test as an index of English proficiency.

    Science.gov (United States)

    Erdodi, Laszlo A; Jongsma, Katherine A; Issa, Meriam

    2017-01-01

    The present study was designed to examine the potential of the Boston Naming Test - Short Form (BNT-15) to provide an objective estimate of English proficiency. A secondary goal was to examine the effect of limited English proficiency (LEP) on neuropsychological test performance. A brief battery of neuropsychological tests was administered to 79 bilingual participants (40.5% male, MAge = 26.9, MEducation = 14.2). The majority (n = 56) were English dominant (EN), and the rest were Arabic dominant (AR). The BNT-15 was further reduced to 10 items that best discriminated between EN and AR (BNT-10). Participants were divided into low, intermediate, and high English proficiency subsamples based on BNT-10 scores (≤6, 7-8, and ≥9). Performance across groups was compared on neuropsychological tests with high and low verbal mediation. The BNT-15 and BNT-10 respectively correctly identified 89 and 90% of EN and AR participants. Level of English proficiency had a large effect (partial η(2) = .12-.34; Cohen's d = .67-1.59) on tests with high verbal mediation (animal fluency, sentence comprehension, word reading), but no effect on tests with low verbal mediation (auditory consonant trigrams, clock drawing, digit-symbol substitution). The BNT-15 and BNT-10 can function as indices of English proficiency and predict the deleterious effect of LEP on neuropsychological tests with high verbal mediation. Interpreting low scores on such measures as evidence of impairment in examinees with LEP would likely overestimate deficits.

  17. Performance of Certification and Recertification Examinees on Multiple Choice Test Items: Does Physician Age Have an Impact?

    Science.gov (United States)

    Shen, Linjun; Juul, Dorthea; Faulkner, Larry R

    2016-01-01

    The development of recertification programs (now referred to as Maintenance of Certification or MOC) by the members of the American Board of Medical Specialties provides the opportunity to study knowledge base across the professional lifespan of physicians. Research results to date are mixed with some studies finding negative associations between age and various measures of competency and others finding no or minimal relationships. Four groups of multiple choice test items that were independently developed for certification and MOC examinations in psychiatry and neurology were administered to certification and MOC examinees within each specialty. Percent correct scores were calculated for each examinee. Differences between certification and MOC examinees were compared using unpaired t tests, and logistic regression was used to compare MOC and certification examinee performance on the common test items. Except for the neurology certification test items that addressed basic neurology concepts, the performance of the certification and MOC examinees was similar. The differences in performance on individual test items did not consistently favor one group or the other and could not be attributed to any distinguishable content or format characteristics of those items. The findings of this study are encouraging in that physicians who had recently completed residency training possessed clinical knowledge that was comparable to that of experienced physicians, and the experienced physicians' clinical knowledge was equivalent to that of recent residency graduates. The role testing can play in enhancing expertise is described.

  18. [Relationship between recognition judgments and confidence ratings for repeated test items].

    Science.gov (United States)

    Takahashi, Akira

    2008-12-01

    Eighty-nine participants performed a set of recognition judgment and confidence rating tasks twice. Half of the new items presented in the second task had already presented as old items to the participants in the first task. Analysis of the second task responses, showed a positive correlation between confidence and performance for the old items, and a negative correlation for the new items. In particular, a strong negative correlation was observed when items presented in the first task were presented as "new" in the second task. This negative correlation reflects "source monitoring error" where the participants falsely recognized the items in the first task as those presented in the second, because they were unaware of making source misattributions.

  19. Specificity data for the b Test, Dot Counting Test, Rey-15 Item Plus Recognition, and Rey Word Recognition Test in monolingual Spanish-speakers.

    Science.gov (United States)

    Robles, Luz; López, Enrique; Salazar, Xavier; Boone, Kyle B; Glaser, Debra F

    2015-01-01

    The current study provides specificity data on a large sample (n = 115) of young to middle-aged, male, monolingual Spanish speakers of lower educational level and low acculturation to mainstream US culture for four neurocognitive performance validity tests (PVTs): the Dot Counting, the b Test, Rey Word Recognition, and Rey 15-Item Plus Recognition. Individuals with 0 to 6 years of education performed more poorly than did participants with 7 to 10 years of education on several Rey 15-Item scores (combination equation, recall intrusion errors, and recognition false positives), Rey Word Recognition total correct, and E-score and omission errors on the b Test, but no effect of educational level was observed for Dot Counting Test scores. Cutoff scores are provided that maintain approximately 90% specificity for the education subgroups separately. Some of these cutoffs match, or are even more stringent than, those recommended for use in US test takers who are primarily Caucasian, are tested in English, and have a higher educational level (i.e., Rey Word Recognition correct false-positive errors; Rey 15-Item recall intrusions and recognition false-positive errors; b Test total time; and Dot Counting E-score and grouped dot counting time). Thus, performance on these PVT variables in particular appears relatively robust to cultural/language/educational factors.

  20. Developing multiple-choices test items as tools for measuring the scientific-generic skills on solar system

    Science.gov (United States)

    Bhakti, Satria Seto; Samsudin, Achmad; Chandra, Didi Teguh; Siahaan, Parsaoran

    2017-05-01

    The aim of research is developing multiple-choices test items as tools for measuring the scientific of generic skills on solar system. To achieve the aim that the researchers used the ADDIE model consisting Of: Analyzing, Design, Development, Implementation, dan Evaluation, all of this as a method research. While The scientific of generic skills limited research to five indicator including: (1) indirect observation, (2) awareness of the scale, (3) inference logic, (4) a causal relation, and (5) mathematical modeling. The participants are 32 students at one of junior high schools in Bandung. The result shown that multiple-choices that are constructed test items have been declared valid by the expert validator, and after the tests show that the matter of developing multiple-choices test items be able to measuring the scientific of generic skills on solar system.

  1. Item and test analysis to identify quality multiple choice questions (MCQS from an assessment of medical students of Ahmedabad, Gujarat

    Directory of Open Access Journals (Sweden)

    Sanju Gajjar

    2014-01-01

    Full Text Available Background: Multiple choice questions (MCQs are frequently used to assess students in different educational streams for their objectivity and wide reach of coverage in less time. However, the MCQs to be used must be of quality which depends upon its difficulty index (DIF I, discrimination index (DI and distracter efficiency (DE. Objective: To evaluate MCQs or items and develop a pool of valid items by assessing with DIF I, DI and DE and also to revise/ store or discard items based on obtained results. Settings: Study was conducted in a medical school of Ahmedabad. Materials and Methods: An internal examination in Community Medicine was conducted after 40 hours teaching during 1 st MBBS which was attended by 148 out of 150 students. Total 50 MCQs or items and 150 distractors were analyzed. Statistical Analysis: Data was entered and analyzed in MS Excel 2007 and simple proportions, mean, standard deviations, coefficient of variation were calculated and unpaired t test was applied. Results: Out of 50 items, 24 had "good to excellent" DIF I (31 - 60% and 15 had "good to excellent" DI (> 0.25. Mean DE was 88.6% considered as ideal/ acceptable and non functional distractors (NFD were only 11.4%. Mean DI was 0.14. Poor DI (< 0.15 with negative DI in 10 items indicates poor preparedness of students and some issues with framing of at least some of the MCQs. Increased proportion of NFDs (incorrect alternatives selected by < 5% students in an item decrease DE and makes it easier. There were 15 items with 17 NFDs, while rest items did not have any NFD with mean DE of 100%. Conclusion: Study emphasizes the selection of quality MCQs which truly assess the knowledge and are able to differentiate the students of different abilities in correct manner.

  2. Teachers' Use of Test-Item Banks for Student Assessment in North Carolina Secondary Agricultural Education Programs

    Science.gov (United States)

    Marshall, Joy Morgan

    2014-01-01

    Higher expectations are on all parties to ensure students successfully perform on standardized tests. Specifically in North Carolina agriculture classes, students are given a CTE Post Assessment to measure knowledge gained and proficiency. Prior to students taking the CTE Post Assessment, teachers have access to a test item bank system that…

  3. Set of Criteria for Efficiency of the Process Forming the Answers to Multiple-Choice Test Items

    Science.gov (United States)

    Rybanov, Alexander Aleksandrovich

    2013-01-01

    Is offered the set of criteria for assessing efficiency of the process forming the answers to multiple-choice test items. To increase accuracy of computer-assisted testing results, it is suggested to assess dynamics of the process of forming the final answer using the following factors: loss of time factor and correct choice factor. The model…

  4. Teachers' Use of Test-Item Banks for Student Assessment in North Carolina Secondary Agricultural Education Programs

    Science.gov (United States)

    Marshall, Joy Morgan

    2014-01-01

    Higher expectations are on all parties to ensure students successfully perform on standardized tests. Specifically in North Carolina agriculture classes, students are given a CTE Post Assessment to measure knowledge gained and proficiency. Prior to students taking the CTE Post Assessment, teachers have access to a test item bank system that…

  5. An Exploratory Study of the Applicability of Item Response Theory Methods to the Graduate Management Admission Test.

    Science.gov (United States)

    Kingston, Neal; And Others

    A necessary prerequisite to the operational use of item response theory (IRT) in any testing program is the investigation of the feasibility of such an approach. This report presents the results of such research for the Graduate Management Admission Test (GMAT). Despite the fact that GMAT data appear to violate a basic assumption of the…

  6. Using Necessary Information to Identify Item Dependence in Passage-Based Reading Comprehension Tests

    Science.gov (United States)

    Baldonado, Angela Argo; Svetina, Dubravka; Gorin, Joanna

    2015-01-01

    Applications of traditional unidimensional item response theory models to passage-based reading comprehension assessment data have been criticized based on potential violations of local independence. However, simple rules for determining dependency, such as including all items associated with a particular passage, may overestimate the dependency…

  7. Multiple-choice versus open-ended response formats of reading test items: A two-dimensional IRT analysis

    Directory of Open Access Journals (Sweden)

    Dominique P. Rauch

    2010-12-01

    Full Text Available The dimensionality of a reading comprehension assessment with non-stem equivalent multiple-choice (MC items and open-ended (OE items was analyzed with German test data of 8523 9th-graders. We found that a two-dimensional IRT model with within-item multidimensionality, where MC and OE items load on a general latent dimension and OE items additionally load on a nested latent dimension, had a superior fit compared to an unidimensional model (p ≤ .05. Correlations between general cognitive abilities, orthography and vocabulary and the general latent dimension were significantly higher than with the nested latent dimension (p ≤ .05. Drawing back on experimental studies on the effect of item format on reading processes, we suppose that the general latent dimension measures abilities necessary to master basic reading processes and the nested latent dimension captures abilities necessary to master higher reading processes. Including gender, language spoken at home, and school track as predictors in latent regression models showed that the well known advantage of girls and mother-tongue students is found only for the nested latent dimension.

  8. The psychometric properties of the "Reading the Mind in the Eyes" Test: an item response theory (IRT) analysis.

    Science.gov (United States)

    Preti, Antonio; Vellante, Marcello; Petretto, Donatella R

    2017-05-01

    The "Reading the Mind in the Eyes" Test (hereafter: Eyes Test) is considered an advanced task of the Theory of Mind aimed at assessing the performance of the participant in perspective-takingthat is, the ability to sense or understand other people's cognitive and emotional states. In this study, the item response theory analysis was applied to the adult version of the Eyes Test. The Italian version of the Eyes Test was administered to 200 undergraduate students of both genders (males = 46%). Modified parallel analysis (MPA) was used to test unidimensionality. Marginal maximum likelihood estimation was used to fit the 1-, 2-, and 3-parameter logistic (PL) model to the data. Differential Item Functioning (DIF) due to gender was explored with five independent methods. MPA provided evidence in favour of unidimensionality. The Rasch model (1-PL) was superior to the other two models in explaining participants' responses to the Eyes Test. There was no robust evidence of gender-related DIF in the Eyes Test, although some differences may exist for some items as a reflection of real differences by group. The study results support a one-factor model of the Eyes Test. Performance on the Eyes Test is defined by the participant's ability in perspective-taking. Researchers should cease using arbitrarily selected subscores in assessing the performance of participants to the Eyes Test. Lack of gender-related DIF favours the use of the Eyes Test in the investigation of gender differences concerning empathy and social cognition.

  9. Developing and testing items for the South African Personality Inventory (SAPI

    Directory of Open Access Journals (Sweden)

    Carin Hill

    2013-01-01

    Full Text Available Orientation: A multicultural country like South Africa needs fair cross-cultural psychometric instruments.Research purpose: This article reports on the process of identifying items for, and provides a quantitative evaluation of, the South African Personality Inventory (SAPI items.Motivation for the study: The study intended to develop an indigenous and psychometrically sound personality instrument that adheres to the requirements of South African legislation and excludes cultural bias.Research design, approach and method: The authors used a cross-sectional design. They measured the nine SAPI clusters identified in the qualitative stage of the SAPI project in 11 separate quantitative studies. Convenience sampling yielded 6735 participants. Statistical analysis focused on the construct validity and reliability of items. The authors eliminated items that showed poor performance, based on common psychometric criteria, and selected the best performing items to form part of the final version of the SAPI.Main findings: The authors developed 2573 items from the nine SAPI clusters. Of these, 2268 items were valid and reliable representations of the SAPI facets.Practical/managerial implications: The authors developed a large item pool. It measures personality in South Africa. Researchers can refine it for the SAPI. Furthermore, the project illustrates an approach that researchers can use in projects that aim to develop culturally-informed psychological measures.Contribution/value-add: Personality assessment is important for recruiting, selecting and developing employees. This study contributes to the current knowledge about the early processes researchers follow when they develop a personality instrument that measures personality fairly in different cultural groups, as the SAPI does.

  10. Development of an item bank and computer adaptive test for role functioning

    DEFF Research Database (Denmark)

    Anatchkova, Milena D; Rose, Matthias; Ware, John E

    2012-01-01

    Role functioning (RF) is a key component of health and well-being and an important outcome in health research. The aim of this study was to develop an item bank to measure impact of health on role functioning.......Role functioning (RF) is a key component of health and well-being and an important outcome in health research. The aim of this study was to develop an item bank to measure impact of health on role functioning....

  11. Investigating Linguistic Sources of Differential Item Functioning Using Expert Think-Aloud Protocols in Science Achievement Tests

    Science.gov (United States)

    Roth, Wolff-Michael; Oliveri, Maria Elena; Dallie Sandilands, Debra; Lyons-Thomas, Juliette; Ercikan, Kadriye

    2013-03-01

    Even if national and international assessments are designed to be comparable, subsequent psychometric analyses often reveal differential item functioning (DIF). Central to achieving comparability is to examine the presence of DIF, and if DIF is found, to investigate its sources to ensure differentially functioning items that do not lead to bias. In this study, sources of DIF were examined using think-aloud protocols. The think-aloud protocols of expert reviewers were conducted for comparing the English and French versions of 40 items previously identified as DIF (N = 20) and non-DIF (N = 20). Three highly trained and experienced experts in verifying and accepting/rejecting multi-lingual versions of curriculum and testing materials for government purposes participated in this study. Although there is a considerable amount of agreement in the identification of differentially functioning items, experts do not consistently identify and distinguish DIF and non-DIF items. Our analyses of the think-aloud protocols identified particular linguistic, general pedagogical, content-related, and cognitive factors related to sources of DIF. Implications are provided for the process of arriving at the identification of DIF, prior to the actual administration of tests at national and international levels.

  12. Differential Item Functioning in While-Listening Performance Tests: The Case of the International English Language Testing System (IELTS) Listening Module

    Science.gov (United States)

    Aryadoust, Vahid

    2012-01-01

    This article investigates a version of the International English Language Testing System (IELTS) listening test for evidence of differential item functioning (DIF) based on gender, nationality, age, and degree of previous exposure to the test. Overall, the listening construct was found to be underrepresented, which is probably an important cause…

  13. A Multidimensional Partial Credit Model with Associated Item and Test Statistics: An Application to Mixed-Format Tests

    Science.gov (United States)

    Yao, Lihua; Schwarz, Richard D.

    2006-01-01

    Multidimensional item response theory (IRT) models have been proposed for better understanding the dimensional structure of data or to define diagnostic profiles of student learning. A compensatory multidimensional two-parameter partial credit model (M-2PPC) for constructed-response items is presented that is a generalization of those proposed to…

  14. Development and Reliability of Items Measuring the Nonmedical Use of Prescription Drugs for the Youth Risk Behavior Survey: Results Froman Initial Pilot Test

    Science.gov (United States)

    Howard, Melissa M.; Weiler, Robert M.; Haddox, J. David

    2009-01-01

    Background: The purpose of this study was to develop and test the reliability of self-report survey items designed to monitor the nonmedical use of prescription drugs among adolescents. Methods: Eighteen nonmedical prescription drug items designed to be congruent with the substance abuse items in the US Centers for Disease Control and Prevention's…

  15. Probabilistic Approaches to Examining Linguistic Features of Test Items and Their Effect on the Performance of English Language Learners

    Science.gov (United States)

    Solano-Flores, Guillermo

    2014-01-01

    This article addresses validity and fairness in the testing of English language learners (ELLs)--students in the United States who are developing English as a second language. It discusses limitations of current approaches to examining the linguistic features of items and their effect on the performance of ELL students. The article submits that…

  16. The EORTC emotional functioning computerized adaptive test: phases I-III of a cross-cultural item bank development

    NARCIS (Netherlands)

    Gamper, E.M.; Groenvold, M.; Petersen, M.; Young, T.; Constantini, A.; Aaronson, N.; Giesinger, J.; Meraner, V.; Kemmler, G.; Holzner, B.

    2014-01-01

    Background: The European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Group is currently developing computerized adaptive testing measures for the Quality of Life Questionnaire Core-30 (QLQ-C30) scales. The work presented here describes the development of an EORTC item b

  17. Examining the Stability of the 7-Item Social Physique Anxiety Scale Using a Test-Retest Method

    Science.gov (United States)

    Scott, Lisa A.; Burke, Kevin L.; Joyner, A. Barry; Brand, Jennifer S.

    2004-01-01

    This study examined the stability of the 7-item Social Physique Anxiety Scale (SPAS-7) using a test-retest method. Collegiate, undergraduate (N = 201) students completed two administrations of the SPAS-7, with a 14-day separation between the administrations. The scale was administered either at the beginning or end of the physical activity class.…

  18. Differential Item Functioning Assessment in Cognitive Diagnostic Modeling: Application of the Wald Test to Investigate DIF in the DINA Model

    Science.gov (United States)

    Hou, Likun; de la Torre, Jimmy; Nandakumar, Ratna

    2014-01-01

    Analyzing examinees' responses using cognitive diagnostic models (CDMs) has the advantage of providing diagnostic information. To ensure the validity of the results from these models, differential item functioning (DIF) in CDMs needs to be investigated. In this article, the Wald test is proposed to examine DIF in the context of CDMs. This study…

  19. Comparison of the Air Force Officer Qualifying Test Form T and Form S: Initial Item- and Subtest-Level Analyses

    Science.gov (United States)

    2017-03-15

    Air Force Officer Qualifying Test Form T and Form S: Initial...Item- and Subtest- Level Analyses March, 2017 Imelda D. Aguilar Air Force Personnel Center Strategic Research and Assessment HQ AFPC...DSYX Prepared for: Laura G. Barron, Ph.D. AFPC/Strategic Research and Assessment Branch (SRAB) Air

  20. The EORTC emotional functioning computerized adaptive test: phases I-III of a cross-cultural item bank development

    NARCIS (Netherlands)

    E.M. Gamper; M. Groenvold; M. Petersen; T. Young; A. Constantini; N. Aaronson; J. Giesinger; V. Meraner; G. Kemmler; B. Holzner

    2013-01-01

    Background: The European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Group is currently developing computerized adaptive testing measures for the Quality of Life Questionnaire Core-30 (QLQ-C30) scales. The work presented here describes the development of an EORTC item b

  1. Assessment of chromium(VI) release from 848 jewellery items by use of a diphenylcarbazide spot test

    DEFF Research Database (Denmark)

    Bregnbak, David; Johansen, Jeanne D.; Hamann, Dathan;

    2016-01-01

    We recently evaluated and validated a diphenylcarbazide(DPC)-based screening spot test that can detect the release of chromium(VI) ions (≥0.5 ppm) from various metallic items and leather goods (1). We then screened a selection of metal screws, leather shoes, and gloves, as well as 50 earrings...

  2. An Analysis of Cross Racial Identity Scale Scores Using Classical Test Theory and Rasch Item Response Models

    Science.gov (United States)

    Sussman, Joshua; Beaujean, A. Alexander; Worrell, Frank C.; Watson, Stevie

    2013-01-01

    Item response models (IRMs) were used to analyze Cross Racial Identity Scale (CRIS) scores. Rasch analysis scores were compared with classical test theory (CTT) scores. The partial credit model demonstrated a high goodness of fit and correlations between Rasch and CTT scores ranged from 0.91 to 0.99. CRIS scores are supported by both methods.…

  3. Probabilistic Approaches to Examining Linguistic Features of Test Items and Their Effect on the Performance of English Language Learners

    Science.gov (United States)

    Solano-Flores, Guillermo

    2014-01-01

    This article addresses validity and fairness in the testing of English language learners (ELLs)--students in the United States who are developing English as a second language. It discusses limitations of current approaches to examining the linguistic features of items and their effect on the performance of ELL students. The article submits that…

  4. Evaluating the Wald Test for Item-Level Comparison of Saturated and Reduced Models in Cognitive Diagnosis

    Science.gov (United States)

    de la Torre, Jimmy; Lee, Young-Sun

    2013-01-01

    This article used the Wald test to evaluate the item-level fit of a saturated cognitive diagnosis model (CDM) relative to the fits of the reduced models it subsumes. A simulation study was carried out to examine the Type I error and power of the Wald test in the context of the G-DINA model. Results show that when the sample size is small and a…

  5. Evaluating the Wald Test for Item-Level Comparison of Saturated and Reduced Models in Cognitive Diagnosis

    Science.gov (United States)

    de la Torre, Jimmy; Lee, Young-Sun

    2013-01-01

    This article used the Wald test to evaluate the item-level fit of a saturated cognitive diagnosis model (CDM) relative to the fits of the reduced models it subsumes. A simulation study was carried out to examine the Type I error and power of the Wald test in the context of the G-DINA model. Results show that when the sample size is small and a…

  6. Psychometric evaluation of an item bank for computerized adaptive testing of the EORTC QLQ-C30 cognitive functioning dimension in cancer patients.

    Science.gov (United States)

    Dirven, Linda; Groenvold, Mogens; Taphoorn, Martin J B; Conroy, Thierry; Tomaszewski, Krzysztof A; Young, Teresa; Petersen, Morten Aa

    2017-07-13

    The European Organisation of Research and Treatment of Cancer (EORTC) Quality of Life Group is developing computerized adaptive testing (CAT) versions of all EORTC Quality of Life Questionnaire (QLQ-C30) scales with the aim to enhance measurement precision. Here we present the results on the field-testing and psychometric evaluation of the item bank for cognitive functioning (CF). In previous phases (I-III), 44 candidate items were developed measuring CF in cancer patients. In phase IV, these items were psychometrically evaluated in a large sample of international cancer patients. This evaluation included an assessment of dimensionality, fit to the item response theory (IRT) model, differential item functioning (DIF), and measurement properties. A total of 1030 cancer patients completed the 44 candidate items on CF. Of these, 34 items could be included in a unidimensional IRT model, showing an acceptable fit. Although several items showed DIF, these had a negligible impact on CF estimation. Measurement precision of the item bank was much higher than the two original QLQ-C30 CF items alone, across the whole continuum. Moreover, CAT measurement may on average reduce study sample sizes with about 35-40% compared to the original QLQ-C30 CF scale, without loss of power. A CF item bank for CAT measurement consisting of 34 items was established, applicable to various cancer patients across countries. This CAT measurement system will facilitate precise and efficient assessment of HRQOL of cancer patients, without loss of comparability of results.

  7. A 6-item scale for overall, emotional and social loneliness: confirmatory tests on survey data

    NARCIS (Netherlands)

    de Jong Gierveld, J.; van Tilburg, T.

    2006-01-01

    Loneliness is an indicator of social well-being and pertains to the feeling of missing an intimate relationship (emotional loneliness) or missing a wider social network (social loneliness). The 11-item De Jong Gierveld Loneliness Scale has proved to be a valid and reliable measurement instrument for

  8. Sampling of Common Items: An Unrecognized Source of Error in Test Equating. CSE Report 636

    Science.gov (United States)

    Michaelides, Michalis P.; Haertel, Edward H.

    2004-01-01

    There is variability in the estimation of an equating transformation because common-item parameters are obtained from responses of samples of examinees. The most commonly used standard error of equating quantifies this source of sampling error, which decreases as the sample size of examinees used to derive the transformation increases. In a…

  9. Anatomy of a physics test: Validation of the physics items on the Texas Assessment of Knowledge and Skills

    Directory of Open Access Journals (Sweden)

    Jill A. Marshall

    2009-03-01

    Full Text Available We report the results of an analysis of the Texas Assessment of Knowledge and Skills (TAKS designed to determine whether the TAKS is a valid indicator of whether students know and can do physics at the level necessary for success in future coursework, STEM careers, and life in a technological society. We categorized science items from the 2003 and 2004 10th and 11th grade TAKS by content area(s covered, knowledge and skills required to select the correct answer, and overall quality. We also analyzed a 5000 student sample of item-level results from the 2004 11th grade exam, performing full-information factor analysis, calculating classical test indices, and determining each item's response curve using item response theory. Triangulation of our results revealed strengths and weaknesses of the different methods of analysis. The TAKS was found to be only weakly indicative of physics preparation and we make recommendations for increasing the validity of standardized physics testing.

  10. Varying levels of difficulty index of skills-test items randomly selected by examinees on the Korean emergency medical technician licensing examination.

    Science.gov (United States)

    Koh, Bongyeun; Hong, Sunggi; Kim, Soon-Sim; Hyun, Jin-Sook; Baek, Milye; Moon, Jundong; Kwon, Hayran; Kim, Gyoungyong; Min, Seonggi; Kang, Gu-Hyun

    2016-01-01

    The goal of this study was to characterize the difficulty index of the items in the skills test components of the class I and II Korean emergency medical technician licensing examination (KEMTLE), which requires examinees to select items randomly. The results of 1,309 class I KEMTLE examinations and 1,801 class II KEMTLE examinations in 2013 were subjected to analysis. Items from the basic and advanced skills test sections of the KEMTLE were compared to determine whether some were significantly more difficult than others. In the class I KEMTLE, all 4 of the items on the basic skills test showed significant variation in difficulty index (Ptest items (P<0.01). In the skills test components of the class I and II KEMTLE, the procedure in which examinees randomly select questions should be revised to require examinees to respond to a set of fixed items in order to improve the reliability of the national licensing examination.

  11. Effect of Adjusting Pseudo-Guessing Parameter Estimates on Test Scaling When Item Parameter Drift Is Present

    Directory of Open Access Journals (Sweden)

    Kyung T. Han

    2015-07-01

    Full Text Available In item response theory test scaling/equating with the three-parameter model, the scaling coefficients A and B have no impact on the c-parameter estimates of the test items since the c-parameter estimates are not adjusted in the scaling/equating procedure. The main research question in this study concerned how serious the consequences would be if c-parameter estimates are not adjusted in the test equating procedure when item-parameter drift (IPD is present. This drift is commonly observed in equating studies and hence, has been the source of considerable research. The results from a series of Monte-Carlo simulation studies conducted under 32 different combinations of conditions showed that some calibration strategies in the study, where the c-parameters were adjusted to be identical across two test forms, resulted in more robust equating performance in the presence of IPD. This paper discusses the practical effectiveness and the theoretical importance of appropriately adjusting c-parameter estimates in equating.

  12. Controlling Type I Error Rate in Evaluating Differential Item Functioning for Four DIF Methods: Use of Three Procedures for Adjustment of Multiple Item Testing

    Science.gov (United States)

    Kim, Jihye

    2010-01-01

    In DIF studies, a Type I error refers to the mistake of identifying non-DIF items as DIF items, and a Type I error rate refers to the proportion of Type I errors in a simulation study. The possibility of making a Type I error in DIF studies is always present and high possibility of making such an error can weaken the validity of the assessment.…

  13. The 20 item prosopagnosia index (PI20):relationship with the Glasgow face-matching test

    OpenAIRE

    Shah, Punit; Sowden, Sophie; Gaule, Anne; Catmur, Caroline; Bird, Geoffrey

    2015-01-01

    The 20 item prosopagnosia index (PI20) was recently developed to identify individuals with developmental prosopagnosia. While the PI20's principal purpose is to aid researchers and clinicians, it was suggested that it may serve as a useful screening tool to identify people with face recognition difficulties in applied settings where face matching is a critical part of their occupation. Although the PI20 has been validated using behavioural measures of face recognition, it has yet to be valida...

  14. Realizing a Rasch measurement through instructionally- sequenced domains of test items.

    Science.gov (United States)

    Schulz, E. Matthew

    2016-11-01

    This paper presents results from a project in which instructionally-sequenced domains were defined for purposes of constructing measures that that conform to an ideal in Guttman scaling and Rasch measurement. A fundamental idea in these measurement systems is that every person higher on the measurement scale can do everything that lower-level persons can do, plus at least one more thing. This idea has had limited application in educational measurement due to the stochastic nature of item response data and the sheer number of items needed to obtain reliable measures. However, it has been shown by Schulz, Lee, and Mullen [1] that this ideal can be can be realized at a higher level of abstraction - when items within a content strand are aggregated into a small number of domains that are ordered in instructional timing and difficulty. The present paper shows how this was done, and the results, in an achievement level setting project for the 2007 Grade 12 NAEP Economics Assessment.

  15. Testing the ruler with item response theory: increasing precision of measurement for relationship satisfaction with the Couples Satisfaction Index.

    Science.gov (United States)

    Funk, Janette L; Rogge, Ronald D

    2007-12-01

    The present study took a critical look at a central construct in couples research: relationship satisfaction. Eight well-validated self-report measures of relationship satisfaction, including the Marital Adjustment Test (MAT; H. J. Locke & K. M. Wallace, 1959), the Dyadic Adjustment Scale (DAS; G. B. Spanier, 1976), and an additional 75 potential satisfaction items, were given to 5,315 online participants. Using item response theory, the authors demonstrated that the MAT and DAS provided relatively poor levels of precision in assessing satisfaction, particularly given the length of those scales. Principal-components analysis and item response theory applied to the larger item pool were used to develop the Couples Satisfaction Index (CSI) scales. Compared with the MAS and the DAS, the CSI scales were shown to have higher precision of measurement (less noise) and correspondingly greater power for detecting differences in levels of satisfaction. The CSI scales demonstrated strong convergent validity with other measures of satisfaction and excellent construct validity with anchor scales from the nomological net surrounding satisfaction, suggesting that they assess the same theoretical construct as do prior scales. Implications for research are discussed.

  16. The effect of Trier Social Stress Test (TSST) on item and associative recognition of words and pictures in healthy participants

    OpenAIRE

    Jonathan eGuez; Rotem eSaar-Ashkenazy; Eldad eKeha; Chen eTiferet

    2016-01-01

    Psychological stress, induced by the Trier Social Stress Test (TSST), has repeatedly been shown to alter memory performance. Although factors influencing memory performance such as stimulus nature (verbal /pictorial) and emotional valence have been extensively studied, results whether stress impairs or improves memory are still inconsistent. This study aimed at exploring the effect of TSST on item versus associative memory for neutral, verbal, and pictorial stimuli. 48 healthy subjects were r...

  17. What Do You Think You Are Measuring? A Mixed-Methods Procedure for Assessing the Content Validity of Test Items and Theory-Based Scaling.

    Science.gov (United States)

    Koller, Ingrid; Levenson, Michael R; Glück, Judith

    2017-01-01

    The valid measurement of latent constructs is crucial for psychological research. Here, we present a mixed-methods procedure for improving the precision of construct definitions, determining the content validity of items, evaluating the representativeness of items for the target construct, generating test items, and analyzing items on a theoretical basis. To illustrate the mixed-methods content-scaling-structure (CSS) procedure, we analyze the Adult Self-Transcendence Inventory, a self-report measure of wisdom (ASTI, Levenson et al., 2005). A content-validity analysis of the ASTI items was used as the basis of psychometric analyses using multidimensional item response models (N = 1215). We found that the new procedure produced important suggestions concerning five subdimensions of the ASTI that were not identifiable using exploratory methods. The study shows that the application of the suggested procedure leads to a deeper understanding of latent constructs. It also demonstrates the advantages of theory-based item analysis.

  18. Generalization of the Lord-Wingersky Algorithm to Computing the Distribution of Summed Test Scores Based on Real-Number Item Scores

    Science.gov (United States)

    Kim, Seonghoon

    2013-01-01

    With known item response theory (IRT) item parameters, Lord and Wingersky provided a recursive algorithm for computing the conditional frequency distribution of number-correct test scores, given proficiency. This article presents a generalized algorithm for computing the conditional distribution of summed test scores involving real-number item…

  19. Improving Content Assessment for English Language Learners: Studies of the Linguistic Modification of Test Items. Research Report. ETS RR-14-23

    Science.gov (United States)

    Young, John W.; King, Teresa C.; Hauck, Maurice Cogan; Ginsburgh, Mitchell; Kotloff, Lauren; Cabrera, Julio; Cavalie, Carlos

    2014-01-01

    This article describes two research studies conducted on the linguistic modification of test items from K-12 content assessments. In the first study, 120 linguistically modified test items in mathematics and science taken by fourth and sixth graders were found to have a wide range of outcomes for English language learners (ELLs) and non-ELLs, with…

  20. Enhanced Automatic Question Creator--EAQC: Concept, Development and Evaluation of an Automatic Test Item Creation Tool to Foster Modern e-Education

    Science.gov (United States)

    Gutl, Christian; Lankmayr, Klaus; Weinhofer, Joachim; Hofler, Margit

    2011-01-01

    Research in automated creation of test items for assessment purposes became increasingly important during the recent years. Due to automatic question creation it is possible to support personalized and self-directed learning activities by preparing appropriate and individualized test items quite easily with relatively little effort or even fully…

  1. Re-Fitting for a Different Purpose: A Case Study of Item Writer Practices in Adapting Source Texts for a Test of Academic Reading

    Science.gov (United States)

    Green, Anthony; Hawkey, Roger

    2012-01-01

    The important yet under-researched role of item writers in the selection and adaptation of texts for high-stakes reading tests is investigated through a case study involving a group of trained item writers working on the International English Language Testing System (IELTS). In the first phase of the study, participants were invited to reflect in…

  2. "Are vocabulary tests measurement invariant between age groups? An item response analysis of three popular tests": Correction to Fox, Berry, and Freeman (2014).

    Science.gov (United States)

    2016-08-01

    Reports an error in "Are vocabulary tests measurement invariant between age groups? An item response analysis of three popular tests" by Mark C. Fox, Jane M. Berry and Sara P. Freeman (Psychology and Aging, 2014[Dec], Vol 29[4], 925-938). In the article, unneeded zeros were inadvertently included at the beginnings of some numbers in Tables 1–4. In addition, the right column in Table 4 includes three unnecessary zeros after asterisks. (The following abstract of the original article appeared in record 2014-49140-001.) Relatively high vocabulary scores of older adults are generally interpreted as evidence that older adults possess more of a common ability than younger adults. Yet, this interpretation rests on empirical assumptions about the uniformity of item-response functions between groups. In this article, we test item response models of differential responding against datasets containing younger-, middle-aged-, and older-adult responses to three popular vocabulary tests (the Shipley, Ekstrom, and WAIS–R) to determine whether members of different age groups who achieve the same scores have the same probability of responding in the same categories (e.g., correct vs. incorrect) under the same conditions. Contrary to the null hypothesis of measurement invariance, datasets for all three tests exhibit substantial differential responding. Members of different age groups who achieve the same overall scores exhibit differing response probabilities in relation to the same items (differential item functioning) and appear to approach the tests in qualitatively different ways that generalize across items. Specifically, younger adults are more likely than older adults to leave items unanswered for partial credit on the Ekstrom, and to produce 2-point definitions on the WAIS–R. Yet, older adults score higher than younger adults, consistent with most reports of vocabulary outcomes in the cognitive aging literature. In light of these findings, the most generalizable

  3. Varying levels of difficulty index of skills-test items randomly selected by examinees on the Korean emergency medical technician licensing examination

    Directory of Open Access Journals (Sweden)

    Bongyeun Koh

    2016-01-01

    Full Text Available Purpose: The goal of this study was to characterize the difficulty index of the items in the skills test components of the class I and II Korean emergency medical technician licensing examination (KEMTLE, which requires examinees to select items randomly. Methods: The results of 1,309 class I KEMTLE examinations and 1,801 class II KEMTLE examinations in 2013 were subjected to analysis. Items from the basic and advanced skills test sections of the KEMTLE were compared to determine whether some were significantly more difficult than others. Results: In the class I KEMTLE, all 4 of the items on the basic skills test showed significant variation in difficulty index (P<0.01, as well as 4 of the 5 items on the advanced skills test (P<0.05. In the class II KEMTLE, 4 of the 5 items on the basic skills test showed significantly different difficulty index (P<0.01, as well as all 3 of the advanced skills test items (P<0.01. Conclusion: In the skills test components of the class I and II KEMTLE, the procedure in which examinees randomly select questions should be revised to require examinees to respond to a set of fixed items in order to improve the reliability of the national licensing examination.

  4. A Comparison of Item-Selection Methods for Adaptive Tests with Content

    NARCIS (Netherlands)

    Linden, van der Wim J.

    2005-01-01

    In test assembly, a fundamental difference exists between algorithms that select a test sequentially or simultaneously. Sequential assembly allows us to optimize an objective function at the examinee's ability estimate, such as the test information function in computerized adaptive testing. But it l

  5. Reliability and Levels of Difficulty of Objective Test Items in a Mathematics Achievement Test: A Study of Ten Senior Secondary Schools in Five Local Government Areas of Akure, Ondo State

    Science.gov (United States)

    Adebule, S. O.

    2009-01-01

    This study examined the reliability and difficult indices of Multiple Choice (MC) and True or False (TF) types of objective test items in a Mathematics Achievement Test (MAT). The instruments used were two variants- 50-items Mathematics achievement test based on the multiple choice and true or false test formats. A total of five hundred (500)…

  6. Performance of Accounting students on the Enade/2012 test: an application of the Item-Response Theory

    Directory of Open Access Journals (Sweden)

    Raphael Vinicius Weigert Camargo

    2016-08-01

    Full Text Available The objective in this study was to measure Accounting students’ performance (proficiency on the Enade test using the Item Response Theory (IRT. The students’ performance was measured using the three parameter logistic model (3PL, based on data related to the Enade test/2012, taken from the website of the National Institute for Educational Studies and Research Anísio Teixeira (Inep, concerning 47,098 students. Through the scale, three levels of student performance could be distinguished. Level 1 students master the reading and interpretation of texts and quantitative reasoning. In addition, Level 2 students should present logical reasoning and systemic and holistic perspective. Furthermore, at Level 3, students should present interdisciplinary knowledge, covering accounting contents, critical-analytic skills and practical application of the content mastered. The results also appointed that the items of the Enade test were very difficulty for the group that took the test. Independently of the student characteristics analyzed, overall, the proficiency scores were very low. This result suggests that the HEI need to take actions and that public policies are needed that can contribute to improve the students’ performance.

  7. Testing whether the DSM-5 personality disorder trait model can be measured with a reduced set of items: An item response theory investigation of the Personality Inventory for DSM-5.

    Science.gov (United States)

    Maples, Jessica L; Carter, Nathan T; Few, Lauren R; Crego, Cristina; Gore, Whitney L; Samuel, Douglas B; Williamson, Rachel L; Lynam, Donald R; Widiger, Thomas A; Markon, Kristian E; Krueger, Robert F; Miller, Joshua D

    2015-12-01

    The fifth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) includes an alternative model of personality disorders (PDs) in Section III, consisting in part of a pathological personality trait model. To date, the 220-item Personality Inventory for DSM-5 (PID-5; Krueger, Derringer, Markon, Watson, & Skodol, 2012) is the only extant self-report instrument explicitly developed to measure this pathological trait model. The present study used item response theory-based analyses in a large sample (n = 1,417) to investigate whether a reduced set of 100 items could be identified from the PID-5 that could measure the 25 traits and 5 domains. This reduced set of PID-5 items was then tested in a community sample of adults currently receiving psychological treatment (n = 109). Across a wide range of criterion variables including NEO PI-R domains and facets, DSM-5 Section II PD scores, and externalizing and internalizing outcomes, the correlational profiles of the original and reduced versions of the PID-5 were nearly identical (rICC = .995). These results provide strong support for the hypothesis that an abbreviated set of PID-5 items can be used to reliably, validly, and efficiently assess these personality disorder traits. The ability to assess the DSM-5 Section III traits using only 100 items has important implications in that it suggests these traits could still be measured in settings in which assessment-related resources (e.g., time, compensation) are limited.

  8. What is the Ability Emotional Intelligence Test (MSCEIT) good for? An evaluation using item response theory

    National Research Council Canada - National Science Library

    Fiori, Marina; Antonietti, Jean-Philippe; Mikolajczak, Moira; Luminet, Olivier; Hansenne, Michel; Rossier, Jérôme

    2014-01-01

    ...). However, there is scarcity of tests measuring EI as a form of intelligence. The Mayer Salovey Caruso Emotional Intelligence Test, or MSCEIT, is among the few available and the most widespread measure of EI as an ability...

  9. A test of the International Personality Item Pool representation of the Revised NEO Personality Inventory and development of a 120-item IPIP-based measure of the five-factor model.

    Science.gov (United States)

    Maples, Jessica L; Guan, Li; Carter, Nathan T; Miller, Joshua D

    2014-12-01

    There has been a substantial increase in the use of personality assessment measures constructed using items from the International Personality Item Pool (IPIP) such as the 300-item IPIP-NEO (Goldberg, 1999), a representation of the Revised NEO Personality Inventory (NEO PI-R; Costa & McCrae, 1992). The IPIP-NEO is free to use and can be modified to accommodate its users' needs. Despite the substantial interest in this measure, there is still a dearth of data demonstrating its convergence with the NEO PI-R. The present study represents an investigation of the reliability and validity of scores on the IPIP-NEO. Additionally, we used item response theory (IRT) methodology to create a 120-item version of the IPIP-NEO. Using an undergraduate sample (n = 359), we examined the reliability, as well as the convergent and criterion validity, of scores from the 300-item IPIP-NEO, a previously constructed 120-item version of the IPIP-NEO (Johnson, 2011), and the newly created IRT-based IPIP-120 in comparison to the NEO PI-R across a range of outcomes. Scores from all 3 IPIP measures demonstrated strong reliability and convergence with the NEO PI-R and a high degree of similarity with regard to their correlational profiles across the criterion variables (rICC = .983, .972, and .976, respectively). The replicability of these findings was then tested in a community sample (n = 757), and the results closely mirrored the findings from Sample 1. These results provide support for the use of the IPIP-NEO and both 120-item IPIP-NEO measures as assessment tools for measurement of the five-factor model. (c) 2014 APA, all rights reserved.

  10. The Role of Item Models in Automatic Item Generation

    Science.gov (United States)

    Gierl, Mark J.; Lai, Hollis

    2012-01-01

    Automatic item generation represents a relatively new but rapidly evolving research area where cognitive and psychometric theories are used to produce tests that include items generated using computer technology. Automatic item generation requires two steps. First, test development specialists create item models, which are comparable to templates…

  11. Developing Testing Accommodations for English Language Learners: Illustrations as Visual Supports for Item Accessibility

    Science.gov (United States)

    Solano-Flores, Guillermo; Wang, Chao; Kachchaf, Rachel; Soltero-Gonzalez, Lucinda; Nguyen-Le, Khanh

    2014-01-01

    We address valid testing for English language learners (ELLs)--students in the United States who are schooled in English while they are still acquiring English as a second language. Also, we address the need for procedures for systematically developing ELL testing accommodations--changes in tests intended to support ELLs to gain access to the…

  12. On the Issue of Item Selection in Computerized Adaptive Testing With Response Times

    NARCIS (Netherlands)

    Veldkamp, Bernard P.

    2016-01-01

    Many standardized tests are now administered via computer rather than paper-and-pencil format. The computer-based delivery mode brings with it certain advantages. One advantage is the ability to adapt the difficulty level of the test to the ability level of the test taker in what has been termed com

  13. Impact of Accumulated Error on Item Response Theory Pre-Equating with Mixed Format Tests

    Science.gov (United States)

    Keller, Lisa A.; Keller, Robert; Cook, Robert J.; Colvin, Kimberly F.

    2016-01-01

    The equating of tests is an essential process in high-stakes, large-scale testing conducted over multiple forms or administrations. By adjusting for differences in difficulty and placing scores from different administrations of a test on a common scale, equating allows scores from these different forms and administrations to be directly compared…

  14. Specificity and false positive rates of the Test of Memory Malingering, Rey 15-item Test, and Rey Word Recognition Test among forensic inpatients with intellectual disabilities.

    Science.gov (United States)

    Love, Christopher M; Glassmire, David M; Zanolini, Shanna Jordan; Wolf, Amanda

    2014-10-01

    This study evaluated the specificity and false positive (FP) rates of the Rey 15-Item Test (FIT), Word Recognition Test (WRT), and Test of Memory Malingering (TOMM) in a sample of 21 forensic inpatients with mild intellectual disability (ID). The FIT demonstrated an FP rate of 23.8% with the standard quantitative cutoff score. Certain qualitative error types on the FIT showed promise and had low FP rates. The WRT obtained an FP rate of 0.0% with previously reported cutoff scores. Finally, the TOMM demonstrated low FP rates of 4.8% and 0.0% on Trial 2 and the Retention Trial, respectively, when applying the standard cutoff score. FP rates are reported for a range of cutoff scores and compared with published research on individuals diagnosed with ID. Results indicated that although the quantitative variables on the FIT had unacceptably high FP rates, the TOMM and WRT had low FP rates, increasing the confidence clinicians can place in scores reflecting poor effort on these measures during ID evaluations.

  15. The differential item functioning and structural equivalence of a nonverbal cognitive ability test for five language groups

    Directory of Open Access Journals (Sweden)

    Pieter Schaap

    2011-03-01

    Full Text Available Orientation: For a number of years, eliminating a language component in testing by using nonverbal cognitive tests has been proposed as a possible solution to the effect of groups’ languages (mother tongues or first languages on test performance. This is particularly relevant in South Africa with its 11 official languages.Research purpose: The aim of the study was to determine the differential item functioning (DIF and structural equivalence of a nonverbal cognitive ability test (the PiB/SpEEx Observance test [401] for five South African language groups.Motivation for study: Cultural and language group sensitive tests can lead to unfair discrimination and is a contentious workplace issue in South Africa today. Misconceptions about psychometric testing in industry can cause tests to lose credibility if industries do not use a scientifically sound test-by-test evaluation approach.Research design, approach and method: The researcher used a quasi-experimental design and factor analytic and logistic regression techniques to meet the research aims. The study used a convenience sample drawn from industry and an educational institution.Main findings: The main findings of the study show structural equivalence of the test at a holistic level and nonsignificant DIF effect sizes for most of the comparisons that the researcher made.Practical/managerial implications: This research shows that the PIB/SpEEx Observance Test (401 is not completely language insensitive. One should see it rather as a language-reduced test when people from different language groups need testing.Contribution/value-add: The findings provide supporting evidence that nonverbal cognitive tests are plausible alternatives to verbal tests when one compares people from different language groups.

  16. Evaluation of the box and blocks test, stereognosis and item banks of activity and upper extremity function in youths with brachial plexus birth palsy.

    Science.gov (United States)

    Mulcahey, Mary Jane; Kozin, Scott; Merenda, Lisa; Gaughan, John; Tian, Feng; Gogola, Gloria; James, Michelle A; Ni, Pengsheng

    2012-09-01

    One of the greatest limitations to measuring outcomes in pediatric orthopaedics is the lack of effective instruments. Computer adaptive testing, which uses large item banks, select only items that are relevant to a child's function based on a previous response and filters items that are too easy or too hard or simply not relevant to the child. In this way, computer adaptive testing provides for a meaningful, efficient, and precise method to evaluate patient-reported outcomes. Banks of items that assess activity and upper extremity (UE) function have been developed for children with cerebral palsy and have enabled computer adaptive tests that showed strong reliability, strong validity, and broader content range when compared with traditional instruments. Because of the void in instruments for children with brachial plexus birth palsy (BPBP) and the importance of having an UE and activity scale, we were interested in how well these items worked in this population. Cross-sectional, multicenter study involving 200 children with BPBP was conducted. The box and block test (BBT) and Stereognosis tests were administered and patient reports of UE function and activity were obtained with the cerebral palsy item banks. Differential item functioning (DIF) was examined. Predictive ability of the BBT and stereognosis was evaluated with proportional odds logistic regression model. Spearman correlations coefficients (rs) were calculated to examine correlation between stereognosis and the BBT and between individual stereognosis items and the total stereognosis score. Six of the 86 items showed DIF, indicating that the activity and UE item banks may be useful for computer adaptive tests for children with BPBP. The penny and the button were strongest predictors of impairment level (odds ratio=0.34 to 0.40]. There was a good positive relationship between total stereognosis and BBT scores (rs=0.60). The BBT had a good negative (rs=-0.55) and good positive (rs=0.55) relationship with

  17. Evaluation of item candidates: the PROMIS qualitative item review.

    Science.gov (United States)

    DeWalt, Darren A; Rothrock, Nan; Yount, Susan; Stone, Arthur A

    2007-05-01

    One of the PROMIS (Patient-Reported Outcome Measurement Information System) network's primary goals is the development of a comprehensive item bank for patient-reported outcomes of chronic diseases. For its first set of item banks, PROMIS chose to focus on pain, fatigue, emotional distress, physical function, and social function. An essential step for the development of an item pool is the identification, evaluation, and revision of extant questionnaire items for the core item pool. In this work, we also describe the systematic process wherein items are classified for subsequent statistical processing by the PROMIS investigators. Six phases of item development are documented: identification of extant items, item classification and selection, item review and revision, focus group input on domain coverage, cognitive interviews with individual items, and final revision before field testing. Identification of items refers to the systematic search for existing items in currently available scales. Expert item review and revision was conducted by trained professionals who reviewed the wording of each item and revised as appropriate for conventions adopted by the PROMIS network. Focus groups were used to confirm domain definitions and to identify new areas of item development for future PROMIS item banks. Cognitive interviews were used to examine individual items. Items successfully screened through this process were sent to field testing and will be subjected to innovative scale construction procedures.

  18. [Repeated measurement of memory with valenced test items: verbal memory, working memory and autobiographic memory].

    Science.gov (United States)

    Kuffel, A; Terfehr, K; Uhlmann, C; Schreiner, J; Löwe, B; Spitzer, C; Wingenfeld, K

    2013-07-01

    A large number of questions in clinical and/or experimental neuropsychology require the multiple repetition of memory tests at relatively short intervals. Studies on the impact of the associated exercise and interference effects on the validity of the test results are rare. Moreover, hardly any neuropsychological instruments exist to date to record the memory performance with several parallel versions in which the emotional valence of the test material is also taken into consideration. The aim of the present study was to test whether a working memory test (WST, a digit-span task with neutral or negative distraction stimuli) devised by our workgroup can be used with repeated measurements. This question was also examined in parallel versions of a wordlist learning paradigm and an autobiographical memory test (AMT). Both tests contained stimuli with neutral, positive and negative valence. Twenty-four participants completed the memory testing including the working memory test and three versions of a wordlist and the AMT at intervals of a week apiece (measuring points 1. - 3.). The results reveal consistent performances across the three measuring points in the working and autobiographical memory test. The valence of the stimulus material did not influence the memory performance. In the delayed recall of the wordlist an improvement in memory performance over time was seen. The tests on working memory presented and the parallel versions for the declarative and autobiographical memory constitute informal economic instruments within the scope of the measurement repeatability designs. While the WST and AMT are appropriate for study designs with repeated measurements at relatively short intervals, longer intervals might seem more favourable for the use of wordlist learning paradigms. © Georg Thieme Verlag KG Stuttgart · New York.

  19. Measuring Cognitive Load in Test Items: Static Graphics versus Animated Graphics

    Science.gov (United States)

    Dindar, M.; Kabakçi Yurdakul, I.; Inan Dönmez, F.

    2015-01-01

    The majority of multimedia learning studies focus on the use of graphics in learning process but very few of them examine the role of graphics in testing students' knowledge. This study investigates the use of static graphics versus animated graphics in a computer-based English achievement test from a cognitive load theory perspective. Three…

  20. The results of the "essential laboratory tests" applied to new outpatients--re-evaluation of diagnostic efficiencies of the test items.

    Science.gov (United States)

    Takemura, Y; Kobayashi, H; Kugai, N; Sekiguchi, S

    1996-06-01

    We have analyzed diagnostic efficiencies of the individual "Essential laboratory test" items when these tests were applied to 520 new outpatients in the division of comprehensive medicine in a teaching hospital. The integration of these test results with history-taking and physical examination resulted in 544 primary clinical diagnoses which corresponded to the patient's illness complained and in 361 additional diagnoses unrelated to their chief complaints but found by chance by the addition of the test results. Clinical usefulness of these test items were variable depending on the disease category, demonstrating a superior diagnostic efficiency in infectious or inflammatory diseases, liver and biliary tract diseases, hematological disorders or metabolic diseases such as hyperlipidemia and diabetes mellitus, but a lesser degree of usefulness in gastro-intestinal or neurogenic diseases. Urine urobilinogen could not establish its clinical usefulness because of extremely low diagnostic sensitivity even in liver diseases. The leukocyte differential count provided confirmatory information for infectious or inflammatory diseases and was helpful for the estimation of the etiologic nature of infectious diseases. This study failed to terminate a controversy for the adoption of sialic acid instead of erythrocyte sedimentation rate (ESR) in the "Essential laboratory test" items, since the former test showed lower sensitivity, even though higher specificity, in infectious or inflammatory status than ESR. Low albumin globulin ratio (A/G) revealed equivalent diagnostic sensitivity and specificity to the elevated levels in alpha 1 and/or alpha 2 globulin fractions in infectious or inflammatory status, being helpful for the evaluation of patient's general condition at a glance. Incidental analysis for diagnostic values of cholinesterase and random blood glucose for the detection of fatty liver and diabetes mellitus, respectively, suggested that these two tests may be included in

  1. Item Replenishing in Cognitive Diagnostic Computerized Adaptive Testing%认知诊断计算机化自适应测验中的项目增补

    Institute of Scientific and Technical Information of China (English)

    陈平; 辛涛

    2011-01-01

    项目的增补对认知诊断计算机化自适应测验(CD-CAT)题库的开发与维护至关重要.借鉴单维项目反应理论(IRT)中联合极大似然估计方法(JMLE)的思路,提出联合估计算法(JEA),仅依赖被试在旧题和新题上的作答反应联合地、自动地估计新题的属性向量和新题的项目参数.研究结果表明:当项目参数相对较小且样本量相对较大时,JEA算法在新题属性向量和新题项目参数估计精度方面表现不错;而且样本大小、项目参数大小以及项目参数初值都影响着JEA算法的表现.%Item replenishing is essential for item bank maintenance and development in cognitive diagnostic computerized adaptive testing (CD-CAT). Compared with item replenishing in regular CAT, item replenishing in CD-CAT is more complicated because it requires constructing the Q matrix (Embretson, 1984; Tatsuoka, 1995) corresponding to the new items (denoted as Qnew_item). However, the Qnew_item is often constructed manually by content experts and psychometricians, which brings about two issues: first, it takes experts a lot of time and efforts to discuss and complete the attribute identification task, especially when the number of new items is large; second, the Qnew_item identified by experts is not guaranteed to be totally correct because experts often disagree in the discussion. Therefore, this study borrowed the main idea of joint maximum likelihood estimation (JMLE) method in unidimensional item response theory (IRT) to propose the joint estimation algorithm (JEA), which depended fully on the examinees' responses on the operational and new items to jointly estimate the Qnew_item and the item parameters of new items automatically in the context of CD-CAT under the Deterministic Inputs, Noisy "and" Gate (DINA) model.A simulation study was conducted to investigate whether the JEA algorithm could accurately and efficiently estimate the Qnew item and the item parameters of new items under

  2. Using Automated Processes to Generate Test Items And Their Associated Solutions and Rationales to Support Formative Feedback

    Directory of Open Access Journals (Sweden)

    Mark Gierl

    2015-08-01

    Full Text Available Automatic item generation is the process of using item models to produce assessment tasks using computer technology. An item model is similar to a template that highlights the elements in the task that must be manipulated to produce new items. The purpose of our study is to describe an innovative method for generating large numbers of diverse and heterogeneous items along with their solutions and associated rationales to support formative feedback. We demonstrate the method by generating items in two diverse content areas, mathematics and nonverbal reasoning

  3. Claims, Evidence and Achievement Level Descriptors as a Foundation for Item Design and Test Specifications

    Science.gov (United States)

    Hendrickson, Amy; Huff, Kristen; Luecht, Ric

    2009-01-01

    [Slides] presented at the Annual Meeting of National Council on Measurement in Education (NCME) in San Diego, CA in April 2009. This presentation describes how the vehicles for gathering student evidence--task models and test specifications--are developed.

  4. What Is the Ability Emotional Intelligence Test (MSCEIT) Good for? An Evaluation Using Item Response Theory

    OpenAIRE

    Marina Fiori; Jean-Philippe Antonietti; Moira Mikolajczak; Olivier Luminet; Michel Hansenne; Jérôme Rossier

    2014-01-01

    The ability approach has been indicated as promising for advancing research in emotional intelligence (EI). However, there is scarcity of tests measuring EI as a form of intelligence. The Mayer Salovey Caruso Emotional Intelligence Test, or MSCEIT, is among the few available and the most widespread measure of EI as an ability. This implies that conclusions about the value of EI as a meaningful construct and about its utility in predicting various outcomes mainly rely on the properties of this...

  5. Clinical utility of a single-item test for DSM-5 alcohol use disorder among outpatients with anxiety and depressive disorders.

    Science.gov (United States)

    Bartoli, Francesco; Crocamo, Cristina; Biagi, Enrico; Di Carlo, Francesco; Parma, Francesca; Madeddu, Fabio; Capuzzi, Enrico; Colmegna, Fabrizia; Clerici, Massimo; Carrà, Giuseppe

    2016-08-01

    There is a lack of studies testing accuracy of fast screening methods for alcohol use disorder in mental health settings. We aimed at estimating clinical utility of a standard single-item test for case finding and screening of DSM-5 alcohol use disorder among individuals suffering from anxiety and mood disorders. We recruited adults consecutively referred, in a 12-month period, to an outpatient clinic for anxiety and depressive disorders. We assessed the National Institute on Alcohol Abuse and Alcoholism (NIAAA) single-item test, using the Mini- International Neuropsychiatric Interview (MINI), plus an additional item of Composite International Diagnostic Interview (CIDI) for craving, as reference standard to diagnose a current DSM-5 alcohol use disorder. We estimated sensitivity and specificity of the single-item test, as well as positive and negative Clinical Utility Indexes (CUIs). 242 subjects with anxiety and mood disorders were included. The NIAAA single-item test showed high sensitivity (91.9%) and specificity (91.2%) for DSM-5 alcohol use disorder. The positive CUI was 0.601, whereas the negative one was 0.898, with excellent values also accounting for main individual characteristics (age, gender, diagnosis, psychological distress levels, smoking status). Testing for relevant indexes, we found an excellent clinical utility of the NIAAA single-item test for screening true negative cases. Our findings support a routine use of reliable methods for rapid screening in similar mental health settings. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  6. The effect of Trier Social Stress Test (TSST on item and associative recognition of words and pictures in healthy participants

    Directory of Open Access Journals (Sweden)

    Jonathan eGuez

    2016-04-01

    Full Text Available Psychological stress, induced by the Trier Social Stress Test (TSST, has repeatedly been shown to alter memory performance. Although factors influencing memory performance such as stimulus nature (verbal /pictorial and emotional valence have been extensively studied, results whether stress impairs or improves memory are still inconsistent. This study aimed at exploring the effect of TSST on item versus associative memory for neutral, verbal, and pictorial stimuli. 48 healthy subjects were recruited, 24 participants were randomly assigned to the TSST group and the remaining 24 participants were assigned to the control group. Stress reactivity was measured by psychological (subjective state anxiety ratings and physiological (Galvanic skin response recording measurements. Subjects performed an item-association memory task for both stimulus types (words, pictures simultaneously, before, and after the stress/non-stress manipulation. The results showed that memory recognition for pictorial stimuli was higher than for verbal stimuli. Memory for both words and pictures was impaired following TSST; while the source for this impairment was specific to associative recognition in pictures, a more general deficit was observed for verbal material, as expressed in decreased recognition for both items and associations following TSST. Response latency analysis indicated that the TSST manipulation decreased response time but at the cost of memory accuracy. We conclude that stress does not uniformly affect memory; rather it interacts with the task’s cognitive load and stimulus type. Applying the current study results to patients diagnosed with disorders associated with traumatic stress, our findings in healthy subjects under acute stress provide further support for our assertion that patients’ impaired memory originates in poor recollection processing following depletion of attentional resources.

  7. Gender differential item functioning on a national field-specific test: The case of PhD entrance exam of TEFL in Iran

    Directory of Open Access Journals (Sweden)

    Alireza Ahmadi

    2016-01-01

    Full Text Available Differential Item Functioning (DIF exists when examinees of equal ability from different groups have different probabilities of successful performance in a certain item. This study examined gender differential item functioning across the PhD Entrance Exam of TEFL (PEET in Iran, using both logistic regression (LR and one-parameter item response theory (1-p IRT models. The PEET is a national test consisting of a centralized written examination designed to provide information on the eligibility of PhD applicants of TEFL to enter PhD programs. The 2013 administration of this test provided score data for a sample of 999 Iranian PhD applicants consisting of 397 males and 602 females. First, the data were subjected to DIF analysis through logistic regression (LR model. Then, to triangulate the findings, a 1-p IRT procedure was applied. The results indicated (1 more items flagged for DIF by LR than by 1-p IRT (2 DIF cancellation (the number of DIF items were equal for both males and females, as revealed through LR, (3 equal number of uniform and non-uniform DIF, as tracked via LR, and (4 female superiority in the test performance, as revealed via IRT analysis. Overall, the findings of the study indicated that PEET suffers from DIF. As such, test developers and policymakers (like NOET & MSRT are recommended to take these findings into serious consideration and exercise care in fair test practice by dedicating effort to more unbiased test development and decision making.

  8. Psychometric evaluation of the EORTC computerized adaptive test (CAT) fatigue item pool

    DEFF Research Database (Denmark)

    Petersen, Morten Aa; Giesinger, Johannes M; Holzner, Bernhard

    2013-01-01

    Fatigue is one of the most common symptoms associated with cancer and its treatment. To obtain a more precise and flexible measure of fatigue, the EORTC Quality of Life Group has developed a computerized adaptive test (CAT) measure of fatigue. This is part of an ongoing project developing a CAT v...

  9. Item and Test Construct Definition for the New Spanish Baccalaureate Final Evaluation: A Proposal

    Science.gov (United States)

    Laborda, Jesús García; Martin-Monje, Elena

    2013-01-01

    The current English section of the University Entrance Examination (PAU) has kept the same format for twenty years. The Bologna process has provided new reasons to vary its current format, since the majority of international reputed tests usually include oral sections with both listening and speaking tasks. Recently the Universidad de Alcalá…

  10. Development and Application of a New Approach to Testing the Bipolarity of Semantic Differential Items.

    Science.gov (United States)

    Cogliser, Claudia C.; Schriesheim, Chester A.

    1994-01-01

    A method of testing semantic differential scales for bipolarity is developed using a new conception of bipolarity that does not require unidimensionality. Assessment of Fielder's Least Preferred Coworker instrument with 63 college student subjects using multidimensional scaling revealed its significant departures from bipolarity. (SLD)

  11. Explanatory item response modeling of children's change on a dynamic test of analogical reasoning

    NARCIS (Netherlands)

    Stevenson, C.E.; Hickendorff, M.; Resing, W.C.M.; Heiser, W.J.; de Boeck, P.A.L.

    Dynamic testing is an assessment method in which training is incorporated into the procedure with the aim of gauging cognitive potential. Large individual differences are present in children's ability to profit from training in analogical reasoning. The aim of this experiment was to investigate

  12. Development of a lack of appetite item bank for computer-adaptive testing (CAT)

    NARCIS (Netherlands)

    Thamsborg, L.H.; Petersen, M.A.; Aaronson, N.K.; Chie, W.C.; Costantini, A.; Holzner, B.; Verdonck-de Leeuw, I.M.; Young, T.; Groenvold, M.

    2015-01-01

    Purpose: A significant proportion of oncological patients experiences lack of appetite. Precise measurement is relevant to improve the management of lack of appetite. The so-called computer-adaptive test (CAT) allows for adaptation of the questionnaire to the individual patient, thereby optimizing

  13. Chemical and Biological Contamination Survivability (CBCS), Large Item Exteriors. Test Operations Procedure

    Science.gov (United States)

    2011-03-21

    protective posture , level IV; protective clothing; ME; mission-essential 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT SAR...tested are measured while individuals and/or crew members are wearing normal uniforms and while wearing mission-oriented protective posture , level IV...be conducted outdoors. Environmental conditions are monitored, the SUT is allowed to equilibrate with the ambient conditions, and any required

  14. Avanços na psicometria: da Teoria Clássica dos Testes à Teoria de Resposta ao Item

    Directory of Open Access Journals (Sweden)

    Laisa Marcorela Andreoli Sartes

    2013-01-01

    Full Text Available No século XX, o desenvolvimento e avaliação das propriedades psicométricas dos testes se embasou principalmente na Teoria Clássica dos Testes (TCT. Muitos testes são longos e redundantes, com medidas influenciáveis pelas características da amostra dos indivíduos avaliados durante seu desenvolvimento, sendo algumas destas limitações consequências do uso da TCT. A Teoria de Resposta ao Item (TRI surgiu como uma possível solução para algumas limitações da TCT, melhorando a qualidade da avaliação da estrutura dos testes. Neste texto comparamos criticamente as características da TCT e da TRI como métodos para avaliação das propriedades psicométricas dos testes. São discutidas as vantagens e limitações de cada método.

  15. Item Banking with Embedded Standards

    Science.gov (United States)

    MacCann, Robert G.; Stanley, Gordon

    2009-01-01

    An item banking method that does not use Item Response Theory (IRT) is described. This method provides a comparable grading system across schools that would be suitable for low-stakes testing. It uses the Angoff standard-setting method to obtain item ratings that are stored with each item. An example of such a grading system is given, showing how…

  16. Item Banking with Embedded Standards

    Science.gov (United States)

    MacCann, Robert G.; Stanley, Gordon

    2009-01-01

    An item banking method that does not use Item Response Theory (IRT) is described. This method provides a comparable grading system across schools that would be suitable for low-stakes testing. It uses the Angoff standard-setting method to obtain item ratings that are stored with each item. An example of such a grading system is given, showing how…

  17. Changes in Word Usage Frequency May Hamper Intergenerational Comparisons of Vocabulary Skills: An Ngram Analysis of Wordsum, WAIS, and WISC Test Items

    Science.gov (United States)

    Roivainen, Eka

    2014-01-01

    Research on secular trends in mean intelligence test scores shows smaller gains in vocabulary skills than in nonverbal reasoning. One possible explanation is that vocabulary test items become outdated faster compared to nonverbal tasks. The history of the usage frequency of the words on five popular vocabulary tests, the GSS Wordsum, Wechsler…

  18. easyCBM CCSS Math Item Scaling and Test Form Revision (2012-2013): Grades 6-8. Technical Report #1313

    Science.gov (United States)

    Anderson, Daniel; Alonzo, Julie; Tindal, Gerald

    2012-01-01

    The purpose of this technical report is to document the piloting and scaling of new easyCBM mathematics test items aligned with the Common Core State Standards (CCSS) and to describe the process used to revise and supplement the 2012 research version easyCBM CCSS math tests in Grades 6-8. For all operational 2012 research version test forms (10…

  19. Three Statistical Testing Procedures in Logistic Regression: Their Performance in Differential Item Functioning (DIF) Investigation. Research Report. ETS RR-09-35

    Science.gov (United States)

    Paek, Insu

    2009-01-01

    Three statistical testing procedures well-known in the maximum likelihood approach are the Wald, likelihood ratio (LR), and score tests. Although well-known, the application of these three testing procedures in the logistic regression method to investigate differential item function (DIF) has not been rigorously made yet. Employing a variety of…

  20. The EXCITE Trial: analysis of "noncompleted" Wolf Motor Function Test items.

    Science.gov (United States)

    Wolf, Steven L; Thompson, Paul A; Estes, Emily; Lonergan, Timothy; Merchant, Rozina; Richardson, Natasha

    2012-02-01

    This is the first study to examine Wolf Motor Function Test (WMFT) tasks among EXCITE Trial participants that could not be completed at baseline or 2 weeks later. Data were collected from participants who received constraint-induced movement therapy (CIMT) immediately at the time of randomization (CIMT-I, n = 106) and from those for whom there was a delay of 1 year in receiving this intervention (CIMT-D, n = 116). Data were collected at baseline and at a 2-week time point, during which the CIMT-I group received the CIMT intervention and the CIMT-D group did not. Generalized estimating equation (GEE) analyses were used to examine repeated binary data and count values. Group and visit interactions were assessed, adjusting for functional level, affected side, dominant side, age, and gender covariates. In CIMT-I participants, there was an increase in the proportion of completed tasks at posttest compared with CIMT-D participants, particularly with respect to those tasks requiring dexterity with small objects and total incompletes (P < .0033). Compared with baseline, 120 tasks governing distal limb use for CIMT-I and 58 tasks dispersed across the WMFT for CIMT-D could be completed after 2 weeks. Common movement components that may have contributed to incomplete tasks include shoulder stabilization and flexion, elbow flexion and extension, wrist pronation, supination and ulnar deviation, and pincer grip. CIMT training should emphasize therapy for those specific movement components in patients who meet the EXCITE criteria for baseline motor control.

  1. Test-retest reliability of selected items of Health Behaviour in School-aged Children (HBSC survey questionnaire in Beijing, China

    Directory of Open Access Journals (Sweden)

    Liu Yang

    2010-08-01

    Full Text Available Abstract Background Children's health and health behaviour are essential for their development and it is important to obtain abundant and accurate information to understand young people's health and health behaviour. The Health Behaviour in School-aged Children (HBSC study is among the first large-scale international surveys on adolescent health through self-report questionnaires. So far, more than 40 countries in Europe and North America have been involved in the HBSC study. The purpose of this study is to assess the test-retest reliability of selected items in the Chinese version of the HBSC survey questionnaire in a sample of adolescents in Beijing, China. Methods A sample of 95 male and female students aged 11 or 15 years old participated in a test and retest with a three weeks interval. Student Identity numbers of respondents were utilized to permit matching of test-retest questionnaires. 23 items concerning physical activity, sedentary behaviour, sleep and substance use were evaluated by using the percentage of response shifts and the single measure Intraclass Correlation Coefficients (ICC with 95% confidence interval (CI for all respondents and stratified by gender and age. Items on substance use were only evaluated for school children aged 15 years old. Results The percentage of no response shift between test and retest varied from 32% for the item on computer use at weekends to 92% for the three items on smoking. Of all the 23 items evaluated, 6 items (26% showed a moderate reliability, 12 items (52% displayed a substantial reliability and 4 items (17% indicated almost perfect reliability. No gender and age group difference of the test-retest reliability was found except for a few items on sedentary behaviour. Conclusions The overall findings of this study suggest that most selected indicators in the HBSC survey questionnaire have satisfactory test-retest reliability for the students in Beijing. Further test-retest studies in a large

  2. Robust associations between the 20-item prosopagnosia index and the Cambridge Face Memory Test in the general population

    Science.gov (United States)

    Bird, Geoffrey

    2017-01-01

    Developmental prosopagnosia (DP) is a neurodevelopmental condition, characterized by lifelong face recognition deficits. Leading research groups diagnose the condition using complementary computer-based tasks and self-report measures. In an attempt to standardize the reporting of self-report evidence, we recently developed the 20-item prosopagnosia index (PI20), a short questionnaire measure of prosopagnosic traits suitable for screening adult samples for DP. Strong correlations between scores on the PI20 and performance on the Cambridge Face Memory Test (CFMT) appeared to confirm that individuals possess sufficient insight into their face recognition ability to complete a self-report measure of prosopagnosic traits. However, the extent to which people have insight into their face recognition abilities remains contentious. A lingering concern is that feedback from formal testing, received prior to administration of the PI20, may have augmented the self-insight of some respondents in the original validation study. To determine whether the significant correlation with the CFMT was an artefact of previously delivered feedback, we sought to replicate the validation study in individuals with no history of formal testing. We report highly significant correlations in two independent samples drawn from the general population, confirming: (i) that a significant relationship exists between PI20 scores and performance on the CFMT, and (ii) that this is not dependent on the inclusion of individuals who have previously received feedback. These findings support the view that people have sufficient insight into their face recognition abilities to complete a self-report measure of prosopagnosic traits.

  3. Robust associations between the 20-item prosopagnosia index and the Cambridge Face Memory Test in the general population.

    Science.gov (United States)

    Gray, Katie L H; Bird, Geoffrey; Cook, Richard

    2017-03-01

    Developmental prosopagnosia (DP) is a neurodevelopmental condition, characterized by lifelong face recognition deficits. Leading research groups diagnose the condition using complementary computer-based tasks and self-report measures. In an attempt to standardize the reporting of self-report evidence, we recently developed the 20-item prosopagnosia index (PI20), a short questionnaire measure of prosopagnosic traits suitable for screening adult samples for DP. Strong correlations between scores on the PI20 and performance on the Cambridge Face Memory Test (CFMT) appeared to confirm that individuals possess sufficient insight into their face recognition ability to complete a self-report measure of prosopagnosic traits. However, the extent to which people have insight into their face recognition abilities remains contentious. A lingering concern is that feedback from formal testing, received prior to administration of the PI20, may have augmented the self-insight of some respondents in the original validation study. To determine whether the significant correlation with the CFMT was an artefact of previously delivered feedback, we sought to replicate the validation study in individuals with no history of formal testing. We report highly significant correlations in two independent samples drawn from the general population, confirming: (i) that a significant relationship exists between PI20 scores and performance on the CFMT, and (ii) that this is not dependent on the inclusion of individuals who have previously received feedback. These findings support the view that people have sufficient insight into their face recognition abilities to complete a self-report measure of prosopagnosic traits.

  4. Item and test analysis to identify quality multiple choice questions (MCQS) from an assessment of medical students of Ahmedabad, Gujarat

    OpenAIRE

    Sanju Gajjar; Rashmi Sharma; Pradeep Kumar; Manish Rana

    2014-01-01

    Background: Multiple choice questions (MCQs) are frequently used to assess students in different educational streams for their objectivity and wide reach of coverage in less time. However, the MCQs to be used must be of quality which depends upon its difficulty index (DIF I), discrimination index (DI) and distracter efficiency (DE). Objective: To evaluate MCQs or items and develop a pool of valid items by assessing with DIF I, DI and DE and also to revise/ store or discard items based on obta...

  5. Examining Differential Item Functioning Trends for English Language Learners in a Reading Test: A Meta-Analytical Approach

    Science.gov (United States)

    Koo, Jin; Becker, Betsy Jane; Kim, Young-Suk

    2014-01-01

    In this study, differential item functioning (DIF) trends were examined for English language learners (ELLs) versus non-ELL students in third and tenth grades on a large-scale reading assessment. To facilitate the analyses, a meta-analytic DIF technique was employed. The results revealed that items requiring knowledge of words and phrases in…

  6. Gender and Language Differences on the Test of Workplace Essential Skills: Using Overall Mean Scores and Item-Level Differential Item Functioning Analyses

    Science.gov (United States)

    Kline, Theresa J. B.

    2004-01-01

    The Test of Workplace Essential Skills (TOWES) assesses cognitive skills in three areas using the following three separate subscales: Reading Text, Document Use, and Numeracy in Working-Age Adults. The sample was composed of 2,688 working-age English-speaking Canadians who came from a variety of settings (e.g., trades training programs, adult…

  7. Lord-Wingersky Algorithm Version 2.0 for Hierarchical Item Factor Models with Applications in Test Scoring, Scale Alignment, and Model Fit Testing.

    Science.gov (United States)

    Cai, Li

    2015-06-01

    Lord and Wingersky's (Appl Psychol Meas 8:453-461, 1984) recursive algorithm for creating summed score based likelihoods and posteriors has a proven track record in unidimensional item response theory (IRT) applications. Extending the recursive algorithm to handle multidimensionality is relatively simple, especially with fixed quadrature because the recursions can be defined on a grid formed by direct products of quadrature points. However, the increase in computational burden remains exponential in the number of dimensions, making the implementation of the recursive algorithm cumbersome for truly high-dimensional models. In this paper, a dimension reduction method that is specific to the Lord-Wingersky recursions is developed. This method can take advantage of the restrictions implied by hierarchical item factor models, e.g., the bifactor model, the testlet model, or the two-tier model, such that a version of the Lord-Wingersky recursive algorithm can operate on a dramatically reduced set of quadrature points. For instance, in a bifactor model, the dimension of integration is always equal to 2, regardless of the number of factors. The new algorithm not only provides an effective mechanism to produce summed score to IRT scaled score translation tables properly adjusted for residual dependence, but leads to new applications in test scoring, linking, and model fit checking as well. Simulated and empirical examples are used to illustrate the new applications.

  8. A unified factor-analytic approach to the detection of item and test bias: Illustration with the effect of providing calculators to students with dyscalculia

    Directory of Open Access Journals (Sweden)

    Lee, M. K.

    2016-01-01

    Full Text Available An absence of measurement bias against distinct groups is a prerequisite for the use of a given psychological instrument in scientific research or high-stakes assessment. Factor analysis is the framework explicitly adopted for the identification of such bias when the instrument consists of a multi-test battery, whereas item response theory is employed when the focus narrows to a single test composed of discrete items. Item response theory can be treated as a mild nonlinearization of the standard factor model, and thus the essential unity of bias detection at the two levels merits greater recognition. Here we illustrate the benefits of a unified approach with a real-data example, which comes from a statewide test of mathematics achievement where examinees diagnosed with dyscalculia were accommodated with calculators. We found that items that can be solved by explicit arithmetical computation became easier for the accommodated examinees, but the quantitative magnitude of this differential item functioning (measurement bias was small.

  9. Developmental Validation of the ParaDNA® Screening System - A presumptive test for the detection of DNA on forensic evidence items.

    Science.gov (United States)

    Dawnay, Nick; Stafford-Allen, Beccy; Moore, Dave; Blackman, Stephen; Rendell, Paul; Hanson, Erin K; Ballantyne, Jack; Kallifatidis, Beatrice; Mendel, Julian; Mills, DeEtta K; Nagy, Randy; Wells, Simon

    2014-07-01

    Current assessment of whether a forensic evidence item should be submitted for STR profiling is largely based on the personal experience of the Crime Scene Investigator (CSI) and the submissions policy of the law enforcement authority involved. While there are chemical tests that can infer the presence of DNA through the detection of biological stains, the process remains mostly subjective and leads to many samples being submitted that give no profile or not being submitted although DNA is present. The ParaDNA(®) Screening System was developed to address this issue. It consists of a sampling device, pre-loaded reaction plates and detection instrument. The test uses direct PCR with fluorescent HyBeacon™ detection of PCR amplicons to identify the presence and relative amount of DNA on an evidence item and also provides a gender identification result in approximately 75 minutes. This simple-to-use design allows objective data to be acquired by both DNA analyst and non-specialist personnel, to enable a more informed submission decision to be made. The developmental validation study described here tested the sensitivity, reproducibility, accuracy, inhibitor tolerance, and performance of the ParaDNA Screening System on a range of mock evidence items. The data collected demonstrates that the ParaDNA Screening System identifies the presence of DNA on a variety of evidence items including blood, saliva and touch DNA items. Copyright © 2014 The Authors. Published by Elsevier Ireland Ltd.. All rights reserved.

  10. Test of item-response bias in the CES-D scale. experience from the New Haven EPESE study.

    Science.gov (United States)

    Cole, S R; Kawachi, I; Maller, S J; Berkman, L F

    2000-03-01

    We present results of item-response bias analyses of the exogenous variables age, gender, and race for all items from the Center for Epidemiologic Studies Depression (CES-D) scale using data (N = 2340) from the New Haven component of the Established Populations for Epidemiologic Studies of the Elderly (EPESE). The proportional odds of blacks responding higher on the CES-D items "people are unfriendly" and "people dislike me" were 2.29 (95% confidence interval: 1.74, 3.02) and 2.96 (95% confidence interval: 2.15, 4.07) times that of whites matched on overall depressive symptoms, respectively. In addition, the proportional odds of women responding higher on the CES-D item "crying spells" were 2.14 (95% confidence interval: 1.60, 2.82) times that of men matched on overall depressive symptoms. Our data indicate the CES-D would have greater validity among this diverse group of older men and women after removal of the crying item and two interpersonal items.

  11. Estimation of ability and item parameters in mathematics testing by using the combination of 3PLM/ GRM and MCM/ GPCM scoring model

    Directory of Open Access Journals (Sweden)

    Abadyo Abadyo

    2015-06-01

    Full Text Available The main purpose of the study was to investigate the superiority of scoring by utilizing the combination of MCM/GPCM model in comparison to 3PLM/GRM model within a mixed-item format of Mathematics tests. To achieve the purpose, the impact of two scoring models was investigated based on the test length, the sample size, and the M-C item proportion within the mixed-item format test and the investigation was conducted on the aspects of: (1 estimation of ability and item parameters, (2 optimalization of TIF, (3 standard error rates, and (4 model fitness on the data. The investigation made use of simulated data that was generated based on fixed effects factorial design 2 x 3 x 3 x 3 and 5 replications resulting in 270 data sets. The data were analyzed by means of fixed effect MANOVA on Root Mean Square Error (RMSE of the ability and RMSE and Root Mean Square Deviation (RNSD of the itemparameters in order to identify the significant main effects at level of a = .05; on the other hand, the interaction effects were incorporated into the error term for statistical testing. The -2LL statistics were also used in order to evaluate the moel fitness on the data set. The results of the study show that the combination of MCM/GPCM model provide higher accurate estimation than that of 3PLM/GRM model. In addition, the test information given by the combination of MCM/GPCM model is three times hhigher than that of 3PLM/GRM model although the test information cannot offer a solid conclusion in relation to the sample size and the M-C item proportion on each test length which provides the optimal score of thest information. Finally, the differences of fit statistics between the two models of scoring determine the position of MCM/GPCM model rather than that of 3PLM/GRM model.

  12. IRT Item Parameter Scaling for Developing New Item Pools

    Science.gov (United States)

    Kang, Hyeon-Ah; Lu, Ying; Chang, Hua-Hua

    2017-01-01

    Increasing use of item pools in large-scale educational assessments calls for an appropriate scaling procedure to achieve a common metric among field-tested items. The present study examines scaling procedures for developing a new item pool under a spiraled block linking design. The three scaling procedures are considered: (a) concurrent…

  13. CUSUM Statistics for Large Item Banks: Computation of Standard Errors. Law School Admission Council Computerized Testing Report. LSAC Research Report Series.

    Science.gov (United States)

    Glas, C. A. W.

    In a previous study (1998), how to evaluate whether adaptive testing data used for online calibration sufficiently fit the item response model used by C. Glas was studied. Three approaches were suggested, based on a Lagrange multiplier (LM) statistic, a Wald statistic, and a cumulative sum (CUMSUM) statistic respectively. For all these methods,…

  14. An Item Response Theory-Based, Computerized Adaptive Testing Version of the MacArthur-Bates Communicative Development Inventory: Words & Sentences (CDI:WS)

    Science.gov (United States)

    Makransky, Guido; Dale, Philip S.; Havmose, Philip; Bleses, Dorthe

    2016-01-01

    Purpose: This study investigated the feasibility and potential validity of an item response theory (IRT)-based computerized adaptive testing (CAT) version of the MacArthur-Bates Communicative Development Inventory: Words & Sentences (CDI:WS; Fenson et al., 2007) vocabulary checklist, with the objective of reducing length while maintaining…

  15. Cuing effect of "all of the above" on the reliability and validity of multiple-choice test items.

    Science.gov (United States)

    Harasym, P H; Leong, E J; Violato, C; Brant, R; Lorscheider, F L

    1998-03-01

    It is generally acknowledged that alternatives such as none of the above and all of the above should be used sparingly in multiple-choice (MC) items. But the effect that all of the above has on the reliability and validity of an MC item is unclear. This study compared the results of a single-response (SRa) item format that included all of the above as the correct response to a multiple-response (MR) item format that required examinees to select all of the available alternatives for a correct response. A crossover design was used to compare the effect of formats on student performance while item content, scoring method, and student ability levels remained constant. Results indicated that the SRa format greatly distorted examinee performance by elevating their scores because examinees who recognized two or more alternatives as being correct were cued to select all of the above. In addition, the SRa format significantly reduced the reliability and concurrent validity of examinee scores. In summary, the MR format was found to be superior. Based upon new empirical evidence, this study recommends that whenever an educator wishes to evaluate student understanding of an issue that has multiple facts, the SRa format should be avoided and the MR format should be used instead.

  16. Multilevel Modeling of Item Position Effects

    Science.gov (United States)

    Albano, Anthony D.

    2013-01-01

    In many testing programs it is assumed that the context or position in which an item is administered does not have a differential effect on examinee responses to the item. Violations of this assumption may bias item response theory estimates of item and person parameters. This study examines the potentially biasing effects of item position. A…

  17. An investigation of the effects of relative probability of old and new test items on the neural correlates of successful and unsuccessful source memory.

    Science.gov (United States)

    Vilberg, Kaia L; Rugg, Michael D

    2009-04-01

    The present event-related fMRI study addressed the question whether retrieval-related neural activity in lateral parietal cortex is affected by the relative probability of test items. We varied the proportion of old to new items across two test blocks, with 25% of the items being old in one block and 75% being old in the other. Prior to each block, participants (N=18) completed one of two types of study judgment on each of 108 object images. They then performed a source memory test with four response options: studied in task 1, studied in task 2, old but unsure of the study task, and new. Retrieval-related activity in regions previously identified as recollection-sensitive, including the left inferior lateral parietal cortex and bilateral medial temporal cortex, was unaffected by old/new ratio. Generic retrieval success effects--retrieval-related effects common to recognized items attracting either a correct or an incorrect source judgment--were identified in several regions of left superior parietal cortex. These effects dissociated between a middle region of the intraparietal sulcus (IPS), where activity did not interact with ratio, and regions anterior and posterior to the middle IPS where activity was sensitive to old/new ratio. The findings are inconsistent with prior proposals that retrieval-related activity in and around the left middle IPS reflects the relative salience of old and new test items. Rather, they suggest that, as in the case of more inferior left parietal regions, retrieval-related activity in this region reflects processes directly linked to retrieval.

  18. Differential Item Functioning Analysis Using a Mixture 3-Parameter Logistic Model with a Covariate on the TIMSS 2007 Mathematics Test

    Science.gov (United States)

    Choi, Youn-Jeng; Alexeev, Natalia; Cohen, Allan S.

    2015-01-01

    The purpose of this study was to explore what may be contributing to differences in performance in mathematics on the Trends in International Mathematics and Science Study 2007. This was done by using a mixture item response theory modeling approach to first detect latent classes in the data and then to examine differences in performance on items…

  19. The use of the Rey 15-Item Test and recognition trial to evaluate noncredible effort after pediatric mild traumatic brain injury.

    Science.gov (United States)

    Green, Cassie M; Kirk, John W; Connery, Amy K; Baker, David A; Kirkwood, Michael W

    2014-01-01

    The Rey 15-Item Test (FIT) is a performance validity test commonly used in adult neuropsychological assessment. FIT classification statistics across studies have been variable, so a recognition trial was created to enhance the measure (Boone, K. B., Salazar, X., Lu, P., Warner-Chacon, K., & Razani, J. (2002). The Rey 15-Item recognition trial: A technique to enhance sensitivity of the Rey 15-Item Memorization Test. Journal of Clinical and Experimental Neuropsychology, 24(5), 561-573.). The current study assessed the utility of the FIT and recognition trial in a pediatric mild traumatic brain injury sample (N = 319, M = 14.57 years). All participants were administered the FIT and recognition trial as part of an abbreviated clinical neuropsychological evaluation. Failure on the Medical Symptom Validity Test was used as the criterion for noncredible effort. Fifteen percent of the sample met the criterion. The traditional adult cutoff score of <9 on the FIT recall trial yielded excellent specificity (98%), but very poor sensitivity (12%). When the recognition trial was utilized, a total score of <26 resulted in the best combined cutoff score (sensitivity = 55%, specificity = 91%). Results indicate that the FIT with recognition trial may be useful in the assessment of noncredible effort with children and adolescents, at least among relatively high-functioning populations.

  20. Item-Writing Guidelines for Physics

    Science.gov (United States)

    Regan, Tom

    2015-01-01

    A teacher learning how to write test questions (test items) will almost certainly encounter item-writing guidelines--lists of item-writing do's and don'ts. Item-writing guidelines usually are presented as applicable across all assessment settings. Table I shows some guidelines that I believe to be generally applicable and two will be briefly…

  1. Teoria de Resposta ao Item na análise de uma prova de estatística em universitários Item Response Theory to analyze a statistics test in university students

    Directory of Open Access Journals (Sweden)

    Claudette Maria Medeiros Vendramini

    2005-12-01

    Full Text Available Este estudo objetivou aplicar a Teoria de Resposta ao Item na análise das 15 questões de múltipla escolha de uma prova de estatística apresentada na forma de gráficos ou de tabelas estatísticas. Participaram 413 universitários, selecionados por conveniência, de duas instituições da rede particular de ensino superior, predominantemente do curso de Psicologia (91,5%. Os universitários foram 80% do gênero feminino e do período diurno (69,8%, com idades de 16 a 53 anos, média 24,4 e desvio padrão 7,4. A prova é predominantemente unidimensional e os itens são mais bem ajustados ao modelo logístico de três parâmetros. Os índices de discriminação, dificuldade e correlação bisserial apresentam valores aceitáveis. Os resultados mostram as dificuldades apresentadas pelos estudantes com relação aos conceitos matemáticos e estatísticos, dificuldades essas observadas em outras pesquisas desde o ensino fundamental. Sugere-se que esses conceitos sejam tratados mais profundamente no ensino superior.This study aimed to use the Item Response Theory to analyze the 15 multiple-choice questions of a statistics test presented in the statistics graphics or tables form. The 414 university students were selected by convenience from two private universities, predominantly psychology students (91.5%. The university students were 80% female and with 16-53 years old, mean 24.4 and standard deviation 7.4. The test has predominantly one dimension and the items can be better fitting to the model of three parameters. The indexes of difficulty, discrimination and bisserial correlation presented acceptable values. The results indicate the difficulties of university students in the mathematic and statistic concepts, that difficulties are observed in the other studies since the elementary education. One suggests making more profound studies of these concepts in higher education.

  2. The construct equivalence and item bias of the pib/SpEEx conceptualisation-ability test for members of five language groups in South Africa

    OpenAIRE

    Pieter Schaap; Theresa Vermeulen

    2008-01-01

    This study’s objective was to determine whether the Potential Index Batteries/Situation Specific Evaluation Expert (PIB/SpEEx) conceptualisation (100) ability test displays construct equivalence and item bias for members of five selected language groups in South Africa. The sample consisted of a non-probability convenience sample (N = 6 261) of members of five language groups (speakers of Afrikaans, English, North Sotho, Setswana and isiZulu) working in the medical and beverage industries or ...

  3. When BAWE meets WELT:The use of a corpus of student writing to develop items for a proficiency test in grammar and English usage

    Directory of Open Access Journals (Sweden)

    Gerard Paul Sharpling

    2010-08-01

    Full Text Available This article reports on the use of the British Academic Written English (BAWE corpus as a source for developing test items for the Grammar and English Usage section of the Warwick English Language (WELT test in 2007. A key feature of this newly designed multiple choice grammar test was its use of student-generated writing. The extracts used for the re-designed test were derived directly from the BAWE corpus, as opposed to text books, published sources or indeed, simulated extracts of academic writing devised by test developers, which had been the case previously. The rationale for using the BAWE corpus for language test design is outlined, with a particular focus on the attributes of the students’ writing within the corpus, and the inclusion of both first and second language writing. The challenges involved in developing grammar test items based on BAWE corpus data are also presented. While the procedures set out in the paper were undertaken within a specifically British higher education setting, it is hoped that the research will be of interest to test developers and/or researchers in writing skills in other academic settings worldwide.

  4. 教育考试中短测验的分析方法——基于两种项目反应理论方法的比较研究%Item Analysis of Short Test in Educational Testing: Comparative Study on Parameter and Non-parameter Item Response Theory

    Institute of Scientific and Technical Information of China (English)

    何壮; 袁淑莉; 赵守盈

    2012-01-01

    教育考试中专题、短测验等形式是命题的一种主要方式。对这类测验的分析,可以从参数项目反应理论和非参数项目反应理论入手。本研究分别选取Rasch模型和Mokken模型对某高三文科综合地理试卷进行分析比较。使用winsteps和xeaaibre软件进行Rasch分析,得到难度、信息量、项目功能差异等参数;使用MSP软件进行Mokken分析,得到正答率和同质性系数。比较两种结果,得出以下结论:(1)非参数项目反应理论以正答率对题目排序与参数项目反应理论以难度排序一致;(2)而有个别不符合参数项目反应理论标准的题目对提高测验质量同样有意义,不应被删除;(3)进行维度检验和题目筛选时,非参数项目反应理论标准比参数项目反应理论标准更加严格;(4)两种理论的项目功能差异检验结果一致。%As one of the significant types of tests, the test project and short test are popular in educational testing. Parameter and non-parameter item response theory being the starts, these tests were under analysis. Compared was the geography paper in inaugurated arts taken by some senior three students. During this comparison the Rasch and Mokken model were respectively selected. For analyzing software Winsteps and Xcalibre were utilized to analyze item parameters in Rasch model. Analyzed in detail were the parameters of difficulty, differential item functioning and information curve. Software MSP was for the purpose of analyzing items in Mokken model. Besides, the statistics of accurate rate and coefficients of homogeneity were also analyzed in detail. Finally, four conclusions were arrived at as the following: ( 1 ) The estimate results of difficulty between non-parameter and parameter item response theory were equivalent. (2)Those items, which failed to fit parameter item response theory, succeeded in non-parameter item response theory. (3)Non-parameter item

  5. The optimal sequence and selection of screening test items to predict fall risk in older disabled women: the Women's Health and Aging Study.

    Science.gov (United States)

    Lamb, Sarah E; McCabe, Chris; Becker, Clemens; Fried, Linda P; Guralnik, Jack M

    2008-10-01

    Falls are a major cause of disability, dependence, and death in older people. Brief screening algorithms may be helpful in identifying risk and leading to more detailed assessment. Our aim was to determine the most effective sequence of falls screening test items from a wide selection of recommended items including self-report and performance tests, and to compare performance with other published guidelines. Data were from a prospective, age-stratified, cohort study. Participants were 1002 community-dwelling women aged 65 years old or older, experiencing at least some mild disability. Assessments of fall risk factors were conducted in participants' homes. Fall outcomes were collected at 6 monthly intervals. Algorithms were built for prediction of any fall over a 12-month period using tree classification with cross-set validation. Algorithms using performance tests provided the best prediction of fall events, and achieved moderate to strong performance when compared to commonly accepted benchmarks. The items selected by the best performing algorithm were the number of falls in the last year and, in selected subpopulations, frequency of difficulty balancing while walking, a 4 m walking speed test, body mass index, and a test of knee extensor strength. The algorithm performed better than that from the American Geriatric Society/British Geriatric Society/American Academy of Orthopaedic Surgeons and other guidance, although these findings should be treated with caution. Suggestions are made on the type, number, and sequence of tests that could be used to maximize estimation of the probability of falling in older disabled women.

  6. Rasch analysis of the Pediatric Evaluation of Disability Inventory-computer adaptive test (PEDI-CAT) item bank for children and young adults with spinal muscular atrophy.

    Science.gov (United States)

    Pasternak, Amy; Sideridis, Georgios; Fragala-Pinkham, Maria; Glanzman, Allan M; Montes, Jacqueline; Dunaway, Sally; Salazar, Rachel; Quigley, Janet; Pandya, Shree; O'Riley, Susan; Greenwood, Jonathan; Chiriboga, Claudia; Finkel, Richard; Tennekoon, Gihan; Martens, William B; McDermott, Michael P; Fournier, Heather Szelag; Madabusi, Lavanya; Harrington, Timothy; Cruz, Rosangel E; LaMarca, Nicole M; Videon, Nancy M; Vivo, Darryl C De; Darras, Basil T

    2016-12-01

    In this study we evaluated the suitability of a caregiver-reported functional measure, the Pediatric Evaluation of Disability Inventory-Computer Adaptive Test (PEDI-CAT), for children and young adults with spinal muscular atrophy (SMA). PEDI-CAT Mobility and Daily Activities domain item banks were administered to 58 caregivers of children and young adults with SMA. Rasch analysis was used to evaluate test properties across SMA types. Unidimensional content for each domain was confirmed. The PEDI-CAT was most informative for type III SMA, with ability levels distributed close to 0.0 logits in both domains. It was less informative for types I and II SMA, especially for mobility skills. Item and person abilities were not distributed evenly across all types. The PEDI-CAT may be used to measure functional performance in SMA, but additional items are needed to identify small changes in function and best represent the abilities of all types of SMA. Muscle Nerve 54: 1097-1107, 2016. © 2016 Wiley Periodicals, Inc.

  7. The construct equivalence and item bias of the pib/SpEEx conceptualisation-ability test for members of five language groups in South Africa

    Directory of Open Access Journals (Sweden)

    Pieter Schaap

    2008-12-01

    Full Text Available This study’s objective was to determine whether the Potential Index Batteries/Situation Specific Evaluation Expert (PIB/SpEEx conceptualisation (100 ability test displays construct equivalence and item bias for members of five selected language groups in South Africa. The sample consisted of a non-probability convenience sample (N = 6 261 of members of five language groups (speakers of Afrikaans, English, North Sotho, Setswana and isiZulu working in the medical and beverage industries or studying at higher-educational institutions. Exploratory factor analysis with target rotations confrmed the PIB/SpEEx 100’s construct equivalence for the respondents from these five language groups. No evidence of either uniform or non-uniform item bias of practical signifcance was found for the sample.

  8. Lord-Wingersky Algorithm Version 2.0 for Hierarchical Item Factor Models with Applications in Test Scoring, Scale Alignment, and Model Fit Testing. CRESST Report 830

    Science.gov (United States)

    Cai, Li

    2013-01-01

    Lord and Wingersky's (1984) recursive algorithm for creating summed score based likelihoods and posteriors has a proven track record in unidimensional item response theory (IRT) applications. Extending the recursive algorithm to handle multidimensionality is relatively simple, especially with fixed quadrature because the recursions can be defined…

  9. Continuous Online Item Calibration: Parameter Recovery and Item Utilization.

    Science.gov (United States)

    Ren, Hao; van der Linden, Wim J; Diao, Qi

    2017-06-01

    Parameter recovery and item utilization were investigated for different designs for online test item calibration. The design was adaptive in a double sense: it assumed both adaptive testing of examinees from an operational pool of previously calibrated items and adaptive assignment of field-test items to the examinees. Four criteria of optimality for the assignment of the field-test items were used, each of them based on the information in the posterior distributions of the examinee's ability parameter during adaptive testing as well as the sequentially updated posterior distributions of the field-test item parameters. In addition, different stopping rules based on target values for the posterior standard deviations of the field-test parameters and the size of the calibration sample were used. The impact of each of the criteria and stopping rules on the statistical efficiency of the estimates of the field-test parameters and on the time spent by the items in the calibration procedure was investigated. Recommendations as to the practical use of the designs are given.

  10. Confiabilidade teste-reteste do item único de saúde bucal percebida em uma população de adultos no Rio de Janeiro, Brasil

    Directory of Open Access Journals (Sweden)

    Gislaine Afonso-Souza

    2007-06-01

    Full Text Available O estudo avaliou a confiabilidade teste-reteste do item único de saúde bucal percebida, que fez parte de um questionário de um estudo longitudinal (Estudo Pró-Saúde, 2001. Esse questionário foi aplicado duas vezes, em uma amostra de 101 funcionários de uma universidade do Rio de Janeiro, Brasil. A avaliação da saúde bucal percebida foi feita utilizando-se um item único com cinco opções de resposta: de "muito bom" a "muito ruim". A concordância foi estimada pela estatística kappa (k ponderada quadrática e estratificada segundo sexo, idade, renda e escolaridade. O coeficiente kappa para toda população foi 0,80. Valores pontuais mais altos foram obtidos para mulheres (k = 0,84, adultos jovens (k = 0,85, participantes com nível médio de escolaridade (k = 0,86 e os de renda maior que seis salários mínimos (k = 0,91. Estimativas kappa mais baixas foram encontradas em indivíduos acima de 40 anos (k = 0,67, 40-49 anos; e k = 0,69, > 50. A confiabilidade do item único de saúde bucal percebida variou de substancial a quase perfeita, para todos os estratos da população, sugerindo que este item pode ser usado em análises futuras no âmbito do Estudo Pró-Saúde.

  11. GRE Verbal Analogy Items: Examinee Reasoning on Items.

    Science.gov (United States)

    Duran, Richard P.; And Others

    Information about how Graduate Record Examination (GRE) examinees solve verbal analogy problems was obtained in this study through protocol analysis. High- and low-ability subjects who had recently taken the GRE General Test were asked to "think aloud" as they worked through eight analogy items. These items varied factorially on the…

  12. What Is the Ability Emotional Intelligence Test (MSCEIT) Good for? An Evaluation Using Item Response Theory: e98827

    National Research Council Canada - National Science Library

    Marina Fiori; Jean-Philippe Antonietti; Moira Mikolajczak; Olivier Luminet; Michel Hansenne; Jérôme Rossier

    2014-01-01

    ...). However, there is scarcity of tests measuring EI as a form of intelligence. The Mayer Salovey Caruso Emotional Intelligence Test, or MSCEIT, is among the few available and the most widespread measure of EI as an ability...

  13. A continuous-scale measure of child development for population-based epidemiological surveys: a preliminary study using Item Response Theory for the Denver Test.

    Science.gov (United States)

    Drachler, Maria de Lourdes; Marshall, Tom; de Carvalho Leite, José Carlos

    2007-03-01

    A method for translating research data from the Denver Test into individual scores of developmental status measured in a continuous scale is presented. It was devised using the Denver Developmental Screening Test (DDST) but can be used for Denver II. The DDST was applied in a community-based survey of 3389 under-5-year-olds in Porto Alegre, Brazil. The items of success were standardised by logistic regression on log chronological age. Each child's ability age was then estimated by maximum likelihood as the age in this reference population corresponding to the child's success and failures in the test. The score of developmental status is the natural logarithm of this ability age divided by chronological age and thus measures the delay or advance in the child's ability age compared with chronological age. This method estimates development status using both difficulty and discriminating power of each item in the reference population, an advantage over scores based on total number of items correctly performed or failed, which depend on difficulty only. The score corresponds with maternal opinion of child developmental status and with the 3-category scale of the DDST. It shows good construct validity, indicated by symmetrical and homogeneous variability from 3 months upwards, and reasonable results in describing gender differences in development by age, the mean score increasing with socio-economic conditions and diminishing among low-birthweight children. If a standardised measure of development status (z-scores) is required, this can be obtained by dividing the score by its standard deviation. Concurrent and discriminant validity of the score must be examined in further studies.

  14. Limited-Information Goodness-of-Fit Testing of Diagnostic Classification Item Response Theory Models. CRESST Report 840

    Science.gov (United States)

    Hansen, Mark; Cai, Li; Monroe, Scott; Li, Zhen

    2014-01-01

    It is a well-known problem in testing the fit of models to multinomial data that the full underlying contingency table will inevitably be sparse for tests of reasonable length and for realistic sample sizes. Under such conditions, full-information test statistics such as Pearson's X[superscript 2]?? and the likelihood ratio statistic…

  15. Developing Pairwise Preference-Based Personality Test and Experimental Investigation of Its Resistance to Faking Effect by Item Response Model

    Science.gov (United States)

    Usami, Satoshi; Sakamoto, Asami; Naito, Jun; Abe, Yu

    2016-01-01

    Recent years have shown increased awareness of the importance of personality tests in educational, clinical, and occupational settings, and developing faking-resistant personality tests is a very pragmatic issue for achieving more precise measurement. Inspired by Stark (2002) and Stark, Chernyshenko, and Drasgow (2005), we develop a pairwise…

  16. Towards an authoring system for item construction

    NARCIS (Netherlands)

    Rikers, Jos H.A.N.

    1988-01-01

    The process of writing test items is analyzed, and a blueprint is presented for an authoring system for test item writing to reduce invalidity and to structure the process of item writing. The developmental methodology is introduced, and the first steps in the process are reported. A historical revi

  17. Unfair items detection in educational measurement

    CERN Document Server

    Bakman, Yefim

    2012-01-01

    Measurement professionals cannot come to an agreement on the definition of the term 'item fairness'. In this paper a continuous measure of item unfairness is proposed. The more the unfairness measure deviates from zero, the less fair the item is. If the measure exceeds the cutoff value, the item is identified as definitely unfair. The new approach can identify unfair items that would not be identified with conventional procedures. The results are in accord with experts' judgments on the item qualities. Since no assumptions about scores distributions and/or correlations are assumed, the method is applicable to any educational test. Its performance is illustrated through application to scores of a real test.

  18. Análise de itens de uma prova de raciocínio estatístico Analysis of items of a statistical reasoning test

    Directory of Open Access Journals (Sweden)

    Claudette Maria Medeiros Vendramini

    2004-12-01

    Full Text Available Este estudo objetivou analisar as 18 questões (do tipo múltipla escolha de uma prova sobre conceitos básicos de Estatística pelas teorias clássica e moderna. Participaram 325 universitários, selecionados aleatoriamente das áreas de humanas, exatas e saúde. A análise indicou que a prova é predominantemente unidimensional e que os itens podem ser mais bem ajustados ao modelo de três parâmetros. Os índices de dificuldade, discriminação e correlação bisserial apresentam valores aceitáveis. Sugere-se a inclusão de novos itens na prova, que busquem confiabilidade e validade para o contexto educacional e revelem o raciocínio estatístico de universitários ao ler representações de dados estatísticos.This study aimed at to analyze the 18 questions (of multiple choice type of a test on basic concepts of Statistics for the classic and modern theories. The test was taken by 325 undergraduate students, randomly selected from the areas of Human, Exact and Health Sciences. The analysis indicated that the test has predominantly one dimension and that the items can be better fitting to the model of three parameters. The indexes of difficulty, discrimination and biserial correlation present acceptable values. It is suggested to include new items to the test in order to obtain reliability and validity to use it in the education context and to reveal the statistical reasoning of undergraduate students when dealing with statistical data representation.

  19. On-Demand Testing and Maintaining Standards for General Qualifications in the UK Using Item Response Theory: Possibilities and Challenges

    Science.gov (United States)

    He, Qingping

    2012-01-01

    Background: Although on-demand testing is being increasingly used in many areas of assessment, it has not been adopted in high stakes examinations like the General Certificate of Secondary Education (GCSE) and General Certificate of Education Advanced level (GCE A level) offered by awarding organisations (AOs) in the UK. One of the major issues…

  20. Test-retest reliability of Antonovsky's 13-item sense of coherence scale in patients with hand-related disorders

    DEFF Research Database (Denmark)

    Hansen, Alice Ørts; Kristensen, Hanne Kaae; Cederlund, Ragnhild

    2017-01-01

    to be a powerful tool to measure the ICF component personal factors, which could have an impact on patients' rehabilitation outcomes. Implications for rehabilitation Antonovsky's SOC-13 scale showed test-retest reliability for patients with hand-related disorders. The SOC-13 scale could be a suitable tool to help...

  1. The six-item Clock Drawing Test – reliability and validity in mild Alzheimer’s disease

    DEFF Research Database (Denmark)

    Jørgensen, Kasper; Kristensen, Maria K; Waldemar, Gunhild

    2015-01-01

    This study presents a reliable, short and practical version of the Clock Drawing Test (CDT) for clinical use and examines its diagnostic accuracy in mild Alzheimer's disease versus elderly nonpatients. Clock drawings from 231 participants were scored independently by four clinical neuropsychologi...

  2. Cross-cultural development of an item list for computer-adaptive testing of fatigue in oncological patients

    DEFF Research Database (Denmark)

    Giesinger, Johannes M; Aa Petersen, Morten; Groenvold, Mogens;

    2011-01-01

    Within an ongoing project of the EORTC Quality of Life Group, we are developing computerized adaptive test (CAT) measures for the QLQ-C30 scales. These new CAT measures are conceptualised to reflect the same constructs as the QLQ-C30 scales. Accordingly, the Fatigue-CAT is intended to capture...

  3. An Evaluation of One- and Three-Parameter Logistic Tailored Testing Procedures for Use with Small Item Pools.

    Science.gov (United States)

    McKinley, Robert L.; Reckase, Mark D.

    A two-stage study was conducted to compare the ability estimates yielded by tailored testing procedures based on the one-parameter logistic (1PL) and three-parameter logistic (3PL) models. The first stage of the study employed real data, while the second stage employed simulated data. In the first stage, response data for 3,000 examinees were…

  4. How to Get Really Smart: Modeling Retest and Training Effects in Ability Testing using Computer-Generated Figural Matrix Items

    Science.gov (United States)

    Freund, Philipp Alexander; Holling, Heinz

    2011-01-01

    The interpretation of retest scores is problematic because they are potentially affected by measurement and predictive bias, which impact construct validity, and because their size differs as a function of various factors. This paper investigates the construct stability of scores on a figural matrices test and models retest effects at the level of…

  5. Cross-cultural development of an item list for computer-adaptive testing of fatigue in oncological patients

    NARCIS (Netherlands)

    Giesinger, J.M.; Petersen, M.A.; Groenvold, M.; Aaronson, N.K.; Arraras, J.I.; Conroy, T.; Gamper, E.M.; Kemmler, G.; King, M.T.; Oberguggenberger, A.S.; Velikova, G.; Young, T.; Holzner, B.

    2011-01-01

    Introduction Within an ongoing project of the EORTC Quality of Life Group, we are developing computerized adaptive test (CAT) measures for the QLQ-C30 scales. These new CAT measures are conceptualised to reflect the same constructs as the QLQ-C30 scales. Accordingly, the Fatigue-CAT is intended to c

  6. Cross-cultural development of an item list for computer-adaptive testing of fatigue in oncological patients

    NARCIS (Netherlands)

    Giesinger, J.M.; Petersen, M.A.; Groenvold, M.; Aaronson, N.K.; Arraras, J.I.; Conroy, T.; Gamper, E.M.; Kemmler, G.; King, M.T.; Oberguggenberger, A.S.; Velikova, G.; Young, T.; Holzner, B.

    2011-01-01

    Introduction Within an ongoing project of the EORTC Quality of Life Group, we are developing computerized adaptive test (CAT) measures for the QLQ-C30 scales. These new CAT measures are conceptualised to reflect the same constructs as the QLQ-C30 scales. Accordingly, the Fatigue-CAT is intended to

  7. 不同认知成分在图形推理测验项目难度预测中的作用%The Role of Different Cognitive Components in the Prediction of the Figural Reasoning Test's Item Difficulty

    Institute of Scientific and Technical Information of China (English)

    李中权; 王力; 张厚粲; 周仁来

    2011-01-01

    Figural reasoning tests (as represented by Raven's tests) are widely applied as effective measures of fluid intelligence in recruitment and personnel selection. However, several studies have revealed that those tests are not appropriate anymore due to high item exposure rates. Computerized automatic item generation (AIG) has gradually been recognized as a promising technique in handling item exposure. Understanding sources of item variation constitutes the initial stage of Computerized AIG, that is, searching for the underlying processing components and the stimuli that significantly influence those components. Some studies have explored sources of item variation, but so far there are no consistent results. This study investigated the relation between item difficulties and stimuli factors (e.g., familiarity of figures, abstraction of attributes, perceptual organization, and memory load) and determines the relative importance of those factors in predicting item difficulities.Eight sets of figural reasoning tests (each set containing 14 items imitating items from Raven's Advanced Progressive Matrics, APM) were constructed manipulating the familiarity of figures, the degree of abstraction of attributes, the perceptual organization as well as the types and number of rules. Using anchor-test design, these tests were administrated via the internet to 6323 participants with 10 items drawing from APAM as anchor items; thus, each participant completed 14 items from either one set and 10 anchor items within half an hour. In order to prevent participants from using response elimination strategy, we presented one item stem first, then alternatives in turn, and asked participants to determine which alternative was the best.DIMTEST analyses were conducted on the participants' responses on each of eight tests. Results showed that items measure a single dimension on each test. Likelihood ratio test indicated that the data fit two-parameter logistic model (2PL) best. Items were

  8. Assessment of free and cued recall in Alzheimer's disease and vascular and frontotemporal dementia with 24-item Grober and Buschke test.

    Science.gov (United States)

    Cerciello, Milena; Isella, Valeria; Proserpi, Alice; Papagno, Costanza

    2017-01-01

    Alzheimer's disease (AD), vascular dementia (VaD) and frontotemporal dementia (FTD) are the most common forms of dementia. It is well known that memory deficits in AD are different from those in VaD and FTD, especially with respect to cued recall. The aim of this clinical study was to compare the memory performance in 15 AD, 10 VaD and 9 FTD patients and 20 normal controls by means of a 24-item Grober-Buschke test [8]. The patients' groups were comparable in terms of severity of dementia. We considered free and total recall (free plus cued) both in immediate and delayed recall and computed an Index of Sensitivity to Cueing (ISC) [8] for immediate and delayed trials. We assessed whether cued recall predicted the subsequent free recall across our patients' groups. We found that AD patients recalled fewer items from the beginning and were less sensitive to cueing supporting the hypothesis that memory disorders in AD depend on encoding and storage deficit. In immediate recall VaD and FTD showed a similar memory performance and a stronger sensitivity to cueing than AD, suggesting that memory disorders in these patients are due to a difficulty in spontaneously implementing efficient retrieval strategies. However, we found a lower ISC in the delayed recall compared to the immediate trials in VaD than FTD due to a higher forgetting in VaD.

  9. Compreensão da leitura: análise do funcionamento diferencial dos itens de um Teste de Cloze Reading comprehension: differential item functioning analysis of a Cloze Test

    Directory of Open Access Journals (Sweden)

    Katya Luciane Oliveira

    2012-01-01

    Full Text Available Este estudo teve por objetivos investigar o ajuste de um Teste de Cloze ao modelo Rasch e avaliar a dificuldade na resposta ao item em razão do gênero das pessoas (DIF. Participaram da pesquisa 573 alunos das 5ª a 8ª séries do ensino fundamental de escolas públicas estaduais dos estados de São Paulo e Minas Gerais. O teste de Cloze foi aplicado de forma coletiva. A análise do instrumento evidenciou um bom ajuste ao modelo Rasch, bem como os itens foram respondidos conforme o padrão esperado, demonstrando um bom ajuste, também. Quanto ao DIF, apenas três itens indicaram diferenciar o gênero. Com base nos dados, identificou-se que houve equilíbrio nas respostas dadas pelos meninos e meninas.The objectives of the present study were to investigate the adaptation of a Cloze test to the Rasch Model as well as to evaluate the Differential Item Functioning (DIF in relation to gender. The sample was composed by 573 students from 5th to 8th grades of public schools in the state of São Paulo. The cloze test was applied collectively. The analysis of the instrument revealed its adaptation to Rash Model and that the items were responded according to the expected pattern, showing good adjustment, as well. Regarding DIF, only three items were differentiated by gender. Based on the data, results indicated a balance in the answers given by boys and girls.

  10. Research on Test Items and Methods of Pressure Validator%压力校验器测试项目和方法的研究

    Institute of Scientific and Technical Information of China (English)

    林景星; 林勤; 王孔祥; 王永红

    2015-01-01

    文中针对压力表校验器的工作原理研究,从计量特性分析压力表校验器外观及功能性检查、耐压强度测试、密封性测试等测试项目和方法,通过压力表校验器测试项目实际试验,研究结果确定“压力校验器耐压强度技术要求”、“密封性技术要求”,其结果表明文中提出的压力校验器检查项目、测试项目是切实可行的,可作为压力计量检定/校准用的压力校验器测试方法。%This paper from the point of the working principle of pressure gauge validator, analysis the appearance of pressure gauge validator from the measuring characteristic and also analysis the functional check, compression strength test, sealing test and other test items and method. Through pressure gauge validator test project actual test, makes the result of the pressure validator compressive strength technical requirements and sealing technical requirements. The paper also provides the experimental results showed that the proposed validator to check the pressure of projects, test can be used as pressure of metrological veriifcation/calibration validator test method.

  11. Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions.

    Science.gov (United States)

    Haberman, Shelby J; Sinharay, Sandip; Chon, Kyong Hee

    2013-07-01

    Residual analysis (e.g. Hambleton & Swaminathan, Item response theory: principles and applications, Kluwer Academic, Boston, 1985; Hambleton, Swaminathan, & Rogers, Fundamentals of item response theory, Sage, Newbury Park, 1991) is a popular method to assess fit of item response theory (IRT) models. We suggest a form of residual analysis that may be applied to assess item fit for unidimensional IRT models. The residual analysis consists of a comparison of the maximum-likelihood estimate of the item characteristic curve with an alternative ratio estimate of the item characteristic curve. The large sample distribution of the residual is proved to be standardized normal when the IRT model fits the data. We compare the performance of our suggested residual to the standardized residual of Hambleton et al. (Fundamentals of item response theory, Sage, Newbury Park, 1991) in a detailed simulation study. We then calculate our suggested residuals using data from an operational test. The residuals appear to be useful in assessing the item fit for unidimensional IRT models.

  12. Real and Artificial Differential Item Functioning

    Science.gov (United States)

    Andrich, David; Hagquist, Curt

    2012-01-01

    The literature in modern test theory on procedures for identifying items with differential item functioning (DIF) among two groups of persons includes the Mantel-Haenszel (MH) procedure. Generally, it is not recognized explicitly that if there is real DIF in some items which favor one group, then as an artifact of this procedure, artificial DIF…

  13. The Multidimensionality of Verbal Analogy Items

    Science.gov (United States)

    Ullstadius, Eva; Carlstedt, Berit; Gustafsson, Jan-Eric

    2008-01-01

    The influence of general and verbal ability on each of 72 verbal analogy test items were investigated with new factor analytical techniques. The analogy items together with the Computerized Swedish Enlistment Battery (CAT-SEB) were given randomly to two samples of 18-year-old male conscripts (n = 8566 and n = 5289). Thirty-two of the 72 items had…

  14. Measuring outcomes in allergic rhinitis: psychometric characteristics of a Spanish version of the congestion quantifier seven-item test (CQ7

    Directory of Open Access Journals (Sweden)

    Mullol Joaquim

    2011-03-01

    Full Text Available Abstract Background No control tools for nasal congestion (NC are currently available in Spanish. This study aimed to adapt and validate the Congestion Quantifier Seven Item Test (CQ7 for Spain. Methods CQ7 was adapted from English following international guidelines. The instrument was validated in an observational, prospective study in allergic rhinitis patients with NC (N = 166 and a control group without NC (N = 35. Participants completed the CQ7, MOS sleep questionnaire, and a measure of psychological well-being (PGWBI. Clinical data included NC severity rating, acoustic rhinometry, and total symptom score (TSS. Internal consistency was assessed using Cronbach's alpha and test-retest reliability using the intraclass correlation coefficient (ICC. Construct validity was tested by examining correlations with other outcome measures and ability to discriminate between groups classified by NC severity. Sensitivity and specificity were assessed using Area under the Receiver Operating Curve (AUC and responsiveness over time using effect sizes (ES. Results Cronbach's alpha for the CQ7 was 0.92, and the ICC was 0.81, indicating good reliability. CQ7 correlated most strongly with the TSS (r = 0.60, p Conclusions The Spanish version of the CQ7 is appropriate for detecting, measuring, and monitoring NC in allergic rhinitis patients.

  15. [A factor analytic study of the items for the personality description based on the principle of the three traits theory for the work curve of addition of the Uchida-Kraepelin psychodiagnostic test].

    Science.gov (United States)

    Kashiwagi, S; Yanai, H; Aoki, T; Tamai, H; Tanaka, Y; Hokugoh, K

    1985-08-01

    An inventory form for the personality description based on the principle of the three traits theory for the work curve of the Uchida-Kraepelin psychodiagnostic test was investigated from the factor-analytic point of view. Three kinds of analyses were performed. First of all, the 66 items of the present form were administrated and a tentative orthogonal factor solution was obtained. Secondly, the 20 items for a simplified pattern in the sense of "simple structure" were selected based on the result of the first factor rotation, and the further orthogonal factor rotation was applied to the data based on the selected items so that the assumption for the three traits, primacy (A), variability (B), and, recency (C) were confirmed factor-analytically. Finally, in order to increase the number of items for providing an extended form for academic and practical use, more 10 items were added to the 20 items of the second analysis after applying the third orthogonal factor rotation so that the new form consisting of 30 items was obtained. Some relationships between the present work and the one of Eysenck and Eysenck (1968) were discussed.

  16. Teste de Raciocínio Auditivo Musical (RAu: estudo inicial por meio da Teoria de Reposta ao Item Test de Raciocinio Auditivo Musical (RAu: estudio inicial a través de la Teoría de Repuesta al Ítem Auditory Musical Reasoning Test: an initial study with Item Response Theory

    Directory of Open Access Journals (Sweden)

    Fernando Pessotto

    2012-12-01

    Full Text Available A presente pesquisa tem como objetivo buscar evidências de validade com base na estrutura interna e de critério para um instrumento de avaliação do processamento auditivo das habilidades musicais (Teste de Processamento Auditivo com Estímulos Musicais, RAu. Para tanto, foram avaliadas 162 pessoas de ambos os sexos, sendo 56,8% homens, com faixa etária entre 15 e 59 anos (M=27,5; DP=9,01. Os participantes foram divididos entre músicos (N=24, amadores (N=62 e leigos (N=76, de acordo com o nível de conhecimento em música. Por meio da análise Full Information Factor Analysis, verificou-se a dimensionalidade do instrumento, e também as propriedades dos itens, por meio da Teoria de Resposta ao Item (TRI. Além disso, buscou-se identificar a capacidade de discriminação entre os grupos de músicos e não-músicos. Os dados encontrados apontam evidências de que os itens medem uma dimensão principal (alfa=0,92 com alta capacidade para diferenciar os grupos de músicos profissionais, amadores e leigos, obtendo-se um coeficiente de validade de critério de r=0,68. Os resultado indicam evidências positivas de precisão e validade para o RAu.La presente investigación tiene como objetivo buscar evidencias de validez basadas en la estructura interna y de criterio para un instrumento de evaluación del procesamiento auditivo de las habilidades musicales (Test de Procesamiento Auditivo con Estímulos Musicales, RAu. Para eso, fueron evaluadas 162 personas de ambos los sexos, siendo 56,8% hombres, con rango de edad entre 15 y 59 años (M=27,5; DP=9,01. Los participantes fueron divididos entre músicos (N=24, aficionados (N=62 y laicos (N=76 de acuerdo con el nivel de conocimiento en música. Por medio del análisis Full Information Factor Analysis se verificó la dimensionalidad del instrumento y también las propiedades de los ítems a través de la Teoría de Respuesta al Ítem (TRI. Además, se buscó identificar la capacidad de discriminaci

  17. The Detection and Influence of Problematic Item Content in Ability Tests: An Examination of Sensitivity Review Practices for Personnel Selection Test Development

    Science.gov (United States)

    Grand, James A.; Golubovich, Juliya; Ryan, Ann Marie; Schmitt, Neal

    2013-01-01

    In organizational and educational practices, sensitivity reviews are commonly advocated techniques for reducing test bias and enhancing fairness. In the present paper, results from two studies are reported which investigate how effective individuals are at detecting problematic test content and the influence such content has on important testing…

  18. Validation of the Ten-Item Internet Gaming Disorder Test (IGDT-10) and evaluation of the nine DSM-5 Internet Gaming Disorder criteria.

    Science.gov (United States)

    Király, Orsolya; Sleczka, Pawel; Pontes, Halley M; Urbán, Róbert; Griffiths, Mark D; Demetrovics, Zsolt

    2017-01-01

    The inclusion of Internet Gaming Disorder (IGD) in the DSM-5 (Section 3) has given rise to much scholarly debate regarding the proposed criteria and their operationalization. The present study's aim was threefold: to (i) develop and validate a brief psychometric instrument (Ten-Item Internet Gaming Disorder Test; IGDT-10) to assess IGD using definitions suggested in DSM-5, (ii) contribute to ongoing debate regards the usefulness and validity of each of the nine IGD criteria (using Item Response Theory [IRT]), and (iii) investigate the cut-off threshold suggested in the DSM-5. An online gamer sample of 4887 gamers (age range 14-64years, mean age 22.2years [SD=6.4], 92.5% male) was collected through Facebook and a gaming-related website with the cooperation of a popular Hungarian gaming magazine. A shopping voucher of approx. 300 Euros was drawn between participants to boost participation (i.e., lottery incentive). Confirmatory factor analysis and a structural regression model were used to test the psychometric properties of the IGDT-10 and IRT analysis was conducted to test the measurement performance of the nine IGD criteria. Finally, Latent Class Analysis along with sensitivity and specificity analysis were used to investigate the cut-off threshold proposed in the DSM-5. Analysis supported IGDT-10's validity, reliability, and suitability to be used in future research. Findings of the IRT analysis suggest IGD is manifested through a different set of symptoms depending on the level of severity of the disorder. More specifically, "continuation", "preoccupation", "negative consequences" and "escape" were associated with lower severity of IGD, while "tolerance", "loss of control", "giving up other activities" and "deception" criteria were associated with more severe levels. "Preoccupation" and "escape" provided very little information to the estimation IGD severity. Finally, the DSM-5 suggested threshold appeared to be supported by our statistical analyses. IGDT-10 is

  19. Faculty development on item writing substantially improves item quality.

    Science.gov (United States)

    Naeem, Naghma; van der Vleuten, Cees; Alfaris, Eiad Abdelmohsen

    2012-08-01

    The quality of items written for in-house examinations in medical schools remains a cause of concern. Several faculty development programs are aimed at improving faculty's item writing skills. The purpose of this study was to evaluate the effectiveness of a faculty development program in item development. An objective method was developed and used to assess improvement in faculty's competence to develop high quality test items. This was a quasi experimental study with a pretest-midtest-posttest design. A convenience sample of 51 faculty members participated. Structured checklists were used to assess the quality of test items at each phase of the study. Group scores were analyzed using repeated measures analysis of variance. The results showed a significant increase in participants' mean scores on Multiple Choice Questions, Short Answer Questions and Objective Structured Clinical Examination checklists from pretest to posttest (p development are generally lacking in quality. It also provides evidence of the value of faculty development in improving the quality of items generated by faculty.

  20. Corn Breeding Test Investigation Record Items and Norms%玉米育种试验调查记载项目及标准

    Institute of Scientific and Technical Information of China (English)

    高会林; 高玮; 杨桂英

    2003-01-01

    This article states that to find generality and to distinguish differences through orderly and scientifical combination of investigation record items and norms in routine corn breeding with properties description when applying "new breed right protection" will make items and norms of investigation records integral as well as exact, thus truly reflect corn's characteristics and specific properties.

  1. Adapting Item Format for Cultural Effects in Translated Tests: Cultural Effects on Construct Validity of the Chinese Versions of the MBTI

    Science.gov (United States)

    Osterlind, Steven J.; Miao, Danmin; Sheng, Yanyan; Chia, Rosina C.

    2004-01-01

    This study investigated the interaction between different cultural groups and item type, and the ensuing effect on construct validity for a psychological inventory, the Myers-Briggs Type Indicator (MBTI, Form G). The authors analyzed 94 items from 2 Chinese-translated versions of the MBTI (Form G) for factorial differences among groups of…

  2. Het nut van item respons theorie bij de constructie en evaluatie van niet-cognitieve instrumenten voor selectie en assessment binnen organisaties. : (The usefulness of item response theory for the construction and evaluation of noncognitive tests in personnel selection and assessment.)

    NARCIS (Netherlands)

    Egberink, Iris J. L.; Meijer, Rob R.

    In this article we discuss the use of IRT for the development and application of noncognitive measures in personnel selection and career development. We introduce the basic principles of IRT and we discuss the usefulness of IRT to evaluate the quality of items and tests to assess the measurement

  3. Het nut van item respons theorie bij de constructie en evaluatie van niet-cognitieve instrumenten voor selectie en assessment binnen organisaties. : (The usefulness of item response theory for the construction and evaluation of noncognitive tests in personnel selection and assessment.)

    NARCIS (Netherlands)

    Egberink, Iris J. L.; Meijer, Rob R.

    2012-01-01

    In this article we discuss the use of IRT for the development and application of noncognitive measures in personnel selection and career development. We introduce the basic principles of IRT and we discuss the usefulness of IRT to evaluate the quality of items and tests to assess the measurement pre

  4. 认知诊断CAT中具有非统计约束选题方法的比较%A Comparison of Item Selection Methods for Cognitive Diagnostic Computerized Adaptive Testing with Nonstatistical Constraints

    Institute of Scientific and Technical Information of China (English)

    毛秀珍; 辛涛

    2014-01-01

    It is well known that items in the bank of computerized adaptive testing (CAT) are always expected to be used equally. For one thing, a good deal of manpower and financial resources spent on constructing the item bank will surely be wasted if a large proportion of items are seldom exposed or even never be used. For the other, works for ensuring the test security and maintaining the item bank will become serious for test practitioners if items are exposed extremely skewed. In addition to controlling the item exposure, tests which assembled for different examinees are usually required to satisfy many constraints, such as (a) the well-proportional of each content domain;(b) the“enemy items”could not be appeared in the same test, and (c) the appropriate balance of item keys. Supposing some constraints are violated, it will give some unexpected reactions during the test and result in inaccuracy of trait estimates. Therefore, both item exposure control and content constraints are important non-statistical constraints. They have great influence on the test validity, measurement accuracy and comparability among examinees. So, they need to be incorporated into the designing of item selection for CAT in practical settings. When cognitive diagnostic theory is used in CAT, examinees can receive more detailed diagnostic information regarding their mastery of every attribute. Therefore, cognitive diagnostic CAT (CD-CAT) is a promising research area and has gained much attention because it integrates both the cognitive diagnostic method and adaptive testing. The present study compared the performances of five item selection methods in CD-CAT with item exposure control and content constraints. The item selection methods applied are (a) incorporating the Monte Carlo approach into the item eligibility approach (MC-IE); (b) incorporating the maximum priority index method into the Monte Carlo approach (MC-MPI); (c) incorporating the restrictive threshold method into the Monte

  5. Australian Item Bank Program: Social Science Item Bank.

    Science.gov (United States)

    Australian Council for Educational Research, Hawthorn.

    After vigorous review, editing, and trial testing, this item bank was compiled to help secondary school teachers construct objective tests in the social sciences. Anthropology, economics, ethnic and cultural studies, geography, history, legal studies, politics, and sociology are among the topics represented. The bank consists of multiple choice…

  6. New decision criteria for selecting delta check methods based on the ratio of the delta difference to the width of the reference range can be generally applicable for each clinical chemistry test item.

    Science.gov (United States)

    Park, Sang Hyuk; Kim, So-Young; Lee, Woochang; Chun, Sail; Min, Won-Ki

    2012-09-01

    Many laboratories use 4 delta check methods: delta difference, delta percent change, rate difference, and rate percent change. However, guidelines regarding decision criteria for selecting delta check methods have not yet been provided. We present new decision criteria for selecting delta check methods for each clinical chemistry test item. We collected 811,920 and 669,750 paired (present and previous) test results for 27 clinical chemistry test items from inpatients and outpatients, respectively. We devised new decision criteria for the selection of delta check methods based on the ratio of the delta difference to the width of the reference range (DD/RR). Delta check methods based on these criteria were compared with those based on the CV% of the absolute delta difference (ADD) as well as those reported in 2 previous studies. The delta check methods suggested by new decision criteria based on the DD/RR ratio corresponded well with those based on the CV% of the ADD except for only 2 items each in inpatients and outpatients. Delta check methods based on the DD/RR ratio also corresponded with those suggested in the 2 previous studies, except for 1 and 7 items in inpatients and outpatients, respectively. The DD/RR method appears to yield more feasible and intuitive selection criteria and can easily explain changes in the results by reflecting both the biological variation of the test item and the clinical characteristics of patients in each laboratory. We suggest this as a measure to determine delta check methods.

  7. 手持式电动工具产品日常检测不合格项目及原因浅析%Unqualified Items and Cause Analysis in Hand-held Electric Tools Routine Test

    Institute of Scientific and Technical Information of China (English)

    高家材; 陈其勇; 郑小龙

    2015-01-01

    基于日常检测尤其是CCC检测和监督抽查以及各类地方抽查的试验,介绍手持式电动工具在测试过程中常见的不合格项目,对不合格项目的产生原因进行简要分析.%Based on the routine test, especially the CCC certification, supervision and various local spot check, introduce the common unqualified items during the process in the hand-held electric tools test, and analyze the cause of the unqualified items briefly.

  8. Dynamic and Comprehensive Item Selection Strategies for Computerized Adaptive Testing Based on Graded Response Model%多级评分计算机化自适应测验动态综合选题策略

    Institute of Scientific and Technical Information of China (English)

    罗芬; 丁树良; 王晓庆

    2012-01-01

    Item selection strategy (ISS) is a core component in Computerized Adaptive Testing (CAT). Polytomous items can provide more information about examinee compared with dichotomous items, and adopting polytomously scored items in test is a research direction of CAT. As we know, the most widely used ISS is the maximum Fisher information (MFI) criterion, which raises concerns about cost-efficiency of the pool utilization and poses security risks for CAT programs. Chang & Ying (1999) and Chang, Qian, & Ying (2001) proposed two alternative item selection procedures, the a-stratified method (a-STR) and the a-stratified with b blocking method (&-STR) based on dichotomous model, with the goal to remedy the problems of item overexposure and item underexposure produced by MFI. However, the technology of a-STR and fc-STR is static because the items are stratified according to the given information at the beginning of test. Based on graded response model (GRM), a technique of the reduction dimensionality of difficulty (or step) parameters was employed[0] to construct some ISSs recently. The limitation of this dimension reduction technique is that it loses a lot of information. Thus, in order to improve MFI, two new item selection methods are proposed based on GRM: (1) modify the technique of the reduction dimensionality of difficulty (or step) parameters by integrating the interval estimation; (2) dynamic a-STR and dynamic fc-STR methods are implemented in the testing process. On one hand, these new ISSs can avoid and remedy the limitations of MFI and make good use of the advantages of the Fisher information function (FIF); FIF compresses all item parameters and ability parameters, so it is a comprehensive tool for all parameters in nature. On the other hand, the new ISSs employ the property that FIF could represent the inverse of the variance of the ability estimation, let £ be the square root of the reciprocal ofthe Fisher information, d be the absolute deviation between the

  9. UN ANÁLISIS NO PARAMÉTRICO DE ÍTEMS DE LA PRUEBA DEL BENDER/A NONPARAMETRIC ITEM ANALYSIS OF THE BENDER GESTALT TEST MODIFIED

    Directory of Open Access Journals (Sweden)

    César Merino Soto

    2009-05-01

    Full Text Available Resumen:La presente investigación hace un estudio psicométrico de un nuevo sistema de calificación de la Prueba Gestáltica del Bendermodificada para niños, que es el Sistema de Calificación Cualitativa (Brannigan y Brunner, 2002, en un muestra de 244 niñosingresantes a primer grado de primaria en cuatro colegios públicos, ubicados en Lima. El enfoque usado es un análisis noparamétrico de ítems mediante el programa Testgraf (Ramsay, 1991. Los resultados indican niveles apropiados deconsistencia interna, identificándose la unidimensionalidad, y el buen nivel discriminativo de las categorías de calificación deeste Sistema Cualitativo. No se hallaron diferencias demográficas respecto al género ni la edad. Se discuten los presenteshallazgos en el contexto del potencial uso del Sistema de Calificación Cualitativa y del análisis no paramétrico de ítems en lainvestigación psicométrica.AbstracThis research designs a psychometric study of a new scoring system of the Bender Gestalt test modified to children: it is theQualitative Scoring System (Brannigan & Brunner, 2002, in a sample of 244 first grade children of primary level, in four public school of Lima. The approach aplied is the nonparametric item analysis using The test graft computer program (Ramsay, 1991. Our findings point to good levels of internal consistency, unidimensionality and good discriminative level ofthe categories of scoring from the Qualitative Scoring System. There are not demographic differences between gender or age.We discuss our findings within the context of the potential use of the Qualitative Scoring System and of the nonparametricitem analysis approach in the psychometric research.

  10. Principles and procedures of considering item sequence effects in the development of calibrated item pools: Conceptual analysis and empirical illustration

    Directory of Open Access Journals (Sweden)

    Safir Yousfi

    2012-12-01

    Full Text Available Item responses can be context-sensitive. Consequently, composing test forms flexibly from a calibrated item pool requires considering potential context effects. This paper focuses on context effects that are related to the item sequence. It is argued that sequence effects are not necessarily a violation of item response theory but that item response theory offers a powerful tool to analyze them. If sequence effects are substantial, test forms cannot be composed flexibly on the basis of a calibrated item pool, which precludes applications like computerized adaptive testing. In contrast, minor sequence effects do not thwart applications of calibrated item pools. Strategies to minimize the detrimental impact of sequence effects on item parameters are discussed and integrated into a nomenclature that addresses the major features of item calibration designs. An example of an item calibration design demonstrates how this nomenclature can guide the process of developing a calibrated item pool.

  11. Item Purification Does Not Always Improve DIF Detection: A Counterexample with Angoff's Delta Plot

    Science.gov (United States)

    Magis, David; Facon, Bruno

    2013-01-01

    Item purification is an iterative process that is often advocated as improving the identification of items affected by differential item functioning (DIF). With test-score-based DIF detection methods, item purification iteratively removes the items currently flagged as DIF from the test scores to get purified sets of items, unaffected by DIF. The…

  12. A Psychometric Analysis of the Italian Version of the eHealth Literacy Scale Using Item Response and Classical Test Theory Methods.

    Science.gov (United States)

    Diviani, Nicola; Dima, Alexandra Lelia; Schulz, Peter Johannes

    2017-04-11

    The eHealth Literacy Scale (eHEALS) is a tool to assess consumers' comfort and skills in using information technologies for health. Although evidence exists of reliability and construct validity of the scale, less agreement exists on structural validity. The aim of this study was to validate the Italian version of the eHealth Literacy Scale (I-eHEALS) in a community sample with a focus on its structural validity, by applying psychometric techniques that account for item difficulty. Two Web-based surveys were conducted among a total of 296 people living in the Italian-speaking region of Switzerland (Ticino). After examining the latent variables underlying the observed variables of the Italian scale via principal component analysis (PCA), fit indices for two alternative models were calculated using confirmatory factor analysis (CFA). The scale structure was examined via parametric and nonparametric item response theory (IRT) analyses accounting for differences between items regarding the proportion of answers indicating high ability. Convergent validity was assessed by correlations with theoretically related constructs. CFA showed a suboptimal model fit for both models. IRT analyses confirmed all items measure a single dimension as intended. Reliability and construct validity of the final scale were also confirmed. The contrasting results of factor analysis (FA) and IRT analyses highlight the importance of considering differences in item difficulty when examining health literacy scales. The findings support the reliability and validity of the translated scale and its use for assessing Italian-speaking consumers' eHealth literacy.

  13. Reversed item bias: an integrative model.

    Science.gov (United States)

    Weijters, Bert; Baumgartner, Hans; Schillewaert, Niels

    2013-09-01

    In the recent methodological literature, various models have been proposed to account for the phenomenon that reversed items (defined as items for which respondents' scores have to be recoded in order to make the direction of keying consistent across all items) tend to lead to problematic responses. In this article we propose an integrative conceptualization of three important sources of reversed item method bias (acquiescence, careless responding, and confirmation bias) and specify a multisample confirmatory factor analysis model with 2 method factors to empirically test the hypothesized mechanisms, using explicit measures of acquiescence and carelessness and experimentally manipulated versions of a questionnaire that varies 3 item arrangements and the keying direction of the first item measuring the focal construct. We explain the mechanisms, review prior attempts to model reversed item bias, present our new model, and apply it to responses to a 4-item self-esteem scale (N = 306) and the 6-item Revised Life Orientation Test (N = 595). Based on the literature review and the empirical results, we formulate recommendations on how to use reversed items in questionnaires.

  14. On-Line Item Attribute Identification in Cognitive Diagnostic Computerized Adaptive Testing%计算机化自适应诊断测验中原始题的属性标定

    Institute of Scientific and Technical Information of China (English)

    汪文义; 丁树良; 游晓锋

    2011-01-01

    认知诊断测验项目开发成本较高,要标定大量项目的属性相当费时费力,专家完成这一任务也比较困难.对于在计算机化自适应诊断测验中的项目属性的标定尚未见到报导.在已有的为诊断测验开发的小型题库基础上,本文在计算机化自适应认知诊断测验过程中,植入原始题,对项目属性标定的问题进行探讨,重点研究原始题属性标定的方法及其影响因素,除了MMLE方法和MLE方法外,还建立了一种新的可用于所有非补偿认知诊断模型的属性标定的方法——交差方法.Monte Carlo模拟结果显示,MMLE方法较MLE方法好;在知识状态估计精度较高时,自适应植入原始题较随机植入原始题有一定的优势;随着知识状态估计精度提高和原始题作答次数增加,交差方法与MLE方法基本相当,只是在发散型和无结构型表现欠佳,但是交差方法不需要预先设定项目参数值.%Cognitive Diagnostic Assessment (CDA) combining psychometrics and cognitive science has received increased attention recently, but it is still in its infancy (Leighton and Gierl, 2007). The CDA based on the incidence Q-matrix (Tatsuoka, 1990) is quite different from the traditional Item Response Theory. The entries in each column of the incidence Q-matrix indicate which skills and knowledge are involved in the solution of each item. So the Q-matrix plays an important role in establishing the relation between the latent knowledge states and the ideal response patterns so as to provide information about students' cognitive strengths and weaknesses. On the other hand, CDA requires the specifications which latent attributes are measured by the test items and how these characteristics are related to one another. Leighton, Gierl and Hunka (2004) indicated the logic of Attribute Hierarchy Method (AHM) as following. Firstly, the hierarchy of attributes must be specified through protocol techniques before test item construction

  15. Investigating the Population Sensitivity Assumption of Item Response Theory True-Score Equating across Two Subgroups of Examinees and Two Test Formats

    Science.gov (United States)

    von Davier, Alina A.; Wilson, Christine

    2008-01-01

    Dorans and Holland (2000) and von Davier, Holland, and Thayer (2003) introduced measures of the degree to which an observed-score equating function is sensitive to the population on which it is computed. This article extends the findings of Dorans and Holland and of von Davier et al. to item response theory (IRT) true-score equating methods that…

  16. Simplified scoring of the Actionable 8-item screening questionnaire for neurogenic bladder overactivity in multiple sclerosis : a comparative analysis of test performance at different cut-off points

    NARCIS (Netherlands)

    Jongen, Peter Joseph; Blok, Bertil F.; Heesakkers, John P.; Heerings, Marco; Lemmens, Wim A.; Donders, Rogier

    2015-01-01

    Background: The Actionable questionnaire is an 8-item tool to screen patients with multiple sclerosis (MS) for neurogenic bladder problems, identifying those patients who might benefit from urological referral and bladder-specific treatment. The original scoring yields a total score of 0 to 24 with

  17. Simplified scoring of the Actionable 8-item screening questionnaire for neurogenic bladder overactivity in multiple sclerosis: a comparative analysis of test performance at different cut-off points

    NARCIS (Netherlands)

    Jongen, P.J.; Blok, B.F.; Heesakkers, J.P.F.A.; Heerings, M.; Lemmens, W.A.J.G.; Donders, R.

    2015-01-01

    BACKGROUND: The Actionable questionnaire is an 8-item tool to screen patients with multiple sclerosis (MS) for neurogenic bladder problems, identifying those patients who might benefit from urological referral and bladder-specific treatment. The original scoring yields a total score of 0 to 24 with

  18. Investigating the Population Sensitivity Assumption of Item Response Theory True-Score Equating across Two Subgroups of Examinees and Two Test Formats

    Science.gov (United States)

    von Davier, Alina A.; Wilson, Christine

    2008-01-01

    Dorans and Holland (2000) and von Davier, Holland, and Thayer (2003) introduced measures of the degree to which an observed-score equating function is sensitive to the population on which it is computed. This article extends the findings of Dorans and Holland and of von Davier et al. to item response theory (IRT) true-score equating methods that…

  19. Differential Item Functioning Analysis of the Science and Mathematics Items in the University Entrance Examinations in Turkey

    Science.gov (United States)

    Kalaycioglu, Dilara Bakan; Berberoglu, Giray

    2011-01-01

    This study is aimed to detect differential item functioning (DIF) items across gender groups, analyze item content for the possible sources of DIF, and eventually investigate the effect of DIF items on the criterion-related validity of the test scores in the quantitative section of the university entrance examination (UEE) in Turkey. The reason…

  20. Psychometric Changes on Item Difficulty Due to Item Review by Examinees

    Directory of Open Access Journals (Sweden)

    Elena C. Papanastasiou

    2015-01-01

    Full Text Available If good measurement depends in part on the estimation of accurate item characteristics, it is essential that test developers become aware of discrepancies that may exist on the item parameters before and after item review. The purpose of this study was to examine the answer changing patterns of students while taking paper-and-pencil multiple choice exams, and to examine how these changes affect the estimation of item difficulty parameters. The results of this study have shown that item review by examinees does produce some changes to the examinee ability estimates and to the item difficulty parameters. In addition, these effects are more pronounced in shorter tests than in longer tests. In turn, these small changes produce larger effects when estimating the changes in the information values of each student's test score.

  1. Testing the Personal Wellbeing Index on 12-16-Year-Old Adolescents in 3 Different Countries with 2 New Items

    Science.gov (United States)

    Casas, Ferran; Sarriera, Jorge Castella; Alfaro, Jaime; Gonzalez, Monica; Malo, Sara; Bertran, Irma; Figuer, Cristina; da Cruz, Daniel Abs; Bedin, Livia; Paradiso, Angela; Weinreich, Karin; Valdenegro, Boris

    2012-01-01

    The 7-item adult version of the Personal Wellbeing scale (Cummins et al. Social Indic Res 64:159-190, 2003) was administered to two samples of adolescents aged 12-16 in Brazil (N = 1,588) and Spain (N = 2,900), and to a sample of adolescents aged 14-16 in Chile (N = 843). The results obtained were analyzed to determine its psychometric…

  2. Testing the Personal Wellbeing Index on 12-16-Year-Old Adolescents in 3 Different Countries with 2 New Items

    Science.gov (United States)

    Casas, Ferran; Sarriera, Jorge Castella; Alfaro, Jaime; Gonzalez, Monica; Malo, Sara; Bertran, Irma; Figuer, Cristina; da Cruz, Daniel Abs; Bedin, Livia; Paradiso, Angela; Weinreich, Karin; Valdenegro, Boris

    2012-01-01

    The 7-item adult version of the Personal Wellbeing scale (Cummins et al. Social Indic Res 64:159-190, 2003) was administered to two samples of adolescents aged 12-16 in Brazil (N = 1,588) and Spain (N = 2,900), and to a sample of adolescents aged 14-16 in Chile (N = 843). The results obtained were analyzed to determine its psychometric…

  3. Propriedades psicométricas dos itens do teste WISC-III Propiedades psicométricas de los ítenes del subtest WISC-III Psychometric properties of WISC-III items

    Directory of Open Access Journals (Sweden)

    Vera Lúcia Marques de Figueiredo

    2008-09-01

    Full Text Available O aperfeiçoamento de um teste se dá através da seleção, substituição ou revisão de itens, e quando um item é analisado, aumenta a validade e precisão do teste. Este artigo trata da apresentação dos resultados relativos às propriedades psicométricas dos itens dos subtestes do WISC-III, referentes a dificuldade, discriminação e validade. O WISC-III é um instrumento amplamente utilizado no contexto da avaliação da inteligência, e conhecer a qualidade dos itens é essencial ao profissional que administra o teste. As análises foram efetuadas com base nas pontuações de 801 protocolos do teste, aplicados por ocasião da pesquisa de adaptação a um contexto brasileiro. As análises mostraram que os itens adaptados apresentaram características psicométricas adequadas, possibilitando a utilização do instrumento como meio confiável de diagnóstico.El perfeccionamiento de un teste ocurre por la selección, sustitución o revisión de ítenes y, cuando un item es analisado, aumenta la validez y fiabilidad del teste. Ese artículo trata de la presentación de los resultados relativos a las propiedades psicométricas de los ítenes del subtest WISC-III, referentes a la dificultad, a la discriminación y a la validez. El WISC-III es un instrumento muy utilizado en el contexto de la evaluación de la inteligencia, y conocer a la calidad de los ítenes es esencial al profesional que administra el teste. Los análisis fueron efectuados con base el los puntajes de 801 protocolos de registro del teste, aplicados por ocasión de encuesta de estandarización a un contexto brasileño. Los análisis enseñaron que los ítenes adaptados apuntaron características psicométricas adecuadas, permitiendo la utilización del instrumento como medio confiable de diagnóstico.The improvement of the quality of items by selection, substitution and review will increase a test's validity and reliability. Current essay will present results referring to

  4. A New Extension of the Binomial Error Model for Responses to Items of Varying Difficulty in Educational Testing and Attitude Surveys.

    Directory of Open Access Journals (Sweden)

    James A Wiley

    Full Text Available We put forward a new item response model which is an extension of the binomial error model first introduced by Keats and Lord. Like the binomial error model, the basic latent variable can be interpreted as a probability of responding in a certain way to an arbitrarily specified item. For a set of dichotomous items, this model gives predictions that are similar to other single parameter IRT models (such as the Rasch model but has certain advantages in more complex cases. The first is that in specifying a flexible two-parameter Beta distribution for the latent variable, it is easy to formulate models for randomized experiments in which there is no reason to believe that either the latent variable or its distribution vary over randomly composed experimental groups. Second, the elementary response function is such that extensions to more complex cases (e.g., polychotomous responses, unfolding scales are straightforward. Third, the probability metric of the latent trait allows tractable extensions to cover a wide variety of stochastic response processes.

  5. A New Extension of the Binomial Error Model for Responses to Items of Varying Difficulty in Educational Testing and Attitude Surveys.

    Science.gov (United States)

    Wiley, James A; Martin, John Levi; Herschkorn, Stephen J; Bond, Jason

    2015-01-01

    We put forward a new item response model which is an extension of the binomial error model first introduced by Keats and Lord. Like the binomial error model, the basic latent variable can be interpreted as a probability of responding in a certain way to an arbitrarily specified item. For a set of dichotomous items, this model gives predictions that are similar to other single parameter IRT models (such as the Rasch model) but has certain advantages in more complex cases. The first is that in specifying a flexible two-parameter Beta distribution for the latent variable, it is easy to formulate models for randomized experiments in which there is no reason to believe that either the latent variable or its distribution vary over randomly composed experimental groups. Second, the elementary response function is such that extensions to more complex cases (e.g., polychotomous responses, unfolding scales) are straightforward. Third, the probability metric of the latent trait allows tractable extensions to cover a wide variety of stochastic response processes.

  6. Gating Items: Definition, Significance, and Need for Further Study

    Science.gov (United States)

    Judd, Wallace

    2009-01-01

    Over the past twenty years in performance testing a specific item type with distinguishing characteristics has arisen time and time again. It's been invented independently by dozens of test development teams. And yet this item type is not recognized in the research literature. This article is an invitation to investigate the item type, evaluate…

  7. Primary Science Assessment Item Setters' Misconceptions Concerning Biological Science Concepts

    Science.gov (United States)

    Boo, Hong Kwen

    2007-01-01

    Assessment is an integral and vital part of teaching and learning, providing feedback on progress through the assessment period to both learners and teachers. However, if test items are flawed because of misconceptions held by the question setter, then such test items are invalid as assessment tools. Moreover, such flawed items are also likely to…

  8. Influence of Item Direction on Student Responses in Attitude Assessment.

    Science.gov (United States)

    Campbell, Noma Jo; Grissom, Stephen

    To investigate the effects of wording in attitude test items, a five-point Likert-type rating scale was administered to 173 undergraduate education majors. The test measured attitudes toward college and self, and contained 38 positively-worded items. Thirty-eight negatively-worded items were also written to parallel the positive statements.…

  9. Using a Linear Regression Method to Detect Outliers in IRT Common Item Equating

    Science.gov (United States)

    He, Yong; Cui, Zhongmin; Fang, Yu; Chen, Hanwei

    2013-01-01

    Common test items play an important role in equating alternate test forms under the common item nonequivalent groups design. When the item response theory (IRT) method is applied in equating, inconsistent item parameter estimates among common items can lead to large bias in equated scores. It is prudent to evaluate inconsistency in parameter…

  10. Using a Linear Regression Method to Detect Outliers in IRT Common Item Equating

    Science.gov (United States)

    He, Yong; Cui, Zhongmin; Fang, Yu; Chen, Hanwei

    2013-01-01

    Common test items play an important role in equating alternate test forms under the common item nonequivalent groups design. When the item response theory (IRT) method is applied in equating, inconsistent item parameter estimates among common items can lead to large bias in equated scores. It is prudent to evaluate inconsistency in parameter…

  11. 教育认知诊断测验与认知模型一致性的评估%Evaluating the Consistency of Test Items Relative to the Cognitive Model for Educational Cognitive Diagnosis

    Institute of Scientific and Technical Information of China (English)

    丁树良; 毛萌萌; 汪文义; 罗芬; CUI Ying

    2012-01-01

    构建正确的认知模型是成功进行认知诊断的关键之一,如果认知诊断测验不能完整准确地代表这个认知模型,这个测验的效度就存在问题.属性及其层级可以表示一个认知模型.在认知模型正确基础上,给出了一个计量公式以衡量认知诊断测验能够多大程度上代表认知模型;对于不止包含一个知识状态的等价类及其形成原因进行了分析,对Cui等人的属性层级相合性指标(HCI)提出修改建议,以更好地探查数据与专家给出的认知模型的一致性.%Attributes and their hierarchy may present a cognitive model. Building a cognitive model is one of key steps for cognitive diagnosis as it is directly related to the validity and usefulness of test results. It is very important to detect whether the test specification coincides with the cognitive model before administering the test. In this paper, an explicit index is given to measure the extent to which the cognitive model is represented by the test items of the diagnostic test. We call this index as the theoretical construct validity (TCV). In terms of TCV, the test reported by Tatsuoka and her colleagues (1988) is reanalyzed, and the TCV of the test is only 9/24. That is, 24 knowledge states are obtained from the theoretic cognitive model but the test specification could distinguish only 9 knowledge states. Cui and her colleagues established a person fit index named hierarchy consistency index (HCI) to detect the fitness of an examinee's observed response pattern (ORP) to an expected response pattern (ERP). HCI is not defined well when an examinee mastered one attribute only and there is only one item measuring the same attribute in the test, and the examinee responses correctly to the item. The original HCI could not compute for the number of comparisons being zero, hence the denominator being zero. In addition, the HCI includes the slipping only. Combining the slipping and guessing in the response to

  12. Bookmark locations and item response model selection in the presence of local item dependence.

    Science.gov (United States)

    Skaggs, Gary

    2007-01-01

    The bookmark standard setting procedure is a popular method for setting performance standards on state assessment programs. This study reanalyzed data from an application of the bookmark procedure to a passage-based test that used the Rasch model to create the item ordered booklet. Several problems were noted in this implementation of the bookmark procedure, including disagreement among the SMEs about the correct order of items in the bookmark booklet, performance level descriptions of the passing standard being based on passage difficulty as well as item difficulty, and the presence of local item dependence within reading passages. Bookmark item locations were recalculated for the IRT three-parameter model and the multidimensional bifactor model. The results showed that the order of item locations was very similar for all three models when items of high difficulty and low discrimination were excluded. However, the items whose positions were the most discrepant between models were not the items that the SMEs disagreed about the most in the original standard setting. The choice of latent trait model did not address problems of item order disagreement. Implications for the use of the bookmark method in the presence of local item dependence are discussed.

  13. A Brief Analysis of Hot Testing Item in NMT-Ion Inference%高考热点试题--离子推断题浅析

    Institute of Scientific and Technical Information of China (English)

    寸智

    2012-01-01

    Ion inference is one of the main topic ion reactions and a kind of item students are easy to lose scores. Because it is of wide knowledge scope, strong comprehension and large thinking capacity, comprehensive applying ability and inferential capability, it gains more popularity. This paper it can check students' problem-solving strategies and analyzes some typical items in NMT.%离子推断题是离子反应的主要题型之一,也是学生失分比较严重的一种题型。因其涉及知识面广、知识综合强、思维容量大.能较好地考查出学生对知识的综合运用水平和推理能力,从而备受高考命题者的青睐。本文对此类试题的解题策略做了简单归纳,并分析了此类高考试题的经典题型。

  14. 医生工作站电子申请检验项目时附加费用的智能收取%Additional fees charged intelligently while test items electronically applied for on Doctor workstation

    Institute of Scientific and Technical Information of China (English)

    范久波; 刘海菊; 刘晓东

    2011-01-01

    Objective: To explore the realization and the value of additional fees charged intelligently while test items electronically applied for on Doctor workstation. Methods: The single item or combination tests grouped into many types. In each group first the cost for blood collection and materials expenses to be added one time into list and then compare sample type between each items in the group and that been added previously. If different, the material expense is added again. When there are multiple groups, the material expense is added one time for each group first, and then each subjected to the same judgment process. Results: When clinicians select the test items, the cost for blood collection and materials expenses needed be added into Doctor's advice automatically. When removing an item, only need deleting the item directly, additional fees needed to retain for the remaining items determined by judgment program, unnecessary deleted automatically. In special circumstances additional fees charged according to actual situation. Conclusion: Additional fees charged intelligently while test items electronically applied for on Doctor workstation, indicates that while pays great attention on big and complete function in His construction, also needs to pay more attention on small and the fine detail, in order to facilitate the daily work.%目的:探讨医生工作站电子申请检验项目时附加费用的智能收取的实现及应用价值.方法:检验单项和检验组套进行分组;每组内先根据标本类型收取采血费和一次卫材费,然后将标本类型同组内样本类型进行比较,不同则加收一次.有多个分组时,每组先加收一次卫材费,然后组内判断循环同上.特殊情况设置特殊的处理方案.结果:临床医生选中检验项目时,采样所需要收取的卫材费及采血费,自动添加到医嘱中.当去掉某一项目时只需删除项目,卫材费及采血费由收费程序判断保留,多余的自动删

  15. Teoria da Resposta ao Item Teoria de la respuesta al item Item response theory

    Directory of Open Access Journals (Sweden)

    Eutalia Aparecida Candido de Araujo

    2009-12-01

    Full Text Available A preocupação com medidas de traços psicológicos é antiga, sendo que muitos estudos e propostas de métodos foram desenvolvidos no sentido de alcançar este objetivo. Entre os trabalhos propostos, destaca-se a Teoria da Resposta ao Item (TRI que, a princípio, veio completar limitações da Teoria Clássica de Medidas, empregada em larga escala até hoje na medida de traços psicológicos. O ponto principal da TRI é que ela leva em consideração o item particularmente, sem relevar os escores totais; portanto, as conclusões não dependem apenas do teste ou questionário, mas de cada item que o compõe. Este artigo propõe-se a apresentar esta Teoria que revolucionou a teoria de medidas.La preocupación con las medidas de los rasgos psicológicos es antigua y muchos estudios y propuestas de métodos fueron desarrollados para lograr este objetivo. Entre estas propuestas de trabajo se incluye la Teoría de la Respuesta al Ítem (TRI que, en principio, vino a completar las limitaciones de la Teoría Clásica de los Tests, ampliamente utilizada hasta hoy en la medida de los rasgos psicológicos. El punto principal de la TRI es que se tiene en cuenta el punto concreto, sin relevar las puntuaciones totales; por lo tanto, los resultados no sólo dependen de la prueba o cuestionario, sino que de cada ítem que lo compone. En este artículo se propone presentar la Teoría que revolucionó la teoría de medidas.The concern with measures of psychological traits is old and many studies and proposals of methods were developed to achieve this goal. Among these proposed methods highlights the Item Response Theory (IRT that, in principle, came to complete limitations of the Classical Test Theory, which is widely used until nowadays in the measurement of psychological traits. The main point of IRT is that it takes into account the item in particular, not relieving the total scores; therefore, the findings do not only depend on the test or questionnaire

  16. Item response theory - A first approach

    Science.gov (United States)

    Nunes, Sandra; Oliveira, Teresa; Oliveira, Amílcar

    2017-07-01

    The Item Response Theory (IRT) has become one of the most popular scoring frameworks for measurement data, frequently used in computerized adaptive testing, cognitively diagnostic assessment and test equating. According to Andrade et al. (2000), IRT can be defined as a set of mathematical models (Item Response Models - IRM) constructed to represent the probability of an individual giving the right answer to an item of a particular test. The number of Item Responsible Models available to measurement analysis has increased considerably in the last fifteen years due to increasing computer power and due to a demand for accuracy and more meaningful inferences grounded in complex data. The developments in modeling with Item Response Theory were related with developments in estimation theory, most remarkably Bayesian estimation with Markov chain Monte Carlo algorithms (Patz & Junker, 1999). The popularity of Item Response Theory has also implied numerous overviews in books and journals, and many connections between IRT and other statistical estimation procedures, such as factor analysis and structural equation modeling, have been made repeatedly (Van der Lindem & Hambleton, 1997). As stated before the Item Response Theory covers a variety of measurement models, ranging from basic one-dimensional models for dichotomously and polytomously scored items and their multidimensional analogues to models that incorporate information about cognitive sub-processes which influence the overall item response process. The aim of this work is to introduce the main concepts associated with one-dimensional models of Item Response Theory, to specify the logistic models with one, two and three parameters, to discuss some properties of these models and to present the main estimation procedures.

  17. Item Dimension Identification of Psychological Tests based on Statistical Variable Selection Methods%基于统计学变量筛选方法的心理测验题目的维度识别

    Institute of Scientific and Technical Information of China (English)

    孙佳楠; 杨武岳; 陈秋

    2016-01-01

    Multidimensional psychological tests have been widely used to evaluate examinees'latent traits in all kinds of subject assessment .Although the possible latent traits or the so‐called dimensions of the tests can be known to some extent ,the dimensions probed by each item of the tests are still needed to identify for the application purpose .Based on multidimensional item response theory and the shrinkage estimation methods of statistical variable selection ,this research explored to statistically identify the item‐dimension correspondence relationship in some typical psychological tests . Simulation studies were conducted to investigate the performance of the proposed method and the results showed that the method based on LASSO did better than that based on the elastic net in terms of correctly identifying the dimensions of test items .%近年来多维心理测验被广泛应用于各类评估,虽然编制测验时知道整个测验考察的潜在特质(或称为维度),但是测验题目具体考察的维度仍需确定。借助多维项目反应理论模型与广义线性模型的关系,使用LASSO和弹性网两种变量筛选方法,可解决测验题目的维度识别问题。模拟研究发现,LASSO方法比弹性网方法具有更好的维度识别效果,前者对不同类型的多维测验具有较高的维度识别准确率。

  18. Study of Match Fit between Passage and Item Numbers on Reading Comprehension Subsection of Chinese Proficiency Test%阅读理解考试篇章数量与题目数量拟合度研究

    Institute of Scientific and Technical Information of China (English)

    柴省三

    2014-01-01

    阅读篇章的选择、多项选择题目的设计以及篇章数量与测验题目数量的拟合度问题,是影响阅读理解能力测试信度和效度的基本因素。篇章数量和题目数量的不同组合方式对阅读理解测验误差和信度的影响也不相同。本研究以中国汉语水平考试(HSK)的实测数据为基础,随机选择500名考生作为研究样本,借助概化理论的随机双面嵌套(nested)设计s×(i:p)分析了HSK阅读理解测验中的误差来源和结构,对篇章数量和题目数量的匹配合理性进行了检验。研究结果显示:增加文章数量和题目数量均可以提高测验的精度,但增加文章数量比增加题目数量对概化系数(Generalizability coefficient, Eρ2)的提高作用更有效;HSK阅读理解测验的篇章数量和题目数量的现行组合方式符合误差控制的原则和信度指标的要求。%The selection of passages, the design of multiple choice items based on the passages and match fit between passage and item numbers are among the most important factors affecting reading comprehension test reliability and validity. This study applied generalizability theory to investigate the relative contributions of test-takers, items and passages to the score dependability of the Chinese Proficiency Test (HSK). The study sampled 500 test takers from total of 7238 participants in the HSK generic data set which was administered in the April 2011 of the China mainland. The analysis isolated the variance components due to persons, items and passages, and their effects on the dependability. The research indicated that the main effect component that took the largest share of variance was the items within a passage;the increase of passage numbers contributed more than that of the item numbers did for the generalizability coefficient (Eρ2). The findings taken together prove that the match of the passage and item numbers in the HSK is desirable for

  19. 全国医院检验科临床检验项目基本情况现状调查与分析%Investigation and analysis of current application status of testing items of clinical laboratories in China

    Institute of Scientific and Technical Information of China (English)

    钟堃; 王薇; 何法霖; 王治国

    2015-01-01

    目的:了解我国目前医院检验科所涉及的检验项目、各项目检测量、所需费用以及检验项目周转时间( TAT)等情况。统计出所占统计量比重较大的检验项目,以期更好的进行质量控制,为财政投入以及政策制定给出建议。方法在我国每个省、自治区和直辖市中(除西藏和台湾)随机选取三级甲等医院、三级医院和二级医院各3家,共270家医院。向医院检验科负责人发送调查表,检验科通过卫生部临床检验中心网络系统回报调查表。结果汇总后使用Microsoft Excel 2007进行统计分析。所调查的信息包括:医院基本信息、所开展的检验项目、组合项目、年检测量、收费、TAT等。结果270家医院都回报了有效结果,回报率为100%。检验单项共628项。临床免疫、血清学的检验项目种类最多,230项,占36.62%(230/628)。临床生物化学专业的总检测量最高达到59.97%。检测量最多的前100个检验单项的检测量之和超过了所有项目检测总量的90%。收费最多的前100个检验单项的收费之和超过了收费总和的85%。结论本次调查研究所得到的我国医院检验科的检验项目、检测量、收费、TAT等相关信息,可以为我国检验学相关项目和专业的质量控制、财政投入以及政策制定提供有参考价值的信息。(中华检验医学杂志,2015,38:637-641)%Objective To investigate the current application status of clinical laboratories in China, including:testing items, the testing amount of each item, total cost of each item and turnaround time ( TAT) of each item.The testing tem with larger proportion of testing amount required better quality control, more financial investment and policy making.Methods Except Tibet and Taiwan, 30 provinceswere included in this investigation.3 grade A tertiary hospitals, 3 tertiary hospitals and 3 secondary hospitals were

  20. The Development of Practical Item Analysis Program for Indonesian Teachers

    Directory of Open Access Journals (Sweden)

    Ali Muhson

    2017-04-01

    Full Text Available Item analysis has essential roles in the learning assessment. The item analysis program is designed to measure student achievement and instructional effectiveness. This study was aimed to develop item-analysis program and verify its feasibility. This study uses a Research and Development (R & D model. The procedure includes designing and developing a product, validating, and testing the product. The data were collected through documentations, questionnaires, and interviews. This study successfully developed item analysis program, namely AnBuso. It is developed based on classical test theory (CTT. It was practical and applicable for Indonesian teachers to analyse test items

  1. Assessing the Psychometric Properties of Alternative Items for Certification.

    Science.gov (United States)

    Krogh, Mary Anne; Muckle, Timothy

    Alternative items were added as scored items to the National Certification Examination for Nurse Anesthetists (NCE) in 2010. A common concern related to the new items has been their measurement attributes. This study was undertaken to evaluate the psychometric impact of adding these items to the examination. Candidates had a significantly higher ability estimate in alternative items than in multiple choice questions and 6.7 percent of test candidates performed significantly differently in alternative item formats. The ability estimates of multiple choice questions correlated at r = .58. The alternative items took significantly longer time to answer than standard multiple choice questions and discriminated to a higher degree than MCQs. The alternative items exhibited unidimensionality to the same degree as MCQs and the BIC confirmed the Rasch model as acceptable for scoring. The new item types were found to have acceptable attributes for inclusion in the certification program.

  2. Are Inferential Reading Items More Susceptible to Cultural Bias than Literal Reading Items?

    Science.gov (United States)

    Banks, Kathleen

    2012-01-01

    The purpose of this article is to illustrate a seven-step process for determining whether inferential reading items were more susceptible to cultural bias than literal reading items. The seven-step process was demonstrated using multiple-choice data from the reading portion of a reading/language arts test for fifth and seventh grade Hispanic,…

  3. Curriculum, Translation, and Differential Functioning of Measurement and Geometry Items

    Science.gov (United States)

    Emenogu, Barnabas C.; Childs, Ruth A.

    2005-01-01

    A test item exhibits differential item functioning (DIF) if students with the same ability find it differentially difficult. When the item is administered in French and English, differences in language difficulty and meaning are the most likely explanations. However, curriculum differences may also contribute to DIF. The responses of Ontario…

  4. A method for designing IRT-based item banks

    NARCIS (Netherlands)

    Boekkooi-Timminga, Ellen

    1990-01-01

    Since 1985 several procedures for computerized test construction using linear programing techniques have been described in the literature. To apply these procedures successfully, suitable item banks are needed. The problem of designing item banks based on item response theory (IRT) is addressed. A p

  5. SHIPPING OF RADIOACTIVE ITEMS

    CERN Multimedia

    TIS/RP Group

    2001-01-01

    The TIS-RP group informs users that shipping of small radioactive items is normally guaranteed within 24 hours from the time the material is handed in at the TIS-RP service. This time is imposed by the necessary procedures (identification of the radionuclides, determination of dose rate and massive objects require a longer procedure and will therefore take longer.

  6. Multidimensional CAT Item Selection Methods for Domain Scores and Composite Scores with Item Exposure Control and Content Constraints

    Science.gov (United States)

    Yao, Lihua

    2014-01-01

    The intent of this research was to find an item selection procedure in the multidimensional computer adaptive testing (CAT) framework that yielded higher precision for both the domain and composite abilities, had a higher usage of the item pool, and controlled the exposure rate. Five multidimensional CAT item selection procedures (minimum angle;…

  7. Multidimensional CAT Item Selection Methods for Domain Scores and Composite Scores with Item Exposure Control and Content Constraints

    Science.gov (United States)

    Yao, Lihua

    2014-01-01

    The intent of this research was to find an item selection procedure in the multidimensional computer adaptive testing (CAT) framework that yielded higher precision for both the domain and composite abilities, had a higher usage of the item pool, and controlled the exposure rate. Five multidimensional CAT item selection procedures (minimum angle;…

  8. Do Images Influence Assessment in Anatomy? Exploring the Effect of Images on Item Difficulty and Item Discrimination

    Science.gov (United States)

    Vorstenbosch, Marc A. T. M.; Klaassen, Tim P. F. M.; Kooloos, Jan G. M.; Bolhuis, Sanneke M.; Laan, Roland F. J. M.

    2013-01-01

    Anatomists often use images in assessments and examinations. This study aims to investigate the influence of different types of images on item difficulty and item discrimination in written assessments. A total of 210 of 460 students volunteered for an extra assessment in a gross anatomy course. This assessment contained 39 test items grouped in…

  9. A Comparison of Anchor-Item Designs for the Concurrent Calibration of Large Banks of Likert-Type Items

    Science.gov (United States)

    Garcia-Perez, Miguel A.; Alcala-Quintana, Rocio; Garcia-Cueto, Eduardo

    2010-01-01

    Current interest in measuring quality of life is generating interest in the construction of computerized adaptive tests (CATs) with Likert-type items. Calibration of an item bank for use in CAT requires collecting responses to a large number of candidate items. However, the number is usually too large to administer to each subject in the…

  10. [Perceptions on item disclosure for the Korean medical licensing examination].

    Science.gov (United States)

    Yang, Eunbae B

    2015-09-01

    This study analyzed the perceptions of medical students and faculty regarding disclosure of test items on the Korean medical licensing examination. I conducted a survey of medical students from medical colleges and professional medical schools nationwide. Responses were analyzed from 718 participants as well as 69 faculty members who participated in creating the medical licensing examination item sets. Data were analyzed using descriptive statistics and the chi-square test. It is important to maintain test quality and to keep the test items unavailable to the public. There are also concerns among students that disclosure of test items would prompt increasing difficulty of test items (48.3%). Further, few students found it desirable to disclose test items regardless of any considerations (28.5%). The professors, who had experience in designing the test items, also expressed their opposition to test item disclosure (60.9%). It is desirable not to disclose the test items of the Korean medical licensing examination to the public on the condition that students are provided with a sufficient amount of information regarding the examination. This is so that the exam can appropriately identify candidates with the required qualifications.

  11. SHIPPING OF RADIOACTIVE ITEMS

    CERN Multimedia

    TIS/RP Group

    2001-01-01

    The TIS-RP group informs users that shipping of small radioactive items is normally guaranteed within 24 hours from the time the material is handed in at the TIS-RP service. This time is imposed by the necessary procedures (identification of the radionuclides, determination of dose rate, preparation of the package and related paperwork). Large and massive objects require a longer procedure and will therefore take longer.

  12. Identification of candidate children for maturity-onset diabetes of the young type 2 (MODY2) gene testing: a seven-item clinical flowchart (7-iF).

    Science.gov (United States)

    Pinelli, Michele; Acquaviva, Fabio; Barbetti, Fabrizio; Caredda, Elisabetta; Cocozza, Sergio; Delvecchio, Maurizio; Mozzillo, Enza; Pirozzi, Daniele; Prisco, Francesco; Rabbone, Ivana; Sacchetti, Lucia; Tinto, Nadia; Toni, Sonia; Zucchini, Stefano; Iafusco, Dario

    2013-01-01

    MODY2 is the most prevalent monogenic form of diabetes in Italy with an estimated prevalence of about 0.5-1.5%. MODY2 is potentially indistinguishable from other forms of diabetes, however, its identification impacts on patients' quality of life and healthcare resources. Unfortunately, DNA direct sequencing as diagnostic test is not readily accessible and expensive. In addition current guidelines, aiming to establish when the test should be performed, proved a poor detection rate. Aim of this study is to propose a reliable and easy-to-use tool to identify candidate patients for MODY2 genetic testing. We designed and validated a diagnostic flowchart in the attempt to improve the detection rate and to increase the number of properly requested tests. The flowchart, called 7-iF, consists of 7 binary "yes or no" questions and its unequivocal output is an indication for whether testing or not. We tested the 7-iF to estimate its clinical utility in comparison to the clinical suspicion alone. The 7-iF, in a prospective 2-year study (921 diabetic children) showed a precision of about the 76%. Using retrospective data, the 7-iF showed a precision in identifying MODY2 patients of about 80% compared to the 40% of the clinical suspicion. On the other hand, despite a relatively high number of missing MODY2 patients, the 7-iF would not suggest the test for 90% of the non-MODY2 patients, demonstrating that a wide application of this method might 1) help less experienced clinicians in suspecting MODY2 patients and 2) reducing the number of unnecessary tests. With the 7-iF, a clinician can feel confident of identifying a potential case of MODY2 and suggest the molecular test without fear of wasting time and money. A Qaly-type analysis estimated an increase in the patients' quality of life and savings for the health care system of about 9 million euros per year.

  13. Identification of candidate children for maturity-onset diabetes of the young type 2 (MODY2 gene testing: a seven-item clinical flowchart (7-iF.

    Directory of Open Access Journals (Sweden)

    Michele Pinelli

    Full Text Available MODY2 is the most prevalent monogenic form of diabetes in Italy with an estimated prevalence of about 0.5-1.5%. MODY2 is potentially indistinguishable from other forms of diabetes, however, its identification impacts on patients' quality of life and healthcare resources. Unfortunately, DNA direct sequencing as diagnostic test is not readily accessible and expensive. In addition current guidelines, aiming to establish when the test should be performed, proved a poor detection rate. Aim of this study is to propose a reliable and easy-to-use tool to identify candidate patients for MODY2 genetic testing. We designed and validated a diagnostic flowchart in the attempt to improve the detection rate and to increase the number of properly requested tests. The flowchart, called 7-iF, consists of 7 binary "yes or no" questions and its unequivocal output is an indication for whether testing or not. We tested the 7-iF to estimate its clinical utility in comparison to the clinical suspicion alone. The 7-iF, in a prospective 2-year study (921 diabetic children showed a precision of about the 76%. Using retrospective data, the 7-iF showed a precision in identifying MODY2 patients of about 80% compared to the 40% of the clinical suspicion. On the other hand, despite a relatively high number of missing MODY2 patients, the 7-iF would not suggest the test for 90% of the non-MODY2 patients, demonstrating that a wide application of this method might 1 help less experienced clinicians in suspecting MODY2 patients and 2 reducing the number of unnecessary tests. With the 7-iF, a clinician can feel confident of identifying a potential case of MODY2 and suggest the molecular test without fear of wasting time and money. A Qaly-type analysis estimated an increase in the patients' quality of life and savings for the health care system of about 9 million euros per year.

  14. 主观性试题网上评阅趋中评分控制研究初探%Research on Controlling Central Rating in Net-based Scoring of Subjective Test Items

    Institute of Scientific and Technical Information of China (English)

    彭恒利; 俞韫烨

    2013-01-01

      Researchers find that in scoring process of subjective test items such as composition scoring, some raters tend to assign central scores, avoiding using the high or low end of the rating scale. This is called“central rating”. Central rating belongs to systematic scoring errors which will cause deterioration of test quality. This paper analyzes central rating in net-based scoring of subjective test items. It suggests three types of central rating and analyzes causes of central rating. Central rating can be detected by two methods, which are anchor paper method and statistic index method. All through the test development and scoring process, multiple measures should be applied to control central rating.%  “趋中评分”指在作文评分等主观性评价过程中,评分员较少给出高分或低分,分数多集中在评分量表中间段的现象。趋中评分是一种系统性的评分误差,它对考试质量有较大影响。本研究分析了主观性试题网上评阅中的趋中评分现象,归纳出三种趋中评分类型,分析了趋中评分的成因,认为趋中评分可以通过校验卷法和统计指标法进行判定,在考试研发和评阅阶段可在多个方面、结合技术和非技术的手段进行控制。

  15. Exploring Differential Effects across Two Decoding Treatments on Item-Level Transfer in Children with Significant Word Reading Difficulties: A New Approach for Testing Intervention Elements

    Science.gov (United States)

    Steacy, Laura M.; Elleman, Amy M.; Lovett, Maureen W.; Compton, Donald L.

    2016-01-01

    In English, gains in decoding skill do not map directly onto increases in word reading. However, beyond the Self-Teaching Hypothesis, little is known about the transfer of decoding skills to word reading. In this study, we offer a new approach to testing specific decoding elements on transfer to word reading. To illustrate, we modeled word-reading…

  16. Exploring Differential Effects across Two Decoding Treatments on Item-Level Transfer in Children with Significant Word Reading Difficulties: A New Approach for Testing Intervention Elements

    Science.gov (United States)

    Steacy, Laura M.; Elleman, Amy M.; Lovett, Maureen W.; Compton, Donald L.

    2016-01-01

    In English, gains in decoding skill do not map directly onto increases in word reading. However, beyond the Self-Teaching Hypothesis, little is known about the transfer of decoding skills to word reading. In this study, we offer a new approach to testing specific decoding elements on transfer to word reading. To illustrate, we modeled word-reading…

  17. An Item Response Theory–Based, Computerized Adaptive Testing Version of the MacArthur–Bates Communicative Development Inventory: Words & Sentences (CDI:WS)

    DEFF Research Database (Denmark)

    Makransky, Guido; Dale, Philip S.; Havmose, Philip

    2016-01-01

    Purpose: To investigate the feasibility and potential validity of an IRT-based computerized adaptive testing (CAT) version of the MacArthur-Bates Communicative Development Inventory: Words & Sentences (CDI:WS) vocabulary checklist, with the objective of reducing length while maintaining measureme...

  18. Item Response Theory Using Hierarchical Generalized Linear Models

    Directory of Open Access Journals (Sweden)

    Hamdollah Ravand

    2015-03-01

    Full Text Available Multilevel models (MLMs are flexible in that they can be employed to obtain item and person parameters, test for differential item functioning (DIF and capture both local item and person dependence. Papers on the MLM analysis of item response data have focused mostly on theoretical issues where applications have been add-ons to simulation studies with a methodological focus. Although the methodological direction was necessary as a first step to show how MLMs can be utilized and extended to model item response data, the emphasis needs to be shifted towards providing evidence on how applications of MLMs in educational testing can provide the benefits that have been promised. The present study uses foreign language reading comprehension data to illustrate application of hierarchical generalized models to estimate person and item parameters, differential item functioning (DIF, and local person dependence in a three-level model.

  19. Identifying Unbiased Items for Screening Preschoolers for Disruptive Behavior Problems.

    Science.gov (United States)

    Studts, Christina R; Polaha, Jodi; van Zyl, Michiel A

    2016-10-25

    OBJECTIVE : Efficient identification and referral to behavioral services are crucial in addressing early-onset disruptive behavior problems. Existing screening instruments for preschoolers are not ideal for pediatric primary care settings serving diverse populations. Eighteen candidate items for a new brief screening instrument were examined to identify those exhibiting measurement bias (i.e., differential item functioning, DIF) by child characteristics. METHOD : Parents/guardians of preschool-aged children (N = 900) from four primary care settings completed two full-length behavioral rating scales. Items measuring disruptive behavior problems were tested for DIF by child race, sex, and socioeconomic status using two approaches: item response theory-based likelihood ratio tests and ordinal logistic regression. RESULTS : Of 18 items, eight were identified with statistically significant DIF by at least one method. CONCLUSIONS : The bias observed in 8 of 18 items made them undesirable for screening diverse populations of children. These items were excluded from the new brief screening tool.

  20. The Determination of Hierarchies among TOEFL Vocabulary and Reading Comprehension Items.

    Science.gov (United States)

    Perkins, Kyle; And Others

    A study was undertaken to identify the prerequisite relations (or hierarchies among the items) existing in the item responses of a sample of 86 foreign students who took the Test of English as a Foreign Language (TOEFL) vocabulary and reading comprehension test, Form 3JTF1. The form contains 30 vocabulary items and 30 reading comprehension items.…

  1. Designing a Virtual Item Bank Based on the Techniques of Image Processing

    Science.gov (United States)

    Liao, Wen-Wei; Ho, Rong-Guey

    2011-01-01

    One of the major weaknesses of the item exposure rates of figural items in Intelligence Quotient (IQ) tests lies in its inaccuracies. In this study, a new approach is proposed and a useful test tool known as the Virtual Item Bank (VIB) is introduced. The VIB combine Automatic Item Generation theory and image processing theory with the concepts of…

  2. Faculty development on item writing substantially improves item quality.

    NARCIS (Netherlands)

    Naeem, N.; Vleuten, C.P.M. van der; Alfaris, E.A.

    2012-01-01

    The quality of items written for in-house examinations in medical schools remains a cause of concern. Several faculty development programs are aimed at improving faculty's item writing skills. The purpose of this study was to evaluate the effectiveness of a faculty development program in item develo

  3. Evaluation of the Magnitude of Differential Item Functioning in Polytomous Items. Program Statistics Research Technical Report No. 94-2.

    Science.gov (United States)

    Zwick, Rebecca; Thayer, Dorothy T.

    Several recent studies have investigated the application of statistical inference procedures to the analysis of differential item functioning (DIF) in test items that are scored on an ordinal scale. Mantel's extension of the Mantel-Haenszel test is a possible hypothesis-testing method for this purpose. The development of descriptive statistics for…

  4. Feasibility of using training cases from International Spinal Cord Injury Core Data Set for testing of International Standards for Neurological Classification of Spinal Cord Injury items

    DEFF Research Database (Denmark)

    Liu, N; Hu, Z W; Zhou, M W;

    2014-01-01

    STUDY DESIGN: Descriptive comparison analysis. OBJECTIVE: To evaluate whether five training cases of International Spinal Cord Injury Core Data Set (ISCICDS) are appropriate for testing the facts within the International Standards for Neurological Classification of Spinal Cord Injury (ISNCSCI......) and could thus be used for testing its training effectiveness. METHODS: The authors reviewed the five training cases from the ISCICDS and determined the sensory level (SL), motor level (ML) and American Spinal Injury Association Impairment Scale (AIS) for the training cases. The key points from the training...... cases were compared with our interpretation of the key aspects of the ISNCSCI. RESULTS: For determining SL, three principles of ML, sacral sparing, complete injury, classification of AIS A, B, C and D, determining motor incomplete status through sparing of motor function more than three levels below...

  5. Assessing normative cut points through differential item functioning analysis: An example from the adaptation of the Middlesex Elderly Assessment of Mental State (MEAMS for use as a cognitive screening test in Turkey

    Directory of Open Access Journals (Sweden)

    Kutlay Sehim

    2006-03-01

    Full Text Available Abstract Background The Middlesex Elderly Assessment of Mental State (MEAMS was developed as a screening test to detect cognitive impairment in the elderly. It includes 12 subtests, each having a 'pass score'. A series of tasks were undertaken to adapt the measure for use in the adult population in Turkey and to determine the validity of existing cut points for passing subtests, given the wide range of educational level in the Turkish population. This study focuses on identifying and validating the scoring system of the MEAMS for Turkish adult population. Methods After the translation procedure, 350 normal subjects and 158 acquired brain injury patients were assessed by the Turkish version of MEAMS. Initially, appropriate pass scores for the normal population were determined through ANOVA post-hoc tests according to age, gender and education. Rasch analysis was then used to test the internal construct validity of the scale and the validity of the cut points for pass scores on the pooled data by using Differential Item Functioning (DIF analysis within the framework of the Rasch model. Results Data with the initially modified pass scores were analyzed. DIF was found for certain subtests by age and education, but not for gender. Following this, pass scores were further adjusted and data re-fitted to the model. All subtests were found to fit the Rasch model (mean item fit 0.184, SD 0.319; person fit -0.224, SD 0.557 and DIF was then found to be absent. Thus the final pass scores for all subtests were determined. Conclusion The MEAMS offers a valid assessment of cognitive state for the adult Turkish population, and the revised cut points accommodate for age and education. Further studies are required to ascertain the validity in different diagnostic groups.

  6. Objective Tests versus Subjective tests

    Institute of Scientific and Technical Information of China (English)

    魏福林

    2007-01-01

    objective test has only one correct answer, while subjective test has a range of possible answers. Because of this feature, reliability will not be difficult to achieve in the marking of the objective item, while the marking of the subjective items is reliable. On the whole, a good test must contain both subjective and objective test items.

  7. Algoritmo de um teste adaptativo informatizado com base na teoria da resposta ao item para a estimação da usabilidade de sites de e-commerce

    Directory of Open Access Journals (Sweden)

    Fernando de Jesus Moreira Junior

    2013-09-01

    Full Text Available O presente artigo propõe um algoritmo de um teste adaptativo informatizado baseado na teoria da resposta ao item, desenvolvido para estimar o grau de usabilidade de sites de e-commerce. Cinco algoritmos baseados no critério da máxima informação foram desenvolvidos e testados via simulação. O algoritmo com o melhor desempenho foi aplicado nos dados reais de 361 sites de e-commerce. Os resultados mostraram que o algoritmo desenvolvido consegue obter uma boa estimativa para o grau de usabilidade de sites de e-commerce com a aplicação de 13 itens.

  8. Applied Research on the Construction of the Textbook-based Test Item Bank for the Course of English in Vocational Colleges%高职英语课程性试题库建设的应用研究

    Institute of Scientific and Technical Information of China (English)

    邹园艳; 李志萍

    2012-01-01

    科学的测试是提高教学质量的必要保证,因此建立科学合理的测试系统非常必要。在分析建设高职英语课程性试题库必要性的基础上,探讨试题库的命题策略,以重庆电子工程职业学院的实例阐述试题库的实际应用效果,并提出在试题库建设中应注意的问题。%Scientific tests provide essential guarantee for the improvement of the quality of teaching, so the establish ment of a scientific and effective test system is indeed necessary. Based on the analysis of the necessity of the con struction of the textbookbased test item bank for the course of English in vocational colleges, this essay probes into its test design strategies. Besides, this essay also expounds its practical application in Chongqing College of Electron ic Engineering as well as pointing out the problems in its construction which should be paid attention to.

  9. Physics Items and Student's Performance at Enem

    CERN Document Server

    Gonçalves, Wanderley P

    2013-01-01

    The Brazilian National Assessment of Secondary Education (ENEM, Exame Nacional do Ensino M\\'edio) has changed in 2009: from a self-assessment of competences at the end of high school to an assessment that allows access to college and student financing. From a single general exam, now there are tests in four areas: Mathematics, Language, Natural Sciences and Social Sciences. A new Reference Matrix is build with components as cognitive domains, competencies, skills and knowledge objects; also, the methodological framework has changed, using now Item Response Theory to provide scores and allowing longitudinal comparison of results from different years, providing conditions for monitoring high school quality in Brazil. We present a study on the issues discussed in Natural Science Test of ENEM over the years 2009, 2010 and 2011. Qualitative variables are proposed to characterize the items, and data from students' responses in Physics items were analysed. The qualitative analysis reveals the characteristics of the ...

  10. The 12-item World Health Organization Disability Assessment Schedule II (WHO-DAS II: a nonparametric item response analysis

    Directory of Open Access Journals (Sweden)

    Fernandez Ana

    2010-05-01

    Full Text Available Abstract Background Previous studies have analyzed the psychometric properties of the World Health Organization Disability Assessment Schedule II (WHO-DAS II using classical omnibus measures of scale quality. These analyses are sample dependent and do not model item responses as a function of the underlying trait level. The main objective of this study was to examine the effectiveness of the WHO-DAS II items and their options in discriminating between changes in the underlying disability level by means of item response analyses. We also explored differential item functioning (DIF in men and women. Methods The participants were 3615 adult general practice patients from 17 regions of Spain, with a first diagnosed major depressive episode. The 12-item WHO-DAS II was administered by the general practitioners during the consultation. We used a non-parametric item response method (Kernel-Smoothing implemented with the TestGraf software to examine the effectiveness of each item (item characteristic curves and their options (option characteristic curves in discriminating between changes in the underliying disability level. We examined composite DIF to know whether women had a higher probability than men of endorsing each item. Results Item response analyses indicated that the twelve items forming the WHO-DAS II perform very well. All items were determined to provide good discrimination across varying standardized levels of the trait. The items also had option characteristic curves that showed good discrimination, given that each increasing option became more likely than the previous as a function of increasing trait level. No gender-related DIF was found on any of the items. Conclusions All WHO-DAS II items were very good at assessing overall disability. Our results supported the appropriateness of the weights assigned to response option categories and showed an absence of gender differences in item functioning.

  11. Multivariate Associations Among Health-Related Fitness, Physical Activity, and TGMD-3 Test Items in Disadvantaged Children From Low-Income Families.

    Science.gov (United States)

    Burns, Ryan; Brusseau, Tim; Hannon, James

    2016-10-04

    Motor skills are needed for physical development and may be linked to health-related fitness and physical activity levels. No studies have examined the relationships among these constructs in large samples of disadvantaged children from low-income families using the Test for Gross Motor Development-3rd Edition (TGMD-3). The purpose of this study was to examine the multivariate associations among health-related fitness, physical activity, and motor skills assessed using the TGMD-3. Participants included 1460 school-aged children (730 boys, 730 girls; M age = 8.4 years, SD = 1.8 years) recruited from the K to sixth grades from three low-income schools. Health-related fitness was assessed using the FITNESSGRAM battery, physical activity was assessed using accelerometers and pedometers, and motor skills were assessed using the TGMD-3. Canonical correlations revealed statistically significant correlations between the Ball Skills and health-related fitness variates (Rc = 0.43, Rc(2 )= 17%, p children from low-income families.

  12. Análise dos itens do mayer-salovey-caruso emotional intelligence test: escalas da área estratégica Análisis de los ítems del mayer-salovey-caruso emotional intelligence test: escalas del área estratégica Item analysis of the mayer-salovey-caruso emotional intelligence test: strategic area

    OpenAIRE

    Ana Paula Porto Noronha; Ricardo Primi; Fernanda Andrade de Freitas; Marilda Aparecida Dantas

    2007-01-01

    O presente estudo objetivou analisar os itens da Área Estratégica do Mayer-Salovey-Caruso Emotional Intellligence Test - MSCEIT, a fim de investigar a consistência interna dos sub-testes, a correlação item-total e a correção das respostas dos itens do instrumento. Para tanto, fizeram parte deste estudo 522 participantes com idade média de 23,78, tendo variado entre 16 e 65 anos (DP = 7,543). Dentre esses, 281 (F=53,3%) eram do sexo feminino e 238 (F=45,2%) do masculino. A aplicação do instrum...

  13. Helping Poor Readers Demonstrate Their Science Competence: Item Characteristics Supporting Text-Picture Integration

    Science.gov (United States)

    Saß, Steffani; Schütte, Kerstin

    2016-01-01

    Solving test items might require abilities in test-takers other than the construct the test was designed to assess. Item and student characteristics such as item format or reading comprehension can impact the test result. This experiment is based on cognitive theories of text and picture comprehension. It examines whether integration aids, which…

  14. Adaptation of an Instrument for Measuring the Cognitive Complexity of Organic Chemistry Exam Items

    Science.gov (United States)

    Raker, Jeffrey R.; Trate, Jaclyn M.; Holme, Thomas A.; Murphy, Kristen

    2013-01-01

    Experts use their domain expertise and knowledge of examinees' ability levels as they write test items. The expert test writer can then estimate the difficulty of the test items subjectively. However, an objective method for assigning difficulty to a test item would capture the cognitive demands imposed on the examinee as well as be…

  15. Algoritmo de um teste adaptativo informatizado com base na teoria da resposta ao item para a estimação da usabilidade de sites de e-commerce Algorithm of computerized adaptive testing to estimate the usability of e-commerce sites

    Directory of Open Access Journals (Sweden)

    Fernando de Jesus Moreira Junior

    2012-01-01

    Full Text Available O presente artigo propõe um algoritmo de um teste adaptativo informatizado baseado na teoria da resposta ao item, desenvolvido para estimar o grau de usabilidade de sites de e-commerce. Cinco algoritmos baseados no critério da máxima informação foram desenvolvidos e testados via simulação. O algoritmo com o melhor desempenho foi aplicado nos dados reais de 361 sites de e-commerce. Os resultados mostraram que o algoritmo desenvolvido consegue obter uma boa estimativa para o grau de usabilidade de sites de e-commerce com a aplicação de 13 itens.This paper proposes an algorithm of a computerized adaptive testing based on Item Response Theory, designed to estimate the degree of usability of e-commerce sites. Five algorithms were tested by simulation. The algorithm with the best performance was applied to real data from 361 e-commerce sites. The results showed that the algorithm could obtain good estimates for the degree of usability of e-commerce sites with the application of 13 items.

  16. The Prediction of TOEFL Reading Comprehension Item Difficulty for Expository Prose Passages for Three Item Types: Main Idea, Inference, and Supporting Idea Items.

    Science.gov (United States)

    Freedle, Roy; Kostin, Irene

    Prediction of the difficulty (equated delta) of a large sample (n=213) of reading comprehension items from the Test of English as a Foreign Language (TOEFL) was studied using main idea, inference, and supporting statement items. A related purpose was to examine whether text and text-related variables play a significant role in predicting item…

  17. Item Response Theory with Covariates (IRT-C): Assessing Item Recovery and Differential Item Functioning for the Three-Parameter Logistic Model

    Science.gov (United States)

    Tay, Louis; Huang, Qiming; Vermunt, Jeroen K.

    2016-01-01

    In large-scale testing, the use of multigroup approaches is limited for assessing differential item functioning (DIF) across multiple variables as DIF is examined for each variable separately. In contrast, the item response theory with covariate (IRT-C) procedure can be used to examine DIF across multiple variables (covariates) simultaneously. To…

  18. 基于生物学核心素养的高考命题研究%Research on Item Development for the College Admission Test of Biology Based on Core Competence

    Institute of Scientific and Technical Information of China (English)

    吴成军

    2016-01-01

    The components of core competence in biology,namely perception of life,rational thinking,scientific inquiry and social responsibility,which are interdisciplinary in nature,have a unique value in the subject. A delineation of the definition and value of the components paves the way for developing items for the college admission test of the subject based on core competence. The test will not only tap different levels of the four components but also make a special effort to tap rational thinking and scientific inquiry,hopefully in authentic contexts. And presentation styles of test items should be adapted as well,so as to give full play to the test as a“baton”for biology instruction and help to cultivate students’core competence,the ultimate goal of instruction of the school subject.%生物学核心素养由生命观念、理性思维、科学探究和社会责任组成,这些要素具有跨学科性质,但在生物学科中具有独到的价值。阐释这些要素在生物学的定义和价值,为“基于生物学核心素养”而进行的高考命题指明了方向。高考中既要考查四个素养的不同层次,又要着重考查理性思维和科学探究,并且尽量在真实的情境中作答,还要改变考题的呈现形式,以充分利用好“高考指挥棒”的作用,为发展学生的生物学核心素养这一生物学课程的根本目的服务。

  19. Análise dos itens do mayer-salovey-caruso emotional intelligence test: escalas da área estratégica Análisis de los ítems del mayer-salovey-caruso emotional intelligence test: escalas del área estratégica Item analysis of the mayer-salovey-caruso emotional intelligence test: strategic area

    Directory of Open Access Journals (Sweden)

    Ana Paula Porto Noronha

    2007-08-01

    Full Text Available O presente estudo objetivou analisar os itens da Área Estratégica do Mayer-Salovey-Caruso Emotional Intellligence Test - MSCEIT, a fim de investigar a consistência interna dos sub-testes, a correlação item-total e a correção das respostas dos itens do instrumento. Para tanto, fizeram parte deste estudo 522 participantes com idade média de 23,78, tendo variado entre 16 e 65 anos (DP = 7,543. Dentre esses, 281 (F=53,3% eram do sexo feminino e 238 (F=45,2% do masculino. A aplicação do instrumento se deu de forma coletiva, tendo sido realizada por diferentes aplicadores, treinados para a tarefa, e foi realizada em universidades e empresas do interior do estado de São Paulo, com duração média de 45 minutos. Os resultados indicaram que no geral, os sub-testes apresentaram níveis de consistência interna aceitáveis considerando os padrões estabelecidos pelo Conselho Federal de Psicologia, embora alguns problemas tenham sido encontrados; e que o método de correção por consenso, de alguma forma, dificulta a criação de itens difíceis.El presente estudio tuvo como objetivo el de analizar los ítems del Área Estratégica del Mayer-Salovey-Caruso Emotional Intellligence Test - MSCEIT, con el fin de investigar la consistencia interna dos subtests, la correlación ítem-total y la corrección de las respuestas de los ítems del instrumento. Para ello, tomaron parte de este estudio 522 participantes, con edad promedio de 23,78 años, teniendo variado entre 16 y 65 años (DP = 7,543. De ésos, 281 (F=53,3% eran del sexo femenino y 238 (F=45,2% del masculino. La aplicación del instrumento se dio de forma colectiva, habiendo sido realizada por diferentes aplicadores, entrenados para la tarea, y fue realizada en universidades y empresas del interior del Estado de São Paulo (Brasil, con duración media de 45 minutos. Los resultados indicaron que, por lo general, los subtests presentaron niveles de consistencia interna aceptables, se

  20. Estimating the Importance of Differential Item Functioning.

    Science.gov (United States)

    Rudas, Tamas; Zwick, Rebecca

    1997-01-01

    The mixture index of fit (T. Rudas et al, 1994) is used to estimate the fraction of a population for which differential item functioning (DIF) occurs, and this approach is compared to the Mantel Haenszel test of DIF. The proposed noniterative procedure provides information about data portions contributing to DIF. (SLD)

  1. Estimating the Importance of Differential Item Functioning.

    Science.gov (United States)

    Rudas, Tamas; Zwick, Rebecca

    1997-01-01

    The mixture index of fit (T. Rudas et al, 1994) is used to estimate the fraction of a population for which differential item functioning (DIF) occurs, and this approach is compared to the Mantel Haenszel test of DIF. The proposed noniterative procedure provides information about data portions contributing to DIF. (SLD)

  2. 探讨相关护理因素对凝血4项检测结果的影响%The Effect of Relative Nursing Factor on the Test Results About Four Items of Blood Coagulation

    Institute of Scientific and Technical Information of China (English)

    毛黎

    2015-01-01

    Objective To explore the effect of the related factors on the test results of four items of blood coagulation.Methods According to the clinical nursing data of 860 cases of inpatient department in our hospital from June 2013 to June 2015,the effect of the relevant nursing factors on the detection results of four items of blood coagulation were discussed.Results The unqualified sample accounted for 2.21%. The main nursing factors were that blood samples had the phenomenon of hemolysis,blood coagulation or local coagulation,blood vessel were not properly,blood colection was too high or too low. Conclusion Nursing staff should improve their nursing skils to ensure the accuracy of blood sampling and inspection results.%目的 探究相关护理因素对凝血4项检测结果的影响.方法 根据我院2013年6月~2015年6月收治的400例住院部患者的临床护理资料,对相关护理因素对凝血4项检测结果的影响进行讨论.结果 不合格样本占2.21%.主要护理因素为:血液样本有溶血现象,抗凝血样本有小块血凝或局部凝血,采血容器不当,采血量过高或过低.讨论 护理人员应提高自身护理技能以确保采血样本的准确性和检查结果的准确性.

  3. Item analysis of in use multiple choice questions in pharmacology

    Science.gov (United States)

    Kaur, Mandeep; Singla, Shweta; Mahajan, Rajiv

    2016-01-01

    Background: Multiple choice questions (MCQs) are a common method of assessment of medical students. The quality of MCQs is determined by three parameters such as difficulty index (DIF I), discrimination index (DI), and distracter efficiency (DE). Objectives: The objective of this study is to assess the quality of MCQs currently in use in pharmacology and discard the MCQs which are not found useful. Materials and Methods: A class test of central nervous system unit was conducted in the Department of Pharmacology. This test comprised 50 MCQs/items and 150 distracters. A correct response to an item was awarded one mark with no negative marking for incorrect response. Each item was analyzed for three parameters such as DIF I, DI, and DE. Results: DIF of 38 (76%) items was in the acceptable range (P = 30–70%), 11 (22%) items were too easy (P > 70%), and 1 (2%) item was too difficult (P 0.35), of 12 (24%) items was good (d = 0.20–0.34), and of 7 (14%) items was poor (d < 0.20). A total of 50 items had 150 distracters. Among these, 27 (18%) were nonfunctional distracters (NFDs) and 123 (82%) were functional distracters. Items with one NFD were 11 and with two NFDs were 8. Based on these parameters, 6 items were discarded, 17 were revised, and 27 were kept for subsequent use. Conclusion: Item analysis is a valuable tool as it helps us to retain the valuable MCQs and discard the items which are not useful. It also helps in increasing our skills in test construction and identifies the specific areas of course content which need greater emphasis or clarity. PMID:27563581

  4. Bayesian item fit analysis for unidimensional item response theory models.

    Science.gov (United States)

    Sinharay, Sandip

    2006-11-01

    Assessing item fit for unidimensional item response theory models for dichotomous items has always been an issue of enormous interest, but there exists no unanimously agreed item fit diagnostic for these models, and hence there is room for further investigation of the area. This paper employs the posterior predictive model-checking method, a popular Bayesian model-checking tool, to examine item fit for the above-mentioned models. An item fit plot, comparing the observed and predicted proportion-correct scores of examinees with different raw scores, is suggested. This paper also suggests how to obtain posterior predictive p-values (which are natural Bayesian p-values) for the item fit statistics of Orlando and Thissen that summarize numerically the information in the above-mentioned item fit plots. A number of simulation studies and a real data application demonstrate the effectiveness of the suggested item fit diagnostics. The suggested techniques seem to have adequate power and reasonable Type I error rate, and psychometricians will find them promising.

  5. Three controversies over item disclosure in medical licensure examinations

    Directory of Open Access Journals (Sweden)

    Yoon Soo Park

    2015-09-01

    Full Text Available In response to views on public's right to know, there is growing attention to item disclosure – release of items, answer keys, and performance data to the public – in medical licensure examinations and their potential impact on the test's ability to measure competence and select qualified candidates. Recent debates on this issue have sparked legislative action internationally, including South Korea, with prior discussions among North American countries dating over three decades. The purpose of this study is to identify and analyze three issues associated with item disclosure in medical licensure examinations – 1 fairness and validity, 2 impact on passing levels, and 3 utility of item disclosure – by synthesizing existing literature in relation to standards in testing. Historically, the controversy over item disclosure has centered on fairness and validity. Proponents of item disclosure stress test takers’ right to know, while opponents argue from a validity perspective. Item disclosure may bias item characteristics, such as difficulty and discrimination, and has consequences on setting passing levels. To date, there has been limited research on the utility of item disclosure for large scale testing. These issues requires ongoing and careful consideration.

  6. Secondary Item Procurement Lead Time Study.

    Science.gov (United States)

    1984-03-01

    NONEhhhhhh % I - 15 III& 1-- -NAION&°I° oI Ai OF .° NAIU REUONF TEST CANA -~~ 7 .. SECONDARY ITEM PROCUREMENT LEAD TIME STUDY __ LOGISTICS SYSTEMS...ASSISTANT SECRETARY OF THE AIR FORCE (RD&L) DIRECTOR, DEFENSE LOGISTICS AGENCY SUBJECT: Secondary Item Procurement Lead Time Study A recent report by the...determination of procurement lead time. A plan for the study is enclosed. In order to achieve the objectives of the procurement lead time study as well as the

  7. Editorial Changes and Item Performance: Implications for Calibration and Pretesting

    Directory of Open Access Journals (Sweden)

    Heather Stoffel

    2014-11-01

    Full Text Available Previous research on the impact of text and formatting changes on test-item performance has produced mixed results. This matter is important because it is generally acknowledged that any change to an item requires that it be recalibrated. The present study investigated the effects of seven classes of stylistic changes on item difficulty, discrimination, and response time for a subset of 65 items that make up a standardized test for physician licensure completed by 31,918 examinees in 2012. One of two versions of each item (original or revised was randomly assigned to examinees such that each examinee saw only two experimental items, with each item being administered to approximately 480 examinees. The stylistic changes had little or no effect on item difficulty or discrimination; however, one class of edits -' changing an item from an open lead-in (incomplete statement to a closed lead-in (direct question -' did result in slightly longer response times. Data for nonnative speakers of English were analyzed separately with nearly identical results. These findings have implications for the conventional practice of repretesting (or recalibrating items that have been subjected to minor editorial changes.

  8. A Comparison of Mantel-Haenszel Differential Item Functioning Parameters. LSAC Research Report Series.

    Science.gov (United States)

    Schnipke, Deborah L.; Roussos, Louis A.; Pashley, Peter J.

    Differential item functioning (DIF) analyses are conducted to investigate how items function in various subgroups. The Mantel-Haenszel (MH) DIF statistic is used at the Law School Admission Council and other testing companies. When item functioning can be well-described in terms of a one- or two-parameter logistic item response theory (IRT) model…

  9. A Framework for Examining the Utility of Technology-Enhanced Items

    Science.gov (United States)

    Russell, Michael

    2016-01-01

    Interest in and use of technology-enhanced items has increased over the past decade. Given the additional time required to administer many technology-enhanced items and the increased expense required to develop them, it is important for testing programs to consider the utility of technology-enhanced items. The Technology-Enhanced Item Utility…

  10. Measuring response styles in Likert items.

    Science.gov (United States)

    Böckenholt, Ulf

    2017-03-01

    The recently proposed class of item response tree models provides a flexible framework for modeling multiple response processes. This feature is particularly attractive for understanding how response styles may affect answers to attitudinal questions. Facilitating the disassociation of response styles and attitudinal traits, item response tree models can provide powerful process tests of how different response formats may affect the measurement of substantive traits. In an empirical study, 3 response formats were used to measure the 2-dimensional Personal Need for Structure traits. Different item response tree models are proposed to capture the response styles for each of the response formats. These models show that the response formats give rise to similar trait measures but different response-style effects. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  11. CORRELATION OF AATCC TEST METHOD 150 TO AATCC TEST METHOD 61 FOR USE WITH LAUNDERING DURABILITY STUDIES OF RETROREFLECTIVE ITEMS AS DEFINED IN PURCHASE DESCRIPTION CO/PD 06 05A

    Science.gov (United States)

    2017-06-02

    TECHNICAL REPORT AD ________________ NATICK/TR-17/015 CORRELATION OF AATCC...Final 3. DATES COVERED (From - To) October 2014 – April 2016 4. TITLE AND SUBTITLE CORRELATION OF AATCC TEST METHOD 150 TO AATCC TEST METHOD 61 FOR...data to the support the correlation between 5 laundering cycles of the America Association of Textile Chemists & Colorists Test Method 61

  12. NBC Contamination Survivability, Large Item Exteriors

    Science.gov (United States)

    2007-11-02

    Perform these tasks (timed) in the standard garment. .(3) Perform these tasks (timed) in mission-oriented protective posture level 4 (MOPP4). (4...bring the chamber to the environmental conditions specified for the test. Condition the test item until it has equilibrated at 30±50 C. Temperature and...condition that all essential operations can be continued in the lowest protective posture consistent with the mission and threat, and without long-term

  13. A Mixed Effects Randomized Item Response Model

    Science.gov (United States)

    Fox, J.-P.; Wyrick, Cheryl

    2008-01-01

    The randomized response technique ensures that individual item responses, denoted as true item responses, are randomized before observing them and so-called randomized item responses are observed. A relationship is specified between randomized item response data and true item response data. True item response data are modeled with a (non)linear…

  14. Item Veto: Dangerous Constitutional Tinkering.

    Science.gov (United States)

    Bellamy, Calvin

    1989-01-01

    In theory, the item veto would empower the President to remove wasteful and unnecessary projects from legislation. Yet, despite its history at the state level, the item veto is a loosely defined concept that may not work well at the federal level. Much more worrisome is the impact on the balance of power. (Author/CH)

  15. The basics of item response theory using R

    CERN Document Server

    Baker, Frank B

    2017-01-01

    This graduate-level textbook is a tutorial for item response theory that covers both the basics of item response theory and the use of R for preparing graphical presentation in writings about the theory. Item response theory has become one of the most powerful tools used in test construction, yet one of the barriers to learning and applying it is the considerable amount of sophisticated computational effort required to illustrate even the simplest concepts. This text provides the reader access to the basic concepts of item response theory freed of the tedious underlying calculations. It is intended for those who possess limited knowledge of educational measurement and psychometrics. Rather than presenting the full scope of item response theory, this textbook is concise and practical and presents basic concepts without becoming enmeshed in underlying mathematical and computational complexities. Clearly written text and succinct R code allow anyone familiar with statistical concepts to explore and apply item re...

  16. The Body Appreciation Scale-2: item refinement and psychometric evaluation.

    Science.gov (United States)

    Tylka, Tracy L; Wood-Barcalow, Nichole L

    2015-01-01

    Considered a positive body image measure, the 13-item Body Appreciation Scale (BAS; Avalos, Tylka, & Wood-Barcalow, 2005) assesses individuals' acceptance of, favorable opinions toward, and respect for their bodies. While the BAS has accrued psychometric support, we improved it by rewording certain BAS items (to eliminate sex-specific versions and body dissatisfaction-based language) and developing additional items based on positive body image research. In three studies, we examined the reworded, newly developed, and retained items to determine their psychometric properties among college and online community (Amazon Mechanical Turk) samples of 820 women and 767 men. After exploratory factor analysis, we retained 10 items (five original BAS items). Confirmatory factor analysis upheld the BAS-2's unidimensionality and invariance across sex and sample type. Its internal consistency, test-retest reliability, and construct (convergent, incremental, and discriminant) validity were supported. The BAS-2 is a psychometrically sound positive body image measure applicable for research and clinical settings.

  17. A nonparametric approach to the analysis of dichotomous item responses

    NARCIS (Netherlands)

    Mokken, R.J.; Lewis, C.

    1982-01-01

    An item response theory is discussed which is based on purely ordinal assumptions about the probabilities that people respond positively to items. It is considered as a natural generalization of both Guttman scaling and classical test theory. A distinction is drawn between construction and evaluatio

  18. Detecting Local Item Dependence in Polytomous Adaptive Data

    Science.gov (United States)

    Mislevy, Jessica L.; Rupp, Andre A.; Harring, Jeffrey R.

    2012-01-01

    A rapidly expanding arena for item response theory (IRT) is in attitudinal and health-outcomes survey applications, often with polytomous items. In particular, there is interest in computer adaptive testing (CAT). Meeting model assumptions is necessary to realize the benefits of IRT in this setting, however. Although initial investigations of…

  19. Distributed Item Review: Administrator User Guide. Technical Report #1603

    Science.gov (United States)

    Irvin, P. Shawn

    2016-01-01

    The Distributed Item Review (DIR) is a secure and flexible, web-based system designed to present test items to expert reviewers across a broad geographic area for evaluation of important dimensions of quality (e.g., alignment with standards, bias, sensitivity, and student accessibility). The DIR is comprised of essential features that allow system…

  20. An item factor analysis and item response theory-based revision of the Everyday Discrimination Scale.

    Science.gov (United States)

    Stucky, Brian D; Gottfredson, Nisha C; Panter, A T; Daye, Charles E; Allen, Walter R; Wightman, Linda F

    2011-04-01

    The Everyday Discrimination Scale (EDS), a widely used measure of daily perceived discrimination, is purported to be unidimensional, to function well among African Americans, and to have adequate construct validity. Two separate studies and data sources were used to examine and cross-validate the psychometric properties of the EDS. In Study 1, an exploratory factor analysis was conducted on a sample of African American law students (N = 589), providing strong evidence of local dependence, or nuisance multidimensionality within the EDS. In Study 2, a separate nationally representative community sample (N = 3,527) was used to model the identified local dependence in an item factor analysis (i.e., bifactor model). Next, item response theory (IRT) calibrations were conducted to obtain item parameters. A five-item, revised-EDS was then tested for gender differential item functioning (in an IRT framework). Based on these analyses, a summed score to IRT-scaled score translation table is provided for the revised-EDS. Our results indicate that the revised-EDS is unidimensional, with minimal differential item functioning, and retains predictive validity consistent with the original scale.