Testing and Assessment Research Paper

This sample Testing and Assessment Research Paper is published for educational and informational purposes only. If you need help writing your assignment, please use our research paper writing service and buy a paper on any topic at affordable price. Also check our tips on how to write a research paper, see the lists of psychology research paper topics, and browse research paper examples.

The endeavor of measuring an intangible construct such as an individual’s intelligence or aptitude is both fascinating and essential. Psychologists, educators, the government, and corporate America use the principles and measures of testing and assessment to solve challenging problems on a daily basis. For example, a psychologist may administer a wide variety of test batteries including intelligence and personality tests as well as measures to screen for neurological impairment. Educators may use tests to assess achievement and facilitate student placement decisions. Corporate America, as well as government agencies, have also embraced testing and frequently administer tests to make vocational decisions related to hiring, firing, and general utilization of personnel.

The field of testing and assessment in the United States is relatively young and consequently most of the major developments occurred during the 20th century. However, the origins of testing and assessment are neither recent nor American. Evidence has suggested that the ancient Chinese had a relatively sophisticated civil service testing program. For example, written exams were introduced by the Han Dynasty (202 BCE-200 CE) to measure an individual’s potential to succeed in vocations such as civil law, military affairs, and agriculture.

Testing and assessment methods became quite well developed by the Ming Dynasty (1368-1644 CE) during which officials used a national multistage testing program to make vocational decisions in both local and regional venues. Individuals who performed well on local tests progressed to provincial capitals for more extensive examinations. For example, one relevant vocational measure included good penmanship, which was essential for clear and precise communication and therefore was viewed as a relevant predictor of suitability for civil service employment.

In the early 1800s, reports from British missionaries may have encouraged the English East India Company to copy the Chinese civil service system as a method of selecting employees for overseas duty. Because the testing programs worked well for the company, the British government eventually adopted and refined a similar system of testing for its civil service program. The French and German governments followed suit. In the late 1800s, the U.S. government established the American Civil Service Commission, which developed and administered competitive examinations for specified government jobs. The testing and assessment movement in the Western world grew rapidly from this point.

Most historians trace the early development of psychological testing to the investigation of individual differences that flourished in Europe (most notably Great Britain and Germany) in the late 1800s. There is no doubt that early experimentalists like Charles Darwin, Wilhelm Wundt, Francis Galton, and James McKeen Cattell laid an indelible foundation for 20th-century testing and assessment. Researchers believe that Darwin contributed one of the most basic concepts underlying psychological and educational measurement—individual differences.

Darwin’s (1859) On the Origin of Species by Means of Natural Selection argued that chance variation in species would be perpetuated or encumbered based on its ability to adapt or survive in nature and consequently that humans had descended from the apes as a result of such chance genetic variation. Popular topics in contemporary psychology that reveal a noticeably strong Darwinian influence include theories of learning, developmental psychology, animal behavior, psychobiology, theories of emotions, behavioral genetics, abnormal psychology, and testing and assessment.

Darwin’s research stimulated interest in the study of individual differences and demonstrated that studying human and animal behavior was at least as important as was studying the mind. Darwin’s work appears to have influenced and motivated Francis Galton’s research on heredity. Through his efforts to explore and quantify individual differences between people, Galton became a prominent contributor to the field of testing and measurement. Galton aspired to classify people according to their deviation from the average and eventually would be credited with contributing to the development of many contemporary psychological measures such as questionnaires, rating scales, and self-report inventories.

Galton and his assistant Karl Pearson pioneered the use of the correction coefficient. This development was important for the field of testing and measurement because it gave researchers a method for obtaining an index of a relation between two variables. Through Galton’s research efforts and his persistence for educational institutions to maintain anthropometric records (e.g., height, breathing capacity, and discrimination of color) on their students, he encouraged widespread interest in the measurement of psychologically related topics.

Charles Spearman, who espoused a theory of intelligence in which he believed there was a single, global mental ability (general intelligence or “g”), also contributed significantly to the development of testing and measurement. Despite challenges to the concept, Spearman’s “g” still permeates psychological thinking and research. Spearman has also been attributed with discovering that independent measures of an individual’s physical characteristics (e.g., mental ability) vary in a random fashion from one measurement trial to another. In statistical terms, the correlation between such independent measures for a group of persons is not perfect. Because of this latter research, Spearman is often referred to as the father of classical reliability theory.

Wilhelm Wundt employed the early principles and methods of testing and measurement at his experimental psychology lab in Germany. For example, he and his students worked to formulate a general description of human abilities with respect to variables such as reaction time, perception, and attention span. Wundt believed that reaction time could supplement introspection as a technique for studying the elements and activities of the mind. Further, Wundt attempted to standardize his research methods and control extraneous variables in an effort to minimize measurement error. It appears he was ahead of his time in two respects. First, he attempted to control extraneous variables for the purpose of minimizing error, which is now a routine component of contemporary quantitative measurement. Second, he standardized research conditions, which is also a contemporary quantitative method used to ensure that differences in scores are the result of true differences among individuals.

Alfred Binet’s pioneering work, which that led to the development of the first widely used intelligence test in 1905, also made an indelible impact on educational and psychological measurement. The French Ministry of Education commissioned Binet and Théophile Simon to devise a practical means to distinguish normal children from those with mental deficiencies. Binet concentrated his efforts on finding a way to measure higher mental processes and eventually devised a simple chronological age scale to determine a child’s level of mental functioning.

The inception of the Journal of Educational Psychology in 1910 and its publication of Edward Lee Thorndike’s seminal article, “The Contribution of Psychology to Education,” were invaluable for the field of testing and measurement because they introduced and encouraged a common research forum to discuss educational measurement and testing issues. The journal’s strict adherence to an experimental pedagogy had a significant impact on the credibility of research conducted in the field of educational psychology for the next 100 years. Thorndike and the other researchers played a vital role in establishing the fundamental theories and methods that would act to perpetuate and solidify the emergence of a separate field of psychological testing and assessment.

Theory

Psychological testing and assessment have grown and matured primarily within the parameters of two fundamental theories—classical test theory and modern test theory. Both theories rely heavily on the essential concepts of reliability and validity to guide and substantiate credible testing and measurement practices.

Classical Test Theory

In the early part of the 20th century, researchers utilizing psychometrics (the science of psychological and educational measurement) focused primarily on the concepts of true score and measurement error, which were predominately based on Charles Spearman’s correlation and reliability studies. These concepts were further strengthened by Lee Cronbach’s discussion of construct validity in 1955. Research has indicated that classical test theory has matured due to several remarkable achievements over the past 150 years, including (a) a recognition of the presence of error in measurement, (b) a conception of error as a random variable, and (c) a conception of correlation and the means for it to be indexed.

In 1904 Charles Spearman demonstrated how to correct a correlation coefficient for attenuation due to measurement error and how to obtain the index needed to make the correction. According to Traub (1997), Spearman’s demonstration marked the beginning of classical test theory. Additionally, Frederic Kuder, Marion Richardson, Louis Guttman, Melvin Nowak, and Frederic Lord contributed important ideas. For example, the Frederic Kuder and Marion Richardson internal consistency formulas (KR20 and KR21) were published in 1937, and in 1945, Louis Guttman published an article titled “A Basis for Analyzing Test Reliability,” in which the lower bounds of reliability were explicitly derived (Traub, 1997). The culmination of these efforts to formulize classical test theory was realized best by the research of M. R. Novick and Frederic Lord in the 1960s.

Measurement Error

Classical measurement theory, commonly referred to as traditional or true score theory, has provided a strong foundation for psychometric methods since its origin. In classical test theory, an assumption is made that each examinee has a true score on a test that would be obtained if it were not for random measurement error (Cohen & Swerdlik, 2002). A true score is considered to be measurement without error or, more simply, reducing the discrepancy between an observed score (with error) and the true score on a given measure.

The principles of psychometrics can be used as an effective tool to reduce error in mental measures. Psychometricians (testing and measurement professionals) should understand they cannot completely eliminate error in mental measures, and their goal should therefore entail finding methods to reduce the known sources of error in as many testing contexts as possible. Error associated with the measurement process can be described as any factor not directly relevant to the construct or topic being measured. Although there are many ways of categorizing the types of error in test scores, Lyman (1978) presented one useful classification system, in which test score errors were related to the following five factors: (a) the influence of time, (b) test content, (c) the test examiner or scorer, (d) the testing situation, and (e) the examinee.

Osterlind (2005) emphasizes that mental measurements are imperfect and attributes this to two human fallibilities. The first fallability is that humans do not always respond to a test item or exercise in a manner that typically reflects their best ability. For example, a sick or fatigued examinee may obtain a score that inaccurately reflects his or her true score. Similarly, individuals who are not motivated to do well on a test (or worse, purposely make mistakes or provide inaccurate information) can contribute error to their test scores.

The second human fallibility that may result in measurement error is our inability to produce a flawless testing instrument. Instruments may be imperfect because a test developer may not design an assessment with the precision needed for sound measurement. A testing instrument may also be imperfect because an item writer may create items or exercises that do not accurately represent the cognitive processes involved in measuring a specified task or construct. Further, instruments created to measure mental processes typically assess only a small component of a complex phenomenon. Consequently, we must infer meaning from a limited sample to a broader domain and this inferential process is not without error.

Random and Systematic Measurement Error

According to Nunnally and Bernstein (1994), it is common and appropriate to think of an obtained measure as deviating from a true value, and the resultant measurement error can incorporate a mixture of both systematic and random processes. When error is systematic, it can affect all observations equally and be a constant error or affect certain types of observations differently and consequently demonstrate bias. For example, a miscalibrated thermometer that always reads five degrees too low illustrates a constant error in the physical sciences.

It is important to understand that random error applies to an individual response, whereas systematic error applies to a group’s response. More specifically, random error is the difference between a true score and an observed score for an individual, whereas systematic error comprises consistent differences between groups that are unrelated to the construct or skill being assessed. Random errors are considered much more common than systematic errors and are important because they limit the degree of lawfulness in nature by complicating relations (Osterlind, 2005).

Nunnally and Bernstein (1994) suggested that random error may influence scores on a particular classroom test by (a) the content studied (e.g., luck in studying the same information that is on the test), (b) luck in guessing, (c) state of alertness, (d) clerical errors, (e) distractions, (f) not giving your best effort, and (g) anxiety. Random measurement error can never be completely eliminated, but researchers should always strive to minimize the sources of this error. Essentially anything that detracts examinees from exhibiting their optimal score can be considered random error; keep in mind that random error always degrades the assessment of an individual’s true score.

The determination of systematic error is often difficult to identify because it frequently relies on arbitrary and erratic judgments. According to Osterlind (2005), systematic error is generally associated with, but not limited to, differential performance on an exam by samples of one sex or a particular ethnic group. It can apply to any distinct subpopulation of an examinee group. Sometimes systematic error is referred to as test bias in older terminology (more recently differential performance), which oversimplifies a complex measurement phenomenon. Osterlind explained that it is the consistency in error that makes it systematic. For example, a compass can provide an inaccurate reading because of influences unrelated to its intended purpose (error), and the compass can give the inaccurate reading every time it is used (systematic).

Shortcomings of Classical Test Theory

Perhaps the most important shortcoming of classical test theory is that examinee and test characteristics cannot be separated. More specifically, the examinee and test characteristics can only be interpreted within the context of the other. The examinee characteristics of interest usually pertain to the proficiency measured by the test. According to Hambleton, Swaminathan, and Rogers (1991), in classical test theory the notion of proficiency is expressed by the true score or the expected value of an observed performance on the test of interest. In classical test theory, an examinee’s proficiency can only be defined in terms of a particular test. The difficulty of a test can be defined as the proportion of examinees in a group of interest who answer the item correctly (Hambleton, Swaminathan, & Rogers, 1991). Whether an item is difficult or easy depends on the proficiency of the examinees being measured, and the proficiency of the examinee depends on whether the test items are difficult or easy. In other words, a test item or exercise can only be defined in terms of a reference group. Due to the limitations associated with classical test theory, measurement professionals have sought out alternative modern test theory methods to address the problems inherent with classical test theory.

Modern Test Theory

How has classical test theory influenced modern test theory? What are the differences between classical and modern test theory? How will testing and assessment professionals benefit from modern test theory beyond the limitations inherent within classical test theory? Considering and reflecting on these questions has challenged measurement professionals to examine their research methodology practices, data analysis procedures, and decision-making processes. The notion of a true score dates back to the time of early measurement theorists like Galton, Binet, and especially Spearman in the early 20th century. As one may anticipate, error is a feature that distinguishes various psychometric models or theories. Each measurement model defines error differently, and each approaches error from a distinct perspective.

Classical measurement theory is often referred to as a true score theory because of its emphasis on true scores, whereas many modern measurement theories (e.g., latent trait and generalizability) are referred to as theories of reliability or occasionally universe score theories (Osterlind, 2005). Nunnally and Bernstein (1994) suggested that contrasting what is classical test theory versus what is modern test theory is always a bit risky, but they consider measures based on linear combinations as classical. For example, Thurstone’s law of comparative judgment is generally regarded as classical because it appeared more than 60 years ago, but it is not based upon linear combinations. Conversely, the 1950 Guttman scale is considered modern, despite its long history, because it is based upon individual response profiles rather than sums.

Although classical measurement theory has served many practical testing problems well for more than a century with little change, Osterlind (2005) believes that classical measurement theory is not comprehensive enough to address all theoretical and practical test problems. For example, in classical test theory, random error is presumed to be equal throughout the entire range of the score distribution; this assumption is not realistic. Although the standard error of measurement is extremely useful, it does not reveal differing error rates at various points in the distribution. Moreover, classical measurement theory treats each item and exercise on a test as equally as difficult as all other items and exercises; this assumption also is not realistic. Although classical test theory has its deficiencies, it does have the advantage of being simpler to understand and is more accessible to a wider measurement audience. Whether measurement professionals choose to utilize classical test theory, modern test theory, or both to address measurement issues, they will necessarily rely on foundational concepts such as reliability and validity to determine the consistency and meaningfulness of a test score or what the test score truly means.

Reliability

Reliability can be defined as the extent to which measurements are consistent or repeatable over time. Further, reliability may also be viewed as the extent to which measurements differ from occasion to occasion as a function of measurement error. Reliability is seldom an all-or-nothing matter, as there are different types and degrees of reliability. A reliability coefficient is an index of reliability or, more specifically, a proportion that indicates the ratio between the true score variance on a test and the total variance (Cohen & Swerdlik, 2002).

Reliability has become an essential concept providing researchers with theoretical guidelines and mathematical measures to evaluate the quality of a psychological or educational construct. Stated differently, reliability measures help researchers identify and diminish the amount of error involved in measuring a psychological or educational construct. Error implies that there will always be some inaccuracy in our measurements, and measurement error is common in all fields of science. Psychological and educational specialists, however, have devoted a great deal of time and study to measurement error and its effects. More specifically, they have sought to identify the source and magnitude of such error and to develop methods by which it can be minimized. Generally, tests that are relatively free of measurement error are considered reliable, and tests containing measurement error are considered unreliable.

In testing and assessment settings, many factors complicate the measurement process because researchers are rarely interested in measuring simple concrete qualities such as height or length. Instead, researchers typically seek to measure complex and abstract traits such as intelligence or aptitude. Consequently, educational or psychological researchers must carefully assess the reliability and meaningfulness of their measurement tools (e.g., tests or questionnaires) in order to make better predictions or inferences regarding the phenomena they are studying.

When evaluating the reliability of a measure, researchers should first specify the source of measurement error they are trying to evaluate. If researchers are concerned about errors resulting from a test being administered at different times, they might consider employing a test-retest evaluation method, in which test scores obtained at two different points in time are correlated. On other occasions, researchers may be concerned about errors that arise because they have selected a small sample of items to represent a larger conceptualized domain. To evaluate this type of measurement error, researchers could use a method that assesses the internal consistency of the test, such as the split-half evaluation method.

Although reliability is a critical factor in determining the value of a test or assessment, it is not a sufficient condition in and of itself. Testing and measurement professionals must also evaluate the validity of a test or assessment.

Validity

The term validity may engender different interpretations based on the purpose of its use (e.g., everyday language or legal terminology). However, when the term is used to describe a test, validity typically refers to a judgment pertaining to how effectively the test measures what it purports to measure. More specifically, the term is used to express a judgment based on acquired evidence regarding the appropriateness of the inferences drawn from test scores (Cohen & Swerdlik, 2002). The most recent standards for educational and psychological testing, published in 1999, emphasize that validity is a unitary concept representing a compilation of evidence supporting the intended interpretation of a measure. Some commonly accepted forms of evidence that researchers may use to support the unitary concept of validity are (a) content validity, (b) criterion-related validity, and (c) construct validity.

Content validity evidence has typically been of greatest concern to educational testing and may be described as a judgment concerning the adequacy with which a test measures behavior that is representative of the universe of behavior it was designed to measure. Criterion-related validity evidence may be described as evidence demonstrating that a test score corresponds to an accurate measure of interest. Finally, construct validity may be described as a judgment related to the appropriateness of inferences drawn from test scores regarding individual standings on a variable referred to as a construct (e.g., intelligence).

Methods And Applications

Researchers can use classical and modern test theory methods and applications to develop, maintain, and revise tests or assessments intended to measure academic achievement, intelligence, and aptitude or potential to succeed in a specific academic or employment setting. The assessment of aptitude and achievement began sometime after the assessment of intelligence and was aimed primarily at identifying more specific abilities. Intelligence tests (e.g., the Stanford-Binet and the Wechsler) were useful because they produced valuable assessment information about overall intellectual level (global intelligence), but limited because they yielded little information about special abilities. The development of aptitude and achievement tests was an attempt to bridge this gap (Walsh & Betz, 2001). Aptitude tests were thought to measure people’s ability to learn if given the opportunity (future performance), whereas achievement tests were thought to measure what people had in fact learned (present performance).

Measuring Academic Achievement

Measuring academic achievement can be a daunting and challenging task that requires constructing test items that accurately measure student learning objectives. The process of constructing good test items is a hybrid of art and science. Typically, test creators who seek to measure academic achievement are interested in developing selected response (e.g., multiple-choice) and supply-response (e.g., essay) test items. However, before developing such a test, test developers must intimately understand the important relation between classroom instruction and the subsequent assessment methods and goals. A necessary condition for effective and meaningful instruction involves the coexistence and continuous development of the instructional, learning, and assessment processes.

The relation of instruction and assessment becomes evident when instructors closely examine the roles of each process. For example, Gronlund (2003) emphasized this relation when he stated, “Instruction is most effective when directed toward a clearly defined set of intended learning outcomes and assessment is most effective when designed to assess a clearly defined set of intended learning outcomes” (p. 4). Essentially, the roles of instruction and assessment are inseparable.

Pedagogic research (e.g., Ory & Ryan, 1993) has encouraged instructors to use Bloom’s taxonomy to analyze the compatibility of their instructional process, their desired student outcomes or objectives, and their test items. Instructors typically use test scores to make inferences about student content mastery. Consequently, it is essential that these inferences be valid. The inferences made by instructors are more likely to be valid when the test items are comprised of a representative sample of course content, objectives, and difficulty. Therefore, when developing test items, instructors should revisit the relation between course objectives, instruction, and testing. The assessment process involves more than merely constructing test items. Instructors should also become adept at administering, scoring, and interpreting objective and essay type tests.

Multiple-Choice Test Items

Test items are typically presented as objective (multiple-choice, true and false, and matching) and essay type items. Multiple-choice items represent the most frequently used selected-response format in college classrooms (Jacobs & Chase, 1992), and instructors should become adept at creating and revising them. The benefits of using multiple-choice items include (a) accurate and efficient scoring, (b) improved score reliability, (c) a wide sampling of learning objectives, and (d) the ability to obtain diagnostic information from incorrect answers. The limitations include (a) the increased time necessary for creating items that accurately discriminate mastery from nonmastery performance, (b) the difficulty of creating items that measure complex learning objectives (e.g., items that measure the ability to synthesize information), (c) the difficulty and time-consuming nature of creating plausible distracters (i.e., incorrect response alternatives), (d) the increased potential for students to benefit from guessing, and (e) the possibility that the difficulty of the items becomes a function of reading ability, even though reading ability may not be the purpose of the assessment.

True-False Test Items

True-false item formats typically measure the ability to determine whether declarative statements are correct. Effective true-false items are difficult to construct because they usually reflect isolated statements with no (or limited) frame of reference (Thorndike, 1997). The benefits of using true-false items include (a) ease of construction (when compared with multiple-choice items), (b) accurate and efficient scoring, (c) flexibility in measuring learning objectives, and (d) the usefulness for measuring outcomes or objectives with two possible alternatives. The limitations include, (a) an increased guessing potential, (b) the difficulty of creating unequivocally true or false items, (c) a measurement of typically trivial knowledge, and (d) a lack of diagnostic information from incorrect responses (unless students are required to change false statements into true statements).

Matching Test Items

Matching items typically measure associative learning or simple recall, but they can assess more complex learning objectives (Jacobs & Chase, 1992). The benefits of using matching items include (a) ease of construction (b) accurate and efficient scoring and (c) short reading and response times. The limitations include (a) measurement of simple recall and associations (b) difficulty in selecting homogenous or similar sets of stimuli and response choices and (c) provision of unintended clues for response choices.

Essay Test Items

Essay questions are more useful than selection-type items when measuring the ability to organize, integrate, and express ideas (Gronlund, 2003). The benefits of using essay items include (a) ease of construction, (b) measurement of more complex learning objectives, and (c) effective measurement of the ability to organize, compose, and logically express relations or ideas. The limitations include: (a) time consumption for grading, (b) decreased score reliability when compared with objective items, (c) limited ability to sample several learning objectives due to time constraints, and (d) the items’ typical dependence on language.

Measuring Intelligence

The measurement of intelligence is perhaps one of the most controversial topics that testing and assessment professionals encounter. To begin, there is a great deal of debate regarding a common definition of what entails intelligence and how it can best be measured. Intelligence may be defined as a multifaceted capacity that manifests itself in different ways across the developmental lifespan, but in general it includes the abilities and capabilities to acquire and apply knowledge, to reason logically, to plan effectively, to infer perceptively, to exhibit sound judgment and problem-solving ability, to grasp and visualize concepts, to be mentally alert and intuitive, to be able to find the right words and thoughts with facility, and to be able to cope, adjust, and make the most of new situations (Cohen & Swerdlik, 2002). Although this definition appears to be comprehensive, it also demonstrates the inherent difficulty involved with measuring an individual’s intelligence given the broad range of factors that can potentially be associated with intelligence.

Two common contemporary measures of intelligence include the Stanford-Binet Intelligence Scale and the Wechsler Tests. The fourth edition of the Stanford-Binet contains 15 separate subtests yielding scores in the following four areas of cognitive ability: verbal reasoning, abstract/visual reasoning, quantitative reasoning, and short-term memory (Cohen & Swerdlik, 2002). The Weschler tests were designed to assess the intellectual abilities of people ranging in age from preschool (ages 3 to 7), childhood (ages 6 to 16), and adulthood (ages 16 to 89). The tests are similar in structure, and each contains several verbal and performance scales.

Measuring Aptitude or Ability

Classical and modern test theory methods and applications can effectively be used to develop, maintain, and revise tests or assessments intended to measure aptitude or ability. For example, when researchers or admissions committees want to know which students should be selected for a graduate program (e.g., medical or law school), they often depend on aptitude measures to predict future behavior or inclinations. The forecasting function of a test is actually a type or form of criterion validity known as predictive validity evidence (Kaplan & Saccuzzo, 2001). For example, a student’s score on the MCAT (medical college admissions test) may serve as predictive validity evidence if it accurately predicts how well that particular student will perform in medical school. The purpose of the test is to predict the student’s likelihood of succeeding on the criterion—that is, successfully achieving or meeting the academic requirements set forth by the medical school. A valid test for this purpose would help admissions committees to make better decisions because it would provide evidence as to which students would typically succeed in an academic medical school setting.

Measuring Occupational Aptitude

Business corporations and the government typically use aptitude tests to facilitate their decision-making processes concerning employment recruitment, placement, and promotion. An example of an occupational assessment is the General Aptitude Test Battery (GATB), which is a reading ability test that purportedly measures aptitude for a variety of occupations. The U.S. Employment Service developed the GATB to help make employment decisions in government agencies. The GATB seeks to measure a wide array of aptitudes, ranging from general intelligence to manual dexterity (Kaplan & Saccuzzo, 2001). The Department of Defense utilizes the Armed Services Vocational Aptitude Battery (ASVAB). The ASVAB yields scores that apply to both educational and military settings. In the latter, the ASVAB results are used to facilitate the identification of students who may qualify for entry into the military, and they can potentially be used by military officials to recommend the assignment of soldiers to various occupational training programs (Kaplan & Saccuzzo, 2001).

Future Directions

Attempting to predict the future trends of educational or psychological testing and assessment can be beneficial for several reasons. The examination of trends can inform current practice, clarify future research goals, and identify areas of potential concern or danger. Before attempting to speculate on the future of testing and assessment, it seems prudent to contemplate several historical questions raised by Engelhard (1997). For example, what is the history of educational and social science measurement? Have we made progress, and if so, how should progress be defined within the context of measurement theory and practice? Who are the major measurement theorists in the social sciences, and what are their contributions? What are the major measurement problems in education and the social sciences, and how have our views and approaches to these problems (i.e., reliability, validity, test bias, and objectivity) changed over time?

Reflecting upon Engelhard’s questions is critical for measurement professionals in the 21st century; they surely will be confronted with an unprecedented array of social, legal, economic, technical, ethical, and educational issues. More important, measurement professionals will likely have the opportunity to influence the roles of localization, teacher education, computerized testing, assessment variation, and litigation pertaining to the field of testing and assessment by becoming an integral part of future policy debate.

Localization

Cizek’s (1993) thought-provoking article, “Some Thoughts on Educational Testing: Measurement Policy Issues into the Next Millennium,” discusses localization or a trend for elementary and secondary education programs to develop, administer, and interpret tests and assessments to be utilized at the district, school, or classroom level. Localization represents a departure from reliance on national and commercially produced tests and assessments.

The trend to limit the use of national testing will inevitably place more responsibility and accountability on school districts, schools, and teachers to become more actively involved in the assessment process. Localization proponents believe it is valuable for teachers to be intimately involved in constructing, administering, and interpreting assessments relevant to their students’ educational needs. However, some measurement professionals have expressed concern that most teachers do not receive adequate formal training in testing and assessment, unless they pursue a master’s degree.

Rudman (1987) has discussed an alternative to the full localization approach; he believes that the future of assessment will witness a solicitation on the part of some test publishers to offer tailored instruments to both local and state agencies. These tailored instruments would likely contain items from previous editions of standardized survey-type achievement tests and national test-item banks. These items would be combined with the so-called local items supplied by the classroom teachers to provide the ability to compare the local items with the previously standardized items to examine the appropriateness of the relevant psychometric properties.

Computer-Assisted Testing

The convenience and economy of time in administering, scoring, and interpreting tests afforded by computer-assisted testing and measurement will continue to evolve and play an important role in the future of testing and assessment professionals. Computer-assisted assessment is unlike conventional testing. Where all examinees receive all items. Rather, its focus is on providing each examinee with a unique assessment based on an examinee’s proficiency level. When this objective is accomplished, the result is known as a tailored or adaptive test. According to Nunnally and Bernstein (1994), if testing is under computer control, the tailored test is referred to as a computerized adaptive test (CA1). A CAT can be administered on even the least expensive personal computers now available. A test has the potential to be different for each examinee depending on how they perform on the test items. For example, if an examinee displays a consistent response pattern based on his or her proficiency level, he or she may take a shorter CAT. Whereas, if an examinee displays an aberrant or irregular response pattern (e.g., answering a difficult item correctly and an easy item incorrectly) based on their proficiency level, he or she may take a longer test because it will be more difficult to estimate his or her actual proficiency level. Each item on a CAT typically has a known standardized difficulty level and discrimination index.

Conventional paper-and-pencil tests employ many items that generate little information about the examinee, especially for examinees at the extreme ends of the measured construct. For example, proficient students are typically asked too many easy, time-wasting questions. An even worse scenario may involve low-ability students being asked too many difficult questions that may detrimentally affect their self-confidence. An advantage of CAT involves the administration of only a sample of the total items in the test bank item pool to any one examinee; this reduces the number of items that need to be administered by as much as 50 percent. On the basis of previous response patterns, items that have a high probability of being answered correctly (if it is a proficiency test) are not presented, thus providing economy in terms of testing time and total number of items presented (Embretson, 1996).

Another advantage of the CAT is the utilization of an item bank, which is a set of test items stored in computer memory and retrieved on demand when a test is prepared. Each item is stored and uniquely classified by several dimensions, such as (a) item type, (b) content measured, (c) difficulty level, and (d) date of last use. According to Cohen and Swerdlik (2002), another major advantage of the CAT is the capability for item branching—the ability of the computer to tailor the content and presentation order of the items on the basis of examinee responses to previous items. For example, a computer may be programmed to not present an item related to the next difficulty level until two consecutive items of the previous difficulty level are answered correctly. The computer can also be programmed to terminate an exam or a subcategory of an exam at specified levels.

In general, the potential advantages of CAT are attributed to the objectivity, accuracy, and efficiency that computers and software bring to various aspects of testing and assessment. According to Cohen and Swerdlik (2002), CAT may include these potential disadvantages: (a) A CAT may be an intimidating experience for an examinee; (b) test-taking strategies that have worked for examinees in the past, such as previewing, reviewing, and skipping around the test material to answer easier questions first are not possible; and (c) examinees are deprived of the option to purposefully omit items because they must enter a response before proceeding to the next item. Tailored testing appears to be moving out of its experimental phase and becoming more accessible to a larger measurement audience because computer technology is cheaper to purchase and because item response theory applications are becoming easier to use.

Testing and Assessment Litigation

Another important issue for measurement professionals in the 21st century pertains to testing litigation. Cizek (1993) explained that competency tests, licensure examinations, and personnel evaluations will undoubtedly continue to be challenged in courts when opportunities for advancement are denied or bias may be present. Cizek believes that the process of setting standards will probably receive the most scrutiny, given its arbitrary nature and debatable empirical grounding in the field of psychometrics.

A final legal or ethical consideration involves the role of test publishers in assessment. Rudman (1987) believed that test publishers would continue to be plagued by the improper use of their tests. For example, he indicated that once a test company sells a test to a client, it has limited or no control over the way the test is administered or interpreted. Measurement professionals should be conscientious regarding the improper administration or interpretation of a test or assessment that can deleteriously affect examinees.

Summary

During the last century, educational and psychological tests and assessments have demonstrated tremendous utility for addressing a wide range of applied problems. They have proved to be excellent tools for facilitating increased understanding through theory development and research. However, testing and measurement professionals should remain cautious because there is always the potential for tests to be used in harmful, inappropriate, or inaccurate ways. Consequently, these professionals must be familiar with, and guide their research practices in accordance with, the American Psychological Association’s (APA) ethical practices as well as the American Educational Research Association, APA, and National Council on Measurement in Education testing standards (1999). Further, they also must be intimately familiar with the possible negative effects of various test uses and with procedures in which those deleterious effects can be minimized. If tests and assessments are used cautiously, knowledgeably, thoughtfully, and ethically, their potential for great benefits and wide practical utility can be fully realized.

References:

American Educational Research Association, American Psychological Association, & the National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Bloom, B. S. (1971). Mastery learning. In J. H. Block (Ed.), Mastery learning: Theory and practice (pp. 47-63). New York: Holt, Rinehart & Winston.
Cizek, G. J. (1993). Some thoughts on educational testing: Measurement policy issues into the next millennium. Educational Measurement: Issues and Practices, 12, 10-16.
Code of fair testing practices in education. (1988). Washington, DC: Joint Committee on Testing Practices.
Cohen, R. J., & Swerdlik, M. E. (2002). Psychological testing and assessment. Boston: McGraw-Hill.
Darwin, C. (1859). On the origin of species by means of natural selection. London: Murray.
Drasgow, F., Levine, M. V., & McLaughlin, M. E. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement, 11, 59-79.
Embretson, S. E. (1996). The new rules of measurement. Psychological Assessment, 8, 341-349.
Engelhard, G., Jr. (1997). Introduction. Educational Measurement: Issues and Practice, 12, 5-7.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105-146). New York: Macmillan.
Garb, H. N. (2000a). Introduction to the special section on the use of computers for making judgments and decisions. Psychological Assessment, 12, 3-5.
Garb, H. N. (2000b). Computers will become increasingly important for psychological assessment: Not that there’s anything wrong with that! Psychological Assessment, 12, 31-39.
Glaser, R. (1994). Criterion-referenced tests: Origins. Educational Measurement: Issues and Practice, 12, 9-11.
Gronlund, N. E. (2003). Assessment of student achievement (7th ed.). Boston: Allyn & Bacon.
Hambleton, R. K., Swaminathan, H., & Rogers, J. H. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Hofer, P. J., & Green, B. F. (1985). The challenge of competence and creativity in computerized psychological testing. Journal of Consulting and Clinical Psychology, 53, 826-838.
Honaker, L. M., & Fowler, R. D. (1990). Computer-assisted psychological assessment. In G. Goldstein & M. Hersen (Eds.), Handbook of psychological assessment (2nd ed., pp. 521-546). New York: Pergamon.
Jacobs, L. C., & Chase, C. I. (1992). Developing and using tests effectively: A guide for faculty. San Francisco: Jossey-Bass.
Kaplan, R. M., & Saccuzzo, D. P. (2001). Psychological testing: Principles, applications, and issues (5th ed.). Belmont, CA: Wadsworth & Thompson Learning.
Linn, R. L. (1989). Educational measurement (3rd ed.) New York: Macmillan.
Lyman, H. B. (1978). Test scores and what they mean (3rd ed.) Englewood Cliffs, NJ: Prentice Hall.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.
Ory, J. C., & Ryan, K. E. (1993). Tips for improving testing and grading (4th ed.). Newbury Park, CA: Sage.
Osterlind, S. J. (2005). Modern measurement: Theory, principles and applications of mental appraisal. Upper Saddle River, NJ: Prentice Hall.
Paris, S. G., Lawton, T. A., Turner, J. C., & Roth, J. L. (1991). A developmental perspective on standardized achievement testing. Educational Researcher, 20, 12-20.
Popham, J. W. (1993). Educational testing in America: What’s right, what’s wrong? Educational Measurement: Issues and Practice, 12, 11-14.
Rudman, H. C. (1987). The future of testing is now. Educational Measurement: Issues and Practice, 6, 5-11.
Smith, M. L. (1991). Put to the test: The effects of external testing on teachers. Educational Researcher, 20, 8-11.
Sturges, J. W. (1998). Practical use of technology in professional practice. Professional Psychology: Research and Practice, 29, 183-188.
Thorndike, E. L. (1910). The contributions of psychology to education. The Journal of Educational Psychology, 1, 5-12.
Thorndike, R. M. (1997). Measurement and evaluation in psychology and education (6th ed.). Upper Saddle River, NJ: Prentice Hall.
Traub, R. E. (1997). Classical test theory in historical perspective. Educational Measurement: Issues and Practice, 12, 8-13.
Walsh, B. W., & Betz, N. E. (2001). Tests and assessment (4^th). Upper Saddle River, NJ: Prentice Hall.
Ward, A. W., Stoker, H. W., & Murray-Ward, M. (1996). Educational measurement: Origins, theories, and explications: Vol. I. Basic concepts and theories. New York: University Press of America.
Weiss, D. J., & Vale, C. D. (1987). Computerized adaptive testing for measuring abilities and other psychological variables. In J. N. Butcher (Ed.), Computerized psychological assessment: A practitioner’s guide (pp. 325-343). New York: Basic.
Wilbrink, B. (1997). Assessment in historical perspective. Studies in Educational Evaluation, 23, 31-48.