Psychometrics Research Paper

This sample Psychometrics Research Paper is published for educational and informational purposes only. If you need help writing your assignment, please use our research paper writing service and buy a paper on any topic at affordable price. Also check our tips on how to write a research paper, see the lists of psychology research paper topics, and browse research paper examples.

The field of psychometrics aims to measure psychological phenomena. Researchers disagree as to whether and how psychological measurement is accomplished. In his letter to the Grand Duchess Christina, Galileo Galilei (1610/1957) stated that the universe has its own language and set of characters. If we wish to understand our universe, we must understand that it “was written in the language of mathematics.” (p. 237). This research-paper introduces the field of psychometrics.

Definition

When we analyze the roots of the word psychometrics, we find psycho, which means “individual” or “mind,” and metric, which means “measurement.” The field of psycho-metrics attempts to measure psychological phenomena. A more specific working definition of psychometrics is derived from the field’s historical roots. Galileo’s philosophy about the importance of numbers is reiterated in the works of Sir Francis Galton. Galton (1879) defined psychometry as the “art of imposing measurement and number upon operations of the mind” (p. 149). Taken together, these philosophies suggest that psychometrics focuses its efforts in the quantification of psychological constructs. That is, as a discipline it assumes that all worthy human attributes can somehow be quantified. Today, this definition is applied more broadly to include more than mental operations. Modern psychometrics attempts to measure all aspects of the human condition. These include our cognitions—thoughts, beliefs, attitudes, and perceptions; our behaviors—overt (observable) and covert (within the individual), intentional and unintentional, and personal and social; and our emotions—positive, negative, primary, and secondary. Measuring these human attributes poses many challenges, as illustrated in the colorful history of psychometrics as a discipline.

History

Psychometrics has a home in the broader field of psychology, an arm of the social and behavioral sciences. To understand its role in this family of sciences, one must first understand its conception and birth. A brief history of psychology is needed. Whereas 18th-century philosophers (Immanuel Kant, Erasmus Darwin, and Ernst Heinrich Weber) planted the idea of psychology as a science, it took the works of 19th-century physiologists and physicists (Gustav Theodor Fechner, Charles Darwin, Herbert Spencer, and Wilhelm Wundt) to develop the seedling field into a full-fledged discipline with true experimental methodology. The thread that united these diverse thinkers was physics, which was the exemplar science they hoped psychology would become. Psychology’s scientific path was altered, however, by the onset of World War II (1939) and the events and policy decisions that occurred thereafter. Grave human atrocities were witnessed by Allied and Axis powers alike. Many of the infringements upon human rights were done in the name of scientific advancement. With the world astonished, leading psychologists were well positioned, in both Germany and the United States, to effect social reform to a level unparalleled today.

In this manner, modern psychometrics was conceived by two parents—physics and bureaucracy. From its ancestry in physics, psychometrics gained objectivity, quantification, parametric statistics, systematic methodology, and inductive and deductive reasoning. From its ancestry in bureaucracy, psychometrics gained pragmatics, census and surveys, nonparametric statistics, qualitative methodology, administration, and reform. Evidence of the field’s bipolar ancestry remains today, as mentors continue to train their apprentices in one approach or the other. Thus, emerging psychometric researchers tend to be skilled in either quantitative or qualitative methodology. Even though their different viewpoints are clear, contemporary psychometricians are more compatible than they are combative. Both approaches have much to contribute to our understanding of psychology.

The Importance Of Psychometrics To Psychology And Individual Differences

Like any true science, psychology is a dynamic discipline. Individuals develop, societies evolve, and variables of interest change. Therefore, our measurement tools and procedures must continually change as well. The use of quantitative and qualitative methods have produced thou-sands of research findings that have advanced our understanding of personality, neurology, memory, intelligence, learning, emotion, and child development. Likewise, these psychometric methods have advanced our knowledge in areas outside of psychology such as medicine, pharmacology, engineering, education, sociology, and politics. Over the decades, consistent bodies of findings emerge in each area, lending reliability (consistency over time or across samples) to our knowledge bases. To this end, certain areas have exhausted efforts aimed at understanding aggregate effects across individuals. These research outcomes spurred the field of phenomenology (the study of individual human perceptions).

Where objective, aggregate approaches leave off, phenomenological approaches promise addition to our knowledge bases by looking inward, into human consciousness. Phenomenological research employs qualitative methods of inquiry, as it is more interested in how individual differences determine behavior than in one’s objective responses to stimuli. Consequently, phenomenologists rely on descriptive procedures rather than inferential. As history has witnessed often, the emergence of an integrated approach was inevitable. Thus, the field of psychophysics was born, which is the study of the relation between physical stimuli and the subjective perceptions they produce. Both the quantitative and qualitative approaches have much to offer the evolving science of measurement. The quantitative approach has contributed objective procedures that allow us to estimate differences across individuals, and the qualitative approach has contributed procedures such as content analysis that yields a rich picture of the individual as he sees the world.

Theory And Modeling

The vast majority of psychometricians have been trained in the quantitative tradition. Therefore, the majority of current psychometric procedures were derived from two sets of theories—classical test theories and modern latent trait theories. Classical test theories share a history that spans the centuries back to Sir Francis Galton, and they are referred to as “classic” because of their reliance on true score theory. True score theory posits that any measurement may be broken down into two parts—true scores and error scores. For example, if a metric measuring tape is used to measure a group of adults’ heights, each individual’s height consists of two portions: the individual’s true height in centimeters and some degree of measurement error, hopefully a minimal few millimeters or fractions thereof. The primary goals of classical test theories (CTT) are to analyze individual differences across test scores and to improve the reliability of measurement tools. They do not aim to understand individual test scores. Therefore, results from studies rooted in CTT may be generalized (applied broadly) to populations but not to the level of the individual.

Modern latent trait theories, more commonly referred to as item response theories, share a history back to early 20th-century work on intelligence (Alfred Binet), but its recent popularity surged in the 1960s with item response models developed by Georg Rasch. Item response theories (IRT) are models that endeavor to relate person characteristics and item characteristics to the probability of a discrete (categorical) outcome such as “yes” or “no” answers to test questions. Similar to CTT, item response theory aims to improve reliability of measurement tools. However, IRT specializes in improving reliability through its investigation into the psychometric properties of assessments. Both CCT and IRT use a series of quantitative procedures to determine measurement reliability such as correlation (a number reflecting the linear relation among variables) and covariance (how much variables overlap in their variances). The key element that differentiates the two test theories is how models (visual representations of relations among variables) are used to assess a measure’s psychometric properties. Whereas CCT strives to identify a model that best fits a set of data, IRT strives to identify or obtain data that fit a model. It is a simple case of inductive (from the specific to the general) versus deductive (from the general to the specific) reasoning, respectively.

The Model Fits the Data

With the goal of identifying the model that best fits the data, classical methods analyze matrices (rectangular arrays of numbers) of correlations and covariances. Several classical methods exist. The four most commonly used classical analyses include (a) factor analysis, modeling aimed at explaining variability among observed variables with a smaller set of unobserved factors; (b) cluster analysis, modeling that partitions a dataset into subsets with shared traits; (c) path analysis, a form of multiple regression that is causal modeling aimed at predicting exogenous variables (dependent variables) from endogenous variables (independent variables); and (d) structural equation modeling, a form of confirmatory factor analysis aimed at testing theory by modeling social behavioral constructs as latent variables.

Because each of the aforementioned methods uses datasets taken from a sample, results obtained from them may differ across samples. Another limitation of classical methods has been documented for over two decades by David Rogosa and colleagues. Rogosa (2004) asserts that because classical models assess between-subjects (across individuals) variability and assume that within-subjects (within individuals) variability has a constant rate of change, they inadequately measure individual differences in response to a treatment effect (intervention). For this reason he suggests that most classical test methods are misleading, as individual growth curves are not considered appropriately.

The Data Fit the Model

With the goal of obtaining data that fit a specified model, IRT statistical procedures vary dramatically from those employed by classical testing. The primary procedure of IRT modeling relates person characteristics and item characteristics to item responses, enabling the improvement of the assessment tool’s reliability. IRT uses logistic models to estimate person parameters (e.g., reading comprehension level) and one or more item parameters (e.g., difficulty, discrimination). In this manner, IRT models have two advantages. First, unlike CTT, which only ascertains an assessment’s average reliability, IRT models allow the careful shaping of reliability for different ranges of ability by including only the best items. Thus, IRT models yield stronger reliability findings. Second, because IRT models provide greater flexibility in situations when data are used at different times, with different samples, and with varied test forms, measures derived from IRT are not dependent upon the sample tested.

Although item response theories are fairly new and developing, there are two disadvantages that prevent them from having mainstream appeal. These disadvantages are related. First, IRT relies on complex logistical modeling, and many psychometric researchers lack the mathematical and procedural skills to employ and interpret them. Second, CTT analyses are readily available in user-friendly and readily available applications such as SAS (statistical analysis software) and SPSS (formerly, the Statistical Package for the Social Sciences). IRT modeling applications are rare, more expensive, and not currently available in SAS or SPSS. However, IRT methodology continues to develop, making the impact of these new modeling approaches promising.

Methods

All psychometric methods are one of two types, qualitative or quantitative. Qualitative methods investigate the reasoning behind human behaviors, and therefore attempt to answer how and why humans behave as they do. Quantitative methods assume that all constructs investigated can be measured numerically, regardless of whether they are overt or covert. Thus, quantitative methods allow us to answer the what, where, and when of human behavior. For these reasons, qualitative measures are considered subjective tests and consist primarily of open-ended (free response) items, whereas quantitative measures aspire to be objective and consist mostly of forced-choice (must select from available options) items.

Regardless of method, the very first step in every psychometric project is to define the target population (those to which results may be generalized) and the test sample (representative subset of the population). Next, the psychometrician must decide upon operational definitions of the constructs investigated. Operational definitions describe constructs by the specific manner in which they are measured. It is the operational definitions that determine all subsequent steps in the psychometric process. Let’s consider a psychometric project with the goal of scale construction (measurement tool development). The researchers wish to develop a tool that measures teenagers’ understanding and use of the Internet. The next step in a qualitative approach would be to write open-ended questions or items such as “For what purposes do you use the Internet?” or “Why is using the Internet helpful to you?” In using a quantitative approach, there is one more important consideration before measurement items are written. The researchers need to determine the variables’ scale of measurement.

Scales of Measurement

A variable’s level or scale of measurement refers to the nature of the values assigned to describe it accurately. Because of its simplicity and ease of understanding, the research field has adopted the classification system developed by Stanley Smith Stevens (1946). Stevens’s scales of measurement consist of four levels, organized from simplest to most sophisticated. Therefore, variables may be scaled at the nominal, ordinal, interval, or ratio level. A variable’s scale of measure is important, as it determines how items will be written, how many items are needed, the data collection method, the adequate sample size needed, and ultimately what statistical analyses may be used to determine the tool’s psychometric properties. Each of the four scales of measurement is discussed in turn.

Nominal

Nominal variables are categorical. The values of the variable consist of names or labels and are mutually exclusive. The nominal scale is not continuous and merely reflects membership in a group. Examples of nominal variables are sex (female or male), political party affiliation (Democratic, Republican, Independent), and favorite ice cream (chocolate, strawberry, vanilla). The only comparisons that can be made between nominal variables are statements of equality or inequality. No mathematical operations are appropriate. In scale construction projects, many demographic questions regarding the sample are written at the nominal level. For example, “What is your sex?” or “Select your current grade in school.” In such cases, participants are forced to select either “female” or “male” and their grade level from a set list of options (i.e., 9th grade, 10th grade, 11th grade, or 12th grade). Only descriptive analyses such as frequencies (how often each value appears in the variable) and percentages are appropriate for assessing nominal variables.

Ordinal

Ordinal variables contain all the characteristics of nominal variables and also contain a rank order of the constructs measured. For this reason, the values of ordinal variables are called ordinals. The rank-order characteristic of these variables makes them more sophisticated than nominal variables, as greater than and less than comparisons can be made. Like the nominal scale, the ordinal scale is not truly continuous; therefore, higher mathematical operations are not appropriate. Examples of ordinal variables include marathon finalists (1st place, 2nd place, 3rd place), university classifications (freshman, sophomore, junior, senior), and social economic status (lower, middle, upper). At a glance, ordinal variables look just like nominal categories; however, inherent within the values is a rank order. For example, university classification is more accurately depicted as ordinal because seniors have more completed credits than juniors, who have more completed credits than sophomores. In a scale construction project, researchers use ordinal variables to assess demographic information as well as attitudes or preferences. For example, “On a scale of one to five, how much do you prefer to shop online compared to shopping in a store?” Like nominal variables, ordinal variables are assessed by descriptive analyses such as frequencies and percentages, and the most typical values in the variable may be identified by two measures of central tendency, the median (the middle-most value) or the mode (most frequently appearing value).

Interval

The interval scale contains the characteristics of the nominal and ordinal scales as well as the assumption that the distances between anchors (values) are equivalent. These equal intervals allow meaningful comparisons among the variable’s values, and mathematical operations such as addition and subtraction are appropriate. The researcher assigns anchors that may consist of positive or negative values. If zero is used, the point it represents is arbitrary. That is, zero does not reflect the true absence or degree of the construct measured. Rather, zero is a meaningful placeholder. For example, Celsius temperature is an interval scale, as zero degrees reflects the point at which water freezes; it does not reflect the point at which temperature ceases to exist. Other interval scaled variables include standardized intelligence tests (IQ) and the SAT (Scholastic Aptitude Test). The interval scale is useful, as ratios of differences across individuals may be expressed; however, higher operations such as multiplication and division are still not appropriate. For this reason, they are not truly numeric, although many researchers choose to treat their interval measurements as such. The interval scale’s level of sophistication allows variables to be analyzed by numerous descriptive statistics such as the mode, median, and mean (arithmetic average), as well as standard deviation (average distance of values to the mean).

Ratio

The most sophisticated level of measurement is ratio. Ratio-scaled variables are continuous and truly numeric. The ratio scale contains all the characteristics of the prior three in addition to having a true zero point. That is, zero reflects the absence of the variable measured. Examples of ratio-scaled variables are time, money, length, and weight. Ratio statements are appropriate and meaningful, as it is accurate to conclude that a person who took 1.5 hours to complete her final exam was twice as fast as the person who completed the exam in 3 hours. All mathematic operations may be used, and the full range of descriptive statistics and inferential statistics (use of representative samples to estimate population parameters) may be employed for assessment of ratio variables.

Measurement Level Considerations

Sometimes the scale of variables is unclear. In such cases, the investigator defines the scale of measurement as she generates the variable’s operational definition. For example, in measuring attitudes or preferences, some consider the scale ordinal, and others, interval. Likewise, many rating scales that meet the characteristics of the interval level are treated as truly numeric, thus ratio, by the investigator. These ambiguities may occur when a Likert scale is used. Likert scales are usually considered ordinal and use numbers as anchors, each of which reflects a level of agreement or disagreement with the statement at hand. The scale was developed by Rensis Likert (1932), an educator and industrial-organizational psychologist, who used the scale to evaluate management styles. It is the most commonly used scale in survey research. In scoring a survey, many researchers sum responses across items. When this procedure is employed, the resulting sum or mean may be normally distributed (have a bell-shaped curve), allowing the new score to be treated as an interval variable.

Psychometrics in Scale Construction

Regardless of whether the measurement instrument is an interview, self-report survey, checklist, or standardized assessment, measurement is an integral part of the scale construction process. For these purposes, scale is defined as a collection of items to which participants respond in a meaningful way. The responses are scored per their level of measurement and combined to yield scale scores (aggregate values used for comparison). Scale scores are then evaluated to determine how accurately and reliably the instrument measured the construct under investigation. Thus, the scale construction process unfolds in three stages: scale design, scale development, and scale evaluation.

Scale Design

As mentioned, a clear theoretical definition of the construct at hand must first be in place. Next, in any scale construction project, certain assumptions are made. The scale developer assumes that all participants can read at the same level and interpret the items similarly. Additionally, the developer assumes minimal differences in responses due to uncontrollable individual characteristics such as motivation, personality traits, and self-preservation bias (tendency to give socially desirable responses). The remaining steps in the scale design stage include (a) identifying and selecting the respondents; (b) determining the conditions in which the scale is administered; (c) identifying the appropriate analyses for scoring, scaling, and evaluating; (d) selecting content and writing items; and (e) sequencing the items.

Identifying and selecting respondents. A part of the operational definition of the construct depends on for whom the scale is intended. Identifying the target population is first. It seems to be a straightforward process, but without a clear population defined, to whom the results can be generalized becomes ambiguous and therefore subject to reader interpretation. For the use and understanding of the Internet project, one may define the target population as all teenagers (ages 13 to 19 years) in the United States who use the Internet. That is fairly specific, but there are still some gray areas. How often must they have used the Internet? At least once? For what purposes must they have used the Internet? Why limit it to U.S. teenagers? How will the participants be selected?

To answer the last question, several methods are possible, but the goal of each is the same—to acquire a sample of individuals who accurately represent the target population from which they were drawn. The ideal method to acquire such a sample is through random sampling, a process that assures that each and every member of the population has an equal chance of getting selected into the sample. An example of true random sampling includes placing all population member names into a hat and randomly selecting individuals until the desired sample size is obtained. Although it is the ideal, random sampling is rarely used in social science research, as it is often hard to find individuals who exhibit the constructs under investigation. For this reason, nonprobability sampling techniques are used, such as quota sampling, obtaining participants who reflect the numerical composition of subgroups on a trait in the population (e.g., depressed versus nondepressed individuals) and haphazard sampling, obtaining participants where they are conveniently available. Other questions may still emerge when one discerns how the scale will be administered.

How the scale is administered. The context and conditions in which the scale is administered have a profound effect on the scale’s design. Scales may be administered individually or in groups. The responses may be self-report (individual’s perception of self) or other report (individual’s perception of another individual) accounts. Scales that measure ability may need to be timed and have very specific administration instructions. Scales may be administered orally, such as with interviews; in writing, such as with questionnaires and standardized tests; or physically, such as with performance tasks (e.g., finger tapping or puzzles). Finally, scales may be administered via different media. Examples include by telephone, mail system, or computer (i.e., e-mail or Internet). A careful consideration of media is recommended, as the choice affects one’s response rate (percentage of completed scales returned) and ultimately the validity and reliability of the information gathered.

Selecting analyses. It is not premature to consider potential analyses in the scale design stage, as the steps in the scale construction process affect one another. If the developer does not know which analyses to employ to score, scale, and evaluate the information gathered, all previous efforts will have reduced value. The developer needs to identify the process, either by hand or by using a statistical software package that she will use to score the items. Scoring may involve reversing the direction of some items and deciding how to extract meaning from the responses. Choices include summing or averaging items to produce scale scores or subscale scores (aggregate values across a subset of items). Scale scores are usually assessed with a statistical package such as SAS or SPSS. To that end, statistical analyses must be selected to assess the items’ and subscale scores’ structure, validity, and reliability. This decision is also affected by the purpose and the design used to collect the data.

Selecting content and writing items. It is best to restrict a scale to a single construct, and it is advantageous to start by searching the literature for measures with a similar purpose. If a valid and reliable measure already exists, it begs the question as to whether a new one is needed. Next, the type of responses one wishes to capture depends on what is being measured. Do items ask for behavioral, cognitive, or affective responses to the construct? To address this question, many issues warrant consideration. Here we discuss five major issues.

First, item format is important. Items may be open-ended or closed (forced-choice). Open-ended questions are used when all possible answers are unknown, when the range of possible answers is very large, when the developer wishes to avoid suggesting answers to respondents, or when the developer wants answers in the respondents’ own words. Closed questions are advantageous when there is a large number of respondents or questions, when administration time is limited, when scoring will be conducted by a computer, or when responses are to be compared across time or groups (B. Sommer & R. Sommer, 1991).

Second, the response choices depend on the underlying measurement dimension. For example, in rating statements, are respondents asked the degree to which they agree or disagree with them or the level of importance the rating statements have to them personally?

Third, to determine the sensitivity of items, the number of scale points is important. Fewer scale points (three to four) are less sensitive but increase the likelihood that respondents interpret the anchors (scale-point labels) similarly. Consequently, more scale points (seven to nine) are more sensitive but may reduce accurate anchor comprehension, thus increasing error. Regardless of number selected, the anchors should not overlap and should be inclusive, allowing an accurate choice for all respondents. Using an “other” category accomplishes the latter.

A fourth, related consideration is the number of items. The goal is to be thorough yet concise. Too few items will leave the construct under evaluated and less understood, although too many items may fatigue or bore respondents.

The fifth consideration in writing items is wording. The language used should be clear and meaningful. Items need to be written at the respondents’ reading level, but jargon and catch phrases should be avoided. Likewise, the developer wants to avoid double-barreled (asking about more than one topic), loaded (emotionally charged), or negatively worded questions, as they increase the likelihood that respondents interpret items differently and ultimately skew the results.

Sequencing the items. Once items are written, the developer needs to consider the order of their presentation. If the items have different purposes, then more general items and those used to establish rapport should be listed first. More specific, more personal, and more controversial items should be listed later. Items need to flow logically from one to the other, ensuring that answers to earlier items do not influence the answering of later items. If there is a mixture of dimensions measured, then factual and behavioral items should be asked first, with perceptual and attitudinal items asked subsequently. If reactivity (completing items changes respondent’s behavior) and/or self-presentation bias are unavoidable, then the presentation of items may be counterbalanced (presented in different order for some respondents) so that their possible effects may be determined in the scale-evaluation stage.

Scale Development

Now that the initial version of the scale has been produced, the next steps are taken to acquire preliminary findings that are used to edit the initial version until an acceptable final version of the scale is reached. Therefore, the scale-development stage is an iterative process that may use several different sample sizes and types. The first step may involve a pilot test (trial run with a small number of individuals). The goals of the pilot are to pretest the items for wording and clarity and to determine how easily instructions are followed and how long it takes to complete the scale. If the problems identified in the pilot test are minor, the scale is easily revised. If numerous problems are identified, a second pilot test may be warranted after the changes have been made.

With errors in respondent comprehension reduced, the second step in the scale-development stage is to administer the revised measure to a large sample of respondents who are representative of the target population. The goals of this second step depend upon the scale’s purpose. For example, if the scale is constructed to identify the best set of items to measure teenagers’ use and understanding of the Internet, then a data-reduction technique such as exploratory factor analysis may be employed. In exploratory factor analysis, scale items are interrelated, producing a correlation matrix. From these covariances an initial factor structure is extracted. Then the structure may be rotated to improve interpretability. Items with the highest factor loadings are retained and items with lower factor loadings and/or load on multiple factors are removed from the scale. In this manner, exploratory factor analysis is just one technique used to identify the fewest items that best measure an underlying construct.

Scale Evaluation

Once a stable factor structure has been established, measurement instruments are evaluated upon two criteria: reliability, the degree to which responses to the scale items are consistent, and validity, the degree to which the scale measures what it was developed to measure. In scale construction, reliability is determined by assessing the proportion of scale score variance that is not error variance (measurement noise). When constructing scales, many forms of validity may be determined. The most appropriate ones depend upon the purpose of the scale. For example, if the developer intends the new scale to be used as a diagnostic tool, then establishing both construct and criterion validity is important. Construct validity is the degree to which a measurement instrument accurately measures the theoretical construct it is designed to measure. In scale construction, construct validity is captured in the proportion of scale scorevariance that accurately represents the theoretical construct. Criterion validity (also referred to as predictive validity) is the degree to which a measurement instrument accurately predicts behavior on a criterion measure, where a criterion measure is any variable that the construct should be related to in a predictable manner. Therefore, in scale construction, criterion validity is represented by the proportion of criterion variance that is predicted by the scale. These forms of reliability and validity may be assessed by several different techniques. Specific techniques are discussed in the next two sections, as validity and reliability are important concepts beyond the task of scale construction; they are the primary goals of all psychometric endeavors.

Measuring and Improving Validity

Unlike research in the physical sciences, the constructs of interest in social and behavioral research are seldom physical, observable, or tangible phenomena. For example, psychologists want to understand concepts such as memory, prejudice, love, and job satisfaction, none of which is overt or contained in a Petri dish. For these reasons, social scientists have invested interest in measuring and ensuring validity. Over the years, determining validity has become an ongoing process that includes several steps and activities. As mentioned previously, the first step is important: It is a clear definition of the construct under investigation that includes how it is measured. Regardless of purpose, the social and behavioral researcher is interested in determining construct validity. Without this validity, the construct essentially does not exist. The strategies used to determine construct validity require one to evaluate the correlations between the construct measured and variables it is known to relate to theoretically in a meaningful way (Campbell & Fiske, 1959). Correlations that fit the expected pattern of relations provide evidence of the nature of the construct. In this manner, the construct validity, and thus value, of a theoretical construct is created through conclusions based upon an accumulation of correlations from various studies and samples in which the construct is measured. To this end, there are two primary forms of construct validity, convergent and discriminant.

Convergent validity is the degree to which the construct is positively related to other measures of the same construct or similar ones. For example, if one wished to develop a new measure of math ability for elementary schoolchildren, the new measure could be administered to a large sample of schoolchildren along with a test that required them to solve grade-level-appropriate math problems. The measure’s convergent validity is evaluated by examining the correlation between the children’s scale score on the new math test and their performance on the problem-solving task. These correlations are usually referred to as validity coefficients. Positive correlations (i.e., commonly .40-.60) suggest high construct validity, but there is no set criteria as to what range constitutes “adequate” construct validity; it depends greatly on the research purpose and discipline (Cronbach, 1971; Kaplan & Saccuzzo, 2005 ).

Discriminant validity is defined as the degree to which the construct is not related to constructs for which there is no theoretical basis that a relation exists. Said differently, it is the degree of divergence between the construct and the variables it should not be related to systematically. For example, to determine discriminant validity of the new math ability test, one might administer it to school-age children along with measures of reading ability and writing ability. Again, correlations among the math, reading, and writing scale scores are evaluated. Although they may be positively related (i.e., one’s reading is related to interpreting math story problems), these validity coefficients should be low, indicating that the three constructs measured are distinct phenomena.

Taken together, indices of convergent and divergent validity provide social researchers with confidence that the constructs investigated in fact measure what they purport to and, over time, confirm their value to all who use them. Similar to scale construction, evaluating and improving validity are ongoing processes. Accordingly, determining measurement validity is the most important psychometric property of all scales, followed by determining the appropriate measures of reliability.

Measuring and Improving Reliability

At its most basic level, reliability reflects a measure’s consistency or stability. It is not the case that a scale is reliable in all manners. Reliability may be assessed over time, with different age groups, in different contexts, or with different types of people. Similar to measures of validity, the appropriate measures of reliability depend upon the purpose of the research and nature of the construct. Here we discuss three reliability estimate procedures, with situations when they may be used.

First, trait scales are those that measure a construct considered relatively stable over time, such as intelligence and personality types. To evaluate trait scales one may wish to determine its test-retest reliability (consistency over time). To do so, one administers the measure to the same group of individuals at two different times. The scores across time should be similar. A correlation coefficient is computed between the time 1 and time 2 scale scores. Higher positive scores suggest good test-retest reliability. Importantly, how the trait is defined is a key element. When personality is considered a stable set of traits, an objective assessment such as the MMPI (Minnesota Multiphasic Personality Inventory) is used. When personality is considered less stable, akin to mood, a projective assessment such as the Rorschach Inkblot test may be used. In the latter case, test-retest reliability may be low and inappropriate.

Second, state scales are those that measure constructs that are expected to change in direction or intensity under different conditions. Our states (e.g., mood, motivation) are affected by many factors. To investigate state-scale reliability, one must answer one more question. Is the scale homogenous (measures a single construct) or heterogeneous (measures several related constructs). To assess homogenous scales, a measure of internal reliability is appropriate. Internal reliability (also referred to as internal consistency) is the degree of variability that exists among the scale items. When the variability is low, suggesting that all items measure the same construct consistently, internal reliability indices are high. When variability is high, internal reliability is reduced.

There are two procedures for assessing internal reliability: the split-half method or Cronbach’s coefficient alpha. The split-half method consists of dividing the scale items into two equivalent sections. One might separate the odd- and even-numbered items or randomly place all items into two groups. Regardless, the same group of individuals completes both halves of the items, and their scores on the first half are compared to scores on the second half with a correlation. If indeed only one construct is measured, scores on both halves should be similar and the correlation coefficient should be positive and high, reflecting strong internal reliability.

When scales are homogenous, have no right or wrong answers, or are heterogeneous, Cronbach’s coefficient alpha may be computed to estimate internal reliability. Cronbach’s alpha measures variability among all scale items or those within a subscale. To illustrate, most standard IQ tests measure general intelligence (innate, nonspecific mental ability) as well as specific intelligence (developed domain-specific mental abilities) via sub-scales. For instance, there may be subscales that measure reading comprehension, verbal ability, problem solving, and logical reasoning. In this case, an alpha coefficient may be computed across all general items to yield an internal reliability estimate of general intelligence or computed among the subscale items to estimate how consistently the problem-solving items relate to one another. Cronbach alphas need to be in the .70 or higher range to be considered acceptable, although opinions on this criterion vary quite a bit (Kaplan & Saccuzzo, 2005).

As is true of determining validity, the process of ensuring the reliability of a scale is multifaceted and ongoing. The goal for both is continual improvement through the identification and reduction of error variance among the scale and related construct items. Once the field finds a scale’s estimates of validity and reliability acceptable, its mainstream adoption and use are usually forthcoming.

Applications

Where Psychometrics Is Practiced

You will find psychometricians working virtually anywhere—in the fields of psychology, counseling, government, marketing, forensics, education, medicine, and the military. Anywhere that research is conducted, psychometric experts are likely behind the scenes. In addition to the dimensions discussed previously (cognitions, behaviors, affects), field researchers are also interested in measuring individuals’ KSAs, the specific Knowledge, Skills, and Abilities related to work in their particular job or profession. There is no terminal degree required in order to use the title psychometrician, although most hold a graduate degree (master’s or doctorate) from a university. Most of those with formal graduate degrees received their psychometric training in an educational measurement program or quantitative psychology program. Consequently, specializations in experimental psychology are common among practicing psychometricians. For those wondering if a career in psychometrics might be a good personal fit, graduate internships in educational testing, healthcare consulting, and clinical applications are readily available.

Measuring Individual Differences

Much of the work conducted by the aforementioned professionals consists of assessing differences across individuals. Specifically, individual difference psychologists focus their examination on the ways individual people differ in their behaviors. Topics of investigation by these differential psychologists include intelligence, personality, motivation, self-esteem, and attitudes. For example, those interested in measuring intelligence are interested in understanding the differences in cognitive ability based upon age, sex, ethnicity, and other personal characteristics. It is simply not enough to gather scores from a large sample of diverse individuals, compute their average intelligence score, and generalize it back to the population with any meaning. For this reason, individual difference researchers believe that considering individual characteristics in our investigations is the way to contribute meaningfully to our knowledge bases. This rationale is why norms (means and standard deviations) for major standardized intelligence tests such as the Stanford-Binet Scale and Wechsler Intelligence Scales are available for specific age groups. Readers are encouraged to review Chapter 44, Intelligence, for more information on the theory and measurement of individual intelligence testing.

Group Testing

Individual test administration is a costly business, as a trained test administrator is required for each participant assessed; the time, effort, and resource costs accrue exponentially. Therefore, a large percentage of ability measurement now occurs in groups. In group testing, a trained examiner may read the test instructions to all participants at once and impose a time limit for the test. Participants read their own test items and record their scores, usually in writing. Participant responses usually consist of objective answers to forced-choice items, so that scoring and analysis are easily completed by one individual using a statistical software package. Clearly, the primary advantage of group testing is cost-effectiveness. Group testing occurs in schools at every level, in the military, and in industry, and researchers use group administration extensively for a variety of reasons. In these fields, group testing is used for various purposes including diagnostic screening, selection purposes, assessing special abilities, and assessing interest and aptitude levels for special jobs and occupational duties. Through group testing, measurement application is especially broad.

Summary

Because of the prevalence of their use among psychometricians, the latter half of this research-paper focused on quantitative methods and applications as opposed to qualitative. Qualitative methodologists continue to develop and fine-tune robust qualitative means of measurement. New developments emerge each year. For more information on qualitative methodology, readers should see Chapter 11, Qualitative Research.

Another area that promises psychometric advancement is item response theory modeling. Given the current trend of ever-increasing diversity in our society, models that aim to take into account individual growth curves are invaluable for an accurate understanding of human social behavior.

A final note for future directions is not novel. It is the importance of using multiple measurements in every research endeavor. The late Donald T. Campbell, considered by many the father of social research methodology, coauthored with Donald W. Fiske the paper titled “Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix.” In it, Campbell and Fiske contend that single quantitative and qualitative methods have strengths and weaknesses that, individually, leave investigations lacking. They also warn investigators of problems that may arise when a single measurement of a construct is used for diagnostic decision making. They suggest that even the most stable of traits is variable under certain conditions. Half a century later, their message remains contemporarily important. This is why their 1959 paper is one of the most-often-cited articles in the social science research literature.

In this vein, convergent methodologies have emerged that center upon Campbell and Fiske’s multimethod/multitrait principle of establishing validity. The term triangulation has been coined to describe how both quantitative and qualitative methods are blended to measure constructs more effectively (Webb, Campbell, Schwartz, & Sechrest, 1966). In such efforts, our faith in the validity of social constructs is maximized, as methodological error variance is reduced, if not altogether eliminated. To this end, psychometricians can have confidence that the constructs we measure truly mirror the underlying human qualities we endeavor to understand.

References:

  1. Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42, 1-16.
  2. Campbell, D. T., & Fiske, D.W. (1959). Convergent and discriminant validity by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.
  3. Cozby, P. C. (2001). Methods in behavioral research (7th ed.).
  4. Mountain View, CA: Mayfield Publishing. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, DC: American Council on Education.
  5. Dawis, R. V. (1987). Scale construction. Journal of Counseling Psychology, 34(4), 481-489.
  6. Galilei, G. (1957). Discoveries and opinions of Galileo (S. Drake, Trans.). New York: Anchor Books. (Original work published 1610)
  7. Galton, F. (1879). Psychometric experiments. Brain: A Journal of neurology, II, 149-162.
  8. Jick, T. D. (1979). Mixing qualitative and quantitative methods: Triangulation in action. Administrative Science Quarterly,24(4), 602-611.
  9. Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 140, 44-53.
  10. Mitchell, J. (1986). Measurement scales and statistics: A clash of paradigms. Psychological Bulletin, 3, 398-407.
  11. Porter, T. M. (2003). Measurement, objectivity, and trust. Measurement, 1(4), 214-255.
  12. Professional Testing Inc. (2006). What is a psychometrician? Retrieved December 21, 2006, from http://www.proftesting. com/test_topics/pdfs/psychometrician.pdf
  13. Rogosa, D. R. (2004). Some history on modeling the processes that generate the data. Measurement: Interdisciplinary Research and Perspectives, 2, 231-234.
  14. Rasch, G. (1961). On general laws and the meaning of measurement in psychology. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 4, 321-333.
  15. Sommer, B., & Sommer, R. (1991). A practical guide to behavioral research: Tools and techniques. New York: Oxford University Press.
  16. Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677-680.
  17. Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34, 278-286.
  18. Webb, E. J., Campbell, D. T., Schwartz, R. D., & Sechrest, L. (1966). Unobtrusive measures. Chicago: Rand McNally.

See also:

Free research papers are not written to satisfy your specific instructions. You can use our professional writing services to order a custom research paper on any topic and get your high quality paper at affordable price.

ORDER HIGH QUALITY CUSTOM PAPER


Always on-time

Plagiarism-Free

100% Confidentiality
Special offer! Get discount 10% for the first order. Promo code: cd1a428655