Glossary
 Assessment / Test / Instrument / Inventory
These terms all refer to the systematic procedures for obtaining information about the knowledge, skills, or other characteristics of the individuals being examined.
 Adaptive test
A form of testing in which items selected to be administered to the test taker are on the basis of the correctness of the test taker’s responses to previous items. Test takers having good (poor) performance on the previous items tend to encounter harder (easier) questions afterward.
 Aptitude test
A test designed to predict a person’s ability in a certain area for future success.
 Classroombased assessment
An assessment administered to evaluate students’ performance on a topic in the classroom.
 Cognitive assessment
An assessment to systematically measure cognitive domains to determine an individual’s capabilities to perform various mental activities, such as learning and problemsolving.
 Constructedresponse test
A test in which the answer is not provided as an option. Instead of choosing it from a list, the test taker must construct it.
 Holistic scoring
A method of scoring responses on a constructedresponse test, in which the quality of performance is evaluated using specific criteria and the overall performance is assigned a single score.
 Diagnostic assessment
A thorough evaluation process of gathering information about people’s current level of performance to determine a category to put them into, thus, to prescribe solutions to aid them in meeting a set of learning objectives.
 Evidencebased assessment
The use of research and theory to guide the selection of assessment instruments to be used for a specific assessment purpose and to inform the methods and measures used in the assessment process.
 Formative assessment
A tool to measure students’ understanding as a course progresses, the information from the assessment can be used to provide feedback to the students throughout the course and/or revise future class instruction for those students. See also summative assessment.
 Modified assessment
An assessment that has been adjusted to students who are unable to participate in the original assessment. Its scores are not supposed to be compared with or combined with the original assessment’s scores.
 Noncognitive assessment
The assessment is designed to measure an examinee’s soft skills, and attitudes.
 Objective test
A test that contains items with keys in advance of test administration and can be scored without personal subjective opinions. See also subjective test.
 Selfreport inventory
An instrument where participants respond to questions to rate their own behavior or performance, as opposed to ratings being made by observers. See also inventory.
 Standardized test
A test in which testing conditions are the same for all test takers, such as the test content, format, procedures, and scoring method.
 Subjective test
A test that requires some subjective judgment in the scoring process. See also objective test.
 Summative assessment
A tool to measure the extent to which students have met the overall goal(s) of a course. See also formative assessment.
 Battery of tests
A set of tests generally administered as a unit to comprehensively assess different aspects of a particular construct. The resulting scores are standardized so that they can be readily compared for decisionmaking.
 Complete battery
A collection of tests that contain a series of measures.
 Benchmark
A learning target of students at different developmental levels. Can be used to indicate a student’s progress toward meeting a specific standard.
 Bias
Any source of systematic error in an item or test. Bias can come from many sources including, but not limited to, test content, test administration, or test scoring.

 Mean
A value computed by adding up the scores of all test takers and then dividing the sum by the number of test takers.
 Median
The middle value in a sorted list of numbers. The value can separate the upper half of the list of numbers from the lower half.
 Mode
The value that occurs most frequently within a given set of numbers.
 Chance score
A test taker score defined as being at or below that which would be obtained by randomly selecting from a set of given options or by guessing.
 Chisquare test
A statistical hypothesis test that is performed when the test statistic is chisquared distributed under the null hypothesis. It compares the observed frequency of subjects in each group with the proposed or expected frequencies.
 Classical test theory
A theory of testing that is based on the idea that an individual’s observed or obtained test score is the sum of a true score component and an error score component. See also true score and error score.
 Error score
The difference between an observed score (measured score) and its true score. It includes random error and systematic error. See also random error, systematic error, and true score.
 True score
The score could be obtained with no influences of random error in the test. In classical test theory, true score is described as the average of the scores the test taker would receive by averaging the scores for the same test over time. See also error score and classical test theory.
 Cluster analysis
Grouping data into clusters such that the groups within a cluster are similar to one another but different from the groups in the other clusters.
 Cognitive ability
The general mental capability of an individual to perform activities involving reasoning and problemsolving.
 Cognitive interview
A specific form of qualitative interview in which the researcher probes the cognitive understanding of a participant. This typically involves having a student respond to prompts or tasks and then explaining out loud everything that they are thinking as they complete the requested prompt or task.
 Cohort
A group of people with a common statistical characteristic whose progress is followed by obtaining measurements at different points of time.
 Completion rates
Used to determine the perceived speediness or test difficulty by computing the percent of the entire test or a specific number of items completed.
 Confidence interval
An interval provides a range of possible values that indicates, with a specified probability, where an individual’s true score or parameter of interest lies. A confidence band is constructed using the test taker’s observed score and the test’s standard error of measurement: a person’s observed score ± the standard error of measurement.
 Confidence band
A confidence band is constructed using the test taker’s observed score and the test’s standard error of measurement: a person’s observed score ± the standard error of measurement.
 Construct
A theoretical concept that is inferred from commonalities among observable phenomena and can be used to explain those phenomena.
 Construct irrelevance
Situations in which the scores of test takers are positively or negatively influenced by factors that are different from those the test is intended to measure.
 Construct underrepresentation
Occurs when certain important aspects of the construct of interest are not measured or test content is not reflective of what the test specifications require.
 Correlation
The relation between two sets of scores. See also correlation coefficient.
 Correlation coefficient
A statistic that indicates the strength of the relation between two values for the same group of individuals. The correlation ranges from 1.00 (negatively correlated) to +1.00 (positively correlated), while a correlation of 0 (zero) indicates no relation. See also correlation.
 Intercorrelations
A matrix of correlation coefficients to calculate the correlation between each variable and every other variable. See also correlation and correlation coefficient.
 Criterion
The standard, guideline, or rule by which something is judged or evaluated.
 Criterion referencing
Each test taker's performance is measured by comparing it with a fixed standard. See also norm referencing.
 Norm referencing
Test takers’ performance is measured by comparing it with the performance of others, rather than against fixed criteria. This is in contrast with criterion referencing. See also criterion referencing.
 Local norms
Local norms are obtained from a local group of test takers, as opposed to a national population of interest, with the purpose of making normreferenced score interpretations. See also national norms.
 National norms
National norms are obtained from a national population, as opposed to individual or group performance. It aims to make normreferenced score interpretations. See also local norms.
 Criterionreferenced test
A test designed to measure individuals’ performance on specific concepts as compared to an expected level of criteria. It is in contrast with the normreferenced test. See also normreferenced test.
 Cut score
A specified point on the test score scale that differentiates the test takers into groups based on their scores above the point or below it.
 Mastery test
A criterionreferenced test designed to evaluate how well the test taker has mastered a domain of knowledge. Scores above a certain cut score are considered mastery. See also cut score and criterionreference test.
 Mastery level
The cut score for a mastery test. When test takers’ scores are higher than the cut score (above the mastery level), they are considered to have mastered the knowledge; when their scores are lower than the cut score (below the mastery level), they are considered not to have mastered the knowledge. See also cut score, criterionreference test, and mastery test.
 Descriptive statistics
Descriptive statistics use data to provide a description of a given data set.
 Dichotomous scoring
An item for which there are only two possible scores (correct/incorrect). Most often the value of 1 is used for a correct answer and 0 for any other response. See also polytomous scoring.
 Differential item functioning (DIF)
A statistical characteristic of an item that shows the extent to which the item might be measuring different abilities for members of separate subgroups.
 Difficulty / Item difficulty
The proportion of test takers that answer the item correctly. The higher the percentage, the easier the item.
 Discrimination / Item discrimination
A measure of the degree that an item is able to distinguish between test takers who possess much of the knowledge being measured from those who possess little of the knowledge being measured.
 Distracters
Incorrect response options presented to the test taker along with the correct answer in a multiplechoice item.
 Distribution
A tabulation of ordered data showing the number of individuals in a group obtaining each score or contained within a specified fixed range of scores.
 Effect size
An estimate of the magnitude of difference in the population represented by a sample (small, medium, large).
 Error of measurement / Measurement error
The difference between an observed score (measured score) and its true score. It includes random error and systematic error. See also random error, systematic error, and true score.
 Random error
A quantity that is caused by random unknown or unpredictable causes, it has no relation to any other variables. See also systematic error.
 Systematic error
The consistent, reproducible inaccuracy that systematically affects the measurement of the variable across the sample. See also random error.
 Evidencecentered design
An approach to design and develop educational assessments that consider and collect evidence to reveal the reasoning underlying the test design.
 Factor
The variable cannot be measured directly but can be computed from a set of observable variables. See also factor analysis, factor loading, and factor scores.
 Factor analysis
A statistical technique that transforms the correlations among a set of observable variables into a smaller number of underlying factors. It helps identify groups of items that show similar response patterns and provides evidence that the underlying structure is consistent with the theoretical constructs.
 Confirmatory factor analysis
An analysis to test an hypothesis from theory or prior research and to assess how well the collected data fit the theoretical model. Factors are specified beforehand. See also exploratory factor analysis.
 Exploratory factor analysis
A datadriven empirical exploration of the relations among observed variables to identify the number of underlying factors that are not specified a priori. See also confirmatory factor analysis.
 Factor loading
The strength of the itemlatent factor relationship.
 Cross loading
The degree to which an item loads onto more than one factor.
 Factor scores
A linear combination of item scores and factor score coefficients in factor analysis. They estimate how a test taker would have scored on a factor. See also factor and factor analysis.
 Frequency
The number of times that a certain value occurs in a distribution of scores.
 Halo effect
The tendency of raters to rate people similarly on all different qualities, ignoring the fact that some people are high on some qualities and others are low on them.
 Inferential statistics
Inferential statistics are used to make inferences (generalizations) from a sample of the population to the population as a whole. Inferential statistical methods make the fundamental assumption that the sample from the population is representative of the whole population.
 Interquartile range
IQR = Q3 − Q1; The value is the difference between the 3rd quartile and the 1st quartile. See also quartile and standard deviation.
 Quartile
A quartile divides the number of data points in a distribution into four equal groups, each containing 25% of the data. The lower, middle, and upper quartiles correspond to the 25th, 50th (median), and 75th percentiles.
 Item
A general term referring to a question, task, or statement on a test for which the test taker is asked to select or construct a response or perform an activity that will be scored.
 Item analysis
The statistical analysis of test takers' responses to test items in order to examine the quality of each test question.
 Item banking
The creation of a database of test items. It consists of the display of the items and associated information that is computed from the responses of test takers who have taken it.
 Item characteristic curve
The graph of the probability of answering an item correctly versus the student’s ability level. It includes three parameters: (a) the discrimination of the item, (b) the difficulty of the item, and (c) the guessing parameter of the item. These are commonly used to evaluate and interpret Rasch models.
 Item response theory / Rasch analysis
A statistical theory of testing based on the relation between test takers' performances on the test item and their level of performance on an overall measure of their ability, trait, or proficiency being measured. The theory gives the probability that the test taker with a given ability level will achieve certain scores.
 Rasch model
A type of item response theory, also known as the oneparameter logistic model. It assumes the probability of answering a question correctly by a test taker depends on only one item parameter, difficulty. See also item response theory.
 Rasch scaling
A psychometric technique that is used to measure various attributes. The raw/sum score is used as an indicator of the attributes we need to measure.
 Likert (Likert scale)
An item response scale used to assess the respondents' level of "agreement" to understand psychological phenomena.
 Longitudinal data
Measurements of the same individuals are taken repeatedly over time.
 Measurement invariance
A series of statistical tests that attempt to determine to what degree a given construct is validly measured across different groups. As an example; before comparing the scores of Groups A and B on a particular test score, one would first provide measurement invariance evidence that the constructs are robustly represented in both Groups A and B.
 Normreferenced test
The test provides information of test takers’ performance compared against the performance of hypothetical average test takers. A normreferenced score typically indicates the test taker's relative position in the group. See also criterionreferenced test.
 Normal distribution
The bellshaped distribution where values are spread symmetrically about the middle. There are many more scores concentrated in the middle than at the very high or very low scores.
 Norms
Statistics that describe the scores obtained by a specified group of test takers in order to understand how the group performed in the test.
 Percentile
A type of rank score that represents the test score below which a certain percentage of a norm group’s score fall.
 Scaling
The process of transforming scores from one scale to another.
 Normalization
A type of scaling. Data is transformed onto a scale to produce the approximate normal distribution. See also scaling.
 Standardization
The process of obtaining norms from a representative sample of individuals under standard conditions of test administration and scoring.
 Pvalue
The probability of obtaining a value of the test statistic as extreme or more extreme than the one computed by an act of pure chance.
 Statistically significant
Results are unlikely due to chance.
 Pilot test
A pilot test is administered during the test development process. It helps make informed decisions on the future development of the test. In some cases, the pilot test is also called a field test. See also field test.
 Field test (tryout)
Administered during the test development process to ensure the credibility, dependability, and validity of the data collected with the test. In some cases, a field test is also called a pilot test. See also pilot test.
 Point l correlation
A correlation between a continuous variable and a dichotomous variable (a variable with two possible values). See also point biserial correlation coefficient.
 Point biserial correlation coefficient
A measure of the strength of the association between a continuous variable and a dichotomous variable (a variable with two possible values). See also point I correlation.
 Polytomous scoring
An item for which there are more than two possible scores. See also dichotomous scoring.
 Population
Represents the entire set of people/objects a researcher wants to study.
 Power analysis
A measure of detection limit. Determines the smallest sample size that is able to detect the effect of a given test at a certain level of significance to reasonable scientific certainty.
 Psychometrics
A field of study that focuses on how to properly measure psychological concepts. It is concerned with the design and construction of assessment tools, measurement instruments, and formalized models.
 Qualitative analysis
Aims to analyze the qualitative data gathered from individuals’ responses from interviews, focus groups, and/or openresponse items and allow researchers to evaluate data with great detail.
 Closed coding
The process of identifying and analyzing data using a pre‐established coding scheme.
 Open coding
The process of analyzing textual data by labeling salient features and developing categories based on their properties and dimensions.
 Quantitative analysis
Quantitative research is the systematic empirical investigation of phenomena via statistical, mathematical, or other techniques. It can quantify data and generalize results from a sample to the population of interest.
 Reliability
The extent to which the results of an administration can be reproduced when an assessment is repeated under the same conditions. It describes the precision of a measurement.
 Internal consistency
An approach based on the correlations between different items on the same test (or the subscale of an entire test). It measures the extent to which various items that propose to measure the same construct can produce similar scores. See also reliability and singleadministration reliability.
 Interrater reliability (IRR)
The degree of consistency between observers/coders of the data.
 Singleadministration reliability
An estimate of the strength of the relation between responses to items from a single administration of a test. See also reliability.
 Coefficient Alpha (Cronbach’s Alpha)
A singleadministration test score reliability coefficient. It is used to measure the internal consistency of a set of items. See also reliability, singleadministration reliability, and internal consistency.
 KR 20 (KuderRichardson Formula 20)
A measure of internal consistency (i.e., reliability) for a test with dichotomous choices. See also reliability, singleadministration reliability, and internal consistency.
 Splithalf reliability
A type of internal consistency reliability to measure the consistency of the test scores by splitting the data into parts. See also reliability, singleadministration reliability, and internal consistency.
 Testretest reliability
A measure of consistency over repeated measures of the same test to the same sample at two different points in time.
 Reliability coefficient
The degree of the correlation between the scores of the same test takers on two occasions of testing with the same test.
 Sample
The subset of the population of interest.
 Sample size
A total number of cases (e.g., individuals, responses, etc.) in a sample.
 Ncount (n)
The total number of data points in a particular group.
 Sampling
The selection of a subset of individuals from a population to estimate the characteristics of the whole population.
 Score (raw score)
A test score directly obtained by a test taker. It has not been adjusted and reflects the number of items that have been answered correctly.
 Semantic differential
A type of rating scale that contains opposite adjectives that measure students' attitudes toward an object.
 Semistructured interview
A tool used to gather qualitative data based on the participants’ unique perspective. Consists of main openended questions with followup probe questions.
 Structural equation modeling
A covariancebased multivariate statistical method to analyze the structural relations that may exist between constructs.
 Tscore
A normalized standard score. It is a transformation of zscore and ranges from 3 standard deviations above and below the mean.
 Zscore
The number of standard deviations that y lies away from the mean.
 Test equating
A statistical procedure by which scores from two or more alternative test forms are converted to a common score scale. The goal is to establish comparable scores on these different forms of the test, allowing them to be compared directly.
 Test protocol
Collections of test cases, which consist of the test purpose, test record, and test scores.
 Theoretical framework
A literature review section that defines the core and the theory of the study.
 Validation
The process of gathering evidence to support that it is appropriate to use the scores derived from a test in the way they were intended. See also validity.
 Validity
A body of evidence which provides support that the results of a test measure what they purport to measure.
 Concurrent evidence
Shows the extent to which the instrument results correlate with those from another wellestablished test implemented on the same set of data. See also validity.
 Consequential evidence
Provides information on the positive and negative implications of using scores from a test to support certain decisions. See also validity.
 Placement test
A test designed to determine which course would be optimal for a student to enroll in to begin the study.
 Readiness test
A type of prognostic test used to determine whether the test taker has adequate preparedness to handle an instructional program.
 Screening test
A test that is used to quickly and efficiently make broad categorizations of examinees or to identify individuals who deviate in a specified area.
 Constructrelated evidence
Gives information about how well a measurement construct is measuring what the test is claimed to measure. See also validity.
 Contentrelated evidence
The extent to which the instrument measures all aspects of the targeted construct. See also validity.
 Criterionrelated evidence
The evidence of providing the validity of the data collected with a given test by comparing the scores from that test to those of another. See also validity.
 Criterion score
Scores used in obtaining criterionrelated validity evidence for a set of scores. See also criterionrelated evidence.
 Predictive evidence
The extent to which the designed instrument predicts future outcomes. See also validity.

 Evidence based on Consequences of Testing
Traditionally: n/a
Applies if you are suggesting some action or decision based on test score (e.g., remedial classes for students with a score of less than 23; like the ACT, GRE, other standardized tests). Less common in DBER.
What other information do I need to provide evidence that making these decisions based on the score is appropriate?
 Evidence based on Internal Structure
Traditionally: construct validity, discriminant validity, convergent validity, nomological validity.
What are the relations between items or groups of items within your test? What should they be? Should you measure one construct or two? Three?
 Evidence based on Relations to Other Variables
Traditionally: convergent validity, discriminant validity, predictive validity, concurrent validity, criterion validity, external validity.
How does your score relate to others? How should it be? What groups should perform differently than others?
Convergent/Discriminant evidence: What is the agreement between your test and another that should or should not be related?
Testcriterion relationships: How well do test scores predict some criteria?
Generalization: How well can evidence be generalized to other settings?
 Evidence based on Response Process
Traditionally: response process validity, cognitive validity.
To what extent do real thoughts and feelings from your participants line up with their responses? Is your test fully understood by your participants? What processes do your respondents take in answering the questions and is this an assumption you make when interpreting the results?
 Evidence based on Test Content
Traditionally: face validity, expert validity, content validity.
To what extent does the content of your test align with knowledge standards or theory? Is the content correct? Comprehensive? Are there extraneous questions? Are the questions understandable?
 Variance
A measure of the variability of a given dataset. The more that data are identical, the closer the variance is to zero. The more the data are different from each other, the greater the variance. See also standard deviation.
 Standard deviation
A measure of the variation of the values from the group mean, which can be calculated from using the information about the deviations between each datum and the group mean score. It is equivalent to the square root of the variance. See also variance.
 Weighting
A process of assigning a weight to a score to indicate its relative importance to a score distribution.