Skip to main content

Glossary

  • These terms all refer to the systematic procedures for obtaining information about the knowledge, skills, or other characteristics of the individuals being examined.

    • A form of testing in which items selected to be administered to the test taker are on the basis of the correctness of the test taker’s responses to previous items. Test takers having good (poor) performance on the previous items tend to encounter harder (easier) questions afterward.

    • A test designed to predict a person’s ability in a certain area for future success.

    •  An assessment administered to evaluate students’ performance on a topic in the classroom.

    • An assessment to systematically measure cognitive domains to determine an individual’s capabilities to perform various mental activities, such as learning and problem-solving.

    • A test in which the answer is not provided as an option. Instead of choosing it from a list, the test taker must construct it.

      • A method of scoring responses on a constructed-response test, in which the quality of performance is evaluated using specific criteria and the overall performance is assigned a single score.

    • A thorough evaluation process of gathering information about people’s current level of performance to determine a category to put them into, thus, to prescribe solutions to aid them in meeting a set of learning objectives.

    • The use of research and theory to guide the selection of assessment instruments to be used for a specific assessment purpose and to inform the methods and measures used in the assessment process.

    • A tool to measure students’ understanding as a course progresses, the information from the assessment can be used to provide feedback to the students throughout the course and/or revise future class instruction for those students. See also summative assessment.

    •  An assessment that has been adjusted to students who are unable to participate in the original assessment. Its scores are not supposed to be compared with or combined with the original assessment’s scores.

    • The assessment is designed to measure an examinee’s soft skills, and attitudes.

    • A test that contains items with keys in advance of test administration and can be scored without personal subjective opinions. See also subjective test.

    • An instrument where participants respond to questions to rate their own behavior or performance, as opposed to ratings being made by observers. See also inventory.

    • A test in which testing conditions are the same for all test takers, such as the test content, format, procedures, and scoring method.

    •  A test that requires some subjective judgment in the scoring process. See also objective test.

    • A tool to measure the extent to which students have met the overall goal(s) of a course. See also formative assessment.

  • A set of tests generally administered as a unit to comprehensively assess different aspects of a particular construct. The resulting scores are standardized so that they can be readily compared for decision-making.

  • A learning target of students at different developmental levels. Can be used to indicate a student’s progress toward meeting a specific standard.

  • Any source of systematic error in an item or test. Bias can come from many sources including, but not limited to, test content, test administration, or test scoring.

  • The center or typical value of score distribution. The mean and median are measures of central tendency. See also mean, mode, and median.

    • A value computed by adding up the scores of all test takers and then dividing the sum by the number of test takers.

    • The middle value in a sorted list of numbers. The value can separate the upper half of the list of numbers from the lower half.

    •  The value that occurs most frequently within a given set of numbers.

  • A test taker score defined as being at or below that which would be obtained by randomly selecting from a set of given options or by guessing.

  •  A statistical hypothesis test that is performed when the test statistic is chi-squared distributed under the null hypothesis. It compares the observed frequency of subjects in each group with the proposed or expected frequencies.

  • A theory of testing that is based on the idea that an individual’s observed or obtained test score is the sum of a true score component and an error score component. See also true score and error score.

    • The difference between an observed score (measured score) and its true score. It includes random error and systematic error. See also random error, systematic error, and true score.

    • The score could be obtained with no influences of random error in the test. In classical test theory, true score is described as the average of the scores the test taker would receive by averaging the scores for the same test over time. See also error score and classical test theory.

  • Grouping data into clusters such that the groups within a cluster are similar to one another but different from the groups in the other clusters.

  • The general mental capability of an individual to perform activities involving reasoning and problem-solving.

  • A specific form of qualitative interview in which the researcher probes the cognitive understanding of a participant. This typically involves having a student respond to prompts or tasks and then explaining out loud everything that they are thinking as they complete the requested prompt or task.

  • A group of people with a common statistical characteristic whose progress is followed by obtaining measurements at different points of time.

  • Used to determine the perceived speediness or test difficulty by computing the percent of the entire test or a specific number of items completed.

  •  An interval provides a range of possible values that indicates, with a specified probability, where an individual’s true score or parameter of interest lies.  A confidence band is constructed using the test taker’s observed score and the test’s standard error of measurement:  a person’s observed score ± the standard error of measurement.

    • A confidence band is constructed using the test taker’s observed score and the test’s standard error of measurement: a person’s observed score ± the standard error of measurement.

  • A theoretical concept that is inferred from commonalities among observable phenomena and can be used to explain those phenomena.

    • Situations in which the scores of test takers are positively or negatively influenced by factors that are different from those the test is intended to measure.

    • Occurs when certain important aspects of the construct of interest are not measured or test content is not reflective of what the test specifications require.

  • The relation between two sets of scores. See also correlation coefficient.

    • A statistic that indicates the strength of the relation between two values for the same group of individuals. The correlation ranges from -1.00 (negatively correlated) to +1.00 (positively correlated), while a correlation of 0 (zero) indicates no relation. See also correlation.

    • A matrix of correlation coefficients to calculate the correlation between each variable and every other variable. See also correlation and correlation coefficient.

  • The standard, guideline, or rule by which something is judged or evaluated.

    • Each test taker's performance is measured by comparing it with a fixed standard. See also norm referencing.

    • Test takers’ performance is measured by comparing it with the performance of others, rather than against fixed criteria. This is in contrast with criterion referencing. See also criterion referencing.

      • Local norms are obtained from a local group of test takers, as opposed to a national population of interest, with the purpose of making norm-referenced score interpretations. See also national norms.

      • National norms are obtained from a national population, as opposed to individual or group performance. It aims to make norm-referenced score interpretations. See also local norms.

  • A test designed to measure individuals’ performance on specific concepts as compared to an expected level of criteria. It is in contrast with the norm-referenced test. See also norm-referenced test.

    • A specified point on the test score scale that differentiates the test takers into groups based on their scores above the point or below it.

    • A criterion-referenced test designed to evaluate how well the test taker has mastered a domain of knowledge. Scores above a certain cut score are considered mastery. See also cut score and criterion-reference test.

      • The cut score for a mastery test. When test takers’ scores are higher than the cut score (above the mastery level), they are considered to have mastered the knowledge; when their scores are lower than the cut score (below the mastery level), they are considered not to have mastered the knowledge. See also cut score, criterion-reference test, and mastery test.

  • Descriptive statistics use data to provide a description of a given data set.

  • An item for which there are only two possible scores (correct/incorrect). Most often the value of 1 is used for a correct answer and 0 for any other response. See also polytomous scoring.

  • A statistical characteristic of an item that shows the extent to which the item might be measuring different abilities for members of separate subgroups.

  • The proportion of test takers that answer the item correctly. The higher the percentage, the easier the item.

  • A measure of the degree that an item is able to distinguish between test takers who possess much of the knowledge being measured from those who possess little of the knowledge being measured.

  • Incorrect response options presented to the test taker along with the correct answer in a multiple-choice item.

  • A tabulation of ordered data showing the number of individuals in a group obtaining each score or contained within a specified fixed range of scores.

  • An estimate of the magnitude of difference in the population represented by a sample (small, medium, large).

  • The difference between an observed score (measured score) and its true score. It includes random error and systematic error. See also random error, systematic error, and true score.

    • A quantity that is caused by random unknown or unpredictable causes, it has no relation to any other variables. See also systematic error.

    • The consistent, reproducible inaccuracy that systematically affects the measurement of the variable across the sample. See also random error.

  • An approach to design and develop educational assessments that consider and collect evidence to reveal the reasoning underlying the test design.

  • The variable cannot be measured directly but can be computed from a set of observable variables. See also factor analysis, factor loading, and factor scores.

    • A statistical technique that transforms the correlations among a set of observable variables into a smaller number of underlying factors. It helps identify groups of items that show similar response patterns and provides evidence that the underlying structure is consistent with the theoretical constructs.

    • The strength of the item-latent factor relationship.

    • A linear combination of item scores and factor score coefficients in factor analysis. They estimate how a test taker would have scored on a factor. See also factor and factor analysis.

  • The number of times that a certain value occurs in a distribution of scores.

  • The tendency of raters to rate people similarly on all different qualities, ignoring the fact that some people are high on some qualities and others are low on them.

  • Inferential statistics are used to make inferences (generalizations) from a sample of the population to the population as a whole. Inferential statistical methods make the fundamental assumption that the sample from the population is representative of the whole population.

  • IQR = Q3 − Q1; The value is the difference between the 3rd quartile and the 1st quartile. See also quartile and standard deviation.

    • A quartile divides the number of data points in a distribution into four equal groups, each containing 25% of the data. The lower, middle, and upper quartiles correspond to the 25th, 50th (median), and 75th percentiles.

  • A general term referring to a question, task, or statement on a test for which the test taker is asked to select or construct a response or perform an activity that will be scored.

  • The statistical analysis of test takers' responses to test items in order to examine the quality of each test question.

  • The creation of a database of test items. It consists of the display of the items and associated information that is computed from the responses of test takers who have taken it.

  • The graph of the probability of answering an item correctly versus the student’s ability level. It includes three parameters: (a) the discrimination of the item, (b) the difficulty of the item, and (c) the guessing parameter of the item. These are commonly used to evaluate and interpret Rasch models.

  • A statistical theory of testing based on the relation between test takers' performances on the test item and their level of performance on an overall measure of their ability, trait, or proficiency being measured. The theory gives the probability that the test taker with a given ability level will achieve certain scores.

    • A type of item response theory, also known as the one-parameter logistic model. It assumes the probability of answering a question correctly by a test taker depends on only one item parameter, difficulty. See also item response theory.

    • A psychometric technique that is used to measure various attributes. The raw/sum score is used as an indicator of the attributes we need to measure.

  • An item response scale used to assess the respondents' level of "agreement" to understand psychological phenomena.  

  • Measurements of the same individuals are taken repeatedly over time.

  • A series of statistical tests that attempt to determine to what degree a given construct is validly measured across different groups. As an example; before comparing the scores of Groups A and B on a particular test score, one would first provide measurement invariance evidence that the constructs are robustly represented in both Groups A and B.

  • The test provides information of test takers’ performance compared against the performance of hypothetical average test takers. A norm-referenced score typically indicates the test taker's relative position in the group. See also criterion-referenced test.

  • The bell-shaped distribution where values are spread symmetrically about the middle. There are many more scores concentrated in the middle than at the very high or very low scores.

  • Statistics that describe the scores obtained by a specified group of test takers in order to understand how the group performed in the test.

    • A type of rank score that represents the test score below which a certain percentage of a norm group’s score fall.

    • The process of transforming scores from one scale to another.

      • A type of scaling. Data is transformed onto a scale to produce the approximate normal distribution. See also scaling.

    • The process of obtaining norms from a representative sample of individuals under standard conditions of test administration and scoring.

  • The probability of obtaining a value of the test statistic as extreme or more extreme than the one computed by an act of pure chance.

  • A pilot test is administered during the test development process. It helps make informed decisions on the future development of the test. In some cases, the pilot test is also called a field test. See also field test.

    • Administered during the test development process to ensure the credibility, dependability, and validity of the data collected with the test. In some cases, a field test is also called a pilot test. See also pilot test.

  • A correlation between a continuous variable and a dichotomous variable (a variable with two possible values).  See also point biserial correlation coefficient.

  • An item for which there are more than two possible scores. See also dichotomous scoring.

  • Represents the entire set of people/objects a researcher wants to study.

  • A measure of detection limit. Determines the smallest sample size that is able to detect the effect of a given test at a certain level of significance to reasonable scientific certainty.

  • A field of study that focuses on how to properly measure psychological concepts. It is concerned with the design and construction of assessment tools, measurement instruments, and formalized models.

  • Aims to analyze the qualitative data gathered from individuals’ responses from interviews, focus groups, and/or open-response items and allow researchers to evaluate data with great detail.

    • The process of identifying and analyzing data using a pre‐established coding scheme.

    • The process of analyzing textual data by labeling salient features and developing categories based on their properties and dimensions.

  • Quantitative research is the systematic empirical investigation of phenomena via statistical, mathematical, or other techniques. It can quantify data and generalize results from a sample to the population of interest.

  • The extent to which the results of an administration can be reproduced when an assessment is repeated under the same conditions. It describes the precision of a measurement.

  • The subset of the population of interest.

  • A total number of cases (e.g., individuals, responses, etc.) in a sample.

    • The total number of data points in a particular group.

  • The selection of a subset of individuals from a population to estimate the characteristics of the whole population.

  • A test score directly obtained by a test taker. It has not been adjusted and reflects the number of items that have been answered correctly.

  • A type of rating scale that contains opposite adjectives that measure students' attitudes toward an object.

  • A tool used to gather qualitative data based on the participants’ unique perspective. Consists of main open-ended questions with follow-up probe questions.

  • A covariance-based multivariate statistical method to analyze the structural relations that may exist between constructs.

  • A normalized standard score. It is a transformation of z-score and ranges from 3 standard deviations above and below the mean.

    • The number of standard deviations that y lies away from the mean.

  • A statistical procedure by which scores from two or more alternative test forms are converted to a common score scale. The goal is to establish comparable scores on these different forms of the test, allowing them to be compared directly.

  • Collections of test cases, which consist of the test purpose, test record, and test scores.

  • A literature review section that defines the core and the theory of the study.

  • The process of gathering evidence to support that it is appropriate to use the scores derived from a test in the way they were intended. See also validity.

    • A body of evidence which provides support that the results of a test measure what they purport to measure.

      • Shows the extent to which the instrument results correlate with those from another well-established test implemented on the same set of data. See also validity.

      • Provides information on the positive and negative implications of using scores from a test to support certain decisions. See also validity.

        • A test designed to determine which course would be optimal for a student to enroll in to begin the study.

        • A type of prognostic test used to determine whether the test taker has adequate preparedness to handle an instructional program.

        • A test that is used to quickly and efficiently make broad categorizations of examinees or to identify individuals who deviate in a specified area.

      • Gives information about how well a measurement construct is measuring what the test is claimed to measure. See also validity.

      • The extent to which the instrument measures all aspects of the targeted construct. See also validity.

      • The evidence of providing the validity of the data collected with a given test by comparing the scores from that test to those of another. See also validity.

      • The extent to which the designed instrument predicts future outcomes. See also validity.

    • Traditionally: n/a

      Applies if you are suggesting some action or decision based on test score (e.g., remedial classes for students with a score of less than 23; like the ACT, GRE, other standardized tests). Less common in DBER​.

      What other information do I need to provide evidence that making these decisions based on the score is appropriate?

    • Traditionally: construct validity, discriminant validity, convergent validity, nomological validity​.

      What are the relations between items or groups of items within your test? What should they be? Should you measure one construct or two? Three?

    • Traditionally: convergent validity, discriminant validity, predictive validity, concurrent validity, criterion validity, external validity​.

      How does your score relate to others? How should it be? What groups should perform differently than others? 

      Convergent/Discriminant evidence: What is the agreement between your test and another that should or should not be related?​

      Test-criterion relationships: How well do test scores predict some criteria?​

      Generalization: How well can evidence be generalized to other settings?​

    • Traditionally: response process validity, cognitive validity.

      To what extent do real thoughts and feelings from your participants line up with their responses? Is your test fully understood by your participants? What processes do your respondents take in answering the questions and is this an assumption you make when interpreting the results?

    •  Traditionally: face validity, expert validity, content validity.

      To what extent does the content of your test align with knowledge standards or theory? Is the content correct? Comprehensive? Are there extraneous questions? Are the questions understandable? 

  • A measure of the variability of a given dataset. The more that data are identical, the closer the variance is to zero. The more the data are different from each other, the greater the variance. See also standard deviation.

    • A measure of the variation of the values from the group mean, which can be calculated from using the information about the deviations between each datum and the group mean score. It is equivalent to the square root of the variance. See also variance.

  • A process of assigning a weight to a score to indicate its relative importance to a score distribution.