Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a method of educational assessment and an application of natural language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades—for example, the numbers 1 to 6. Therefore, it can be considered a problem of statistical classification. Several factors have contributed to a growing interest in AES. Among them are cost, accountability, standards, and technology. Rising education costs have led to pressure to hold the educational system accountable for results by imposing standards. The advance of information technology promises to measure educational achievement at reduced cost.
Most historical summaries of AES trace the origins of the field to the work of Ellis Batten Page. In 1966, he argued for the possibility of scoring essays by computer, and in 1968 he published his successful work with a program called ‘Project Essay Grade’ (PEG). Using the technology of that time, computerized essay scoring would not have been cost-effective, so Page abated his efforts for about two decades. By 1990, AES was a practical possibility. As early as 1982, a UNIX program called ‘Writers Workbench’ was able to offer punctuation, spelling, and grammar advice. In collaboration with several companies (notably Educational Testing Service), Page updated PEG and ran some successful trials in the early 1990s.
Thomas Landauer developed a system using a scoring engine called ‘Knowledge Analysis Technologies.’ ‘Intelligent Essay Assessor’ is an implementation of that engine in a product from Pearson Educational Technologies. IEA was first used to score essays in 1997. IntelliMetric is Vantage Learning’s AES engine. Its development began in 1996, and was first used commercially to score essays in 1998. ETS offers ‘e-rater,’ an automated essay scoring program. It was first used commercially in 1999. ETS’s ‘Criterion Online Writing Evaluation Service’ uses the e-rater engine to provide both scores and targeted feedback.
In 2012, the Hewlett Foundation sponsored a competition called the Automated Student Assessment Prize (ASAP). Nine vendors and seventy individuals attempted to predict, using AES, the scores that human raters would give to thousands of essays written to eight different prompts. The intent was to demonstrate that AES can be as reliable as human raters, or more so.
From the beginning, the basic procedure for AES has been to start with a training set of essays that have been carefully hand-scored. The program evaluates surface features of the text of each essay, such as the total number of words, the number of subordinate clauses, or the ratio of uppercase to lowercase letters – quantities that can be measured without any human insight. It then constructs a mathematical model that relates these quantities to the scores that the essays received. The same model is then applied to calculate scores of new essays. The various AES programs differ in what specific surface features they measure, how many essays are required in the training set, and most significantly in the mathematical modeling technique. Early attempts used linear regression. Modern systems may use linear regression alone or in combination with other statistical techniques such as latent semantic analysis (analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms) and Bayesian inference (a probability system).
Any method of assessment must be judged on validity, fairness, and reliability. An instrument is valid if it actually measures the trait that it purports to measure. It is fair if it does not, in effect, penalize or privilege any one class of people. It is reliable if its outcome is repeatable, even when irrelevant external factors are altered. Before computers entered the picture, high-stakes essays were typically given scores by two trained human raters. If the scores differed by more than one point, a third, more experienced rater would settle the disagreement. In this system, there is an easy way to measure reliability: by inter-rater agreement. If raters do not consistently agree within one point, their training may be at fault. If a rater consistently disagrees with whichever other raters look at the same essays, that rater probably needs more training.
Inter-rater agreement can now be applied to measuring the computer’s performance. A set of essays is given to two human raters and an AES program. If the computer-assigned scores agree with one of the human raters as well as the raters agree with each other, the AES program is considered reliable. Alternatively, each essay is given a ‘true score’ by taking the average of the two human raters’ scores, and the two humans and the computer are compared on the basis of their agreement with the true score. This is basically a form of Turing test: by their scoring behavior, can a computer and a human be told apart?
In current practice, high-stakes assessments such as the GMAT are always scored by at least one human. AES is used in place of a second rater. A human rater resolves any disagreements of more than one point. AES has been criticized on various grounds such as, ‘the overreliance on surface features of responses, the insensitivity to the content of responses and to creativity, and the vulnerability to new types of cheating and test-taking strategies.’ Several critics are concerned that students’ motivation will be diminished if they know that no human will read their writing. Proponents of AES point out that computer scoring is more consistent than fallible human raters and can provide students with instant feedback for formative assessment.